[07:40:04] <moritzm>	 I want to test-drive the 7.2.21 packages on mwdebug (independent of whether we also add the gc patch on top), is the PHP7 flag from X-Wikimedia-Debug browser extension still working (as IIRC there were some changes for the PHP 7 beta option being dropped)
[08:13:41] <_joe_>	 yes
[08:13:55] <_joe_>	 it still works and you can pick the right interpreter from it
[08:18:33] <moritzm>	 thx, gonna do some tests on them in a bit
[08:50:17] <tarrow>	 Hi, since we deployed termbox yesterday we noticed that on a semi regular basis its request to the mediawiki api app servers seems to timeout. I opened at ticket at https://phabricator.wikimedia.org/T230976. It looks to me like something happen semi regularly at around 22-28min intervals. Is there anything like that on the k8s setup that springs to mind? Some kind of GC or something that happens around then?
[08:50:31] <tarrow>	 Or maybe even something that regularly happens on the app servers?
[09:00:15] <_joe_>	 nope
[09:00:35] <_joe_>	 and I won't have time to look into it right now
[09:33:24] <tarrow>	 _joe_: cool! understood :). No hurry; it's *probably* not even noticable to the users
[10:11:36] <wikibugs>	 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Joe) I think this is a reasonable explanation, but how would you suggest we should fix our monitoring?
[10:12:00] <wikibugs>	 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Joe) p:05Triage→03Normal a:03Joe
[10:12:10] <wikibugs>	 10serviceops, 10Operations, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10Joe) p:05Triage→03Normal
[11:19:48] <wikibugs>	 10serviceops, 10Operations, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10jijiki)
[11:27:31] <wikibugs>	 10serviceops, 10Operations, 10PHP 7.2 support, 10PHP 7.3 support: PHP 7.2 is very slow on an allocation-intensive benchmark - https://phabricator.wikimedia.org/T230861 (10jijiki) We are on  our way to finishing migration to PHP7, my opinion is to try the PHP7.2 bandaid rather than upgrading production to P...
[13:20:56] <wikibugs>	 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) I'm looking into what it would take to monitor a celery worker pool on a specific machine....
[13:45:17] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10jijiki)
[13:47:51] <wikibugs>	 10serviceops, 10Scap, 10PHP 7.2 support, 10Patch-For-Review, and 3 others: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 (10jijiki) @thcipriani When a deployer run scap pull on mwdebug, they got `sudo: [/usr/local/sbin/check-and-restart-php,: command not fo...
[14:08:45] <_joe_>	 jijiki, apergos shall we skip the discussion meeting this week? I could use a break :P
[14:08:56] <apergos>	 we're missing mark of course
[14:09:17] <apergos>	 who else is not here? is akosiaris also out?
[14:09:23] <_joe_>	 yes
[14:09:24] <_joe_>	 and daniel
[14:09:31] <_joe_>	 it's the three of us
[14:09:42] <apergos>	 seems fine by me, you have no burning topics I guess?
[14:09:50] <_joe_>	 yeah 
[14:10:09] <_joe_>	 unless you want a 30 min whining session on all-hands dates :P
[14:10:17] <apergos>	 I've been keeping up on the comments
[14:10:38] <apergos>	 as someone who's never been involved with fossdem, my voice would have no weight here but I do hope it gets sorted
[14:11:04] <_joe_>	 I think it's more about being heard than the actual conference :)
[14:12:05] <apergos>	 I think it's both: being heard doesn't necessarily mean the problem gets resolved!
[14:12:30] <_joe_>	 indeed :D
[14:17:40] <wikibugs>	 10serviceops, 10Operations, 10Core Platform Team (Needs Cleaning - Services Operations): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Jdlrobson)
[14:18:16] <_joe_>	 tarrow: what was your task about termbox timeouts?
[14:20:16] <wikibugs>	 10serviceops, 10Operations, 10PHP 7.2 support: Mysterious, coordinated slowdowns every ~ 25 minutes on mw1347,mw1348 (php7 api servers) - https://phabricator.wikimedia.org/T231011 (10Joe)
[14:20:36] <_joe_>	 found it, nevermind
[14:20:51] <wikibugs>	 10serviceops, 10Operations, 10PHP 7.2 support: Mysterious, coordinated slowdowns every ~ 25 minutes on mw1347,mw1348 (php7 api servers) - https://phabricator.wikimedia.org/T231011 (10Joe) p:05Triage→03High
[14:23:59] <wikibugs>	 10serviceops, 10Operations, 10PHP 7.2 support: Mysterious, coordinated slowdowns every ~ 25 minutes on mw1347,mw1348 (php7 api servers) - https://phabricator.wikimedia.org/T231011 (10jijiki)
[14:36:21] <wikibugs>	 10serviceops, 10Operations, 10Traffic, 10Patch-For-Review: Migrate Failoid hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224559 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff New VMs (failoid1001 and failoid2001) have been setup and are in active use now. I'll keep the old jessie VMs around f...
[14:50:31] <wikibugs>	 10serviceops, 10ORES, 10Operations, 10Scoring-platform-team: celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) So, I've been trying explore the behaviors of `celery -A ores_celery inspect ping` to see if...
[15:24:44] <_joe_>	 urandom: my idea is basically to go to kubestage and run perf on the process to register the syscalls
[15:25:18] <urandom>	 _joe_: +1
[15:25:42] <_joe_>	 I'm a bit rusty on how to calculate the duration of each
[15:26:07] <_joe_>	 but the premise should be that we don't need to get the horrendous perf hit that strace gives you
[15:27:18] <urandom>	 _joe_: I dunno if that hit would be an issue or not, the staging pod is sufficiently small that it's already quite constrained
[15:28:19] <_joe_>	 oh strace basically serializes your syscalls and ptrace is kinda dangerous
[17:42:45] <cdanis>	 just noticed something in db-eqiad.php
[17:42:48] <cdanis>	 # This key must exist for the master switch script to work, which means comment and uncomment
[17:42:50] <cdanis>	 # the individual shards, but leave the 'readOnlyBySection' => [ ], alone.
[17:42:54] <cdanis>	 where does this script live?  it surely needs updates post-dbctl
[17:43:03] <cdanis>	 mmh I will file a task
[17:44:35] <volans>	 cdanis: that was the switcdc repo that become the basis to write spicerack, now lives as cookbooks in the cookbooks repo
[17:44:44] <volans>	 but
[17:44:58] <volans>	 it doesn't use anymore the readonlybysection IIRC and uses the global RO instead
[17:46:03] <wikibugs>	 10serviceops, 10Operations, 10conftool: update master DC switch script for a post-dbctl world - https://phabricator.wikimedia.org/T231035 (10CDanis)
[17:46:10] <cdanis>	 ahah
[17:46:20] <cdanis>	 >While I'm mildly surprised that the DC switch script cares about per-section readonly status -- I had assumed it would use the other mwconfig conftool objects WMFMasterDatacenter and the per-DC ReadOnly -- if the switch script does need to edit readOnlyBySection it needs to be updated to invoke dbctl.
[17:46:29] <cdanis>	 beaten by volans again
[17:46:41] <volans>	 lol, it's not a race :D
[17:46:48] <cdanis>	 side rant, we never update our documentation here
[17:47:01] <volans>	 true that, and guilty as charged
[17:47:49] <wikibugs>	 10serviceops, 10Operations, 10conftool: update master DC switch script for a post-dbctl world - https://phabricator.wikimedia.org/T231035 (10CDanis) 05Open→03Invalid `17:44:35 <volans> cdanis: that was the switcdc repo that become the basis to write spicerack, now lives as cookbooks in the cookbooks repo...
[17:50:08] <cdanis>	 also haha
[17:50:10] <cdanis>	 {"ReadOnly": {"val": "You can't edit now. This is because of maintenance. Copy and save your text and try again in a few minutes."}, "tags": "scope=codfw"}
[17:50:36] <cdanis>	 I guess technically the migration to active-active is a form of maintenance 🤔 
[17:54:07] <volans>	 the reason is a CLI param for the script
[17:54:30] <volans>	 so it could be overriden :D
[17:55:06] <cdanis>	 I have to confess I don't understand how the two different flavors of readonly messages interact
[17:55:10] <cdanis>	 (per-section vs per-DC)
[17:56:53] <volans>	 per DC wins
[17:57:13] <volans>	 per section is used when we do master failover on a specific section
[17:58:24] <volans>	 the first switchdc we did people was not sure about the global RO so we modified all the per section ones :)
[17:58:34] <volans>	 because that was "tested" and we knew it would work
[17:58:39] <cdanis>	 haha
[17:58:47] <cdanis>	 are we going to do another one of those btw
[17:59:04] <cdanis>	 (do we ever do any other disaster preparedness exercises?)
[17:59:08] <volans>	 in theory yes, unless the a/a work catches up beofre that :D
[17:59:48] <volans>	 that's a much longer discussion ;) switcdc and disaster recovery are quite different in scope and execution
[18:32:02] <_joe_>	 cdanis: yeah that comment is clearly outdated
[18:32:14] <_joe_>	 we use the global RO from etcd during switchovers