[05:13:55] <_joe_> hnowlan: sorry I was gone. And, well, LOL. [05:30:53] <_joe_> so, the ratelimiter in restbase instead of using poolcounter as anything else had to reinvent the wheel yet another time and use a DHT based on the kademlia protocol [05:33:33] _joe_: with source hosted on gwicke github account iirc :] [05:38:51] anyway that was an interesting use case. There is a node module to be reusable by various services [05:39:00] but yeah it is essentially an alternative to poolcounter [05:42:36] <_joe_> hashar: interesting? I think it's the epitome of all the flaws of our SOA [05:42:52] <_joe_> including the not-invented-by-me attitude that person had [05:42:59] I disagree [05:43:01] ;D [05:43:32] anyway it is used by the service runner for limiting various things [05:43:43] <_joe_> but basically it's a protocol misused for something we already had a solution for, more fragile, less resilient, less performant, and hard to adapt to a clound-native environment [05:43:51] <_joe_> it never worked [05:43:55] <_joe_> also :) [05:44:06] I guess they had short comings with Poolcounter and wanted something a little bit more resilient than the central poolcounter daemon [05:44:11] <_joe_> hashar: like what, specifically? because trust me it doesn't work on kubernetes [05:44:28] then really, I don't know the context and those folks are gone now [05:44:33] <_joe_> it's decidedly less resilient than poolcounter [05:44:41] <_joe_> I was in the discussions, so I have context [05:44:45] ahhh [05:44:47] cool :] [05:44:52] then [05:44:55] <_joe_> poolcounter isn't good because it's poolcounter :P [05:45:04] <_joe_> that was basically the argument [05:45:19] one can imagine adding a poolcounter node modules and move service runner to use that [05:45:49] <_joe_> we had a discussion with Pchelolo on this, given the kademlia library doesn't work well on k8s [05:46:02] <_joe_> (it needs a fixed list of IPs to connect to) [05:46:24] <_joe_> and I think we're going to do rate-limiting one level above, in the TLS termination layer [05:47:05] depends on their use cases for rate limitation [05:47:40] but yeah doing it at the infra level rather than the app levels lets SRE easily tune them as needed [05:48:14] <_joe_> I think there are use-cases for both, but overall I haven't seen any outage due to poolcounter itself since I'm here [05:48:19] assuming those rate limitation do pass through the TLS layer which might not be the case if one rate limit stuff to the backend for example [05:48:29] <_joe_> the only time we had one was when someone turned off both poolcounter instances [05:48:39] hmm [05:48:52] <_joe_> we're moving everything to talk via TLS, service-to-service [05:49:21] is that what "envoy" is used for ? [05:49:48] or is that a feature from moving to k8s? [05:50:10] <_joe_> yes that's what envoy is doing [05:50:35] <_joe_> basically it's a layer we use to abstract a few things from the application [05:51:04] <_joe_> we want encryption, persistent connections, rate-limiting, tracing, telemetry, and failure-management using circuit-breaking [05:51:17] <_joe_> all things envoy does and does well at a sub-millisecond cost on each request [05:53:07] ahh this way you have the same system all over the place and applications do not even need to care about using TLS [05:59:05] kids awake time for breakfast! [07:59:28] reimaging all new ganeti hosts that are not in production yet. because they are supposed to have RAID5 but really had RAID1 because of a bug in partman recipe which is now fixed. [08:01:29] we want that unblocked because currently eqiad ganeti is out out of space and some VM requests are stalled. though some are one-offs which can live in codfw just fine and we still have space there [08:01:55] for example the experiment with DNS-over-HTTPS. i will create that for sukhe now [08:02:56] actually.. before i can reimage I need to fix BIOS settings because remote IPMI isn't enabled [08:04:09] ideally we could have dcops run the ipmi-config diff to check even before handing them over [08:05:23] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1010.eqiad.wmnet ` The log can be found in `/var/lo... [08:18:51] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1011.eqiad.wmnet ` The log can be found in `/var/lo... [08:20:02] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1012.eqiad.wmnet ` The log can be found in `/var/lo... [08:25:24] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1010.eqiad.wmnet'] ` and were **ALL** successful. [08:28:30] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Dzahn) @RobH Remote IPMI was disabled on these hosts which popped up when i tried to run the reimage cookbook (to change software RAID level from 1 to 5) and it... [08:30:17] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1013.eqiad.wmnet ` The log can be found in `/var/lo... [08:34:29] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1011.eqiad.wmnet'] ` Of which those **FAILED**: ` ['ganeti1011.eqiad.wmnet'] ` [08:37:57] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1014.eqiad.wmnet ` The log can be found in `/var/lo... [08:41:45] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1012.eqiad.wmnet'] ` and were **ALL** successful. [08:43:53] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1015.eqiad.wmnet ` The log can be found in `/var/lo... [08:46:20] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1016.eqiad.wmnet ` The log can be found in `/var/lo... [08:51:23] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1013.eqiad.wmnet'] ` and were **ALL** successful. [08:51:41] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1017.eqiad.wmnet ` The log can be found in `/var/lo... [08:59:05] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1018.eqiad.wmnet ` The log can be found in `/var/lo... [08:59:08] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1014.eqiad.wmnet'] ` and were **ALL** successful. [09:04:26] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Dzahn) @RobH @cmjohnson I noticed by chance there are more ganeti machines beyond ganeti1018. ganeti1019-ganeti1022 are in netbox but i don't see a racking tick... [09:06:17] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1015.eqiad.wmnet'] ` and were **ALL** successful. [09:06:46] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1016.eqiad.wmnet'] ` and were **ALL** successful. [09:12:08] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1017.eqiad.wmnet'] ` and were **ALL** successful. [09:21:18] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1018.eqiad.wmnet'] ` and were **ALL** successful. [09:29:55] 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Use envoy for TLS termination on the appservers - https://phabricator.wikimedia.org/T247389 (10Joe) Status update: we've deployed envoy on all mediawiki servers with the exception of: - jobrunners (where we still... [09:32:17] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Dzahn) @akosiaris All of these hosts have RAID5 now: ` ===== NODE GROUP ===== (10) ganeti[1009-... [09:33:08] 10serviceops, 10Operations, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Dzahn) a:05Dzahn→03akosiaris Handing back over for the next "init" command steps you have mentioned are needed next. [09:49:23] <_joe_> hnowlan: didn't we have a task about moving away from checking /rpc/RunJobs from the jobrunners? I thought we did that [10:06:34] yeah there's a CR open for it [10:07:36] _joe_: I can merge that now if it's still desired. The idea was to replace it with a standard check for mediawiki afair, but we can't have that until we harmonise the apache configs etc and we decided that that was too risky post-covid [10:08:05] <_joe_> yeah and btw, I'm switching those servers to envoy right now. [10:09:24] aha, cool. So I'm okay to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/592631/ ? [11:26:27] <_joe_> hnowlan: at your convenience [11:34:59] 10serviceops, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Jobrunner monitoring still calles /rpc/runJobs.php - https://phabricator.wikimedia.org/T243096 (10hnowlan) 05Open→03Resolved [11:35:33] 10serviceops, 10WMF-JobQueue, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Jobrunner monitoring still calles /rpc/runJobs.php - https://phabricator.wikimedia.org/T243096 (10hnowlan) Check has been removed - other monitoring will be added as part of T246389 [11:39:18] _joe_: if you're happy with the purged solution, I might go ahead and remove changeprop from scb? [12:23:13] <_joe_> hnowlan: sure, I'm happy to review patches :) [12:23:58] _joe_: cool! I've added you as reviewer on https://gerrit.wikimedia.org/r/#/c/597258/ [14:40:56] <_joe_> shdubsh: thanks for the patch, I'll look in a few minutes [16:49:39] 10serviceops, 10Operations, 10Kubernetes, 10Patch-For-Review: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) TLS enabled mathoid is corrently deployed in staging and codfw k8s clusters but not in eqiad. CPU throttling has increased a lot (due... [16:57:37] 10serviceops, 10Operations, 10ops-codfw: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10Papaul) [23:55:03] 10serviceops, 10Arc-Lamp, 10Performance-Team: Resolve arclamp disk exhaustion problem (Oct 2019) - https://phabricator.wikimedia.org/T235455 (10ori)