[08:49:06] 10serviceops, 10Operations, 10service-runner, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10jijiki) I will agree with the poolcounter solution :) [09:09:41] _joe_, effie - I am building prometheus-memcached-exporter for buster, but reprepro doesn't like it since an identical (in sha terms) with the same version is already in stretch-wikimedia [09:09:58] <_joe_> reprepro copy [09:10:06] <_joe_> also how is that possible [09:10:47] copy is not super good since it is golang and needs to be compiled [09:11:01] <_joe_> actually golang is statically linked [09:11:03] golang just creates static ELF binary [09:11:06] <_joe_> so it shoudl work [09:11:41] but it makes still sense to rebuild, e.g. to use new features from current golang, for that simply change the version to +deb10u1 when you rebuild [09:11:54] that part I know, but is it ok to build it on say Sid and then use it on Buster? ok == works fine, but I thought that using the build tools available for a distro is better [09:12:01] but maybe I am missing something as usual [09:12:15] does it build on buster or does it need sid? [09:12:25] it builds for buster yes [09:12:56] then I'd imply change the version to 0.4.1+git20181010.2fa99eb-1+deb10u1 and rebuild with DIST=buster on boron [09:13:10] what I wanted to ask is if I can use the +deb10u1 trick or if service ops prefers to have separate branches etc.. for the package [09:18:06] <_joe_> when I do a simple rebuild of a package I already have for stretch, I add +buster0 [09:18:18] <_joe_> in a few months I'll switch to do +stretch0 [09:18:39] all right so it is fine for me to do this with the package, good :) [09:18:59] (I am testing memcached on buster in deployment-prep) [09:23:17] it's better to follow the +debXuY scheme, +stretch0 sorts higher than +buster0, so in cases of upgrades (which we typically avoid in favour of reimages, but they do happen from time to time), the old build would be kept around [09:25:00] <_joe_> do we have a standard? [09:25:03] <_joe_> maybe we should [09:25:32] <_joe_> but I mean maybe we should have a package building pipeline that doesn't involve so much human intervention and ssh [09:26:25] 10serviceops, 10Operations, 10Kubernetes: Upgrade the envoyproxy package to its latest version. - https://phabricator.wikimedia.org/T235412 (10Joe) 05Open→03Resolved All servers in production are upgraded. [09:26:27] 10serviceops, 10Operations, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10Joe) [10:37:09] * liw supports the idea of building and publishing .deb packages fully automatically, from every commit, releases from signed tags [10:38:47] * liw is willing to help with that, having done it before [10:38:55] really interesting https://github.com/facebook/mcrouter/wiki/Shadowing-setup [10:39:50] 10serviceops, 10Operations, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10elukey) Interesting reading: https://github.com/facebook/mcrouter/wiki/Shadowing-setup [10:51:58] 10serviceops, 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) Reminder: ` # TODO The IPv6 IP should be converted into a DNS AAAA resolve once we # enabled the DNS record on the director ` [11:25:02] 10serviceops, 10Operations, 10Beta-Cluster-reproducible, 10Patch-For-Review, 10User-Joe: Update confd package - https://phabricator.wikimedia.org/T147204 (10Joe) 05Open→03Resolved All stretch+ servers in production have been updated to the newer version. Jessie hosts should go away soon. [11:57:58] 10serviceops, 10Operations: Deploy wikidiff2 v1.9.0 - https://phabricator.wikimedia.org/T234175 (10jijiki) [12:00:38] 10serviceops, 10Operations, 10HHVM, 10Patch-For-Review, 10Performance-Team (Radar): Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [13:29:53] 10serviceops, 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10akosiaris) Sorry I missed that, thanks for pinging me on T234900. >>! In T229209#5565968, @jcrespo wrote: > @akosiaris We have reached an impass. We... [14:18:20] 10serviceops, 10Operations, 10Release-Engineering-Team, 10Scap, and 2 others: "Currently active MediaWiki versions:" broken on noc/conf - https://phabricator.wikimedia.org/T235338 (10thcipriani) >>! In T235338#5569953, @Reedy wrote: > Current implementation: > > `lang=html >

Currently active MediaWiki... [14:53:14] 10serviceops, 10Operations, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn) [15:01:50] 10serviceops, 10Operations, 10HHVM, 10Patch-For-Review, 10Performance-Team (Radar): Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [15:02:06] 10serviceops, 10Operations, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn) [15:09:43] 10serviceops, 10Operations, 10service-runner, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10Pchelolo) The `kad` library that the DHT rate limiter is based on was forked. Since it worked OK, the... [15:26:28] 10serviceops, 10Operations, 10Release-Engineering-Team, 10Scap, and 2 others: "Currently active MediaWiki versions:" broken on noc/conf - https://phabricator.wikimedia.org/T235338 (10Krinkle) I thought maybe it was user-permission or working-directory related. But, looks like not.. As www-data and from a d... [15:37:44] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 (10LGoto) [15:43:20] <_joe_> subbu: I think our plan was to first get everything working on those two hosts, with your confirmation, then move to everything [15:43:47] _joe_, sounds good. [15:44:04] <_joe_> subbu: you can already run your tests against wtp1025 btw [15:44:30] will do. that logstash patch would be good to get in so the logs go to a different channel and doesn't clog the production mediawiki logs. [15:45:52] <_joe_> oh ok [15:45:57] <_joe_> I will take a look in a few [15:46:06] subbu: the 2 servers are now pooled in conftool and i just ran scap pull [15:46:20] <_joe_> btw we removed hhvm so you can start working on merging your code into mediawiki [15:46:20] now merging the change to add them to mediawiki-installation to get scap deploys [15:46:35] ok. ty. i'll run some perf tests there later today / tomorrow. [15:46:44] <_joe_> great [15:46:48] cool [15:47:03] <_joe_> ask mutante to depool them before you run tests [15:47:22] will do. [15:47:24] <_joe_> so that they're not affected by actual traffic, and vice-versa [15:47:39] <_joe_> oh and run the tests in eqiad on parsoid-php, mediawiki is still active/passive [15:48:36] oh i see .... so, then all the reparse traffic will have to run on the eqiad cluster then unlikes parsoid/js where they run on codfw. [15:49:20] correct? [15:50:26] which is fine since the eqiad cluster has < 1% cpu usage, but confirming that understanding. [15:51:40] <_joe_> yes correct [15:52:59] in that case .. so, parsoid/js reparse will run in codfw and parsoid/php reparse will run in eqiad ... and live traffic for both will run in eqiad (once we start direct live traffic to parsoid/php in a few weeks). [15:53:53] that makes it simpler wrt loads then .. since my original understanding was that reparse traffic from both parsoid versions will have to share the same cluster (codfw). [15:54:13] oops, deployment in 6 minutes. running puppet on scap proxies to get parsoid-php added in time [15:56:57] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Epic, and 2 others: Resolve service instability due to excessive event loop blockage since starting PCS response pregeneration - https://phabricator.wikimedia.org/T229286 (10Mholloway) [15:57:42] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Epic, 10Product-Infrastructure-Team-Backlog (Kanban): Resolve service instability due to excessive event loop blockage since starting PCS response pregeneration - https://phabricator.wikimedia.org/T229286 (10Mholloway) [15:58:08] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, 10Epic: Resolve service instability due to excessive event loop blockage since starting PCS response pregeneration - https://phabricator.wikimedia.org/T229286 (10Mholloway) [15:59:02] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, 10Epic: Resolve service instability due to excessive event loop blockage since starting PCS response pregeneration - https://phabricator.wikimedia.org/T229286 (10Mholloway) [16:44:27] 10serviceops, 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) I have discussed with alex a plan, there is a preliminary, but timid suggestion of steps on the design (more like diary) document. For now I... [17:19:26] <_joe_> subbu: your logstash patch is live btw [17:19:30] <_joe_> you should test it [17:19:51] <_joe_> I am going afk now, will be back later [17:19:52] ok .. probably later this aft or tomorrow. ty. [17:20:02] will work with mutante for that. [18:37:11] 10serviceops, 10Operations, 10service-runner, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10mobrovac) So here are some options that we could consider. === Kademlia / DHT As stated above (and i... [20:16:32] 10serviceops, 10Growth-Team, 10Notifications, 10Operations, and 3 others: Provision Kask for Echo timestamp storage in k8s - https://phabricator.wikimedia.org/T234376 (10Eevans) [20:17:11] 10serviceops, 10Growth-Team, 10Notifications, 10Operations, and 2 others: Dashboards for monitoring of echostore - https://phabricator.wikimedia.org/T235558 (10Eevans) [21:17:50] hrmm: Warning FailedScheduling 14s (x6 over 4m34s) default-scheduler 0/6 nodes are available: 2 Insufficient cpu, 4 node(s) didn't match node selector. [21:17:55] that looks bad [21:18:43] attempting a new deployment (codfw), echostore, and everything is stuck in Pending status [21:22:19] 10serviceops, 10Growth-Team, 10Notifications, 10Operations, and 2 others: Provision Kask for Echo timestamp storage in k8s - https://phabricator.wikimedia.org/T234376 (10Eevans) I'm unable to deploy to codfw; I'm seeing the following: ` $ kubectl get events LAST SEEN TYPE REASON KIND... [21:24:24] <_joe_> urandom: I think you chose the wrong node selector [21:24:30] <_joe_> you shouldn't have one actually [21:24:37] oh. [21:24:39] ok [21:24:52] <_joe_> it's way too late for me to check what you did or fix it unless it's breaking production [21:25:24] well... there is no production, yet [21:25:42] so if it's not hurting to leave it in this state, I'm OK [21:26:10] <_joe_> no I mean [21:26:22] <_joe_> if it's creating issues for other k8s applications [21:26:43] <_joe_> what is happening is you tried to deploy on the same nodes used by sessionstore [21:26:48] <_joe_> if I had to guess [21:27:03] <_joe_> and kubernetes cannot allocate all the resources you're allocating [21:29:43] <_joe_> now I think you will have to run helmfile delete to remove the attempted deployment [21:29:54] <_joe_> sessionstore is still not used in production, correct? [21:30:14] no, it's not [21:30:32] I gather then it's the nodeAffinity section that is a problem, I shouldn't have that? [21:30:47] I can delete though [21:31:25] {{done}}, actually [21:31:35] <_joe_> yes [21:31:54] <_joe_> now you tried to deploy in codfw right? [21:32:22] I did, and I just did a delete there [21:33:08] was it causing problems? [21:33:48] <_joe_> no but deleting the release was probably the easiest way to move forward [21:34:14] <_joe_> if you remove the node affinity you should able to schedule echostore [21:34:20] k [21:34:32] I'll try that, if it doesn't work I'll delete and bag until tomorrow [21:34:36] <_joe_> ofc it still won't be reachable via the LVS IP [21:34:59] k [21:34:59] <_joe_> but you can curl the kubernetes nodes on port 8082 [21:35:05] right [22:00:06] 10serviceops, 10Growth-Team, 10Notifications, 10Operations, and 3 others: Provision Kask for Echo timestamp storage in k8s - https://phabricator.wikimedia.org/T234376 (10Eevans) From a conversation w/ @Joe on IRC, it seems the `nodeAffinity` section (copypasta from the sessionstore deployment) was likely c... [23:02:34] 10serviceops, 10Arc-Lamp, 10Performance-Team: Resolve arclamp disk exhaustion problem (Oct 2019) - https://phabricator.wikimedia.org/T235455 (10Krinkle) p:05Triage→03High