[08:49:06] <wikibugs>	 10serviceops, 10Operations, 10service-runner, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10jijiki) I will agree with the poolcounter solution :)
[09:09:41] <elukey>	 _joe_, effie - I am building prometheus-memcached-exporter for buster, but reprepro doesn't like it since an identical (in sha terms) with the same version is already in stretch-wikimedia
[09:09:58] <_joe_>	 reprepro copy
[09:10:06] <_joe_>	 also how is that possible
[09:10:47] <elukey>	 copy is not super good since it is golang and needs to be compiled
[09:11:01] <_joe_>	 actually golang is statically linked
[09:11:03] <moritzm>	 golang just creates  static ELF binary
[09:11:06] <_joe_>	 so it shoudl work
[09:11:41] <moritzm>	 but it makes still sense to rebuild, e.g. to use new features from current golang, for that simply change the version to +deb10u1 when you rebuild
[09:11:54] <elukey>	 that part I know, but is it ok to build it on say Sid and then use it on Buster? ok == works fine, but I thought that using the build tools available for a distro is better
[09:12:01] <elukey>	 but maybe I am missing something as usual
[09:12:15] <moritzm>	 does it build on buster or does it need sid?
[09:12:25] <elukey>	 it builds for buster yes
[09:12:56] <moritzm>	 then I'd imply change the version to 0.4.1+git20181010.2fa99eb-1+deb10u1 and rebuild with DIST=buster on boron
[09:13:10] <elukey>	 what I wanted to ask is if I can use the +deb10u1 trick or if service ops prefers to have separate branches etc.. for the package
[09:18:06] <_joe_>	 when I do a simple rebuild of a package I already have for stretch, I add +buster0
[09:18:18] <_joe_>	 in a few months I'll switch to do +stretch0
[09:18:39] <elukey>	 all right so it is fine for me to do this with the package, good :)
[09:18:59] <elukey>	 (I am testing memcached on buster in deployment-prep)
[09:23:17] <moritzm>	 it's better to follow the +debXuY scheme, +stretch0 sorts higher than +buster0, so in cases of upgrades (which we typically avoid in favour of reimages, but they do happen from time to time), the old build would be kept around
[09:25:00] <_joe_>	 do we have a standard?
[09:25:03] <_joe_>	 maybe we should
[09:25:32] <_joe_>	 but I mean maybe we should have a package building pipeline that doesn't involve so much human intervention and ssh
[09:26:25] <wikibugs>	 10serviceops, 10Operations, 10Kubernetes: Upgrade the envoyproxy package to its latest version. - https://phabricator.wikimedia.org/T235412 (10Joe) 05Open→03Resolved All servers in production are upgraded.
[09:26:27] <wikibugs>	 10serviceops, 10Operations, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10Joe)
[10:37:09] * liw supports the idea of building and publishing .deb packages fully automatically, from every commit, releases from signed tags
[10:38:47] * liw is willing to help with that, having done it before
[10:38:55] <elukey>	 really interesting https://github.com/facebook/mcrouter/wiki/Shadowing-setup
[10:39:50] <wikibugs>	 10serviceops, 10Operations, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10elukey) Interesting reading: https://github.com/facebook/mcrouter/wiki/Shadowing-setup
[10:51:58] <wikibugs>	 10serviceops, 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) Reminder: ` # TODO The IPv6 IP should be converted into a DNS AAAA resolve once we # enabled the DNS record on the director `
[11:25:02] <wikibugs>	 10serviceops, 10Operations, 10Beta-Cluster-reproducible, 10Patch-For-Review, 10User-Joe: Update confd package - https://phabricator.wikimedia.org/T147204 (10Joe) 05Open→03Resolved All stretch+ servers in production have been updated to the newer version. Jessie hosts should go away soon.
[11:57:58] <wikibugs>	 10serviceops, 10Operations: Deploy wikidiff2 v1.9.0 - https://phabricator.wikimedia.org/T234175 (10jijiki)
[12:00:38] <wikibugs>	 10serviceops, 10Operations, 10HHVM, 10Patch-For-Review, 10Performance-Team (Radar): Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki)
[13:29:53] <wikibugs>	 10serviceops, 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10akosiaris) Sorry I missed that, thanks for pinging me on T234900.  >>! In T229209#5565968, @jcrespo wrote: > @akosiaris We have reached an impass. We...
[14:18:20] <wikibugs>	 10serviceops, 10Operations, 10Release-Engineering-Team, 10Scap, and 2 others: "Currently active MediaWiki versions:" broken on noc/conf - https://phabricator.wikimedia.org/T235338 (10thcipriani) >>! In T235338#5569953, @Reedy wrote: > Current implementation: >  > `lang=html > <p>Currently active MediaWiki...
[14:53:14] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn)
[15:01:50] <wikibugs>	 10serviceops, 10Operations, 10HHVM, 10Patch-For-Review, 10Performance-Team (Radar): Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki)
[15:02:06] <wikibugs>	 10serviceops, 10Operations, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn)
[15:09:43] <wikibugs>	 10serviceops, 10Operations, 10service-runner, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10Pchelolo) The `kad` library that the DHT rate limiter is based on was forked. Since it worked OK, the...
[15:26:28] <wikibugs>	 10serviceops, 10Operations, 10Release-Engineering-Team, 10Scap, and 2 others: "Currently active MediaWiki versions:" broken on noc/conf - https://phabricator.wikimedia.org/T235338 (10Krinkle) I thought maybe it was user-permission or working-directory related. But, looks like not.. As www-data and from a d...
[15:37:44] <wikibugs>	 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 (10LGoto)
[15:43:20] <_joe_>	 subbu: I think our plan was to first get everything working on those two hosts, with your confirmation, then move to everything
[15:43:47] <subbu>	 _joe_, sounds good.
[15:44:04] <_joe_>	 subbu: you can already run your tests against wtp1025 btw
[15:44:30] <subbu>	 will do. that logstash patch would be good to get in so the logs go to a different channel and doesn't clog the production mediawiki logs.
[15:45:52] <_joe_>	 oh ok
[15:45:57] <_joe_>	 I will take a look in a few
[15:46:06] <mutante>	 subbu: the 2 servers are now pooled in conftool and i just ran scap pull
[15:46:20] <_joe_>	 btw we removed hhvm so you can start working on merging your code into mediawiki
[15:46:20] <mutante>	 now merging the change to add them to mediawiki-installation to get scap deploys
[15:46:35] <subbu>	 ok. ty. i'll run some perf tests there later today / tomorrow.
[15:46:44] <_joe_>	 great
[15:46:48] <mutante>	 cool
[15:47:03] <_joe_>	 ask mutante to depool them before you run tests
[15:47:22] <subbu>	 will do.
[15:47:24] <_joe_>	 so that they're not affected by actual traffic, and vice-versa
[15:47:39] <_joe_>	 oh and run the tests in eqiad on parsoid-php, mediawiki is still active/passive
[15:48:36] <subbu>	 oh i see .... so, then all the reparse traffic will have to run on the eqiad cluster then unlikes parsoid/js where they run on codfw.
[15:49:20] <subbu>	 correct?
[15:50:26] <subbu>	 which is fine since the eqiad cluster has < 1% cpu usage, but confirming that understanding.
[15:51:40] <_joe_>	 yes correct
[15:52:59] <subbu>	 in that case .. so, parsoid/js reparse will run in codfw and parsoid/php reparse will run in eqiad ... and live traffic for both will run in eqiad (once we start direct live traffic to parsoid/php in a few weeks).
[15:53:53] <subbu>	 that makes it simpler wrt loads then .. since my original understanding was that reparse traffic from both parsoid versions will have to share the same cluster (codfw).
[15:54:13] <mutante>	 oops, deployment in 6 minutes. running puppet on scap proxies to get parsoid-php added in time
[15:56:57] <wikibugs>	 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Epic, and 2 others: Resolve service instability due to excessive event loop blockage since starting PCS response pregeneration - https://phabricator.wikimedia.org/T229286 (10Mholloway)
[15:57:42] <wikibugs>	 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Epic, 10Product-Infrastructure-Team-Backlog (Kanban): Resolve service instability due to excessive event loop blockage since starting PCS response pregeneration - https://phabricator.wikimedia.org/T229286 (10Mholloway)
[15:58:08] <wikibugs>	 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, 10Epic: Resolve service instability due to excessive event loop blockage since starting PCS response pregeneration - https://phabricator.wikimedia.org/T229286 (10Mholloway)
[15:59:02] <wikibugs>	 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, 10Epic: Resolve service instability due to excessive event loop blockage since starting PCS response pregeneration - https://phabricator.wikimedia.org/T229286 (10Mholloway)
[16:44:27] <wikibugs>	 10serviceops, 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) I have discussed with alex a plan, there is a preliminary, but timid suggestion of steps on the design (more like diary) document.  For now I...
[17:19:26] <_joe_>	 subbu: your logstash patch is live btw
[17:19:30] <_joe_>	 you should test it
[17:19:51] <_joe_>	 I am going afk now, will be back later
[17:19:52] <subbu>	 ok .. probably later this aft or tomorrow. ty.
[17:20:02] <subbu>	 will work with mutante for that.
[18:37:11] <wikibugs>	 10serviceops, 10Operations, 10service-runner, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10mobrovac) So here are some options that we could consider.  === Kademlia / DHT  As stated above (and i...
[20:16:32] <wikibugs>	 10serviceops, 10Growth-Team, 10Notifications, 10Operations, and 3 others: Provision Kask for Echo timestamp storage in k8s - https://phabricator.wikimedia.org/T234376 (10Eevans)
[20:17:11] <wikibugs>	 10serviceops, 10Growth-Team, 10Notifications, 10Operations, and 2 others: Dashboards for monitoring of echostore - https://phabricator.wikimedia.org/T235558 (10Eevans)
[21:17:50] <urandom>	 hrmm: Warning  FailedScheduling  14s (x6 over 4m34s)  default-scheduler  0/6 nodes are available: 2 Insufficient cpu, 4 node(s) didn't match node selector.
[21:17:55] <urandom>	 that looks bad
[21:18:43] <urandom>	 attempting a new deployment (codfw), echostore, and everything is stuck in Pending status
[21:22:19] <wikibugs>	 10serviceops, 10Growth-Team, 10Notifications, 10Operations, and 2 others: Provision Kask for Echo timestamp storage in k8s - https://phabricator.wikimedia.org/T234376 (10Eevans) I'm unable to deploy to codfw; I'm seeing the following:  ` $ kubectl get events LAST SEEN   TYPE      REASON              KIND...
[21:24:24] <_joe_>	 urandom: I think you chose the wrong node selector
[21:24:30] <_joe_>	 you shouldn't have one actually
[21:24:37] <urandom>	 oh.
[21:24:39] <urandom>	 ok
[21:24:52] <_joe_>	 it's way too late for me to check what you did or fix it unless it's breaking production
[21:25:24] <urandom>	 well... there is no production, yet
[21:25:42] <urandom>	 so if it's not hurting to leave it in this state, I'm OK
[21:26:10] <_joe_>	 no I mean 
[21:26:22] <_joe_>	 if it's creating issues for other k8s applications
[21:26:43] <_joe_>	 what is happening is you tried to deploy on the same nodes used by sessionstore
[21:26:48] <_joe_>	 if I had to guess
[21:27:03] <_joe_>	 and kubernetes cannot allocate all the resources you're allocating
[21:29:43] <_joe_>	 now I think you will have to run helmfile delete to remove the attempted deployment
[21:29:54] <_joe_>	 sessionstore is still not used in production, correct?
[21:30:14] <urandom>	 no, it's not
[21:30:32] <urandom>	 I gather then it's the nodeAffinity section that is a problem, I shouldn't have that?
[21:30:47] <urandom>	 I can delete though
[21:31:25] <urandom>	 {{done}}, actually
[21:31:35] <_joe_>	 yes
[21:31:54] <_joe_>	 now you tried to deploy in codfw right?
[21:32:22] <urandom>	 I did, and I just did a delete there
[21:33:08] <urandom>	 was it causing problems?
[21:33:48] <_joe_>	 no but deleting the release was probably the easiest way to move forward
[21:34:14] <_joe_>	 if you remove the node affinity you should able to schedule echostore
[21:34:20] <urandom>	 k
[21:34:32] <urandom>	 I'll try that, if it doesn't work I'll delete and bag until tomorrow
[21:34:36] <_joe_>	 ofc it still won't be reachable via the LVS IP
[21:34:59] <urandom>	 k
[21:34:59] <_joe_>	 but you can curl the kubernetes nodes on port 8082
[21:35:05] <urandom>	 right
[22:00:06] <wikibugs>	 10serviceops, 10Growth-Team, 10Notifications, 10Operations, and 3 others: Provision Kask for Echo timestamp storage in k8s - https://phabricator.wikimedia.org/T234376 (10Eevans) From a conversation w/ @Joe on IRC, it seems the `nodeAffinity` section (copypasta from the sessionstore deployment) was likely c...
[23:02:34] <wikibugs>	 10serviceops, 10Arc-Lamp, 10Performance-Team: Resolve arclamp disk exhaustion problem (Oct 2019) - https://phabricator.wikimedia.org/T235455 (10Krinkle) p:05Triage→03High