[08:01:03] <_joe_> fsero: I'm changing the values of all the keys we export with mtail, so wait before you edit the graphs more [08:01:33] Ack [08:01:52] <_joe_> I changed the prefix to mediawiki_http, which is more fitting [09:28:57] _joe_: fixed variables in grafana [09:29:02] it was querying old names [09:29:10] and im going to add some charts, feel free to burn them all [09:29:26] <_joe_> fsero: ugh wait [09:29:37] -me waiting [09:29:39] <_joe_> I think I forgot to save the dashboard [09:30:02] save it i can remade what i just did [09:30:27] <_joe_> yeah I did [09:30:32] <_joe_> yeah wait a sec pls [09:31:46] <_joe_> ok I think you have to fix something for the instance variable [09:34:28] lemme fix it [09:34:44] <_joe_> oh yeah they need all to be fixed [09:36:30] done [09:38:05] <_joe_> it's interesting to note how on the API cluster hhvm has a better p95 and p75 than php-fpm, while on the appserver cluster it's clearly the other way around [09:38:12] <_joe_> php-fpm is always marginally faster [09:46:14] i like the dashboard a lot now :) [09:46:17] _joe_: curious https://grafana.wikimedia.org/d/RIA1lzDZk/xxx-joe-appserver?orgId=1&panelId=22&fullscreen&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=mw1221%3A3903&var-method=GET&var-code=200&from=1564562765879&to=1564566365879 [09:47:31] <_joe_> that 304 are the highes ranked doesn't surprise me [09:48:39] <_joe_> yeah it's getting to where I wanted it to be [09:48:47] <_joe_> we shall have such a dashboard for all services [09:49:05] <_joe_> we do for most of the ones on k8s [16:13:39] <_joe_> mutante: I plan to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/525584 tomorrow [16:13:46] <_joe_> and then test it on mw1270 [16:13:57] <_joe_> https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/526720 [16:40:28] _joe_: ok, sounds good! i was already contemplating whether i should. also i amended to https://gerrit.wikimedia.org/r/c/operations/puppet/+/526289 and it compiles on scandium now after some fixes [16:42:13] and yea. it does include the common role in the role. though i still think it was discouraged (in the past) [16:44:46] made https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups to replace old docs talking about "DSH group files" when they were still actual files [16:44:50] after https://phabricator.wikimedia.org/T227547 [17:12:45] <_joe_> mutante: well written, I think it's very clear [17:14:45] :) [18:22:42] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Reading-Infrastructure-Team-Backlog (Kanban): "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 (10Mholloway) [18:30:13] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Reading-Infrastructure-Team-Backlog (Kanban): "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 (10Mholloway) These worker deaths are being caused by the additional load generated by pregenerating respons... [18:37:23] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Reading-Infrastructure-Team-Backlog (Kanban): "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 (10Pchelolo) > Aside from that, I would have expected that the pregeneration jobs would be finishing up by n... [18:47:30] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Reading-Infrastructure-Team-Backlog (Kanban): "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 (10Mholloway) So one issue right off the bat is that we're generating `/feed/onthisday/all` in the service i... [18:49:03] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Reading-Infrastructure-Team-Backlog (Kanban): "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 (10Mholloway) p:05Normal→03High [18:51:12] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Reading-Infrastructure-Team-Backlog (Kanban): "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 (10Pchelolo) >>! In T229286#5381330, @Mholloway wrote: > So one issue right off the bat is that we're genera... [18:54:18] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Reading-Infrastructure-Team-Backlog (Kanban): "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 (10Mholloway) Weird. Why are we still getting internal requests for `/feed/onthisday/all`, then? [19:01:20] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Reading-Infrastructure-Team-Backlog (Kanban): "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 (10Pchelolo) >>! In T229286#5381396, @Mholloway wrote: > Weird. Why are we still getting internal requests... [22:16:17] there is a "Google Kubernetes Engine Plugin 0.6.3" plugin for Jenkins. " allows you to publish deployments built within Jenkins to your Kubernetes clusters running within GKE". maybe that can be configured to your own cluster