[09:13:13] so... back to metrics... [09:13:30] I think I like best the one we discussed about defining SLOs for each service on the pipeline [09:14:05] "Percentage of services in the Deployment Pipeline with SLOs defined together with the service owner" [09:14:09] something like that? [09:20:56] <_joe_> I would s/Deployment Pipeline//, but I understand that makes it a tad too generic [09:21:04] <_joe_> so yeah, in absence of anything better [09:21:21] <_joe_> we also need to provide an SLO for kubernetes [09:21:45] <_joe_> people can't expect their service to be available more than - say - we guarantee kube-proxy to be up [09:22:01] <_joe_> overall I think it works [09:23:53] well we can do that but don't need to add it to the annual plan metrics [09:24:31] and indeed I think that "services on the deployment pipeline" at least makes it well defined which services are in scope and which are not [09:24:34] which is much much harder outside of it [09:24:46] <_joe_> yes [09:24:48] what do we think would be a reasonable target for a) end of next fiscal year, b) in 3-5 years? [09:24:53] i'd say b) is close to 100% :P [09:24:57] <_joe_> b is 100% [09:25:01] <_joe_> well [09:25:10] <_joe_> mediawiki is tricky [09:25:19] <_joe_> a - uhm [09:25:27] <_joe_> assuming new services will have one [09:25:40] <_joe_> and that we will move most old one [09:25:42] yeah we should probably make that a requirement ;) [09:25:42] <_joe_> *ones [09:25:47] <_joe_> it is [09:25:52] <_joe_> approved rfc :P [09:25:55] hah [09:25:57] good [09:26:04] <_joe_> and we require it [09:26:40] <_joe_> so I'd wait for akosiaris and fsero for their opinion, but i guess 40% is reasonable? [09:26:54] well [09:27:00] how many on the pipeline have it today? [09:27:08] <_joe_> 1 [09:27:15] and how many services on the pipeline? [09:27:16] <_joe_> and 2 more wil come soon [09:27:25] <_joe_> good q [09:27:42] 40% seems rather low if we're requiring it for anything new going on there anyway [09:27:45] <_joe_> I think 5 but lemme check [09:28:04] <_joe_> but we will be moving all the old services there this year [09:28:11] <_joe_> so if we count them all [09:28:20] <_joe_> lemme do the count [09:30:32] <_joe_> I count 11 + the migrated ones [09:30:56] <_joe_> so we have 1 out of 16 rn [09:33:30] moving all the old services was the plan for this fiscal year (ending soon) right hehe [09:33:36] but yeah we're a bit behind on that one [09:43:37] 10serviceops, 10Operations: Separate Wikitech cronjobs from production - https://phabricator.wikimedia.org/T222900 (10jijiki) @Dzahn THANK YOU! 😍 [09:45:08] 10serviceops, 10Operations, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) We have upgraded php7 on beta, so now it looks like async jobs are running. We will leave it as is until n... [09:46:47] 10serviceops, 10MediaWiki-History-and-Diffs, 10MediaWiki-Parser, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature) - https://phabricator.wikimedia.org/T216664 (10Joe) [09:49:30] * akosiaris reading backlog [09:50:18] b) is definitely not 100% [09:50:38] never have any metric/SLO at 100%. It's simply unattenable [09:51:36] you could go around and redefine stuff to exclude stuff and get 100%, but that's an painful exercise [09:52:46] 40% sounds rather ok to me btw. given the current status quo [09:53:03] but I have a question. Weren't we saying yesterday that pipeline stuff is not core work? [10:00:12] akosiaris: well, correct [10:00:23] but where does the pipeline program end and where does core work begin [10:00:25] it's a bit vague [10:00:43] can we at least do 50%? [10:00:52] 40% feels really underwhelming :P [10:00:58] sure, I don't see why not [10:01:07] i had 60% in my mind [10:01:15] but i don't know how attainable it all is, i'll rely on you for it [10:01:42] i'll put it in for now, we can still change it [10:02:09] about 8% more difficult than 50%, so.. 1 more service? [10:02:32] heh [10:02:45] see the doc for what I put in [10:02:52] YIPPI!!!! I got kask running WITH cassandra in minikube!!!!! All in one go with just a helm install (and some patience) [10:02:58] urandom: ^ [10:03:17] * akosiaris dancing [10:03:38] I am gonna call it a success and proceed on termbox [10:07:21] "Percentage of services in the Deployment Pipeline having SLOs defined and agreed upon together with their service owner" [10:07:32] i hope the "agreed upon with" was implied [10:07:35] but maybe we should specify it? :) [10:08:00] define what reaching an agreement is? [10:08:29] I see a rabbithole over there Alice [10:08:30] mutually signing off? [10:08:44] signing what? [10:08:48] nothing [10:08:52] just an email or whatever [10:09:11] ah, so just s/agreed upon/mutually signed off/ ? [10:09:15] i think SLOs are not worth much if agreed on between service owner and us, right? [10:09:34] if not [10:09:42] yeah definitely [10:09:51] the entire idea is that's it's binding for both sides [10:10:41] then I think "agreed upon together with" works [10:11:05] let's get it in writing somewhere, and we are good [10:16:51] 10serviceops, 10Wikimedia-Site-requests, 10Patch-For-Review, 10Performance-Team (Radar): Enlarging the default thumb size on Dutch Wikipedia - https://phabricator.wikimedia.org/T215106 (10Ciell) Replacement with {{tl|largethumb}} should be done now, according to https://nl.wikipedia.org/wiki/Wikipedia:Verz... [10:20:27] 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) @Clarakosi @Eevans. I 've updated the chart to also conditionally install a minimal cassandra for use in m... [10:36:19] 10serviceops, 10Operations: Separate Wikitech cronjobs from production - https://phabricator.wikimedia.org/T222900 (10jbond) p:05Triage→03Normal [10:48:30] 10serviceops, 10Operations, 10Patch-For-Review, 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Joe) [10:48:39] 10serviceops, 10Operations, 10Traffic, 10PHP 7.2 support, and 2 others: Improve Pybal's url checks - https://phabricator.wikimedia.org/T222705 (10Joe) 05Open→03Resolved [12:56:07] akosiaris: nice! [13:16:30] 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10akosiaris) @WMDE-leszek. Yes I did. Using https://locust.io/, wrote P8511 and benchmarked the service locally on my minikub... [13:27:14] 40% is reasonable for existing services new services should define slos before going through the pipeline [13:27:23] Sorry late to the party [13:43:08] akosiaris: Would you like us to review https://gerrit.wikimedia.org/r/c/wikibase/termbox/+/509391? FYI it's currently marked WIP but that is the default for the repo :) [13:56:50] 10serviceops, 10Operations, 10Release Pipeline, 10Release-Engineering-Team, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10akosiaris) @Tarrow , @WMDE-leszek I 've noticed 3 things while working on the above * The service seems to be configurable t... [13:57:03] tarrow: ah I had not noticed that. Yes please! [13:57:19] awesome :) [14:00:01] akosiaris: our main question about the x-amples thing is how does it impact us coupling the x-amples to wikidata.org. e.g. I can imagine us having (at the WMF) a version of the service pointing at test.wikidata.org and (outside WMF) people pointing it at their own wikibases [14:00:37] in that has the x-ample request would in all likelihood be expected to fail [14:01:35] yes, it would indeed [14:02:22] maybe we could if guard it and only enabled it on some config param [14:02:23] We debated it for a while and that is why we only left in _info as monitored [14:02:57] well, /termbox could be malfunctioning whereas /_info works fine [14:03:13] Right: so ship different openapi depending on the environment? [14:03:30] yeah, that's the only sane way out I can think of [14:04:22] cool, sounds find to me. I'll open a ticket for us about how to do that then [14:04:39] is there some hint the service runnign platform gives us that we are in "real production" [14:05:04] or should we just look for when we're pointed at 'wikidata.org' as a magic special case [14:05:10] we can pass an ENV var if that helps [14:05:36] and it can be a whatever key/value pair you want [14:05:56] like OPENAPI="the real production" [14:06:01] ;-) [14:06:26] we can also set it in config.yaml if that is better. Both are ok options [14:07:35] I suspect we would be best to have it set as an ENV; since I would think config.yaml would be shared by our many other instances :) [14:18:41] akosiaris: On the note of a "deploying to the pipeline" training session I don't suppose you or any service pipeline enlightened people are coming to the Hackathon next week? [14:23:15] tarrow: I think some people from release engineering are going to be there. I won't. SREs generally did not get to go to the hackathon this year unfortunately [14:24:22] tarrow: releng is also gonna be giving some talks/demos of the pipeline as far as I know. So you should be covered on that front [14:25:24] Right, sad that SRE didn't get to send anyone. I'll try to bribe some relengers to teach us there though [14:27:35] ottomata: eventgate GC quantiles look very suspiciously flat. https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?refresh=1m&panelId=63&fullscreen&orgId=1&from=now-24h&to=now [14:28:07] akosiaris: that's due to T220709 [14:28:13] https://phabricator.wikimedia.org/T220709 [14:28:30] want me to upgrade then? [14:28:39] not sure that it is really related, but easy enough to verify [14:28:47] sure! i thiink it will only affect those graphs, which are busted anyway [14:28:56] yeah I think the same [14:29:00] * akosiaris doing so now [14:31:52] <_joe_> ottomata: you can safely use stretch with role::beta::docker_services [14:31:56] <_joe_> just FYI [14:32:12] _joe_: great, saw that, i just made a new instance to do so [14:33:06] <_joe_> <3 [14:33:18] <_joe_> I need to move over citoid as well [14:33:20] <_joe_> and zotero [14:34:10] _joe_: what should profile::docker::engine::version be set to? [14:34:14] ottomata: done [14:34:26] thanks akosiaris ok let's see if metrics change... [14:34:30] opening prometheus... [14:34:41] no change up to now ... [14:35:05] <_joe_> ottomata: same value [14:35:07] something else is up [14:35:09] <_joe_> yes ~jessie [14:35:23] <_joe_> blame akosiaris and reprepro copy for that :D [14:36:13] ottomata: buckets: [5e-4, 1e-3, 5e-3, 10e-3, 15e-3, 30e-3, 50e-3] [14:36:17] soooooooo [14:36:23] those buckets are way too smal [14:36:28] that's the issue [14:37:03] the metrics says seconds, but it's μseconds in reality [14:37:22] ok, I 'll fix this on Monday. got RL catching up to me right now [14:37:33] well... fix.... deploy a bandaid that is [14:37:34] did we also need to fix the scaling in the exporter config? [14:37:56] _joe_: [14:37:56] 1.12.6-0~debian-jessie [14:37:56] ? [14:38:05] <_joe_> yes [14:38:07] ok [14:38:51] ottomata: yeah, it's the mess with suffixes I wrote about in https://phabricator.wikimedia.org/T222795 [14:38:58] really really confusing [14:39:42] ya [14:39:42] ns suffixed as ms, that get divided to give you μs, that then were due to all this mess wrong set as s [14:39:49] hehheh [14:53:57] seriously, no winners in that story [15:34:08] 10serviceops, 10Operations: Separate Wikitech cronjobs from production - https://phabricator.wikimedia.org/T222900 (10Krinkle) 05Open→03Resolved Appears to be resolved. Re-open if I misunderstood :) [15:34:10] 10serviceops, 10Operations, 10cloud-services-team, 10Core Platform Team Backlog (Watching / External), and 3 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Krinkle) [15:39:00] 10serviceops, 10Operations: SRE FY2019 Q4 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10Reedy) [15:39:05] 10serviceops, 10MediaWiki-History-and-Diffs, 10MediaWiki-Parser, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature) - https://phabricator.wikimedia.org/T216664 (10Reedy)... [15:39:51] <_joe_> we solved 2 blockers for the php7 transition [15:39:58] <_joe_> 1 fundamental blocker remaining [15:40:02] <_joe_> (cssjanus) [15:41:51] the explodng dom one is gone? [15:45:14] <_joe_> yrd [15:45:23] <_joe_> err blame my cat [15:45:25] <_joe_> yes [15:45:42] <_joe_> she came to tell me it's time to stop working and go feed her, I gues [15:46:18] your cat is right :-) [15:46:29] also \o/ for the blockers! [15:56:12] 10serviceops, 10Analytics, 10Analytics-Kanban, 10EventBus, 10Services (watching): Change LVS port for eventlogging-analytics from 31192 to 33192 - https://phabricator.wikimedia.org/T222962 (10Ottomata) [15:58:12] 10serviceops, 10Analytics, 10Analytics-Kanban, 10EventBus, 10Services (watching): Change LVS port for eventlogging-analytics from 31192 to 33192 - https://phabricator.wikimedia.org/T222962 (10Ottomata) Hm, question. Currently mediawiki-config ProductionServices.php has: 'eventgate-analytics' => 'http... [17:17:30] 10serviceops, 10MediaWiki-History-and-Diffs, 10MediaWiki-Parser, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature) - https://phabricator.wikimedia.org/T216664 (10cscott)... [18:23:16] 10serviceops, 10MediaWiki-History-and-Diffs, 10MediaWiki-Parser, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: MWException when viewing or comparing certain pages with Preprocessor_DOM (PHP7 beta feature) - https://phabricator.wikimedia.org/T216664 (10Umherirr... [21:52:08] 10serviceops, 10Gerrit, 10Operations, 10ops-eqiad, 10Release-Engineering-Team (Watching / External): Gerrit Hardware Upgrade - https://phabricator.wikimedia.org/T222391 (10Dzahn) created S4 procurement ticket for this at T222984