[02:05:15] 10serviceops, 10Gerrit, 10Operations: Gerrit loads very slowly - https://phabricator.wikimedia.org/T215855 (10Paladox) [02:21:54] 10serviceops, 10Gerrit, 10Operations: Gerrit loads very slowly - https://phabricator.wikimedia.org/T215855 (10Paladox) p:05Unbreak!→03High @thcipriani restarted gerrit. Keeping this task open for now. [02:30:09] 10serviceops, 10Gerrit, 10Operations: Gerrit loads very slowly - https://phabricator.wikimedia.org/T215855 (10CDanis) maybe this will be illuminating for someone -- it is stack traces from the gerrit jvm process at the time it was guzzling CPU {P8070} [02:36:34] 10serviceops, 10Gerrit, 10Operations: Gerrit loads very slowly - https://phabricator.wikimedia.org/T215855 (10Paladox) I spoke with upstream who said another user had reported that ( its a locking issue ) they tryed to fix it with a another library but that didn’t work. [02:41:01] 10serviceops, 10Gerrit, 10Operations: Gerrit loads very slowly - https://phabricator.wikimedia.org/T215855 (10thcipriani) I noticed that we've been having high cpu usage at about this time every day, unsure if this is some cleanup or indexing that is run on a schedule. I captured a few things prior to resta... [06:05:22] 10serviceops, 10Librarization: Travis tests for mediawiki-libs-etcd broken - https://phabricator.wikimedia.org/T215864 (10Jdforrester-WMF) [06:10:01] 10serviceops, 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10Smalyshev) I get the idea of server-side HTML rendering to avoid delays. But I am kinda questioning whether the advantage of splitting code... [06:13:55] 10serviceops, 10Librarization: Travis tests for mediawiki-libs-etcd broken - https://phabricator.wikimedia.org/T215864 (10Reedy) [09:58:01] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) 05Resolved→03Open ` [Tue Feb 12 06:13:31 2019] mce: [Hardware Error]: Machine check events logged [Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR [... [09:58:20] 10serviceops, 10Operations, 10Thumbor, 10ops-eqiad: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) a:05jijiki→03RobH [10:20:53] 10serviceops, 10Operations, 10Core Platform Team (Session Management Service (CDP2)), 10User-jijiki: Create puppet role for session storage service - https://phabricator.wikimedia.org/T215883 (10jijiki) p:05Triage→03Normal [10:21:49] 10serviceops, 10Operations, 10Core Platform Team (Session Management Service (CDP2)), 10User-jijiki: Create puppet role for session storage service - https://phabricator.wikimedia.org/T215883 (10jijiki) [10:21:58] 10serviceops, 10Operations, 10Core Platform Team (Session Management Service (CDP2)), 10User-jijiki: Create puppet role for session storage service - https://phabricator.wikimedia.org/T215883 (10jijiki) [11:10:32] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [11:10:37] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, and 2 others: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10jijiki) 05Open→03Resolved [11:31:19] 10serviceops, 10Operations, 10Thumbor, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10jijiki) p:05Triage→03Normal [11:32:04] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [11:32:07] 10serviceops, 10Operations, 10Thumbor, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10jijiki) [11:46:30] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [13:52:21] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10Gilles) [13:53:37] 10serviceops, 10Thumbor, 10User-jijiki: First page of a specific PDF files on Commons does not render a preview - https://phabricator.wikimedia.org/T213771 (10Gilles) 05Open→03Resolved a:03Gilles Seems to work now, probably thanks to the Ghostscript update. [13:53:40] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10Gilles) [13:56:16] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10Gilles) [14:00:08] 10serviceops, 10Operations, 10Thumbor, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10Gilles) [14:36:38] o/ woohoo :) [14:51:07] 10serviceops, 10Citoid, 10Operations, 10Kubernetes, 10Wikimedia-Incident: Zotero service crashes and pages multiple times. - https://phabricator.wikimedia.org/T213693 (10fsero) 05Open→03Resolved a:03fsero After latest deployments of zotero this has been fixed [14:52:07] 10serviceops, 10Prod-Kubernetes, 10User-fsero: Kubernetes clusters roadmap - https://phabricator.wikimedia.org/T212123 (10fsero) [14:53:40] 10serviceops, 10Operations, 10vm-requests, 10Patch-For-Review, 10User-fsero: eqiad: 1-2 VM requests for docker-registry-beta.wikimedia.org - https://phabricator.wikimedia.org/T212212 (10fsero) 05Open→03Resolved a:05fsero→03None vms are already assigned and running. [15:11:22] akosiaris: i'm looking for docs on how to get eventgate into /srv/scap-helm on deploy1001 [15:11:26] lookjing here https://wikitech.wikimedia.org/wiki/Kubernetes/Helm [15:14:21] ottomata: it's the values.yaml file [15:14:43] oh its just manually placed ther? [15:14:55] yes. it's the thing we will be migrating to a git repo and use helmfile for [15:15:00] ah [15:15:06] and the eventgate-analytics chart? [15:15:14] AH [15:15:22] helm is setup already [15:15:49] ya, but the chart needs to be checked out here somewhere [15:15:54] do I git pull in /srv/deployment-charts? [15:15:54] nope [15:15:58] no? [15:16:00] it does not [15:16:13] 5 mins after the change was merged the repo was updated on deploy1001 [15:16:26] well something more cause you know, puppet runs and all [15:17:07] the /srv/deployment-charts part is there to help with debugging [15:17:11] but it's not strictly required [15:17:56] (i need to build a new image anyway first...) [15:17:59] ok but... [15:18:02] akosiaris@deploy1001:~$ scap-helm eventgate search eventgate [15:18:02] ### cluster eqiad [15:18:02] NAME CHART VERSION APP VERSION DESCRIPTION [15:18:02] stable/eventgate-analytics 0.0.1 eventgate-analytics receives JSON events over HTTP, valid... [15:18:05] see ? [15:18:09] scap-helm eventgate-analytics status [15:18:14] ... [15:18:21] AH! [15:18:23] cool [15:18:36] ok so status/list is for after a release is deployed [15:18:38] ah we need to create the namespace [15:18:44] lemme do that now [15:18:49] ok... [15:24:23] akosiaris: do you know if docker image build jobs are triggered if i just push directly to gerrit repo? or do I need to actually merge through gerrit ui? [15:25:03] ah NM i found it! [15:25:03] https://integration.wikimedia.org/ci/job/trigger-service-pipeline-test-and-publish/29/ [15:25:13] the answer: push directly works fine. [15:25:54] why do you push directly though? [15:26:03] akosiaris: its just for a tag [15:26:16] we set it up so that a tag is what causes triggers the build [15:26:26] ah ok [15:32:10] hm so, how do thesee values files in /srv/scap-helm get applied? [15:32:29] is it automatic? is there a dir/values-file-name.yaml convention i need to follow? [15:32:50] or, can I do a single dir for all the eventgate deployments, with different values files for each? [15:32:54] or should I have a new dir for each one? [15:32:56] e.g. [15:33:27] eventgate-analytics/values.yaml [15:33:27] or [15:33:27] eventgate/{analytics-values.yaml,main-values.yaml} [15:35:11] no it's not automatic [15:35:27] but have a look at zotero and mathoid in there [15:35:34] should help to understand the convention [15:35:51] mathoid is the same regardless of DC so it has 1 file for production and 1 for staging [15:36:07] zotero is bound to DC so there's 1 file per each DC and 1 for staging [15:36:20] so, if we have deployments (e.g. eventgate-analytics, eventgate-main, etc.) [15:36:29] shoudl I do a directory for each? [15:36:33] don't look that far in the future [15:36:40] as I said it's going to be deprecated [15:36:47] i think yes...because we will probably have multiple values files for each of those...one for staging one for prod? [15:36:48] way before you do other deployments [15:37:06] akosiaris: if all goes well with this one we might deploy -main early next quarter [15:37:23] yeah, if all goes well we will have ditched that directory early next month :-) [15:37:26] actually, we don't need a staging one...the staging one is the same as prod [15:37:27] ok cool [15:37:31] i'll make one dir for now then [15:46:22] akosiaris: do I need to add entry to profile::kubernetes::deployment_server::services in hire? [15:46:23] hiera? [15:48:57] didn't I do already that in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/490078 ? [15:49:35] Oh great! didnt' see that sorry. was looking in my local puppet [15:49:55] hadn't pulled [15:50:54] akosiaris: how can I find the image version? My tag is v1.0.0-rc0 [15:51:07] i guess it will just be that? [15:52:21] also, if you could check: /srv/scap-helm/eventgate/* [15:52:22] ottomata: ah, that's one of the difficult parts currently. We were discussing it with releng during the all hands. The best way right now seems to be delving deep down jenkings logs [15:52:43] ya i'm poking around in /ci/ ... [15:52:50] https://integration.wikimedia.org/ci/job/service-pipeline-test-and-publish/26/console [15:52:59] I only found it faster cause I knew where to look [15:53:04] anyway there are plans to fix that [15:53:12] at the end you will find the tag names [15:53:29] and indeed it's docker-registry.discovery.wmnet/wikimedia/eventgate-ci:v1.0.0-rc0 [15:53:41] nice that someone does semver :-) [15:53:44] 2019-02-12-152010-production [15:53:44] ? [15:53:50] oh great [15:54:01] same images, multiple tags [15:54:07] ok cool. i think it makes multiple tags then for it..? [15:54:08] oh great [15:54:10] ok cool. [15:55:36] ok, the namespace creations and tokens are being rolled out as we speak [15:55:42] quick q [15:55:50] outgoing connects... to kafka? [15:55:54] is there anything else I am missing? [15:56:06] right now that's it. we'll have a http schema registry eventually [15:56:17] logstash [15:56:17] ? [15:56:25] logstash is already allowed [15:56:27] don't fully grok how all the logging/monitoring works [15:56:28] ok cool [15:56:39] I am trying to think what's knew in all of this [15:56:44] aah.. I guess gerrit too? [15:56:47] for the init container? [15:56:55] s/knew/new/ ofc [15:56:59] yes [15:57:14] cool, cause we need to allow these too [15:57:18] gerrit for now (will remove that before prod deployment) [15:58:04] scap-helm eventgate-analytics list now returns nicely an empty output [15:58:18] nice! [15:59:25] so, how about the codfw kafkas? [15:59:29] should I add them too? [15:59:44] we will be having a switchover in 6 months or so probably, hence me asking [16:00:09] so likely the -analytics service will produce to the jumbo cluster only [16:00:21] which doesn't exist in codfw [16:00:43] when we deploy the -main (name tbd) service, then ya we'll need all the -main kafkas in both DCs [16:00:54] well, the eqiad deployment will produce to main-eqiad, and vice versa for codfw [16:07:46] so akosiaris [16:09:12] ottomata: give me 5 mins to finish this [16:09:15] trying to piece together an install command [16:09:16] OH ok sorry [16:09:18] great i'm in standup anyway [16:09:19] thanks [16:09:36] 10serviceops, 10Gerrit, 10Operations: Gerrit loads very slowly - https://phabricator.wikimedia.org/T215855 (10thcipriani) Cleaner threaddump output I grabbed last night and forgot to paste: {P8073} [16:17:34] 10serviceops, 10Operations, 10Core Platform Team (Session Management Service (CDP2)), 10Patch-For-Review, and 2 others: Create puppet role for session storage service - https://phabricator.wikimedia.org/T215883 (10Eevans) [16:29:28] ottomata: ok I think we are all set [16:30:34] akosiaris: staging is only in eqiad? [16:30:57] i have two diffferent files right now, but the only difference between them is the datacenter value setting, which is used for topic prefixing [16:31:01] i could have a single file and just use --set [16:31:07] when installing? [16:31:34] yes staging is only in eqiad [16:31:42] ok [16:31:45] yes you could [16:31:54] but that would not map nicely to the git repo concept [16:31:55] release names are usually [16:31:59] 'staging', 'production' etc.? [16:32:02] aye [16:32:02] exactly [16:32:04] ok [16:32:55] marielle has some personal notes on https://wikitech.wikimedia.org/w/index.php?title=User:Mvolz/Deploying_Zotero [16:32:59] akosiaris: i think https://wikitech.wikimedia.org/wiki/Kubernetes/Helm needs a section for deploying a new service for the first time :) [16:33:04] oh coool looking [16:33:17] if it helps. Part of the goal this q is to gather up all that up and make it a nice and useul portal [16:33:20] ah CLUSTER= [16:33:20] k [16:33:22] nice [16:33:47] what's the stable/zotero part? [16:33:54] the chart name [16:34:03] stable/eventgate-analytics in your case [16:34:03] hm, so i'm eventgate-analytics [16:34:04] ? [16:34:06] oh with stable [16:34:07] ok [16:34:15] yeah stable is the repo [16:34:26] we are overriding the default one since it seems we can get rid of it [16:34:32] can't* [16:36:01] ottomata: once you think the thing is working fine, I guess I should start the LVS part [16:36:12] eventgate-analytics.svc.{eqiad,codfw}.wmnet I guess ? [16:37:04] oh. [16:37:14] hm [16:37:44] yes let's go with that [16:38:30] also, this is strictly internal right? no end user reaching out to it [16:39:55] for now yes [16:40:12] fsero: you rule! [16:40:13] we haven't yet decided if the 'public' analytics endpoint will be the same service [16:40:17] i think it might be... [16:40:22] but, we aren't exposing it at all yet [16:40:24] the point in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/490083/ is great! [16:40:26] probably not for a quarter or two [16:40:30] I 've missed that [16:40:38] ottomata: ok [16:43:23] 10serviceops, 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10WMDE-leszek) @Smalyshev I believe the approach we are suggesting really makes a difference when thinking beyond just rendering a template f... [16:43:54] so something lke... [16:44:03] CLUSTER=staging scap-helm eventgate-analytics install -n staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [16:44:15] yup [16:44:19] sounds about right [16:44:56] hm, i'm going to make a simple template change... the use of 'datacenter' i think is wrong here, i should call this jsut 'topic_prefix' directly. i'm using 'staging' in this case as that value... [16:46:08] akosiaris: thanks :) is more easy to see this things if you are not focused on trying to make things work :D thats the beauty of reviews :-). [16:46:09] ottomata: you can always check whatever is going to be send with helm template [16:46:26] fsero: oh cool! ok will try that first [16:46:34] CLUSTER=staging scap-helm eventgate-analytics template -n staging -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [16:46:38] fsero: helm template or scap-helm template [16:46:38] ah ok [16:46:39] cool. [16:46:41] danke [16:46:44] should work or log into operations [16:46:45] :P [16:47:22] haha [17:13:10] <_joe_> there's a lot of manual work on our side still when a new service is added. As in, a series of commits to puppet [17:13:29] <_joe_> I don't think it's a deficiency per-se, but I hate it [17:13:39] <_joe_> I don't have a solution for it right now [17:42:42] I don't like it either, first step for fixing or making it a little bit better is reduce the number of steps if possible [17:53:37] <_joe_> I'm not even sure that's the way to do it [17:54:05] <_joe_> we should try to remove the whole thing from puppet's control, or centralize a lot of the control and config in a specific place [17:54:38] <_joe_> I mean to fix it probably you need to take a different route than to reduce the manual steps [18:28:27] I don't have a complete and formed opinion, my gut feeling is that puppet should only do what is meant to do, config management for servers, configuration management for services and deployments should be done elsewhere, but the 'elsewhere' i know doesn't apply to our current infrastructure. And in any case, I do think that reducing the number of steps is always a good thing to do :). We should start a working document about this [18:29:53] akosiaris: fsero, checking in (after done with meetings, etc.) [18:30:02] can i try a helm install to staging today? [18:30:06] i see https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/490083/ still ,needs merged... [18:30:42] i was about to say goodbye for today, do you think it can wait til tomorrow? [18:30:56] that way we can merge that patch and test it (with zotero) [18:32:44] ya can do! [18:32:53] i'll ping yall when i get on in my morn [18:32:54] thanks! [18:34:11] great :) see you tomorrow [18:37:29] 10serviceops, 10MediaWiki-Cache, 10Operations, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10jijiki) @EvanProdromou After some digging in mc20* re...