[07:11:01] 10serviceops, 10Operations, 10Performance-Team (Radar), 10User-Elukey: mcrouter codfw proxies sometimes lead to TKOs - https://phabricator.wikimedia.org/T227265 (10elukey) We deployed all the changes for T225642, so async settings for codfw replication was not the culprit. After T225059 we have per-shard... [07:35:48] 10serviceops, 10Operations: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10MoritzMuehlenhoff) The patch seems sane, but I'm wondering whether we actually need to pursue this further? tmpreaper is dead upstream (the Debian maintainer keeps it alive a little for security fixes,... [08:11:51] 10serviceops, 10Operations: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10elukey) >>! In T151304#5391310, @MoritzMuehlenhoff wrote: > The patch seems sane, but I'm wondering whether we actually need to pursue this further? tmpreaper is dead upstream (the Debian maintainer kee... [09:12:32] https://gerrit.wikimedia.org/r/528078 need revieweres cc akosiaris [09:13:46] <_joe_> fsero: I was looking at it [09:14:00] <_joe_> and my first doubt was looking at calico-policy-controller's chart [09:14:05] <_joe_> why 1 replica? [09:15:04] cause it's 1 component looking reading the kubernetes API and syncing to the etcd calico [09:15:23] <_joe_> so more than one would be dangerous [09:15:26] yup [09:15:36] because calico controller in that version also keeps the state in memory [09:15:37] <_joe_> OTOH, if it goes down it's big trouble [09:15:47] _joe_: no it is not [09:15:54] cause the rules stay around [09:15:56] its big trouble for new pods [09:16:03] not even that [09:16:16] <_joe_> it's trouble just if you add new nodeports? [09:16:21] the pods matching stuff is done by felix [09:16:31] it's big trouble if you change the ports a pod is exposing [09:16:44] <_joe_> right [09:17:02] * akosiaris looking at the change [09:17:21] mmm im not so sure at that last thing.. ive experience if policy controller is down new pods will not get any networking [09:17:36] but is easy to test [09:17:52] * fsero playing chaos monkey in staging [09:17:53] could be I am wrong ofc [09:18:02] <_joe_> fsero: one q - it seems to me that helmfile will have duplicated declarations for everything? [09:18:12] <_joe_> like helmfile.d/admin/codfw [09:18:22] <_joe_> contains the chart directly [09:18:33] <_joe_> so will do hemlfile.d/admin/eqiad ? [09:18:36] <_joe_> the same chart [09:18:43] <_joe_> or will some symlinking work? [09:19:16] _joe_: the templating capability could be extorted to get something more dry. embedded charts could be added to our helm repo, and yep i think some symlinks will work [09:19:24] lemme try [09:19:42] <_joe_> just so that the charts are common and just the values are set per-cluster [09:19:51] yep [09:25:28] for the record akosiaris is right [09:25:44] however obviously in the event of a node reboot or felix reboot things will change :) [09:26:00] oh that's true indeed [09:42:44] _joe_: no symlinks were required, now using only one copy [09:43:04] <_joe_> fsero: great :) [09:55:53] 10serviceops, 10Operations, 10HHVM: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [09:56:51] 10serviceops, 10Operations, 10HHVM: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [09:56:54] 10serviceops, 10Operations: SRE FY19-20 Q1 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10jijiki) [10:01:40] 10serviceops, 10Operations, 10HHVM: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [10:02:52] 10serviceops, 10Operations: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10jijiki) @elukey @MoritzMuehlenhoff I have added tmpreapers removal as an actionable in our HHVM removal process (T229792), shall we mark this as resolved or invalid? [10:11:16] very nice from mw1348 [10:11:17] /bin/sh: 1: /usr/local/bin/hhvm-needs-restart: not found [10:11:18] :) [10:15:17] 10serviceops, 10Operations: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10MoritzMuehlenhoff) Toolforge/Toollabs also uses tmpreaper (but not the puppetised version with the tmpreaper Puppet class). I'm adding @Andrew and @aborrero for comments whether we should keep it open f... [10:42:37] question if i do sudo confctl --object-type=discovery select 'dnsdisc=sessionstore|citoid|cxserver|eventgate-analytics|eventgate-main|termbox|blubberoid|mathoid|zotero,name=codfw' set/pooled=no will i depool all services from codfw cluster? [10:42:50] or will i need something else for not getting a page? [12:53:14] 10serviceops, 10Operations, 10Release Pipeline, 10Goal, 10Release-Engineering-Team (Pipeline): Self-service Deployment Pipeline - https://phabricator.wikimedia.org/T228676 (10akosiaris) [12:53:20] 10serviceops, 10Operations, 10Release Pipeline, 10CPT Initiatives (RESTBase Split (CDP2)), and 3 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10akosiaris) [12:57:17] 10serviceops, 10Operations, 10Release Pipeline, 10CPT Initiatives (RESTBase Split (CDP2)), and 3 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10akosiaris) restrouter was temporarily deployed in the staging cluster today. Deployment wa... [12:59:43] _joe_: what is /etc/conftool-state/mediawiki.yaml used for on puppetmasters? [13:00:18] <_joe_> cdanis: I think just to figure out what datacenter is the master for mediawiki [13:00:33] by what? maintenance scripts? [13:01:06] <_joe_> by the mediawiki::state puppet function! [13:01:38] ah [13:01:43] <_joe_> git grep mediawiki::state confirms it's only used to retrieve that information within puppet [13:01:53] ok, cool [13:02:25] <_joe_> this is used for all things that are managed by puppet but need to fetch information from etcd. In this specific case, it's only the primary dc [13:02:43] <_joe_> we try to minimize the amount of things that can use it [13:02:48] yeah, makes sense [13:07:06] do you have any advice on an easy way to iterate on confd templates? [13:09:18] <_joe_> how to test them? [13:09:21] yes [13:09:41] let's assume I know little, and have no deep burning desire to learn the intricacies of text/template [13:09:45] <_joe_> interesting question. I do use an etcd instance and confd itself locally [13:09:58] reasonable enough [13:10:01] <_joe_> also, we should upgrade confd across the fleet [13:10:17] <_joe_> cdanis: if you throw puppet in the mix, it makes things even more interesting! [13:10:21] yes [13:10:26] I am not looking forward to that either [13:11:56] <_joe_> now I'm curious - what do you want to do? [13:12:14] https://phabricator.wikimedia.org/T229631 [13:12:23] <_joe_> (we need to update confd btw) [13:12:23] _joe_: since alex will be off soonish [13:12:36] _joe_: should we sync about restbase tomorrow [13:12:51] with akosiaris [13:12:57] <_joe_> sure [13:12:58] (per will's email) [13:13:08] just want to dump some dbctl stuff, likely just the live config as seen by mediawiki, on config-master [13:13:23] and figure out what can or can't be done, me myself I would really appreciate a general overview [13:13:25] k [13:13:38] fine by me [13:13:55] tx both [13:13:59] <_joe_> what email by will? [13:14:02] <_joe_> I got none? [13:14:17] alex added us both in the loop [13:14:45] <_joe_> oh that thread, I see no new emails though [13:14:57] no there are no new emails [13:15:06] _joe_: I was off most of last week [13:15:23] I am catching up a bit [13:15:37] the main reason I brought this up is that I am off in 2 days and if we want/need to do anything in the next few weeks, it's probably prudent I am around as I can save us some time [13:15:54] <_joe_> sure, hence I said it's good to meet [13:15:58] we can ofc decide we can't do anything up to August 26th [13:16:02] which is fine as well [13:16:06] <_joe_> we'll see [13:16:24] I will leave around that time [13:16:33] so if I am do work a bit on this [13:16:40] it will be while alex is on holiday [13:26:13] <_joe_> jijiki: I think you should focus on php7 [13:26:22] <_joe_> that goal is a blocker for a bunch of things [13:27:52] <_joe_> jijiki: we moved a second api server to php7 this week right? [13:28:01] last week [13:28:15] <_joe_> yeah I mean the one just ended. [13:28:20] yes [13:38:20] hopes that sudo confctl --object-type=discovery select 'dnsdisc=sessionstore|citoid|cxserver|eventgate-analytics|eventgate-main|termbox|blubberoid|mathoid|zotero,name=codfw' set/pooled=no will i depool all services from codfw cluster? doesnt blow up anything [13:39:17] hm [13:39:59] eventgate should be fine i think! [13:45:02] <_joe_> fsero: why not [13:45:14] <_joe_> select 'name=codfw' ? [13:45:16] <_joe_> :P [13:45:35] <_joe_> fsero: also remember traffic from varnish is not affected by the discovery system [13:47:18] <_joe_> cdanis: circling back to confd - I think the easiest way would be to just fetch the data from confd and dump it as json [13:47:33] from etcd you mean? [13:47:40] <_joe_> yep [13:47:59] <_joe_> and then read the json in php and render whatever representation people like [13:48:25] 10serviceops, 10Mobile-Content-Service, 10Page Content Service, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 (10Mholloway) a:03Mholloway [13:48:28] <_joe_> we should also upgrade confd :P [13:48:55] is it an obnoxiously bad idea to have noc.wm.o/db.php do a curl against etcd as part of its execution? [13:49:18] if it's not cached I'd avoid it [13:49:19] <_joe_> just in one way [13:49:53] and if it is, as you can bypass the cache easily ;) [13:50:25] s/and/also/ [13:51:27] 10serviceops, 10Machine vision, 10Operations, 10Reading-Infrastructure-Team-Backlog, and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway) [13:51:36] <_joe_> cdanis: in theory, given the value of they key you're interested in is already a json, you could just dump it straight to disk [13:52:18] _joe_: yeah [13:52:37] is confd already running on the noc host? [13:56:33] 10serviceops, 10Parsoid-PHP, 10CPT Initiatives (Parsoid REST API in PHP (CDP2)): Deploy Parsoid-PHP with Mediawiki to scandium for RT and performance testing - https://phabricator.wikimedia.org/T228069 (10ssastry) [13:56:34] <_joe_> cdanis: don't remember [13:56:39] <_joe_> cdanis: or, you could [13:56:43] <_joe_> curl https://conf1004.eqiad.wmnet:4001/v2/keys/conftool/v1/mediawiki-config/eqiad/dbconfig | jq .node.value | sed -e 's/\\//'g -e 's/^"//' -e 's/"$//' | jq . > db.json [13:56:46] <_joe_> just sayin [13:56:49] lol [13:58:58] <_joe_> there must be a more elegant way to de-escape the value [14:00:18] what is the escape like? is "jq -r" appropriate? [14:00:27] <_joe_> maybe [14:01:34] <_joe_> curl https://conf1004.eqiad.wmnet:4001/v2/keys/conftool/v1/mediawiki-config/eqiad/dbconfig | jq -r .node.value does the trick indeed [14:01:44] <_joe_> ok cdanis go with a systemd timer :P [14:02:17] sold [14:10:12] 10serviceops, 10Operations, 10ops-codfw: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Papaul) @jijiki i need this serveur power down thanks [14:10:44] 10serviceops, 10Operations, 10ops-codfw: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10jijiki) @papaul Server is depooled, ping me when do pool it back, many thanks ! [14:13:51] <_joe_> fsero, akosiaris are you rebuilding codfw from helmfile? [14:13:57] <_joe_> did you downtime all services? [14:13:59] <_joe_> :P [14:14:07] i am [14:14:09] and no [14:14:11] lemme do it [14:14:14] 10serviceops, 10CPT Initiatives (Session Management Service (CDP2)), 10Core Platform Team Workboards (Green), 10User-Clarakosi, 10User-Eevans: Package table_properties utility for Debian - https://phabricator.wikimedia.org/T226551 (10holger.knust) [14:14:15] i didnt do anything yet [14:17:30] <_joe_> fsero: let's also check the traffic layer [14:17:43] <_joe_> akosiaris: can you help him? I need to step afk for ~ 15 minutes [14:18:20] still didnt do anything [14:18:26] but downtimed the lvs checks [14:19:45] 10serviceops, 10Operations, 10PHP 7.2 support: Socket Errors on PHP7 - https://phabricator.wikimedia.org/T224538 (10jijiki) For connection pooling purposes, when we want to access `search.svc.eqiad.wmnet` from php-fpm, we are doing so via nginx. This nginx is installed on each mw* server listening on localho... [14:24:41] _joe_: akosiaris ping me when you are around to proceed, if not i will abort and do it tomorrow [14:29:04] 10serviceops, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review, and 3 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) Removing HHVM and any leftovers are now part of T229792, we mark this as resolved 💃 [14:29:20] 10serviceops, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review, and 3 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) 05Open→03Resolved a:03jijiki [14:29:24] 10serviceops, 10Operations: SRE FY19-20 Q1 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10jijiki) [15:20:10] fsero: I am around now [15:20:30] could you please do a +1 [15:20:32] https://gerrit.wikimedia.org/r/528164 [15:20:38] or no [15:24:21] done [15:25:41] ill start with no visible services from varnish like zotero and termbox [15:55:37] 10serviceops, 10Operations, 10ops-codfw: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Papaul) @jijiki please repool the server when you have a minute. We will have to order a new Storage battery for the server since all the decom HP servers are GEN8 and this one is a GEN9 so diffe... [15:57:03] <_joe_> jijiki: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/528176/ mind taking a look? [15:58:49] if you find 5', can you explain me the details a bit behind that ? [16:10:17] 10serviceops, 10Operations, 10ops-codfw: (OoW) restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10Papaul) @jijiki I made a procurement task the the storage battery at T229847 [16:50:40] 10serviceops, 10Cloud-Services, 10Security: Address kubernetes CVE-2019-11247 and CVE-2019-11249 - https://phabricator.wikimedia.org/T229856 (10akosiaris) [23:39:52] _joe_: I have 'improved' on your oneliner: curl https://$(dig +short _etcd._tcp.eqiad.wmnet srv | head -n1 | awk '{print $4 ":" $3}')/v2/keys/conftool/v1/mediawiki-config/eqiad/dbconfig | jq -r .node.value | jq -r .val