[11:49:55] 10serviceops, 10MediaWiki-General, 10Operations, 10MW-1.34-notes (1.34.0-wmf.16; 2019-07-30), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10matej_suchanek) Where is this now? Did the maintenance script act... [12:54:28] _joe_: sgtm, we need to do it before the service switchover today, right? I'll start a patch but then we should start getting it rolled out [12:54:30] volans: ^ fyi [12:54:49] <_joe_> rzl: all done already :) [12:54:54] oh brilliant [12:55:02] thank you! [12:55:05] <_joe_> we also fixed the switchdc cookbook for services [12:55:24] <_joe_> see #-operations [12:55:41] and increased the retry params so now the dnsdisc stuff doesn't fail [12:55:43] <_joe_> so regarding the check failure, there must be something with the TTL changes and gdnsd [12:55:56] we're not sure yet why is so slow [12:56:02] it takes 12~18s [12:56:04] <_joe_> because the actual up/down commands seem to be caught up immediately [12:56:14] <_joe_> volans: more like 9-15 [12:56:21] <_joe_> the first "retry" is immediate [12:56:31] right [12:56:47] in one case went up to 6 retries [12:56:51] <_joe_> so, my experience even right now is [12:56:56] so 15s yea [12:57:00] <_joe_> - ttl takes between 4 to 6 retries [12:57:07] <_joe_> - record changes at most one [12:57:32] <_joe_> which means the problem is not anymore confd losing updates [12:57:46] <_joe_> but rather gdnsd having some kind of cache I guess [12:57:56] <_joe_> or something along those lines [12:58:09] <_joe_> the good news is, the record changes during read-only shall be fast [12:59:50] that is weird but definitely good [13:00:02] and yeah agree it does sound cache-flavored [13:01:43] _joe_: two more questions queued up for you actually, one small and one larger [13:01:56] the first one is: should we be moving thumbor along with the mediawiki services? [13:01:57] <_joe_> rzl: shoot [13:02:10] <_joe_> rzl: no, thumbor gets only called by swift [13:02:16] <_joe_> that remains a/a AIUI [13:02:23] it's A/p currently, is why I ask [13:02:32] <_joe_> what is a/p? [13:02:37] <_joe_> thumbor? [13:02:41] yeah [13:02:46] <_joe_> uhhh [13:02:48] at least according to my notes from yesterday, I'll double check [13:02:52] <_joe_> no idea why, frankly [13:04:47] rzl@cumin1001:~$ confctl --quiet --object-type discovery select dnsdisc=thumbor get [13:04:47] {"codfw": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=thumbor"} [13:04:47] {"eqiad": {"pooled": true, "references": [], "ttl": 300}, "tags": "dnsdisc=thumbor"} [13:05:05] <_joe_> so... I don't think we declare it, actually [13:05:10] <_joe_> in discovery [13:05:32] <_joe_> yep, we don't [13:05:37] <_joe_> it's unused it seems [13:05:51] huh, okay [13:05:58] <_joe_> that's a relic from when we didn't have swift a/a I guess? [13:07:48] okay, won't worry about it for now then [13:07:57] we should clean up but that can wait [13:08:22] <_joe_> what's the larger one? [13:08:25] heh, maybe I'll start the switchover IR and list it as our first action item :D [13:08:34] why not :) [13:09:08] the larger one is, I realized we never actually listed out which of the active/active services we're depooling from eqiad today, or talked about whether any of them have to happen in order [13:09:13] here's the list of everything currently a/a: [13:09:23] apertium, blubberoid, citoid, cxserver, echostore, eventgate-analytics, eventgate-analytics-external, eventgate-logging-external, eventgate-main, eventstreams, graphoid, helm-charts, kartotherian, mathoid, mobileapps, ores, parsoid, proton, recommendation-api, restbase, schema, search, sessionstore, termbox, thanos-query, thanos-swift, wdqs, wdqs-internal, wikifeeds, zotero [13:09:28] <_joe_> nothing needs to happen in order [13:10:15] <_joe_> but there is a list of services I excluded [13:10:17] <_joe_> https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/623330/2/cookbooks/sre/switchdc/services/__init__.py [13:10:28] <_joe_> each one with its justification [13:10:35] <_joe_> everything else will be migrated [13:10:42] oh! well. good! :) [13:10:53] let me catch up with you, then [13:11:41] <_joe_> so my proposal is to reduce ttl, switch, wait ~ 1 hour, restore ttl [13:11:55] <_joe_> the main thing is seeing if we create any issue to mediawiki [13:12:08] <_joe_> it should not, given how much stuff we moved to envoy as a proxy [13:12:14] nod [13:12:17] <_joe_> but this is still the first switchover with php7 :) [13:12:26] this patch looks good, thanks for doing that [13:12:32] <_joe_> (that's the biggest unknown for both today and tomorrow) [13:12:43] love to wake up on Monday and find everything I was planning to do is already done [13:12:58] and yeah, agree with leaving the ttl [13:13:43] <_joe_> without envoy, we would've been forced to migrate some services in lockstep with mediawiki [13:23:44] what do you want/need me to do during today's switch? [13:24:45] _joe_: regarding helm-charts in cookbooks/sre/switchdc/services/__init__.py: What do you think is needed in addition there? I would say that dnsdisc.depool() is sufficient [13:25:32] <_joe_> jayme: to special-case something in puppet where the config file gets generated [13:25:38] <_joe_> and I didn't have time today [13:25:43] volans: I think nothing as long as automation works as expected :D [13:26:37] rzl: ack, I'll hide and spy on the switch then :) [13:33:15] _joe_: hm. Not clear to me what that might be [13:33:37] <_joe_> jayme: you have a hostname rather than a fqdn in the monitoring stanzas [13:33:41] <_joe_> which is expected [13:33:51] <_joe_> but I have to figure out a correct way to handle that case [13:34:02] <_joe_> I mean we can switch that with confctl :) [13:35:09] Ah, "the config file" in that context is sre.switchdc.services.yaml? [13:35:15] <_joe_> yes [13:36:11] oookay. I assumed you where talking about a chartmuseum config file [13:38:05] * volans quick break, bbiab [14:31:33] possibly dumb k8s question - I have 2 ports in a pod that offer prometheus metrics at /metrics. One is the statsd exporter and one is the envoy admin interface (very similar to the one in the TLS template). I'd like them to both be scraped - can I reuse the envoyproxy.io/scrape annotation used in the TLS sidecar and still get the statsd port scraped at the same time? [14:32:38] unfortunately not [14:33:29] at least not without additional (non standard) config for prometheus service discovery [14:37:48] Oh...I misread. Sorry. Ignore everything I said. You could use the "prometheus.io/scrape:" annotation for statsd_exporter and leave envoyproxy.io/scrape to envoy [14:37:51] Oh...I misread. Sorry. Ignore everything I said. You could use the "prometheus.io/scrape:" annotation for statsd_exporter and leave envoyproxy.io/scrape to envoy [14:41:46] jayme: ah, great! thanks [14:49:57] <_joe_> isn't our team meeting canceled today? [14:52:26] I don't know, I didn't get any email cancelling it [14:52:42] I can be the one that pops in and tells people it's off, if you like. [14:53:53] no, i think we can have it for everybody that is not directly involved and get some quick updates out. [14:54:06] but there is nothing urgent [14:54:34] <_joe_> rzl: I think the situation is under control [15:50:22] For some reason I thought it was a holiday today, but was wrong :) [15:53:17] next week [15:53:45] ack, yes [15:55:57] mutante: regarding the non-LVS gdns discovery thing: Take a look at what akosiari_s did it https://gerrit.wikimedia.org/r/c/operations/puppet/+/609403 (apart from messing up he hostnames :D) [16:00:32] jayme: oh, so that means i would have to add these things to the service catalog.. aha. i will look at that, thank you [16:01:37] mutante: yeah. But you won't have to set up full LVS support and you can nicely pool/depool with confctl then [16:03:03] cool! [16:53:46] 10serviceops, 10Operations: High traffic on mc1020 (18 Aug) - https://phabricator.wikimedia.org/T260622 (10jijiki) 05Open→03Resolved We can close this for now [16:57:07] 10serviceops, 10Icinga, 10Operations: incident 20170323-wikibase did not trigger Icinga paging - https://phabricator.wikimedia.org/T161528 (10Dzahn) 05Declined→03Open [19:18:04] 10serviceops, 10Operations, 10Traffic, 10conftool, 10Patch-For-Review: confd's watch functionality appears to be partially broken when interacting with etcd 3.x - https://phabricator.wikimedia.org/T260889 (10Volans) After today's failure of the `check_ttl` step in the switchdc of the services, I had a ch...