[00:16:06] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) [06:42:35] 10serviceops, 10MediaWiki-General, 10Operations, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10tstarling) I'll take that as a reminder to check the MediaWiki version i... [07:37:11] Pchelolo: alright, topic configured on deployment-cache-text06. Now I'm not sure which brokers to use for deployment-prep PURGEs though! [08:08:58] 10serviceops, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Dzahn) @Papaul done! Do you still have anything to do here on your side? [09:02:38] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Upgrade all TLS enabled charts to v0.2 tls_helper - https://phabricator.wikimedia.org/T253396 (10JMeybohm) [10:51:43] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Dzahn) @Papaul I uploaded a new change to add mgmt and production IPs for mw2335-mw2339 (C3). Does it look good to you? [10:59:58] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Move termbox to use TLS only - https://phabricator.wikimedia.org/T254581 (10JMeybohm) [11:52:28] 10serviceops, 10Core Platform Team, 10Performance-Team, 10Wikimedia-Rdbms: Determine multi-dc strategy for ChronologyProtector - https://phabricator.wikimedia.org/T254634 (10Joe) I thought about this and I think we will **need** to direct people who would set the CP session to the main datacenter anyways:... [13:12:44] ema: deployment-kafka-main-1.deployment-prep.eqiad.wmflabs [13:12:51] and 2 [13:14:32] Pchelolo: you are a scholar and a gentleman [13:15:51] it's a pleasure doing business with you sir [13:20:21] now I just have to understand what modules/role/lib/puppet/parser/functions/kafka_config.rb is doing and that's it [13:20:34] <_joe_> I'm that gentle with people only if I'm planning to murder them [13:20:34] ETA: September [13:20:38] <_joe_> and see? ema revealed his true plot [13:21:11] <_joe_> ema: that function is so simple it explains itself [13:23:27] so far, I'm not sure due to what sort of magic, the list of kafka servers for deployment-cache-text06 at the end of the day is deployment-cache-text06.deployment-prep.eqiad.wmflabs:9093 [13:23:35] which is sweet, but wrong [13:24:30] I hoped to find a hiera attribute I could set to deployment-kafka-main-1.deployment-prep.eqiad.wmflabs but clearly that was overly optimistic [13:28:16] ema: isnt it deployment-logstash03 instead of 06 ? [13:28:16] 413 profile::rsyslog::kafka_shipper::kafka_brokers: [13:28:16] 414 - 'deployment-logstash03.deployment-prep.eqiad.wmflabs:9093' [13:28:18] hieradata/cloud/eqiad1/deployment-prep/common.yaml [13:30:36] _joe_: So, you saw I have all the patches to deploy kafka purges to prod everywhere. I'm thinking to do it all in today window at 18 utc. Any blockers on your side? [13:31:44] I would really love you you or some other SRE were around, so if it's bad timing, maybe lets schedule a different window [13:33:59] mutante: that's something unfortunately unrelated, the list of brokers I need comes from a bunch of function calls rather than straight from hiera :) [13:35:20] ema: gotcha [13:52:56] <_joe_> Pchelolo: I wiull be in a meeting [13:54:32] _joe_: how about 17:15 utc? [13:55:09] I can make my own window :) [13:59:06] <_joe_> Pchelolo: that would work too [13:59:13] <_joe_> or earlier if you prefer [13:59:57] 16 utc [14:00:01] <_joe_> like, get your coffee and let's go :) [14:00:11] oh, the train is happening now [14:00:29] <_joe_> ah well :P [14:00:42] <_joe_> but yes 16 utc gives me 30 mins before going into another meeting [14:00:42] right after the train - 15utc. let's do it [14:00:47] <_joe_> and I can just follow around [14:01:19] oki! I'll put it on the calendar [14:42:28] 10serviceops, 10Core Platform Team, 10Performance-Team, 10Wikimedia-Rdbms: Determine multi-dc strategy for ChronologyProtector - https://phabricator.wikimedia.org/T254634 (10Krinkle) [14:44:15] 10serviceops, 10Core Platform Team, 10Performance-Team, 10Wikimedia-Rdbms: Determine multi-dc strategy for ChronologyProtector - https://phabricator.wikimedia.org/T254634 (10Krinkle) [14:51:40] <_joe_> wkandek: oh I forgot to add you? :P Sorry [14:57:57] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) ` [edit interfaces interface-range vlan-private1-c-codfw] member xe-7/0/3 { ... } + member ge-3/0/3; + member ge-3/0/4; +... [14:59:12] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) [15:04:17] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2335.codfw.wmnet ` The log can be... [15:07:52] <_joe_> Pchelolo: I'm here btw [15:14:29] 10serviceops, 10Core Platform Team, 10Performance-Team, 10Wikimedia-Rdbms: Determine multi-dc strategy for ChronologyProtector - https://phabricator.wikimedia.org/T254634 (10Krinkle) p:05Triage→03Medium [15:17:03] _joe_: ok. I'm here too. I got confused with times [15:17:53] I'm gonna do a no-op preparation one first, will ping you when I am doing real changes [15:18:13] <_joe_> ack :) [15:20:11] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2336.codfw.wmnet ` The log can be... [15:26:51] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2335.codfw.wmnet'] ` and were **ALL** successful. [15:30:09] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2337.codfw.wmnet ` The log can be... [15:33:26] _joe_: the patch to send purges to kafka for everything is on mwdebug1002 [15:33:43] I've edited some pages and the purges are indeed being sent [15:33:53] <_joe_> good [15:33:58] so I'm ready to pull the trigger for prod [15:34:04] <_joe_> so it's sending both to kafka and htcp right? [15:34:09] yes [15:34:13] <_joe_> I'd say it's a go [15:34:27] ok. going. [15:34:30] so this means we're gonna receive 2x PURGEs, correct? [15:34:42] <_joe_> yes [15:34:46] k, watching [15:35:46] <_joe_> ema: we should see the eqiad.resource-purge count soar [15:36:01] 10serviceops, 10Core Platform Team, 10Performance-Team, 10Wikimedia-Rdbms: Determine multi-dc strategy for ChronologyProtector - https://phabricator.wikimedia.org/T254634 (10Krinkle) Notes from todays meeting: * Data in ChronologyProtector needs to be strongly persisted but only for fixed short duration o... [15:36:17] excellent, I see codfw.resource-purge around 200 ops and eqiad around just a few for now [15:36:40] https://grafana.wikimedia.org/d/RvscY1CZk/purged?orgId=1 for the lurkers [15:36:48] <_joe_> codfw. is sent by restbase-async, so expected [15:36:52] synced [15:37:20] 10serviceops, 10Core Platform Team, 10Performance-Team, 10Wikimedia-Rdbms, 10Sustainability (MediaWiki-MultiDC): Determine multi-dc strategy for ChronologyProtector - https://phabricator.wikimedia.org/T254634 (10Krinkle) [15:37:51] going up [15:37:55] <_joe_> it's going up yes [15:38:40] <_joe_> it gets updated every minute, so we will need a few to see the effect completely [15:38:49] right [15:39:22] the change in bytes received is already very visible: https://grafana.wikimedia.org/d/RvscY1CZk/purged?panelId=35&fullscreen&orgId=1&from=now-3h&to=now [15:39:39] <_joe_> yes because I guess the messages from mediawiki are less sparse [15:39:53] <_joe_> but the traffic seems reasonable anyways [15:41:52] <_joe_> ok I see 1:1 purges over kafka and htcp [15:42:03] <_joe_> lemme just take a peek with kafkacat [15:42:47] I was thinking that we could add a request header to the HTTP PURGEs to distinguish between HTCP-initiated vs Kafka [15:42:56] <_joe_> I see a lot of tags":["mediawiki"] [15:43:09] <_joe_> ema: no need, we'll turn off htcp soon enough [15:43:38] if we had something to analyse in detail that is :) [15:43:38] eventgate is feeling fine [15:44:01] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2336.codfw.wmnet'] ` and were **ALL** successful. [15:44:02] <_joe_> ema: you can stalk the purges with (for example in codfw) kafkacat -b 10.192.0.17:9092 -t eqiad.resource-purge -C -o -1 [15:44:54] <_joe_> Pchelolo: is it expected that none of the mediawiki-generated purges has root_event? [15:45:16] _joe_: yeah, we don't propagate that info well enough [15:45:23] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2338.codfw.wmnet ` The log can be... [15:46:05] that's why I want to do a refactoring of how this is done in MW [15:46:24] <_joe_> sure, gotcha [15:47:01] so, next step would be to disable htcp [15:47:07] do you think we're ready for it? [15:49:23] wait a sec [15:49:42] what's producing the purges for cache_upload? [15:50:09] <_joe_> interesting question [15:50:10] I'm asking because in upload we're getting only HTCP purges and I want to make sure we don't disable those [15:50:19] ema: hm.. [15:50:35] <_joe_> why are we only getting htcp purges on upload? [15:50:46] <_joe_> because we're not listening to kafka, right? [15:51:20] right [15:51:32] <_joe_> lemme see if any url has upload.wikimedia.org in the purges on kafka [15:51:41] and maps! [15:51:56] <_joe_> {"$schema":"/resource_change/1.0.0","meta":{"uri":"https://upload.wikimedia.org/wikipedia/commons/5/5e/Vorderes_Kontrollschild_Liechtenstein.jpg","request_id":"ef7e4e96-4903-414f-94a6-51a268984680","id":"9020b8ec-2cf9-4a65-ace9-94d87d738a3c","dt":"2020-06-10T15:51:46Z","domain":"commons.wikimedia.org","stream":"resource-purge"},"tags":["mediawiki"]} [15:51:59] <_joe_> yes we do [15:52:00] yeah, we are producing upload purges [15:52:12] <_joe_> ema: so we need to start listening to kafka I guess [15:52:23] <_joe_> we're dropping purges based on a regex, right? [15:52:57] yeah, but we aren't dropping anything on cache_text based on regexs [15:53:05] and yet I don't see purges for upload.wm.org [15:53:15] <_joe_> ema: uhm I do see them with kafkacat [15:54:04] yes, me too, the previous instance of not seeing them was PEBKAC [15:54:16] I see them like this FTR: [15:54:17] varnishncsa -n frontend -q 'ReqMethod eq "PURGE" and ReqHeader:Host eq "upload.wikimedia.org"' [15:54:45] <_joe_> so we should proably exclude them on the text cluster, but that's a second-order ooptimization [15:54:50] ok so we need to consume from kafka on upload too, the regex is already in place [15:54:51] <_joe_> we already received them [15:54:55] <_joe_> yes [15:55:11] 10serviceops, 10Operations, 10Performance-Team, 10Patch-For-Review: Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10elukey) Side note: if not already done, I'd double check how the WarmUp route behaves when the local memca... [15:55:13] <_joe_> Pchelolo: so wait for us to add kafka listeners on upload I guess [15:55:19] sure thing [15:55:37] <_joe_> ema: I have a meeting in 30', can you take care of it? [15:55:40] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2337.codfw.wmnet'] ` and were **ALL** successful. [15:55:52] <_joe_> it's two lines of hiera IIRC [15:55:58] _joe_: sure thing, doing [15:56:02] <_joe_> <3 [15:56:17] ok. ping me when we're ready to disable htcp [15:56:26] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2339.codfw.wmnet ` The log can be... [15:57:02] do we have a task number handy? [15:57:39] <_joe_> the usual one I guess [15:57:41] T133821 I guess [15:57:45] <_joe_> yes [15:57:52] ema: T250781 [15:59:55] https://gerrit.wikimedia.org/r/604430 [16:00:48] _joe_, Pchelolo: ^ [16:00:50] running pcc now [16:01:52] this looks good to me: https://puppet-compiler.wmflabs.org/compiler1002/23147/cp3051.esams.wmnet/index.html [16:02:25] <_joe_> ema: certainly looks the part [16:02:56] merging then [16:03:35] now let's see if filtering by regex actually works :D [16:05:10] it's nice when I run puppet and things don't blow up on my face. Puppet run ok on cp3051 [16:05:32] restarting purged on 3051 [16:07:05] <_joe_> ema: regex filtering is needed for htcpd as well [16:07:16] <_joe_> given you listen to the same multicast addresses [16:07:25] <_joe_> but maybe it's broken for kafka [16:07:46] looks like everything works [16:08:08] trying to ensure we're getting all purges twice now [16:08:51] 10serviceops, 10Operations, 10decommission, 10ops-codfw: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Papaul) @Dzahn yes i have to setup all the decom servers to offline [16:09:04] indeed we are, all purges coming in twice [16:09:12] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2338.codfw.wmnet'] ` and were **ALL** successful. [16:09:30] anything else was listening to the same multicast? maps? [16:09:41] ah bravo [16:09:43] in prod we have all HTCP going to the same address [16:09:44] <_joe_> Pchelolo: that's the upload caches [16:10:01] <_joe_> but possibly maps is sending its purges [16:10:14] I haven't checked if we're getting maps.wm.org purges but I guess those aren't that frequent? [16:10:16] <_joe_> so that will need to be converted to use kafka too, damn :P [16:10:26] <_joe_> ema: possibly inexistent [16:10:47] <_joe_> so yeah anyways if they're not produced by mw, they will still go to htcpd [16:10:58] <_joe_> and we will see so in the metrics [16:11:12] <_joe_> we're not turning off listening on multicast from purged for now [16:11:17] excellent [16:11:26] I'll restart purged everywhere then [16:11:34] <_joe_> cool [16:11:48] <_joe_> Pchelolo: once ema is done, we can disable producing htcp from mediawiki [16:11:58] <_joe_> did you two ever thought we'd see this day? [16:12:14] hehe, _joe_ it's not entirely done yet :) [16:12:24] there's still wikitech, private wikis, loginwiki.. [16:12:43] some longtail that needed my to make a new feature in EventBus [16:13:27] <_joe_> wait, private wikis should not be cached at the cdn [16:13:53] that's right, they're not [16:13:55] oh right [16:14:08] ok, then it's wikitech and loginwiki [16:14:21] and beta cluster [16:14:31] <_joe_> lol, beta [16:14:41] I have the feeling beta cluster will take longer than mw [16:14:57] <_joe_> ema: that's not a feeling, it's a reasonable evaluation [16:15:07] <_joe_> ema: it will surely feel like it took longer [16:15:30] ema: btw, I guess in beta the upload cache should also be converted to kafka.. [16:15:57] <_joe_> that too [16:16:29] _joe_: should we make purged autorestart on config changes? [16:16:44] by puppet means I mean [16:16:57] <_joe_> ema: I'm neutral [16:17:17] in case of potentially dangerous changes we can disable puppet and only apply on one host [16:17:37] and otherwise we save a manual systemctl restart [16:17:59] <_joe_> yeah again, I hope purged with live a placid life and we'll restart it only for removing multicast support at some point :P [16:18:23] <_joe_> but yes, we might come back to this and start dropping old purges, or do fancy stuff to ensure causality between caching layers [16:19:08] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2339.codfw.wmnet'] ` and were **ALL** successful. [16:19:25] it's also nice that when a change only applies to, say, upload, we don't have to think about it and limit restarting the service only on upload nodes [16:20:28] anyhow, on cp3051 we're now getting hundreds of purge messages via kafka per second: https://grafana.wikimedia.org/d/RvscY1CZk/purged?panelId=37&fullscreen&orgId=1&var-datasource=esams%20prometheus%2Fops&var-cluster=cache_upload&var-instance=cp3051&from=now-1h&to=now [16:20:48] and still just a few HTTP purges thanks to our advanced regex filtering: https://grafana.wikimedia.org/d/RvscY1CZk/purged?panelId=4&fullscreen&orgId=1&var-datasource=esams%20prometheus%2Fops&var-cluster=cache_upload&var-instance=cp3051&from=now-1h&to=now [16:24:01] all cache_upload nodes are now getting kafka purges. Give me two more minutes for a final round of checks [16:26:06] Pchelolo: I think we're ready to go [16:26:14] okey! [16:28:53] htcp disabled on mwdebug1002 [16:28:59] lemme test editing a page [16:29:03] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) [16:29:23] and it worked [16:29:27] nice! [16:29:41] _joe_: ema - deploying everywhere! [16:30:11] <_joe_> great [16:30:25] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) 05Open→03Resolved @Dzahn the 5 servers in C3 are ready for services [16:31:39] done. And some page editing shows it works [16:31:48] \o/ [16:32:20] * Pchelolo looking at the graphs [16:32:21] now we should be seeing the number of HTTP purges == Kafka purges [16:36:15] Pchelolo: htcp packets received are going down as expected [16:36:47] ~1K purge requests per second, ~1K purges read from kafka, ... [16:36:52] I'd say we're looking good [16:37:11] <_joe_> \o/ [16:37:54] oh.. this is nice. small step for man, but a giant leap for humanity [16:40:40] nice indeed [16:41:15] I've gotta go afk now but reachable by phone in the unlikely case I'm needed [16:41:18] see you folks! [16:41:49] have a nice evening ema [16:41:54] thank you for your help [17:25:00] 10serviceops, 10MediaWiki-Cache, 10Performance-Team, 10Patch-For-Review, 10Sustainability (Incident Prevention): Let WANObjectCache store "sister keys" on the same backend as the main value key - https://phabricator.wikimedia.org/T252564 (10Krinkle) >>! {8a5401cc5feb5e417419d4df622aa86cfd9728ea} > Revert... [17:25:12] 10serviceops, 10MediaWiki-Cache, 10Performance-Team, 10Sustainability (Incident Prevention): Let WANObjectCache store "sister keys" on the same backend as the main value key - https://phabricator.wikimedia.org/T252564 (10Krinkle) [18:03:09] 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption: Method call executed on unrelated object (also: Call to undefined method) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [18:03:11] 10serviceops, 10Wikimedia-production-error: Spike of fatal error "Cannot declare class Wikimedia\MWConfig" on mw1379 (2020-06-01) - https://phabricator.wikimedia.org/T254209 (10Krinkle) [18:03:18] 10serviceops, 10Performance-Team (Radar): Avoid php-opcache corruption in WMF production - https://phabricator.wikimedia.org/T253673 (10Krinkle) [18:03:20] 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption: Method call executed on unrelated object (also: Call to undefined method) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [18:15:34] 10serviceops, 10Operations, 10Wikimedia-production-error: PHP7 corruption: Method call executed on unrelated object (also: Call to undefined method) - https://phabricator.wikimedia.org/T245183 (10Krinkle) > `name=Error message > Fatal error: > Cannot declare class Wikimedia\MWConfig\XWikimediaDebug, because... [19:45:41] 10serviceops, 10Operations, 10decommission, 10ops-codfw: codfw: decom at least 15 appservers in codfw rack C3 to make room for new servers - https://phabricator.wikimedia.org/T247018 (10Papaul) 05Open→03Resolved Complete [20:06:51] <_joe_> Pchelolo: something is sending htcp packets to the caches https://grafana.wikimedia.org/d/RvscY1CZk/purged?panelId=27&fullscreen&orgId=1&from=now-30m&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=cache_text&var-instance=cp1075 [20:07:43] <_joe_> not worrisome, but I'm left wondering what that could be [20:26:57] there's still some longtail wikis that do htcp [20:27:09] for different reasons [20:31:34] I will get rid of these after next train, I need one patch to land there first [20:54:44] <_joe_> oh right [20:54:47] <_joe_> I forgot