[00:18:20] 10serviceops, 10Operations: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) mw2350 through mw2376 are all pooled in production and set to "Active" in netbox now. [00:18:46] 10serviceops, 10Operations: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) [00:19:16] 27 servers pooled in codfw [00:19:39] 15 old ones will be removed soon [00:19:59] then remaining 15 can be added [01:42:57] 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Krinkle) [08:12:01] 10serviceops, 10Operations, 10Service-Architecture: Monitor envoy status where it's installed - https://phabricator.wikimedia.org/T247387 (10Joe) [08:16:37] 10serviceops, 10MediaWiki-General, 10Operations, 10Service-Architecture: Create a grafana dashboard to monitor services proxied via envoy - https://phabricator.wikimedia.org/T247388 (10Joe) [08:19:26] 10serviceops, 10MediaWiki-General, 10Operations, 10Service-Architecture: Use envoy for TLS termination on the appservers - https://phabricator.wikimedia.org/T247389 (10Joe) [08:21:38] 10serviceops, 10MediaWiki-General, 10Operations, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) >>! In T244843#5957717, @Ottomata wrote: > @Joe @akosiaris all deploym... [13:10:42] 10serviceops, 10Android-app-Bugs, 10Operations, 10Traffic, and 4 others: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10JoeWalsh) a:03JoeWalsh [13:30:40] akosiaris: thanks for evenstreams stuff [13:30:41] interesting. [13:30:57] it still seems to be doing that periodic cpu & mem usage stuff [13:31:28] i never ran benchmarks for more than around 5 minutes at once [13:31:35] so i never saw thihs hourly periodic stuff [13:31:48] ottomata: yes, so I noticed that the CPU thing happens right at the peak of the memory usage [13:32:11] interestingly Garbge collections don't seem to add up that much [13:32:17] at least assuming the graphs are correct [13:32:22] that's also when I got throttled when benchmarking, any time i started bumping into mem limits [13:32:30] i got cpu throttled into oblivion [13:33:02] are there service or pod restarts happening when this happens? [13:33:04] checking... [13:33:09] no there aren't [13:33:13] ok [13:33:21] there's already an annotation that would show that [13:33:39] so, current memory limit for eventstreams is at 1500M [13:33:55] I guess we should bump that and tend to the general limit as well while at it [13:34:09] but there are clear signs of a memory leak however [13:34:19] this https://grafana.wikimedia.org/d/znIuUcsWz/eventstreams-k8s?orgId=1&from=1583925135822&to=1583933610464&var-dc=eqiad%20prometheus%2Fk8s&var-service=eventstreams&fullscreen&panelId=25 [13:34:28] shouts memory leak in all possible ways [13:35:38] yeah [13:35:38] 10serviceops, 10Operations, 10Service-Architecture: Monitor envoy status where it's installed - https://phabricator.wikimedia.org/T247387 (10Joe) p:05Triage→03High [13:37:11] now that I look at it, even if we bump the limits, all we 'll do is decrease the frequency of the event, not really solve it [13:37:19] yeah i think you are right [13:37:36] (i am very bad at kibana...yarghhh) [13:37:46] join the club [13:37:50] i'm doing [13:37:50] kubernetes.labels.app:"eventstreams" [13:37:55] in discover [13:37:58] I honestly think I don't understand it at times... [13:37:59] but getting tons of results [13:38:15] not related [13:38:19] OH [13:38:21] NO QUOTES! [13:39:59] ok, akosiaris [13:40:00] https://logstash.wikimedia.org/goto/ed35d2e6fe041876f9ff5aeec195ff79 [13:40:03] i think this is aproblem [13:40:16] that is way too many connection eveents [13:41:18] that's level 30 btw, there is also a level 50 (warn?) message every now and then [13:41:23] lemme find it again [13:41:58] ah this one [13:41:58] Cannot write events after closing the response! [13:41:59] ? [13:42:14] yes [13:42:15] i think that can happen normallly if the client closes the response unexpectedly [13:42:35] but I think it kills the worker after that [13:42:41] but it does happen pretty frequehntly [13:42:43] OH? really? [13:42:46] that is not normal then. [13:43:00] processImmediate (timers.js:658:5)"},"msg":"Cannot write events after closing the response!","time":"2020-03-11T13:39:06.955Z","v":0} [13:43:00] {"name":"eventstreams","hostname":"eventstreams-production-5f94567bc-lf9zr","pid":1,"level":50,"message":"worker stopped sending heartbeats, killing.","worker_pid":4938,"levelPath":"error/service-runner/master","msg":"worker stopped sending heartbeats, killing.","time":"2020-03-11T13:41:02.516Z","v":0} [13:43:12] that ^ ? [13:43:27] I cut out the actual exception from nodejs [13:43:36] but you can find the entirety of it in kibana [13:44:39] hm [13:44:39] so that "Creating new KafkaSSE instance 861fcc50-7ed1-40e8-b098-a8bd68700260."? Probably is right after the worker deaths? [13:44:50] haven't correlated that yet. It might not add up [13:45:44] would make sense, usually SSE clients reconnect automatically [13:45:46] hmm yeah they are way less in number, 15 vs 200 per timeperiod in kibana. it might be unrelated [13:46:00] it still shouldn't kill thhe worker thuogh [13:46:01] am looking [13:46:21] actually 15 vs 800 now that I look at it again [13:46:54] it also seems to be finishing normally and disconnecting [13:47:31] at least trying to figure out the order of actions from that log, that's what I get. [13:47:41] yeah, there is an async consume loop that happens, that error is thrown when the callback fires after the connection ahs been closed [13:47:56] which is why it is a warning [13:55:19] heh actually I think level 50 is an error [13:55:32] not sure [13:55:57] all of this might also very well be fully unrelated to the memory leak as well [13:56:09] yeah it might be [13:56:23] so that error comes from the SSEResponse lib which we appropriated from elsewhere [13:56:35] there are 3 places in the lib that return that exact error message! [13:56:41] lol [13:56:44] and the stack does not indicate which one because promises. [13:57:05] i'm going to differentiate the error messages, but ya probably not related [14:08:57] heh actuallyl, the code i'm looking at mar ko wrote... hmmmm heheh [14:34:25] yeah huhh actually looking at https://grafana.wikimedia.org/d/znIuUcsWz/eventstreams-k8s again [14:34:29] its obiously bad [14:34:41] 700 reqps ? [14:34:42] no wawy [14:34:49] it should be close to 1 or 0 [14:35:31] that means that for some reason connections are just constantly being closed and opened [14:37:30] akosiaris: i know you are busy with eg TLS so respond when you have time [14:37:31] but [14:37:45] where does this one come from? [14:37:46] https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2020.03.11/too_many_requests?id=AXDJ0obih3Uj6x1zNyN_&_g=h@44136fa [14:37:54] Your HTTP client is likely opening too many concurrent connections. [14:38:05] Unable to completely restore the URL, be sure to use the share functionality. [14:46:22] hm i did but ok [14:46:27] oh nknow that was the single doc url [14:47:23] akosiaris: [14:47:23] 429: too_many_requests [14:47:24] oops [14:47:25] ha [14:47:28] https://logstash.wikimedia.org/goto/a2748eca3c273a598129d1c4012954e6 [14:48:52] ottomata: so looking at one at random, it's pod eventstreams-production-5f94567bc-jczkx in eqiad that produced it [14:49:37] akosiaris: looking at all of tem, it looks spread across all pods [14:49:39] hosted on kubernetes1002, and the client seems to be envoy [14:49:50] ah ok, so envoy is throttling the requests? [14:49:51] request.headers.user-agent Go-http-client/2.0 [14:50:03] interesting! didn'it knowwe had that capability [14:50:18] ah wait, that might be an assumption on my part [14:50:26] hm they aren't all Go-http-client tho [14:50:48] that [14:50:49] and [14:50:49] Jersey/2.28 (HttpUrlConnection 1.8.0_222) [14:50:55] and okhttp/3.8.1 [14:51:05] yeah, I take back what I said [14:51:07] https://logstash.wikimedia.org/goto/f6807124c1267c8796c5d63316a8e8fa [14:52:53] so many versions of Jersey.. [14:53:11] I also see 241 instead of 222 [14:53:53] where is the envoyproxy image from? operations-debs-envoyproxy? [14:54:11] yup. but those are the actual UA clients use it seems [14:54:30] envoy passes them as is [14:55:01] ottomata: eventstreams has internal ratelimiting? [14:55:12] it has some per client ip, but now that envoy has it maybe we don't need it [14:55:18] each worker [14:55:23] keeps track of conns per client ip [14:55:36] oh I never said it has that capability. Better refer to _joe_ about that [14:55:40] but we aren't seeing logs about conns being denied because of that [14:56:06] it does! i think. https://www.envoyproxy.io/learn/backpressure [14:56:26] I honestly have no idea. Haven't found the time to dive deep into it yet [14:56:43] OH [14:56:46] maybe this is from eventstreams. [14:56:51] but the message is different. [14:56:52] hmm [14:57:02] oh oh oh [14:57:05] it is from eventstreams [14:57:07] sorry. [14:57:17] interesting... why though? [14:57:34] what do you use for that ratelimiting code? X-Client-IP ? [14:57:46] or XFF ? [14:58:26] the https://phabricator.wikimedia.org/T196553 [14:58:33] X-Client-IP [14:58:41] good [15:00:24] akosiaris: is there any ip affinity stuff going on here? [15:00:25] _joe_: just to make sure, X-Client-IP is being passed through envoy as is, right? [15:00:55] the path being caching-layer => sidecar envoy=>eventstreams [15:01:02] ottomata: what do you mean ? [15:01:50] <_joe_> akosiaris: not sure [15:02:01] <_joe_> akosiaris: there are various related settings [15:02:16] ok, /me will tcpdump then [15:02:33] like are the same pods always getting connections from the same client IPs? [15:02:44] i doubt it, right load balancing is random or round robin or something? [15:02:46] <_joe_> ottomata: def no [15:02:51] <_joe_> yes [15:02:53] network client IPs (as in cp boxes) ? no [15:02:58] <_joe_> ottomata: are you seeing issues? [15:03:00] we can do that if you want, but not right now [15:03:04] no no [15:03:05] don't want [15:03:07] and it's statistically random [15:03:13] good [15:03:47] _joe_: trying to figure out why there are so many disconnect+connects now that eventstreams is in k8s [15:04:17] akosiaris: there might be a mem leak indeed, but I suspect there are jsut so many reconnects, each of which is a new kafka client allocated, that GC can't keep up [15:05:23] <_joe_> maybe we didn't change the timeout in envoy and it's timing out clients? [15:05:35] <_joe_> ottomata: can you try to connect with a client and see if you get a disconnect? [15:06:34] ya... [15:07:37] akosiaris@kubernetes1001:~$ sudo nsenter -t 9101 -n tcpdump -s0 -Ani any port 8092 |grep -i X-Client-IP [15:07:38] I see them as expected [15:08:04] a lot of wmcs ofc, but overall I see what I would expect to see [15:08:56] ottomata: note that it's lowercase per tcpdump. x-client-ip. I highly doubt it matters (in fact nodejs defaults to that lowercase form) just noting it [15:09:11] express, not nodejs IIRC [15:09:12] ya [15:09:45] ottomata: how is the ratelimiter shared across the nodes? [15:09:53] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>GCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC [15:09:55] CCCCCCCCCCCCCCCCCCCCCCCCCkjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj [15:09:57] jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj [15:09:58] I mean, is it per process? or does it have some shared state? [15:09:59] jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj [15:10:01] jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj [15:10:03] jjjjjBDHs/l11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111114333333333333333333333333ffffffft7777777777776hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh\\\\\\\\\\\\\\\\\\\nnnnnnnnnnnnnnnnnnnnnnnsvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv6555555555555555555555555 [15:10:05] 55555555555555555555555555555555555555555555555559999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999th9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999 [15:10:07] chaomodus: ? [15:10:07] 9999999999999999999999999999999999999999999999999999999999999999999999999999999999999lrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv [15:10:11] sorry [15:10:16] akosiaris: it isn't at all [15:10:16] Cat on the keyboard. [15:10:18] cat jumed on here when i looked away for a mo [15:10:23] hahahha [15:10:25] ahahahaha [15:10:26] Called it. [15:10:56] ok, so ratelimit state stays local to the pod [15:11:01] ok this is weird then... [15:11:26] well this also couuld be unrelated, maybe it is just a symptom of the disconnects [15:11:32] ottomata: ah, my tcpdump shows some IPs doing this multiple times [15:11:39] yeah [15:11:47] and given client_ip_connection_limit: 1 [15:11:58] that would trigger it. but maybe they are just trying to reconnect? [15:11:58] per worker [15:12:38] oh heh ' # In prod we run 100 worker replicas.' not accurate anymore [15:12:46] hmmm right many fewer workers now [15:12:54] only 8 in eqiad [15:13:12] let's bump that up [15:13:49] i'm also going to rebuild with some more informative error messages around those sse closed error/warnings [15:13:52] maybe we'll learn something [15:15:33] 10serviceops, 10Core Platform Team, 10MediaWiki-Cache, 10Operations: WanObjectCache::getWithSetCallback seems not to set objects when fetching data is slow - https://phabricator.wikimedia.org/T244877 (10aaron) >>! In T244877#5873326, @Ladsgroup wrote: > Apparently this is a built-in and not-properly docume... [19:03:05] 10serviceops, 10Operations: move 20 new codfw parsoid servers (parse2*) into production - https://phabricator.wikimedia.org/T247441 (10Dzahn) [19:19:23] who do i ask about how parsoid-php.discovery.wmnet works? [19:20:52] in particular, both beta and production seem to use this same hostname for parsoid -- is there some magic that makes this resolve differently in labs? [19:23:05] cscott: no magic for that. in labs it would have to use another name added separately in labs DNS. [19:23:23] but the parameter parsoid_uri can be changed in Hiera [19:23:31] hieradata/role/common/restbase/production.yaml:profile::restbase::parsoid_uri: "https://parsoid-php.discovery.wmnet/w/rest.php" [19:23:43] that is prod [19:23:46] hieradata/role/common/restbase/dev_cluster.yaml:profile::restbase::parsoid_uri: "https://parsoid-php.discovery.wmnet/w/rest.php" [19:23:51] and that should be scandium [19:25:00] so in deployment-prep it could be set differently, either in Horizon web UI or with a change in the puppet repo in hieradata/cloud/eqiad1/deployment-prep [19:25:43] that's all the occurences of the discovery.wmnet name that i see in puppet repo at least [19:28:34] it could (also) be already set to something in deployment-prep Hiera in Horizon based on prefix or hostname of the instances that have the restbase profile on them.. if any [19:34:37] mutante: wait, i was assuming dev_cluster means beta/labs [19:36:00] cscott: ah. it's neither beta/labs nor scandium. it's the "cassandra/restbase dev cluster" in production [19:36:08] as opposed to "production/production" [19:36:27] they are called restbase-dev100[4-6].eqiad.wmnet [19:36:31] hieradata/role/common/restbase/dev_cluster.yaml [19:36:37] yes, that [19:36:42] but they live in wmnet, not .wmflabs? [19:36:47] correct [19:36:53] they live in production site.pp [19:36:56] and .wmnet [19:37:21] deployment-prep/beta is all separate [19:37:44] what is http://deployment-restbase02.deployment-prep.eqiad.wmflabs:7231 then? [19:38:02] that's "beta" in cloud [19:38:15] i think that's the machine i need to (re)configure to fix T246833 [19:38:34] where is it getting its setup from? [19:38:41] yes, that sounds like it [19:38:54] there are multiple ways [19:39:33] either it's in the production puppet repo under hieradata/cloud/eqiad/deployment-prep/common.yaml or ./hosts/ [19:39:41] or it's in Horizon Web UI Hiera .. [19:39:53] and there either under instance name or under project name or under prefix name [19:40:04] i'm guessing the latter, since i think i've looked at every single match for 'parsoid' in codesearch [19:40:24] that's why on that other change i said that thing about "let's first do Horizon and then copy it back" [19:40:41] because web UI is quicker to test it but in the long run it's less confusing to have it in the repo [19:41:27] if it's done by hostname then we have the problem again next time we recreate it, fwiw [19:41:40] so better by prefix in this case [19:41:51] deployment-restbase* [19:42:38] parsoid10 also shows up in hieradata/cloud/eqiad1/devtools/common.yaml [19:42:56] perhaps should be changed there as well? but i don't know what that does [19:44:33] cscott: nah, that's unrelated [19:44:46] devtools is the name of the cloud VPS project [19:44:57] since that is not deployment-prep it's not "beta" [19:45:53] the reason it shows up there is because deployment-mediawiki-parsoid10 is a mediawiki server and in devtools there is an deployment_server and those have scap and scap has lists of mw-installations [19:46:10] not related to beta or that ticket at all [19:46:14] well, parsoid11 is a mediawiki server as well, isn't it? [19:46:34] it's related to T246854 [19:47:24] ah, yea, true. it should probably be updated there if we have a new mediawiki server but it is in itself just for deploying inside its own project [19:48:07] there would not be deploying from devtools to beta [19:48:23] so it's more like random data to ensure there is no puppet failure..copy/paste [19:50:30] i guess the best fix is to have just empty scap::dsh::groups in that other project [19:50:54] i'll do that to point out it is not beta-related [19:59:55] mutante: how do i find out parsoid11's IP address? [19:59:59] cf https://gerrit.wikimedia.org/r/579018 [20:00:53] cscott: "host deployment-parsoid11.deployment-prep.eqiad.wmflabs" on any other instance in cloud [20:01:03] or ssh to it and "ip a s" [20:01:54] ah, i think i'll need that same setting but for production soon. to add parse2* in codfw [20:02:03] the SubmitterWhitelist thing [20:02:59] i could have sworn i tried the 'ssh in and run host' thing and it didn't work. i wonder what i was mistyping. [20:03:21] oh: [20:03:22] hmm. sometimes there are issues with DNS in cloud [20:03:24] $ host deployment-parsoid10.deployment-prep.eqiad.wmflabs [20:03:24] Host deployment-parsoid10.deployment-prep.eqiad.wmflabs not found: 3(NXDOMAIN) [20:03:39] but -parsoid11 works [20:04:03] and parsoid10 shows up on https://openstack-browser.toolforge.org/project/deployment-prep [20:04:09] i'm afraid it might be intermittent DNS issues in cloud [20:04:19] there was something going on the other day [20:04:56] ah [20:04:59] $ host deployment-mediawiki-parsoid10.deployment-prep.eqiad.wmflabs [20:04:59] deployment-mediawiki-parsoid10.deployment-prep.eqiad.wmflabs has address 172.16.0.141 [20:05:08] ah, ok [20:05:11] parsoid10 has an extra -mediawiki- in its dns name [20:05:29] yea, good question which one is better. since there is the "prefix puppet" thing [20:05:37] it can mean that Hiera stuff is applied to only one of them [20:05:58] there's also a parsoid-php-beta.wmflabs.org alias, which is currently pointing to parsoid10 [20:06:11] i couldn't find anything using that alias, though. at least not in codesearch. [20:06:17] that one you should be able to edit in the browser [20:06:25] in Horizon [20:06:33] it's a web UI thing [20:07:25] at least admins of the deployment-prep project should be able to add and delete DNS names [20:07:42] cscott: i cleaned this up. don't worry about the "devtools" part https://gerrit.wikimedia.org/r/c/operations/puppet/+/579040 [20:21:45] cscott: +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/579035 should i merge or wait ? [20:24:06] yes, merge please [20:24:53] i can't see where profile::restbase::parsoid_uri is listed in https://horizon.wikimedia.org/project/instances/63ae9220-64ec-440c-b46b-52c8fc1c44c2/ [20:25:17] i was trying to figure out if i could test in Horizon first, but doing it directly in puppet works too [20:25:44] you should be able to test it in Horizon first, but it was probably simply not in there [20:26:24] but it's also pretty confusing because there is "project level", "prefix level" and "hostname level" in different places [20:27:09] using Horizon and repo have their own pros and cons. [20:27:22] merged. now it should just take a few minutes until it shows up on the beta master [20:27:24] hey, things seem to work now in beta [20:27:28] nice! [20:27:44] at least https://en.wikipedia.beta.wmflabs.org/api/rest_v1/#/Page%20content/get_page_html__title_ isn't giving me hyperswitch errors any more [20:28:20] great [20:28:51] still getting them on de.wikipedia.beta.wmflabs.org though [20:28:59] you think it might take a few minutes? [20:29:27] in theory if restbase02 has the right config i don't know why it would work on enbeta and not debeta [20:29:39] hmm, not if the only difference is the language name within beta, no [20:29:49] yea, agree, i dont see that [20:30:18] after all you changed deployment-prep/common.yaml so that applies to all instances in the project [20:30:21] as well [20:31:31] unless difference instances host "en" and "de" and don't use the same roles or ..caching [20:33:39] $ curl -X GET "https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/html/Main_Page?redirect=false" -H 'accept: text/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/2.1.0"' [20:33:41] works [20:33:56] $ curl -X GET "https://de.wikipedia.beta.wmflabs.org/api/rest_v1/page/html/Main_Page?redirect=false" -H 'accept: text/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/2.1.0"' [20:33:58] doesn't [20:34:31] and https://de.wikipedia.beta.wmflabs.org/wiki/Main_Page exists [20:35:53] uhm yea. that's the same IP address that points to [20:36:52] huh, this doesn't work either: $ curl -X GET "https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/html/Winamac" -H 'accept: text/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/2.1.0"' [20:37:04] maybe it;s just that enwiki:Main_Page got cached somehow [20:37:18] ok, back to trying to diagnose [21:31:20] mutante: /etc/restbase/*.yaml all still have parsoid10 as parsoid_uri [21:31:55] on deployment-restbase02 [21:32:05] "The last Puppet run was at Wed Feb 19 09:18:39 UTC 2020 (30967 minutes ago)." [21:32:10] ^ that seems to be the problem? [21:32:18] Pchelolo: ^ [21:32:20] aww.. yea. then puppet is broken there for some other reason [21:32:32] so new changes dont get applied [21:33:55] how do we fix that [21:35:26] ^ James_F just so we're on the same page [21:35:30] run puppet agent -tv and see what error it is [21:35:51] this is common and can have many different reasons [21:36:03] https://phabricator.wikimedia.org/search/query/QmWMjkBznfIH/#R (i hope this URL works) [21:40:39] mutante: It does. Also, :-( [21:43:41] James_F: maybe something do with your envoy::ensure from yesterday [21:43:45] Info: Loading facts [21:43:45] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'profile::envoy::ensure' (file: /etc/puppet/modules/profile/manifests/envoy.pp, line: 5) on node deployment-restbase02.deployment-prep.eqiad.wmflabs [21:43:45] Warning: Not using cache on failed catalog [21:43:46] Error: Could not retrieve catalog; skipping run [21:43:46] cscott@deployment-restbase02:/etc/restbase$ [21:44:05] (T247467) [21:44:22] cscott: Oh, yes, I only fixed that for parsoid11. [21:44:28] James_F: there should be shinken alerts about this somewhere [21:44:33] Someone needs to tell me how to fix it for the whole of beta cluster. [21:45:29] James_F: you have a fix per hostname? then copy the same thing over to "project puppet" tab [21:45:51] Oh, yes, that works, but how do I fix it in git? :-) [21:45:55] or we add it in the repo in common.yaml that cscott just used [21:46:05] OK. [21:46:19] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/579035/ for reference [21:46:24] James_F: /puppet/hieradata/cloud/eqiad1/deployment-prep/common.yaml [21:46:36] Sounds good. [21:53:06] in another matter: codfw appservers. all servers that could be added have been added as appservers/API. now i wanted to decom 15 old ones to unblock racking of the remaining 15 new ones. except i see now that the affected rack C3 is half jobrunners. [21:53:31] checking if "the oldest ones" is the same as "only jobrunners" [21:55:12] James_F: are you working on that patch? [22:04:30] James_F, mutante: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/579070 [22:05:42] cscott: Sorry, I wasn't. Thanks. [22:08:27] i am merging that, but we can't know for sure if other instances use this [22:10:01] actually, please hold. currently blocked from merging on puppetmaster [22:10:31] ideally check the shinken alerts about puppet runs on beta [22:22:08] cscott: James' change is now merged [22:25:34] puppet is still failing on restbase02 for me, but it's probably my fault [22:26:37] $ puppet agent -tv [22:26:37] Info: Using configured environment 'production' [22:26:38] Info: Retrieving pluginfacts [22:26:39] Info: Retrieving plugin [22:26:40] Info: Loading facts [22:26:42] Info: Caching catalog for deployment-restbase02.deployment-prep.eqiad.wmflabs [22:26:44] Error: Failed to apply catalog: Parameter user failed on Exec[verify-envoy-config]: Only root can execute commands as other users at /etc/puppet/modules/envoyproxy/manifests/init.pp:88 [22:26:47] cscott@deployment-restbase02:~$ sudo puppet agent -tv [22:26:49] Warning: Unable to fetch my node definition, but the agent run will continue: [22:26:51] Warning: Error 500 on SERVER: Server Error: Could not retrieve facts for deployment-restbase02.deployment-prep.eqiad.wmflabs: Failed to find facts from PuppetDB at puppet:8140: Failed to execute '/pdb/query/v4/nodes/deployment-restbase02.deployment-prep.eqiad.wmflabs/facts' on at least 1 of the following 'server_urls': https://deployment-puppetdb03.deployment-prep.eqiad.wmflabs [22:26:53] Info: Retrieving pluginfacts [22:26:55] Info: Retrieving plugin [22:26:58] Info: Loading facts [22:27:00] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed to execute '/pdb/cmd/v1?checksum=19e0362414c0c1493db8e04908d36fee35aff424&version=5&certname=deployment-restbase02.deployment-prep.eqiad.wmflabs&command=replace_facts&producer-timestamp=2020-03-11T22:26:26.389Z' on at least 1 of the following 'server_urls': https://deployment-puppetdb03.deployment-prep.eqiad.wmflabs [22:27:02] Warning: Not using cache on failed catalog [22:27:04] Error: Could not retrieve catalog; skipping run [22:27:06] cscott@deployment-restbase02:~$ [22:27:13] am i Doing It Wrong? [22:34:33] (does https://puppetboard.wikimedia.org/node/deployment-restbase02.deployment-prep.eqiad.wmflabs work for you all? it seems to reject my LDAP login) [22:35:09] cscott: puppetboard is only for prod hosts [22:35:35] the login works but "What you were looking for could not be found in PuppetDB." [22:36:03] also it requires ops at this time [22:36:05] Require ldap-group cn=ops [22:36:31] and WMCS doesn't have a centralized puppetdb [22:36:49] deployment-prep might have it I don't recall [22:36:54] we do [22:38:49] though looks like maybe something's up with it [22:39:16] yeah it got OOM killed again [22:40:24] anyway, puppet is still b0rked on deployment-restbase02, because (it seems) something's wrong with https://deployment-puppetdb03.deployment-prep.eqiad.wmflabs [22:41:08] yes [22:41:12] that's what I said [22:41:30] wondering if I can make that puppetdb instance larger [22:45:32] Krenair: I'm at the end of my day, could you comment on T247467 re next steps? [22:49:53] done [22:50:19] thanks