[07:14:45] trying to enable role::labs::mediawiki_vagrant on stretch I'm getting lxc-net[20057]: iptables v1.6.0: can't initialize iptables table `nat': Table does not exist [07:16:18] dmesg mentions nf_conntrack: Unknown symbol __siphash_aligned [07:17:17] dcausse: afaik labs does not support IPv6 [07:17:35] wait, ignore that comment :) [07:17:53] unrelated [07:18:39] iirc I tried with buster and I could enable the role but then vagrant up was complaining about obsolete/unsupported lxc settings [07:37:40] rebooting the instance did the trick [14:45:18] !log openstack update dns-servers on CODFW subnet `cloudinstancesb-v4` [14:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Openstack/SAL [15:08:36] !log toolsbeta `sudo truncate -s 0 /var/log/exim4/paniclog` on toolsbeta-{sgewebgrid-{lighttpd,generic}-0901,sgecron-01}.toolsbeta.eqiad.wmflabs because of email spam [15:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [17:06:34] godog: I just saw the ticket closures for upgrading grafana. thank you :) [17:29:38] bd808: heh, had to go the JFDI/bold route [18:13:00] !log tools Disabled Puppet on tools-docker-registry-03 to investigate docker-registry issue (no phab task yet) [18:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:55:19] bd808: I assume that whatever issue is on the docker registry won't affect tools in any way? [19:57:16] quick question. I have a server running on an instance (test-endpoint-01 in discovery-stats project), listening on. port 8000. `curl localhost:8000/view` outputs `

hello world

`. In security groups I have "ALLOW IPv4 8000/tcp from 0.0.0.0/0" and I have a web proxy configured with the correct backend port. But https://pai-test.wmflabs.org/view still gives me 502 Bad Gateway. What am I missing here? [19:57:36] Hey folks. I'm starting to be suspicious that the hosts I'm running some VMs on are being weird. I'm certainly not sure. Is there a way to find out what host a VM is running on? [19:57:52] Oh! Just found it in horizon [19:58:02] nevermind. It's in the instance details [20:01:23] (By correct backend port I mean 8000) [20:14:33] halfak: Heh i was just about to answer you :P [20:19:30] OK I think I might have narrowed this down some more. It looks like we're failing to consistently send stats to labs graphite. [20:19:52] We're using host: labmon1001.eqiad.wmnet port: 8125 [20:19:54] Zppix: starting new things on the Toolforge Kubernetes backend probably was messed up for a short time. Should be ok now. [20:20:10] It seems to be not responding. [20:20:15] bd808: ack thanks [20:20:43] bearloga: hmmm... does your curl test work using the instance ip rather than localhost? [20:20:46] https://wikitech.wikimedia.org/wiki/Labmon1001 [20:20:54] Is this thing doing OK? [20:21:48] halfak: what's a metric path you are trying to write to? [20:22:00] A bunch but they all start with "ores" [20:23:19] bd808: should it be able to? `curl http://172.16.7.127:8000/view` on the instance gives connection refused. [20:23:42] halfak: I can confirm that carbon-cache & carbon-relay are running on labmon1001.eqiad.wmnet. I'll look a bit ddeeper in a minute [20:23:53] kk thank you bd808 [20:24:06] bearloga: that makes me think that your service is bound to 127.0.0.1 and not the routable ip of the instance [20:24:09] I'm digging into our uwsgi stats pusher as that might be the problem. [20:24:40] I see "stats-push=statsd:labmon1001.eqiad.wmnet:8125,ores.ores-web-01.uwsgi" in the config so that looks alright. [20:25:20] bd808: ohhhh, thank you! I'll look into that [20:26:08] It looks like we are sending stats, but we're getting a ton of periodic "[2019-09-05T06:25:49] statsd_send_metric()/sendto(): Resource temporarily unavailable [plugins/stats_pusher_statsd/plugin.c line 40]" [20:26:33] halfak: I'm not seeing any issues on the server, while I don't see any ores.ores-web-01 key names I do see a bunch of ores.ores100* https://graphite.wikimedia.org/S/g [20:26:33] bearloga: depending on what you are running and why you may find it easiest to add an nginx config on that instance that reverse proxies from the routable ip to the local service [20:26:53] jeh, check graphite-labs [20:27:38] https://graphite-labs.wikimedia.org/ [20:27:40] jeh: /srv/carbon/whisper/ores/ores-web-01 on labmon1001 is the storage tree [20:27:44] bd808: I specified the routable ip as the host and it works now :D thank you so much!!! and thank you for the nginx config suggestion, that sounds like a much better way to do it [20:28:33] bearloga: no problem. it turns out I have debugged these things a time or two before so I have a mental checklist of "could break here" points :) [20:28:45] halfak bd808: gotcha, thanks for the right endpoints [20:29:04] :) [20:31:08] halfak: I'm seeing metrics like ores.ores-web-01.uwsgi.worker.9.respawns.count that look to be active [20:31:27] Indeed. This looks intermittent. I'm also seeing some activity. [20:31:36] Looks like maybe 90% of the metrics go through. [20:31:40] Maybe more. [20:32:33] ok. let me see if I can find carbon complaining about being resource constrained... that's a thing that has happened in the past with graphite/carbon for sure [20:32:53] thanks [20:33:10] in the meantime, I'm going to try to figure out how to make sending metrics work async! [20:39:51] carbon on labmon1001 is a complicated tree of 8 storage backends, 1 input that hashes to one of the 8 storage backends, and 1 output that hashes to the 8 storage backends. Lots of moving parts there. [20:41:37] I just think of it like a magical pile of metrics that I can just assume to be online and working ;) [20:42:14] log files that are using date stamps not in ISO 8601 format... ugh [20:43:21] carbon's log files seem to use dd/mm/yyyy dates just to make me grumpy [20:45:08] brb [20:45:55] halfak: `grep ores **/creates.log|grep -E ':05/09/2019'|wc -l` shows 5484 new metrics in your namespace today... [20:47:10] I think I remember that globally our carbon config rate limits creation of new metrics... but let me see if I can find config to confirm that [20:49:14] 5373 of them are for deployment-ores, 108 for ores-web-01 [20:49:25] halfak: "max_creates_per_minute = 100" in /etc/carbon/carbon.conf could be making new metrics you are trying to create not show up immediately when you start sending. [20:50:01] that's basically a throttle on consuming new disk space/IOPS in our graphite setup [20:50:51] * bd808 remembers that we really need to stop collecting duplicate system metrics in carbon and stick with prometheus [21:01:02] back [21:01:04] * halfak reads [21:02:11] bd808, when you say "new metric" are you referring to a new name for a metric? Or a new metric increment/timing for a previous name? [21:02:18] !log tools Enabled Puppet on tools-docker-registry-03 and forced puppet run (T232135) [21:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:02:23] T232135: Figure out right way to add CORS header to docker-registry.tools.wmflabs.org - https://phabricator.wikimedia.org/T232135 [21:03:22] This user is now online in #wikimedia-tech. I'll let you know when they show some activity (talk, etc.) [21:03:22] @notify halfak A new "path" which ultimately means a new RRD storage file [21:03:22] Weird. didn't try to jump channels there. [21:04:01] bd808, yeah. That new path -- we shouldn't be doing that often. Can you tell me what new paths were recently created? [21:04:14] We should have a relatively stable set of metric names/paths [21:04:31] When we deploy new models, that should add a small set -- maybe once per month. [21:07:45] here's a list of the creates seen today for ores-web-01 https://phabricator.wikimedia.org/P9044 [21:07:48] halfak: 3582 of them belong to the "deployment-ores01" instance name [21:08:39] jeh, this looks pretty sane. I can cut those numeric sub-paths though. We don't make much use of those. [21:08:43] * halfak digs into that. [21:09:21] bd808, deployment-ores01 is weird. I don't know what to say about that. Our beta cluster shouldn't really be being used much. [21:09:28] *beta cluster instance. [21:09:42] something "woke it up" I'd say [21:10:06] like the carbon exporter for system stats got enabled or fixed today [21:10:25] Must be. [21:15:14] bd808, could this explain why we're having a bunch of metrics-sending fail? [21:15:54] what's failing to send exactly? and how do you know it is failing to send? [21:16:34] maybe even before that... is this a thing you have known to work before? or a new thing that is not doing what you expect? [21:17:40] T232164 [21:17:41] T232164: ORES is creating a log of metrics. This is due to revid count-based metrics - https://phabricator.wikimedia.org/T232164 [21:17:52] We can clean it up either way. [21:25:17] bd808, it was working before with no changes. Now we periodically get the following error from uwsgi in the logs and some request to uwsgi hang for large amounts of time (10s of seconds) while others finish in fractions of a second. [21:25:31] statsd_send_metric()/sendto(): Resource temporarily unavailable [plugins/stats_pusher_statsd/plugin.c line 40] [21:25:53] There are clusters of these in our logs. [21:26:10] "Resource temporarily unavailable" for a UDP packet send. Hmmm. Resource exhaustion on the sender side? [21:26:42] or is this going to an aggregation service in front of carbon? [21:27:17] or... is it a dns resolve that is actually failing occasionally? [21:27:23] This is the config line for uwsgi: stats-push=statsd:labmon1001.eqiad.wmnet:8125,ores.ores-web-01.uwsgi [21:27:28] All good questions. [21:27:51] statsd is an aggregator in front of carbon [21:27:58] so we should be looking a layer up [21:28:13] and there is a dns resolve in there too for good measure [21:29:14] I wonder if I could upgrade my version of stats_pusher_statsd [21:34:17] https://github.com/wikimedia/puppet/blob/f23ef87705613fa6d40f0b28e491949494387963/modules/ores/manifests/web.pp#L47 [21:34:30] Looks like we're getting it via puppet. Or at least configuring with puppet. [21:37:29] Yeah. Seems like this is all managed by puppet so it's nothing I've changed. [21:40:12] jeh: if you have a few minutes, can you poke around on labmon1001 and see if you can find the logs for the statsite process? That's the port 8125 listener that halfak's code is trying to talk to [21:40:34] Thanks :) [21:40:53] bd808: statsite looks OK, checked with ` journalctl -u statsite@8125.service` [21:48:01] fwiw, the last time we got these errors with uwsgi sending stats, it was due to some missing configuration. See T218567 and T189605 [21:48:01] T189605: Beta cluster ORES is emitting statsd errors - https://phabricator.wikimedia.org/T189605 [21:48:02] T218567: Fix statsd config handing in ores - https://phabricator.wikimedia.org/T218567 [21:48:26] I think we were sending statsd to localhost and it was just not responding to us. [22:00:34] I've added some new notes to https://phabricator.wikimedia.org/T231222 representing what I think I know. [22:00:49] jeh, would you add anything you think is relevant too? [22:01:08] I'm also curious if you have any suggestions for where I could look next to identify the source of this issue. [22:02:24] sure, I'll add updates as I come across things. [22:03:19] thank you. I'm going to head to my last meeting of the day and then head out. Thanks for your help today. I'll be looking at this with fresh eyes tomorrow.