[00:10:41] Not sure if there is someone available to answer that, but I'm getting a "Unconfigured Domain" error when accessing [00:10:41] https://recommendation-api-beta.wmflabs.org/www.wikidata.org/v1/description/translation/from/en/to/zh [00:11:11] It is supposed to retrieve the same like https://app-editor-tasks.wmflabs.org/www.wikidata.org/v1/description/translation/from/en/to/zh [00:11:17] Both have the same source code [00:15:29] mateusbs17: All I can tell you about that URL is that it points to deployment-sca02.deployment-prep.eqiad.wmflabs on the backend. I'm not sure who "owns" the configuration of the app running there. [00:17:20] I reached this point too, but thought it could be some misconfiguration because of the message: "This domain points to a Wikimedia Foundation server, but is not configured on this server." [00:17:47] mateusbs17: mobrovac made the last change to the Puppet profile that manages the recommendation_api service. He may or may not know anything about how it is configured in the beta cluster. [00:18:34] afaik mateusbs17 you can't access wikidata from the beta labs instance [00:19:29] That makes sense, I need to adjust for the x-amples of the suggested edit endpoints [00:19:33] Thanks mobrovac and bd808 [00:19:52] yw [00:19:56] ah for the restbase test mateusbs17? [00:20:31] Yep, after adding x-monitor: true to https://github.com/wikimedia/restbase/pull/1153 the tests are failing [00:21:04] And it seems that is related with the recommendations host pointing to https://recommendation-api-beta [00:21:54] yes it is [11:07:01] Hi guys, what's wrong with database replicas? Quarry and Toolforge seems to timeout even for simple queries like counting how many pages in a certain namespace [11:07:44] Queries that took 15 minutes a year ago now take over 2 hours on Forge (and 30 minutes timeout on Quarry) [11:31:09] Oh, I see, the default analytics is slow as s**t, because using --cluster=web the same takes 2 minutes. But why is it and why is the slow analytics the default on Toolforge now? [11:35:08] Or why is analytics so slow? [11:35:24] Overload? [11:35:31] Do we have any solution to overload? [11:36:13] Because Quarry and Toolforge replicas are unusable these days (unless you know the --cluster=web trick) [11:37:09] bd808, ^^ maybe you can help with Dvorapa's issue ^^ [11:52:50] It looks I'm not the only one experiencing the issue, see https://www.mediawiki.org/wiki/Topic:V09mzu8z59fsf1tg [12:57:06] Dvorapa: i believe one of the replicas died or something, and it needs to be brought back into sync now, which is a very expensive and long running operation. [13:02:52] Dvorapa: also in general, many tables are incredibly stretched in performance already. So even when little things change, the impact can be quite dramatic. [13:13:39] Dvorapa, there's been a combination of factors on the replicas. There has been a lot of work going on there, which has meant that the replicas are being depooled one at a time and compressed so they don't run out of space. [13:13:51] Let me check the status on some of that... [13:15:41] They may all be repooled today [13:22:47] but I can't get a git fetch to finish this morning. That's fun [13:24:17] Yes, all are repooled at the moemnt [13:24:19] *moment [13:35:23] bstorm_: What does that mean? [13:35:57] That means, for a while, we were sending a lot of traffic to fewer servers. [13:36:19] All servers are operational right now, so I am going to keep an eye on them to see if things level out [13:37:01] Schema changes appear to have increased load, and the analytics replica is only one server right now, while web is two. That's the reverse of the past. We may change that back if the conditions that required us to make it that way are over (for some reason). [13:37:37] Basically, you are right, things are slower. Some of it is maintenance that has been on-going, and some of it we will keep trying to figure out. [13:41:39] So if I understand it correctly, those queries that took 1 minute in the past and now take 1 minute on web cluster and 2 hours on analytics cluster (which is default) will still be slow, but may increase the speed after the maintenance will be done, right? Shouldn't we inform Quarry users, that even the simple queries may timeout during the mainten [13:41:40] ance (like some sort of banner)? [13:42:56] tools don't have banner provisions, as they are all websites maintained by different sets of volunteers [13:43:47] Of course. Who's the maintainer of Quarry? [13:44:17] Anyway we should also inform people using analytics cluster (like in the sql command intro) [13:44:21] Dvorapa: The maintenance appears to be done for now. Some of traffic should rebalance now. However, I still see things badly lagging in places. [13:44:29] tools.wmflabs.org has a searachable index [13:45:57] The analytics server is probably working hard to catch up its lag right now as well: https://tools.wmflabs.org/replag/ [13:46:02] actually quarry isn't a tool of course, just in wmflabs.. [13:47:07] hmm, no idea who maintains it.. [13:47:43] Yeah, load is quite high on the analytics replica right now [13:50:44] Checking on some of that [14:05:09] It really might be just that it has to catch up a LOT since the maintenance. The web replicas are already caught up. [14:05:28] I'm not seeing anything else jumping out at me. [14:06:07] Quarry is one I could switch, but I'm still looking for where the code is. I think someone else on my team might have messed with that... [14:27:07] Thank you guys for the investigation, hopefully everything will be back to normal soon after a month of struggling [14:37:18] !log quarry changed to web replica for database queries and restarted celery workers [14:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Quarry/SAL [14:43:19] Dvorapa: can you test something on quarry if you have anything you might normally run? [14:43:36] I want to make sure it's up and running correctly. It *should* be pointed at the web replicas now [14:47:12] https://www.irccloud.com/pastebin/SXrJ1kbF/ [14:47:45] So my apologies to those who were running queries while I switched. It would have disconnected you. [14:48:01] But that will make it faster over the weekend while the analytics replica plays catch-up [14:48:50] Thanks zhuyifei1999_ for helping me figure out where things were in quarry :) [17:36:55] !log tools.wikibase-databridge-storybook ln -s www/static/ public_html && webservice start # make storybook available under non-static domain too [17:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibase-databridge-storybook/SAL [18:03:37] Dvorapa, thedj: Framawiki and I maintain quarry. the problem is, I lack time right now and I also lack understanding on how the replicas works. idk about Framawiki [18:06:57] o/ [18:08:23] IMO that would be great to have more info about temporary issues regarding replicas [18:30:03] Quarry is partially down https://quarry.wmflabs.org/query/runs/all. I don't have ssh access near me. Ca you take a look zhuyifei1999_ ? [18:30:19] wut [18:30:20] ok [18:33:53] https://www.irccloud.com/pastebin/A41To2k4/ [18:34:52] bad cache data zhuyifei1999_? Anything I can do to help debug? [18:35:12] I think one of the table rows are messed up [18:35:25] that is used to store misc query run data [18:37:18] I'm just gonna try: except: with json loading [18:38:08] Looks like bad entry after forced halt of workers/celery [18:40:44] framawiki: https://gerrit.wikimedia.org/r/#/c/analytics/quarry/web/+/519679/1/quarry/web/models/queryrun.py [18:40:50] looks good to you? [18:41:59] bd808: I think we have it. thanks [18:43:48] Yeay, +2 over irc [18:44:00] Thanks zhuyifei1999_! [18:45:56] !log quarry Deployed 2f7ee60 to quarry-web-01 [18:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Quarry/SAL [18:46:13] !log tools.historicmaps Migrated webservice type from php5.6 to php7.2 [18:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.historicmaps/SAL [19:10:16] !log toolsbeta T215531 removed toolsbeta-arturo-k8s-master-2/3 and added toolsbeta-test-k8s-master-1 for testing kubeadm [19:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [19:10:18] T215531: Deploy upgraded Kubernetes to toolsbeta - https://phabricator.wikimedia.org/T215531