[05:21:22] Yes, hello. [05:21:23] > WARNING:root:setting '-l release=' directly is deprecated and may be removed in the future [05:21:55] Is really not a very helpful warning message. Wasn't it a whole to-do to get everyone to specify the release? [05:21:59] What is supposed to be done now? [05:23:33] The mailing list is cloud, not cloud-l. I found it. I don't see a mention of jsub changes off-hand. [09:59:28] Ursula: hello, your comments are out of context for me. could you share what this is about? [10:01:47] gtirloni, presumably he is referring to jsub in the tools project [10:01:58] specifying the distribution the job should run under [10:02:19] something is complaining that setting "-l release=" is deprecated? don't know why it would be deprecated [10:03:52] it's a noop now [10:03:53] https://gerrit.wikimedia.org/r/c/labs/toollabs/+/475103 [10:05:02] so it's really just a warning.. we don't support multiple releases anyway [10:05:55] the warning message says everything. users are encouraged to stop using that option. [10:07:43] gtirloni, aren't we going to want to support multiple releases in future though? [10:07:59] hopefully not? [10:08:37] I thik the reason b.storm did this change it's because various things were stuck with trusty and it was blocking the migration to stretch for now good reason [10:08:43] but she should be able to clarify [10:08:48] so when we go from stretch to buster, all jobs will suddenly transition? [10:09:07] shouldn't you just complain if release is set to trusty/jessie? [10:09:38] *I think* they mostly will work.. whereas if we enforce the release option, they certainly won't... but I could be mistaken [10:10:18] presumably the problem right now is people specifying -l release=trusty? [10:31:03] I believe so [10:32:55] well [10:33:31] instead of making it difficult to provide multiple distributions in future, I would suggest just complaining if trusty is set or if it's defaulting to trusty or something [10:33:54] will ask bstorm about it later [12:56:20] andrewbogott: online? [12:58:46] doctaxon: probably not yet, may I help you? [12:59:50] no, but it's okay, if I contact him within the next 2 hous [12:59:54] hours [13:01:04] ok [13:49:49] gtirloni: Sorry, the context: [13:49:59] > Cron jsub -l release=trusty -once -j y -mem 280m -N general.blankpages -o ~/var/log -quiet ~/bin/dbreps -r general.blankpages -s enwiki [13:50:14] > WARNING:root:setting '-l release=' directly is deprecated and may be removed in the future [13:50:21] I'm getting e-mails like that every time I run a cronjob. [13:50:29] Which is a little bothersome. :P [14:15:35] SMalyshev: I note that today is the 'wikidata-query' project's day for migration to the new region. Can I just delete some of those VMs or do you still testing? [14:15:50] (and of course I still owe you a response on that ticket) [14:17:42] doctaxon: what's up? [14:19:50] Ursula: could you open a phabricator task for that issue? [14:28:48] godog: hm… it looks like several of the VMs in the 'swift' project are built off of an image tagged 'experimental' [14:29:04] which means that image is no longer available, which makes it hard (maybe impossible?) to move those VMs :( [14:29:19] is it possible/easy to just rebuild them afresh in the new region, or should I try to invent a way to move them? [14:29:52] andrewbogott: today you want to migrate dwl [14:30:18] yep, I have that down as starting in about three hours. [14:31:06] can you do it so, that there is a break between the VM taxonbot (first to migrate) and the VM taxonbota of at least an half hour? [14:31:09] andrewbogott: do you have the names of the affected VMs handy? might be easier to just delete them [14:31:27] godog: swift-stretch-ms-be02, swift-stretch-ms-be01, swift-stretch-ms-fe01 [14:31:50] doctaxon: yes, that's what my notes say [14:32:03] andrewbogott: yeah deleting is fine, I'll recreate them as needed [14:32:10] godog: thanks, and sorry! [14:32:54] np, hope not a whole lot of other vms are like that [14:33:23] !log swift deleting VMs swift-stretch-ms-be02, swift-stretch-ms-be01, swift-stretch-ms-fe01 [14:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Swift/SAL [14:33:27] andrewbogott: I'm not on the keys if you do the job. I hope that there will be no problems [14:34:05] doctaxon: does that mean that it doesn't matter when I do it, as long as I wait 30 mins between? [14:35:35] yes, start with VM taxonbot as we have scheduled, and if it's running again wait 30 minutes, and then start with VM taxonbota [14:40:42] andrewbogott: and you can migrate any other VMs as you want [14:41:31] ok [15:30:41] Ursula: remove that '-l release=trusty' part [17:14:31] !log dwl migrating taxonbot to eqiad1-r [17:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Dwl/SAL [17:16:28] SMalyshev: what can you tell me about wdqs-test.wikidata-query.eqiad.wmflabs? It's about a year old but a no-longer-existing flavor as best I can tell. Is it still used for anything? [18:24:39] !log dwl migrating taxonboto to eqiad1-r [18:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Dwl/SAL [18:54:18] !log dwl moving dwl to eqiad1-r [18:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Dwl/SAL [21:00:01] addshore: Now that things are in the new region, I can do https://phabricator.wikimedia.org/T199532 whenever. Each VM will experience downtime while it moves… do you care when I move things? [22:11:10] andrewbogott: are you still online [22:11:20] I am! What's up? [22:11:32] I cannot ssh dwl.taxonbot [22:12:13] what's the complete fqdn you're using [22:12:14] ? [22:12:20] ECDSA host key for taxonbot has changed and you have requested strict checking. Host key verification failed. [22:12:44] Oh, that's a side-effect of the move; you can just clear that key and re-accept [22:13:00] how do I clear the key? [22:13:13] the error message tells you the file and line with the key [22:13:15] or, it should [22:13:34] oh, wait [22:14:24] thank you very much [22:14:35] working now? [22:16:13] andrewbogott: all okay [22:16:19] thank you [22:16:28] great. You're welcome. [22:17:58] I'm getting errors on replicas about punjabiwikimedia_p not being a valid wiki [22:18:15] it's apparently exposed as being a wiki on s3 accoridng to the wiki list, but it has no tables [22:18:37] (the meta_p.wiki list that is, I'm not using prod's list) [22:19:08] Current user: u2008@10.64.37.14 [22:19:08] Connection: aawiki.analytics.db.svc.eqiad.wmflabs via TCP/IP [22:19:08] Table 'punjabiwikimedia_p.revision_userindex' doesn't exist [22:19:40] Sounds like someone hasn't run maintain-replicas [22:19:52] but has run maintain-meta_p ? [22:20:11] bstorm_ maybe? ^ [22:20:17] or possibly gtirloni [22:20:20] Yeah, this is bad. it breaks cross-wiki tools that auto-discover all wikis. [22:20:32] short-circuits the GUC query [22:20:52] should probably just make those maintain- scripts run on cron or something [22:21:48] I can try to fix this as soon as I find the docs about how to do it :) [22:21:53] punjabiwikimedia is still in process [22:22:00] for now would be best to omit from meta_p [22:22:07] It shouldn't be in meta_p [22:22:20] It's in DNS because that one captures whatever is there [22:22:26] Even if it is still in process [22:22:38] IIRC that script doesn't offer much options [22:23:05] If I recall correctly anyway. I'm going off the ones I've seen coming through [22:23:09] I'll take a look of course :) [22:23:22] yeah, the *.db alias and empty db can exist, that's fine. [22:23:23] * andrewbogott defers to bstorm_ [22:23:26] Nobody has run meta_p on anything that hasn't been created to my knowledge [22:23:43] That's the last step on all that whatnot [22:24:14] we could make the meta_p script ignore wikis which do not have a view DB set up? [22:24:21] T207584 [22:24:22] | dbname | lang | name | family | url | size | slice | is_closed | has_echo | has_flaggedrevs | has_visualeditor | has_wikidata | is_sensitive | [22:24:22] +------------------+---------+---------------------+-----------+-------------------------------+------+-----------+-----------+----------+-----------------+------------------+--------------+--------------+ [22:24:22] | punjabiwikimedia | punjabi | Punjabi Wikimedians | wikimedia | https://punjabi.wikimedia.org | 1 | s3.labsdb | 0 | 1 | 0 | 1 | 0 | 0 | [22:24:22] T207584: Prepare and check storage layer for punjabiwikimedia - https://phabricator.wikimedia.org/T207584 [22:24:30] Yeah, I have not touched it. [22:24:40] I don't know why that would be in there [22:25:12] so short term fix [22:25:26] someone runs delete from meta_p.wiki where dbname = 'punjabiwikimedia'; ? [22:25:54] I'm concerned about where that would come from, though. My understanding is that meta_p is a manual process [22:26:00] That ticket hasn't reached cloud services yet [22:26:45] Unless it isn't and all our docs are wrong :) I haven't hacked on that script yet. [22:27:42] Looking around a bit [22:27:44] maybe another wiki creation happened for which maintain-meta_p was run? [22:28:01] It's supposed to accept a database as an argument [22:28:13] maintain-meta_p accepts a database as an argument? [22:28:29] That's what the docs say. [22:28:35] I haven't looked at it in a while [22:28:47] However it was run by some other folks just the other day [22:28:51] Lemme check some things [22:28:53] wasn't that script supposed to truncate the table and regenerate it? [22:29:55] It accepts databases as args, yes [22:29:57] I'm reading it [22:30:06] But that said... [22:30:11] What was done here... [22:35:21] So yeah, we always have added one db at a time since I've been doing replicas to meta_p. There may have been some confusion with someone else running it, though. If I remove the record, will that break anything on your end Krinkle? [22:35:58] No, that'd be fine. [22:36:22] Most cross-wiki tools query meta_p on every web request, to get all wikis, and then query them all based on some heuristic. [22:36:33] so this is currently causing s3 to be unavailable for GUC. [22:36:45] because it has a "bad wiki" according to meta_p [22:37:19] GUC uses a single union to query all wikis, but the presence of this wiki, breaks the query for other wikis as result. [22:37:20] I'm concerned about how this got here in case there's anything else there. [22:37:29] However, I think that's the only wiki stuck in limbo [22:37:47] Yeah, definitely want to trace where it came from [22:41:36] Ok. I have removed that one from all three replicas [22:42:27] I figure someone misunderstood a step in there. I just want to make sure they know how to do it right and nobody makes this mistake a gain. [22:42:39] I'll check the doc as well, but I'm pretty sure that's correct [22:43:01] Yep. Doc's right [22:43:17] Krinkle: let me know if there are any others that pop up? [22:44:36] will do. Might also be something we could monitor potentially. E.g. something like select from meta_p and when some basic query like SELECT 1 from %dbname_p.revision_userindex or some other common table/view. Possibly in a big UNION for 1 rtt. [22:44:48] s/when/then/ [22:46:06] bstorm_: btw, thanks again for all the join entaglement and improvements. GUC is snappy again :) [22:46:24] :) Great! [22:46:32] https://tools.wmflabs.org/guc/?user=Krinkle is able to get responses <3s most times, which is pretty good. [22:46:41] add &debug=true to see all the queries it's doing. [22:48:31] Basically what has to happen for meta_p to get confused is that the wiki has to exist and be all but ready for handoff to cloud and then someone has to either run all-databases or they have to do something really weird. It's a manual process. Seems kind of silly to monitor a manual process that has tons of automated checks already. The reason it is manual is because all three servers should be prepped first. [22:48:58] So there's a couple new folks working on these. One of them did it...and I'll talk to both 😉 [22:49:24] bstorm_: I ran the meta_p script with the --all-databases arg today after seeing the ticket about one db being missing [22:49:36] so that nystery is solved [22:49:41] *mystery [22:49:43] Doh! It wasn't the new ones! [22:49:45] LOL [22:49:47] Is there a reason we don't run it on cron? [22:50:05] Well yeah. We manually create the views [22:50:07] Krenair: well, I guess because it can make crappy db state? [22:50:13] And indexes [22:50:19] manually? [22:50:22] Those would need to be full auto first [22:50:24] I'd really like this to all be more automatic though [22:50:29] you manually run the maintain-replicas script right? [22:50:33] Yes [22:50:35] okay [22:50:41] is there a reason we don't run maintain-replicas on cron? [22:50:52] Yes [22:50:55] pretty much everything about the wiki replicas is manual today [22:51:02] The wiki becomes available to the script before it is ready [22:51:12] Long story in the toolchain [22:51:41] But basically, there are config files all over the servers that are maintained in dba-land. Then there's our config. [22:51:52] The first set of files are updated by other processes [22:52:27] They determine if things are private, etc. So, we wait to pull the trigger after the DBAs say the first scrubbing is ready [22:52:43] And once we make sure any changes needed to our config are in place. [22:53:56] So we'd need a way to make sure that the sanitarium level was ready before maintain views runs...and so forth down the line [22:55:30] can't we implement all the security logic applied at their level in views? [22:55:33] Where I think we could improve things is combining the scripts. There is still concern over manual checking, though. If something goes wrong in the process, cleaning up is extremely hard. [22:56:09] That's not as secure, and it is also more likely to be a performance hit [22:56:43] We've had problems as is in that area [22:57:02] so [22:57:11] we have to ensure sanitarium is run automatically and then fire maintain-replicas [22:57:23] and then fire off maintain-meta_p [22:57:25] all in that order [22:57:30] There's a new process in the works in the end. [22:57:31] Nope [22:57:34] This locks the tables [22:57:37] You'd block users [22:57:54] sanitarium locks the tables? [22:57:56] I have to depool the servers to run maintain-views and maintain-indexes [22:58:05] Running DDL commands does [22:58:15] to create views? [22:58:17] yes [22:58:19] And the indexes [22:58:47] A new wiki is no issue [22:58:51] There's no lock contention [22:59:01] When I have to rebuild a view, there will be. I do that often. [22:59:15] the locking is ok for brand new wikis, but not when we are updating a view across live databases [22:59:17] it sounds like this stuff is built atop a carefully stacked tower of manual human intervention and really has very little real automation :( [22:59:26] yup [22:59:34] By design...for better or worse. [22:59:51] Nobody wants a mistake in this area :) [23:00:22] But! There is work in the future to reconstruct sanitization, etc. Just don't expect it anytime soon. [23:00:46] New wiki creation has more potential for automation than the view/index management processes as well. [23:01:32] We have at least 4x the number of user facing features that we can really focus on supporting. So major overhauls happen at a rate of about 1 thing a year [23:01:44] I could easily see a mode where some of this could run in a different, single-purpose form for new wikis. For changes and edits of the structure of tables, it couldn't really happen. [23:03:11] In a systemd timer for the monitoring rather than a cron. Or a daemon like we did for some DB activities. It is a tricky thing to automate DDL with no human intervention, though. I'd rather put it in a CI/CD cycle than just full auto. [23:04:36] yeah, CI/CD or maybe the fancy new spicerack automation stuff [23:05:14] :) [23:05:19] something that gives a staged command and control that we can repeat more easily than "open N ssh sessions and cut-n-paste these commands" [23:06:37] the me from past jobs would have built multi-stage Jenkins automation for a whole lot of things we handle as ad hoc scripts [23:06:39] It still feels to me like something that should be possible to automate :/ [23:07:07] automation of almost anything is possible, the question is if it is worth the effort [23:07:24] see also: https://xkcd.com/1205/ [23:08:31] hehe [23:08:38] new wiki setup is in the "monthly" how often column I would guess [23:09:01] yep