[06:29:13] Hi, mvolz [07:01:30] can someone reset my toolforge account [07:01:59] my icloud keychain has some unique password and my laptop is dead :/ [07:02:32] qedk: you can't use Special:ResetPassword? [07:02:50] @Majavah on toolforge? [07:03:32] qedk: did you ever login to wikitech.wikimedia.org with that account? [07:04:14] I do have a Wikitech account, didn't use it [07:04:24] let me try, thanks @Majavah [07:04:37] wikitech uses same accounts as toolsadmin.wikimedia.org [07:04:43] ah [07:04:44] so try requesting a password reset from there [07:05:01] the toolforge login page didn't have a reset link so i thought it's disallowed [07:05:12] yeah, it's a little complicated [07:06:58] ayy it worked, thanks! [07:07:12] so messy having to reset passwords and ssh keys [07:08:07] that sounds not fun [07:08:54] most of the painful part of moving to windows is over (i hope) [07:19:20] [[Tech]]; 41.214.85.38; /* 781282188 */ new section; https://meta.wikimedia.org/w/index.php?diff=20240609&oldid=20227191&rcid=15812558 [07:23:51] [[Tech]]; WikiBayer; Reverted changes by [[Special:Contributions/41.214.85.38|41.214.85.38]] ([[User talk:41.214.85.38|talk]]) to last version by Tegel; https://meta.wikimedia.org/w/index.php?diff=20240612&oldid=20240609&rcid=15812561 [11:42:04] [[Tech]]; Atsipas; /* 2004 Audi A4 1.8T Cabriolet */ new section; https://meta.wikimedia.org/w/index.php?diff=20241343&oldid=20240687&rcid=15813743 [11:52:34] [[Tech]]; Nemo bis; Undo revision 20241343 by [[Special:Contributions/Atsipas|Atsipas]] ([[User talk:Atsipas|talk]]); https://meta.wikimedia.org/w/index.php?diff=20241356&oldid=20241343&rcid=15813765 [11:53:04] These car advertisers seem a bit too emboldened after https://www.theguardian.com/media/2020/jul/01/france-bans-dutch-bike-tv-ad-for-creating-climate-of-fear [11:57:22] Hi. I had requested a wgCopyUploadsDomains addition a month ago. Can anyone process, or at least comment on a couple of those requests? [11:58:54] acagastya: you have task numbers? [11:58:59] can't look without them [11:59:22] https://phabricator.wikimedia.org/T254342 [12:01:34] that request had fallen down the list so no-one probably saw it when looking for new requests, I dragged it up to top of the list so it can be seen [12:02:01] that looks sensible to me, I'll take a look at it before monday (which is the earliest it can be deployed anyways) [12:02:14] Okay. Thank you. [12:02:24] There was another one which I had filed minutes ago. [12:02:42] arxiv.org? [12:03:09] RhinosF1 has edited it, (the same user who did the first request). [12:03:14] Yes, arxiv.org. [12:03:44] Thanks for handling the nature.com, Majavah. [12:04:05] if Samuel doesn't process that one before I look at nature I'll do that in the same patch [12:05:32] * acagastya does not know who Samuel is. [12:05:44] But in any case, if these two are added, it would make it much easier for me to upload those papers. [12:06:04] Samuel = RhinosF1 [12:06:16] Oh, okay. [12:06:56] Nemo_bis told me that Commons ignores copyfrauds and uploads the files anyway. [12:07:20] Would it be sensible for me to request a addition of another domain which contains loads of PD material? [12:07:37] ...as long as someone actually plans to use it [12:08:11] Letters of former US presidents -- I don't doubt they are going to be useful. [12:08:43] I'm not sure if anyone would want to add a domain which has copyfraud [12:09:39] Have a look at Shapell.org -- they collect many PD letters, make a faithful copy of PD material, and make it available on the internet. [12:10:17] Hey, all. An ingestion job I run regularly has been failing for a bit with the error PHP fatal error: [12:10:20] Allowed memory size of 698351616 bytes exhausted (tried to allocate 933888 bytes) [12:10:57] Is this related to https://phabricator.wikimedia.org/T256459 ? [12:10:59] acagastya: if the first sentence in their TOU says "don't copy this, all rights reserved", I don't think we would add that to the list [12:11:13] Majavah: wrong [12:11:28] mathemancer: ingestion where? [12:11:30] how so? [12:11:34] A request producing this error is https://commons.wikimedia.org/w/api.php?action=query&generator=allimages&gaisort=timestamp&gaidir=newer&gailimit=500&prop=imageinfo|globalusage&iiprop=url|user|dimensions|extmetadata&gulimit=500&gunamespace=0&format=json&gaistart=1593648000&gaiend=1593734400 [12:11:47] I'm ingesting data from WMC via the api [12:12:02] Well, we are going to upload all the PD material from their site, Majavah. (If what Nemo_bis says is correct.) [12:12:13] mathemancer: have you already considered reducing the limit or the amount of metadata fields? that can be a lot of stuff [12:13:15] Nemo_bis: It was working until 22 June [12:13:22] mathemancer: I think it's more likely that you're running into https://phabricator.wikimedia.org/T201205 [12:13:25] Would a memory limit have changed recently? [12:14:32] Users are now uploading millions of DjVu files, so a request of 500 rows like that might be several GB [12:14:39] Nemo_bis: I'm aware of that bug (T201205) [12:14:39] T201205: Bad metadata for a single file errors out the complete imageinfo prop request - https://phabricator.wikimedia.org/T201205 [12:15:10] Nemo_bis: the error behavior is different for the current problem [12:15:18] (Or not, I'm looking at the options again) [12:15:19] Ok [12:15:25] I was wondering about the captcha that WMF projects use -- do we have a statistics on that? [12:15:40] acagastya: no need, we already know that it's a plague [12:15:56] How so, Nemo_bis? [12:15:59] But yes, there are or used to be some dashboards based on the error logs [12:16:04] Nemo_bis: The bug you mentioned returned a JSON with an error, but the new one doesn't return a JSON at all, but rather a page with the error on it (in HTML) [12:18:03] Majavah: if arxiv.org is whitelisted, will I be able to upload files via their URIs using pywikibot? [12:18:07] Nemo_bis: It seems cutting down to only 100 results per request avoids the error. [12:18:24] Nemo_bis: Thanks for the hint. Do you think I should file a bug report/ [12:18:32] acagastya: I think that's how the whitelist works [12:18:51] Huh? [12:19:27] I used to upload via [[com:Special:Upload]], pasting the URI. [12:19:40] mathemancer: self-answer, works up to gailimit=470, which takes ~12 seconds [12:20:05] If I can do that using piwikibot, I might as well request, and then create a bot that handles those. [12:20:10] mathemancer: you can certainly report that you'd like a limit of 500 to work and it no longer does, but I wouldn't be surprised if it's declined [12:20:29] Nemo_bis: Fair enough. Thanks! [12:20:37] It's useful to know such things, but I think our stance is generally that using smaller batches is not a bad thing :) [12:21:14] Unless of course you plan to do things like asking 30M titles from the English Wikipedia, in which case maybe it's better to have less requests [12:21:57] Nemo_bis: The problem is, I'm trying to gather metadata about all the images on WMC, and the data dumps don't seem to have the global usage info [12:22:34] Nemo_bis: So, I'm trying to use the API, but I think it's not really set up for wholesale data ingestion like this. [12:25:44] mathemancer: yes, copying an entire database table like that is often a bit overkill for the API [12:26:01] Nemo_bis: re captcha --- how so? [12:26:26] acagastya: have you tried searching mediawiki.org? [12:26:33] mathemancer: but why is https://ftp.acc.umu.se/mirror/wikimedia.org/dumps/commonswiki/20200620/commonswiki-20200620-globalimagelinks.sql.gz not enough? [12:27:00] (Am I mixing up things again?) [12:27:15] Yes, Nemo_bis, I have. [12:27:40] acagastya: well there's all sorts of discussions of how our CAPTCHA are terrifyingly bad, e.g. https://www.mediawiki.org/wiki/CAPTCHA [12:28:02] Nemo_bis: Let me look back through why we went the way we did (It's been a few months) [12:28:08] There's also a recent page https://www.mediawiki.org/wiki/Core_Platform_Team/Initiative/Captcha/Initiative_Description but no idea about the status of this proposal [12:28:26] I was wondering if it is possible to make use of captcha to train some models to help image classification on commons, and also help with ORC on Wikisource. [12:28:57] acagastya: yes, it's a long-standing proposal. Not that easy to implement (especially now that DjVu is dead-ish) [12:30:34] If there is a way to extract words from Wikisource material and provide that as captcha. [12:30:52] Or give a bunch of photos and as if there is a cycle, or a blue door in it. [12:31:23] It was you, Nemo_bis, who responded to "scope" concern of Vermont on #wikimedia-commons, right? [12:31:44] Nemo_bis: I was wrong in my earlier message. As I look back through my old notes, the actual issue was that we couldn't seem to retrieve the license and attribution info for files, since it's in the extmetadata field. [12:32:20] Nemo_bis: Someone on this very IRC channel told me that retrieving that info via DB dump was 'hopeless' [12:33:50] mathemancer: ah yes, actually I now remember the discussion. :) [12:34:48] mathemancer: so you use the globalusage to connect the license and attribution to the pages where the files are used? [12:35:09] Nemo_bis: No, that was a red herring [12:35:27] Ah ok. [12:35:59] I would think the first step is to avoid retrieving data for 60M files if you only need a fraction of these. (If) [12:36:04] Nemo_bis: I'm using globalusage just to get how many places an image is used. We're using this as sort of an 'authority' metric for ranking those images [12:36:24] And what you do with the lower "ranked" images= [12:36:44] We catalog them in our DB [12:37:04] So you do need all the 60M files after all? [12:37:07] Nemo_bis: I'm not sure if you recall, but I'm working on ccsearch.creativecommons.org [12:37:16] mathemancer: thanks, I had forgotten :D [12:37:37] Nemo_bis: At the cataloging level, our mandate is to 'Catalog all CC Licensed content' that we can find [12:37:41] Ok, so the issue is that you could import the XML dump and run such API locally, but there would be a delay [12:38:01] Although maybe you could do it for the initial ingest [12:38:14] Nemo_bis: We've actually thought about trying to run it on local infrastructure, yes [12:38:33] I think what you should do now is to reduce the size of the batches and increase the parallelism slowly. [12:39:43] Nemo_bis: I'll try that for the time being [12:39:54] You can ask ops whether there is a place you can monitor to see if you're overloading whichever servers are serving those requests, or at least monitor any performance degradation [12:40:11] Just don't go "oh let's try 100 parallel threads to see if I bring Wikipedia down" ;) [12:40:15] Nemo_bis: How do I ask ops? [12:40:36] mathemancer: here it's fine now that we have a clear thing to ask :D [12:40:38] I'm definitely trying to avoid being the guy who takes down WMC for a day or anything [12:42:06] Usually it's fine to just be reasonable and have a descriptive user agent, then if there are issues (which is very rare) you get contacted [12:42:43] I do have that. It's f'CC-Catalog/0.1 (https://creativecommons.org; {CONTACT_EMAIL})' [12:43:03] Nemo_bis: Where {CONTACT_EMAIL} is a real email that I read [12:47:37] mathemancer: so that should be fine but let's see if someone contradicts me. What concurrency do you think you'd need? [12:49:00] Currently, we're using a concurrency of 4, and that allowed us enough throughput to keep things in sync (with a max data staleness of ~6 months) [12:49:34] With the reduced gulp size, maybe turning up to concurrency 5 or 6 would be necessary. I'd have to do some empirical testing. [12:51:08] mathemancer: creating a task for a heads up/feature request is more than ok [12:51:27] if people are not aware of needs, it is not possible to attend those [12:52:14] I wonder if there will be structured data commons dumps of image metadata in the future [12:52:52] jynus: I hope so. [12:52:53] actually they exist [12:53:02] but the content is not yet there [12:53:54] https://dumps.wikimedia.org/other/wikibase/commonswiki/ [12:55:14] however I think those will be very useful for search, as "depicts" will only be on sdc [12:59:12] I think there is ongoing work for improving the api for external reusers but I cannot find the concrete proposal [13:08:49] That's https://www.mediawiki.org/wiki/User:RBrounley_(WMF) [18:34:58] anyone has a spare minute or two to review this? https://phabricator.wikimedia.org/T256572 [18:35:26] would be really appreciated [18:39:12] tufor: I pinged Proc to to remind them to deploy their patch [18:39:24] it can't be deployed anyways before Monday [18:39:35] Majavah: thanks