[06:29:13] <gry>	 Hi, mvolz
[07:01:30] <qedk>	 can someone reset my toolforge account
[07:01:59] <qedk>	 my icloud keychain has some unique password and my laptop is dead :/
[07:02:32] <Majavah>	 qedk: you can't use Special:ResetPassword?
[07:02:50] <qedk>	 @Majavah on toolforge?
[07:03:32] <Majavah>	 qedk: did you ever login to wikitech.wikimedia.org with that account?
[07:04:14] <qedk>	 I do have a Wikitech account, didn't use it
[07:04:24] <qedk>	 let me try, thanks @Majavah 
[07:04:37] <Majavah>	 wikitech uses same accounts as toolsadmin.wikimedia.org
[07:04:43] <qedk>	 ah
[07:04:44] <Majavah>	 so try requesting a password reset from there
[07:05:01] <qedk>	 the toolforge login page didn't have a reset link so i thought it's disallowed
[07:05:12] <Majavah>	 yeah, it's a little complicated
[07:06:58] <qedk>	 ayy it worked, thanks!
[07:07:12] <qedk>	 so messy having to reset passwords and ssh keys 
[07:08:07] <Majavah>	 that sounds not fun
[07:08:54] <qedk>	 most of the painful part of moving to windows is over (i hope)
[07:19:20] <snitch>	 [[Tech]]; 41.214.85.38; /* 781282188 */ new section; https://meta.wikimedia.org/w/index.php?diff=20240609&oldid=20227191&rcid=15812558
[07:23:51] <snitch>	 [[Tech]]; WikiBayer; Reverted changes by [[Special:Contributions/41.214.85.38|41.214.85.38]] ([[User talk:41.214.85.38|talk]]) to last version by Tegel; https://meta.wikimedia.org/w/index.php?diff=20240612&oldid=20240609&rcid=15812561
[11:42:04] <snitch>	 [[Tech]]; Atsipas; /* 2004 Audi A4 1.8T Cabriolet */ new section; https://meta.wikimedia.org/w/index.php?diff=20241343&oldid=20240687&rcid=15813743
[11:52:34] <snitch>	 [[Tech]]; Nemo bis; Undo revision 20241343 by [[Special:Contributions/Atsipas|Atsipas]] ([[User talk:Atsipas|talk]]); https://meta.wikimedia.org/w/index.php?diff=20241356&oldid=20241343&rcid=15813765
[11:53:04] <Nemo_bis>	 These car advertisers seem a bit too emboldened after https://www.theguardian.com/media/2020/jul/01/france-bans-dutch-bike-tv-ad-for-creating-climate-of-fear
[11:57:22] <acagastya>	 Hi.  I had requested a wgCopyUploadsDomains addition a month ago.  Can anyone process, or at least comment on a couple of those requests?
[11:58:54] <Majavah>	 acagastya: you have task numbers?
[11:58:59] <Majavah>	 can't look without them
[11:59:22] <acagastya>	 https://phabricator.wikimedia.org/T254342
[12:01:34] <Majavah>	 that request had fallen down the list so no-one probably saw it when looking for new requests, I dragged it up to top of the list so it can be seen
[12:02:01] <Majavah>	 that looks sensible to me, I'll take a look at it before monday (which is the earliest it can be deployed anyways)
[12:02:14] <acagastya>	 Okay.  Thank you.
[12:02:24] <acagastya>	 There was another one which I had filed minutes ago.
[12:02:42] <Majavah>	 arxiv.org?
[12:03:09] <acagastya>	 RhinosF1 has edited it, (the same user who did the first request).
[12:03:14] <acagastya>	 Yes, arxiv.org.
[12:03:44] <acagastya>	 Thanks for handling the nature.com, Majavah.
[12:04:05] <Majavah>	 if Samuel doesn't process that one before I look at nature I'll do that in the same patch
[12:05:32] * acagastya does not know who Samuel is.
[12:05:44] <acagastya>	 But in any case, if these two are added, it would make it much easier for me to upload those papers.
[12:06:04] <Majavah>	 Samuel = RhinosF1
[12:06:16] <acagastya>	 Oh, okay.
[12:06:56] <acagastya>	 Nemo_bis told me that Commons ignores copyfrauds and uploads the files anyway.
[12:07:20] <acagastya>	 Would it be sensible for me to request a addition of another domain which contains loads of PD material?
[12:07:37] <Nemo_bis>	 ...as long as someone actually plans to use it
[12:08:11] <acagastya>	 Letters of former US presidents -- I don't doubt they are going to be useful.
[12:08:43] <Majavah>	 I'm not sure if anyone would want to add a domain which has copyfraud
[12:09:39] <acagastya>	 Have a look at Shapell.org -- they collect many PD letters, make a faithful copy of PD material, and make it available on the internet.
[12:10:17] <mathemancer>	 Hey, all.  An ingestion job I run regularly has been failing for a bit with the error PHP fatal error:
[12:10:20] <mathemancer>	 Allowed memory size of 698351616 bytes exhausted (tried to allocate 933888 bytes)
[12:10:57] <mathemancer>	 Is this related to https://phabricator.wikimedia.org/T256459 ?
[12:10:59] <Majavah>	 acagastya: if the first sentence in their TOU says "don't copy this, all rights reserved", I don't think we would add that to the list
[12:11:13] <Nemo_bis>	 Majavah: wrong
[12:11:28] <Nemo_bis>	 mathemancer: ingestion where?
[12:11:30] <Majavah>	 how so?
[12:11:34] <mathemancer>	 A request producing this error is https://commons.wikimedia.org/w/api.php?action=query&generator=allimages&gaisort=timestamp&gaidir=newer&gailimit=500&prop=imageinfo|globalusage&iiprop=url|user|dimensions|extmetadata&gulimit=500&gunamespace=0&format=json&gaistart=1593648000&gaiend=1593734400
[12:11:47] <mathemancer>	 I'm ingesting data from WMC via the api
[12:12:02] <acagastya>	 Well, we are going to upload all the PD material from their site, Majavah.  (If what Nemo_bis says is correct.)
[12:12:13] <Nemo_bis>	 mathemancer: have you already considered reducing the limit or the amount of metadata fields? that can be a lot of stuff
[12:13:15] <mathemancer>	 Nemo_bis: It was working until 22 June
[12:13:22] <Nemo_bis>	 mathemancer: I think it's more likely that you're running into https://phabricator.wikimedia.org/T201205
[12:13:25] <mathemancer>	 Would a memory limit have changed recently?
[12:14:32] <Nemo_bis>	 Users are now uploading millions of DjVu files, so a request of 500 rows like that might be several GB
[12:14:39] <mathemancer>	 Nemo_bis: I'm aware of that bug (T201205)
[12:14:39] <stashbot>	 T201205: Bad metadata for a single file errors out the complete imageinfo prop request - https://phabricator.wikimedia.org/T201205
[12:15:10] <mathemancer>	 Nemo_bis: the error behavior is different for the current problem
[12:15:18] <Nemo_bis>	 (Or not, I'm looking at the options again)
[12:15:19] <Nemo_bis>	 Ok
[12:15:25] <acagastya>	 I was wondering about the captcha that WMF projects use -- do we have a statistics on that?
[12:15:40] <Nemo_bis>	 acagastya: no need, we already know that it's a plague
[12:15:56] <acagastya>	 How so, Nemo_bis?
[12:15:59] <Nemo_bis>	 But yes, there are or used to be some dashboards based on the error logs
[12:16:04] <mathemancer>	 Nemo_bis: The bug you mentioned returned a JSON with an error, but the new one doesn't return a JSON at all, but rather a page with the error on it (in HTML)
[12:18:03] <acagastya>	 Majavah: if arxiv.org is whitelisted, will I be able to upload files via their URIs using pywikibot?
[12:18:07] <mathemancer>	 Nemo_bis: It seems cutting down to only 100 results per request avoids the error.
[12:18:24] <mathemancer>	 Nemo_bis: Thanks for the hint.  Do you think I should file a bug report/
[12:18:32] <Majavah>	 acagastya: I think that's how the whitelist works
[12:18:51] <acagastya>	 Huh?
[12:19:27] <acagastya>	 I used to upload via [[com:Special:Upload]], pasting the URI.
[12:19:40] <Nemo_bis>	 mathemancer: self-answer, works up to gailimit=470, which takes ~12 seconds
[12:20:05] <acagastya>	 If I can do that using piwikibot, I might as well request, and then create a bot that handles those.
[12:20:10] <Nemo_bis>	 mathemancer: you can certainly report that you'd like a limit of 500 to work and it no longer does, but I wouldn't be surprised if it's declined
[12:20:29] <mathemancer>	 Nemo_bis: Fair enough.  Thanks!
[12:20:37] <Nemo_bis>	 It's useful to know such things, but I think our stance is generally that using smaller batches is not a bad thing :) 
[12:21:14] <Nemo_bis>	 Unless of course you plan to do things like asking 30M titles from the English Wikipedia, in which case maybe it's better to have less requests
[12:21:57] <mathemancer>	 Nemo_bis: The problem is, I'm trying to gather metadata about all the images on WMC, and the data dumps don't seem to have the global usage info
[12:22:34] <mathemancer>	 Nemo_bis: So, I'm trying to use the API, but I think it's not really set up for wholesale data ingestion like this.
[12:25:44] <Nemo_bis>	 mathemancer: yes, copying an entire database table like that is often a bit overkill for the API
[12:26:01] <acagastya>	 Nemo_bis: re captcha --- how so?
[12:26:26] <Nemo_bis>	 acagastya: have you tried searching mediawiki.org?
[12:26:33] <Nemo_bis>	 mathemancer: but why is https://ftp.acc.umu.se/mirror/wikimedia.org/dumps/commonswiki/20200620/commonswiki-20200620-globalimagelinks.sql.gz not enough?
[12:27:00] <Nemo_bis>	 (Am I mixing up things again?)
[12:27:15] <acagastya>	 Yes, Nemo_bis, I have.
[12:27:40] <Nemo_bis>	 acagastya: well there's all sorts of discussions of how our CAPTCHA are terrifyingly bad, e.g. https://www.mediawiki.org/wiki/CAPTCHA
[12:28:02] <mathemancer>	 Nemo_bis: Let me look back through why we went the way we did (It's been a few months)
[12:28:08] <Nemo_bis>	 There's also a recent page https://www.mediawiki.org/wiki/Core_Platform_Team/Initiative/Captcha/Initiative_Description but no idea about the status of this proposal
[12:28:26] <acagastya>	 I was wondering if it is possible to make use of captcha to train some models to help image classification on commons, and also help with ORC on Wikisource.
[12:28:57] <Nemo_bis>	 acagastya: yes, it's a long-standing proposal. Not that easy to implement (especially now that DjVu is dead-ish)
[12:30:34] <acagastya>	 If there is a way to extract words from Wikisource material and provide that as captcha.
[12:30:52] <acagastya>	 Or give a bunch of photos and as if there is a cycle, or a blue door in it.
[12:31:23] <acagastya>	 It was you, Nemo_bis, who responded to "scope" concern of Vermont on #wikimedia-commons, right?
[12:31:44] <mathemancer>	 Nemo_bis: I was wrong in my earlier message.  As I look back through my old notes, the actual issue was that we couldn't seem to retrieve the license and attribution info for files, since it's in the extmetadata field.
[12:32:20] <mathemancer>	 Nemo_bis: Someone on this very IRC channel told me that retrieving that info via DB dump was 'hopeless'
[12:33:50] <Nemo_bis>	 mathemancer: ah yes, actually I now remember the discussion. :) 
[12:34:48] <Nemo_bis>	 mathemancer: so you use the globalusage to connect the license and attribution to the pages where the files are used?
[12:35:09] <mathemancer>	 Nemo_bis: No, that was a red herring
[12:35:27] <Nemo_bis>	 Ah ok.
[12:35:59] <Nemo_bis>	 I would think the first step is to avoid retrieving data for 60M files if you only need a fraction of these. (If)
[12:36:04] <mathemancer>	 Nemo_bis: I'm using globalusage just to get how many places an image is used.  We're using this as sort of an 'authority' metric for ranking those images
[12:36:24] <Nemo_bis>	 And what you do with the lower "ranked" images=
[12:36:44] <mathemancer>	 We catalog them in our DB
[12:37:04] <Nemo_bis>	 So you do need all the 60M files after all?
[12:37:07] <mathemancer>	 Nemo_bis: I'm not sure if you recall, but I'm working on ccsearch.creativecommons.org
[12:37:16] <Nemo_bis>	 mathemancer: thanks, I had forgotten :D
[12:37:37] <mathemancer>	 Nemo_bis: At the cataloging level, our mandate is to 'Catalog all CC Licensed content' that we can find
[12:37:41] <Nemo_bis>	 Ok, so the issue is that you could import the XML dump and run such API locally, but there would be a delay
[12:38:01] <Nemo_bis>	 Although maybe you could do it for the initial ingest
[12:38:14] <mathemancer>	 Nemo_bis: We've actually thought about trying to run it on local infrastructure, yes
[12:38:33] <Nemo_bis>	 I think what you should do now is to reduce the size of the batches and increase the parallelism slowly.
[12:39:43] <mathemancer>	 Nemo_bis: I'll try that for the time being
[12:39:54] <Nemo_bis>	 You can ask ops whether there is a place you can monitor to see if you're overloading whichever servers are serving those requests, or at least monitor any performance degradation
[12:40:11] <Nemo_bis>	 Just don't go "oh let's try 100 parallel threads to see if I bring Wikipedia down" ;)
[12:40:15] <mathemancer>	 Nemo_bis: How do I ask ops?
[12:40:36] <Nemo_bis>	 mathemancer: here it's fine now that we have a clear thing to ask :D 
[12:40:38] <mathemancer>	 I'm definitely trying to avoid being the guy who takes down WMC for a day or anything
[12:42:06] <Nemo_bis>	 Usually it's fine to just be reasonable and have a descriptive user agent, then if there are issues (which is very rare) you get contacted
[12:42:43] <mathemancer>	 I do have that.  It's     f'CC-Catalog/0.1 (https://creativecommons.org; {CONTACT_EMAIL})'
[12:43:03] <mathemancer>	 Nemo_bis: Where {CONTACT_EMAIL} is a real email that I read
[12:47:37] <Nemo_bis>	 mathemancer: so that should be fine but let's see if someone contradicts me. What concurrency do you think you'd need?
[12:49:00] <mathemancer>	 Currently, we're using a concurrency of 4, and that allowed us enough throughput to keep things in sync (with a max data staleness of ~6 months)
[12:49:34] <mathemancer>	 With the reduced gulp size, maybe turning up to concurrency 5 or 6 would be necessary.  I'd have to do some empirical testing.
[12:51:08] <jynus>	 mathemancer: creating a task for a heads up/feature request is more than ok
[12:51:27] <jynus>	 if people are not aware of needs, it is not possible to attend those
[12:52:14] <jynus>	 I wonder if there will be structured data commons dumps of image metadata in the future
[12:52:52] <mathemancer>	 jynus: I hope so.  
[12:52:53] <jynus>	 actually they exist
[12:53:02] <jynus>	 but the content is not yet there
[12:53:54] <jynus>	 https://dumps.wikimedia.org/other/wikibase/commonswiki/
[12:55:14] <jynus>	 however I think those will be very useful for search, as "depicts" will only be on sdc
[12:59:12] <jynus>	 I think there is ongoing work for improving the api for external reusers but I cannot find the concrete proposal
[13:08:49] <Nemo_bis>	 That's https://www.mediawiki.org/wiki/User:RBrounley_(WMF) 
[18:34:58] <tufor>	 anyone has a spare minute or two to review this? https://phabricator.wikimedia.org/T256572
[18:35:26] <tufor>	 would be really appreciated
[18:39:12] <Majavah>	 tufor: I pinged Proc to to remind them to deploy their patch
[18:39:24] <Majavah>	 it can't be deployed anyways before Monday
[18:39:35] <tufor>	 Majavah: thanks