[06:51:24] I will let you know when I see infobliss around here [06:51:24] @notify infobliss [07:25:46] hi [07:32:55] * infobliss is online now [07:38:06] zhuyifei1999_ : have you got a chance to look at the code? [09:31:24] I will let you know when I see infobliss and I will deliver that message to them [09:31:24] @notify infobliss oops sorry I was afk [11:10:00] I will let you know when I see basvb around here [11:10:00] @notify basvb [14:24:03] hi [14:24:23] I have been working mostly at night these days [14:24:46] Had to go somewhere in the daytime. [14:24:54] after lunch [14:26:08] sorry that you were looking for me and I wasn't here. [14:26:45] zhuyifei1999_: Are you online now? [14:33:13] This user is now online in #wikimedia-operations. I'll let you know when they show some activity (talk, etc.) [14:33:13] @notify zhuyifei1999_ sorry I was away [14:35:09] hi sorry I was away [14:35:41] :) [14:35:43] so I basically got rid of that hardcoded dict and did some rewrites [14:35:45] a sec [14:36:13] https://www.irccloud.com/pastebin/1Z1JgYRx/ [14:37:38] https://www.irccloud.com/pastebin/ypETXQVS/ [14:37:38] side-by-side diff [14:38:42] sorry I think I need to reboot, somthing is wrong with my internet connection [14:38:58] ok [14:40:43] so now the dict is gone and also you are simply using re. [14:41:04] no need for the photographer names to be in a separate file. [14:41:13] back [14:41:15] yeah [14:41:28] I see the dict serve these purposes: [14:42:06] 1. identify whether they are anefo, and remove "/ anefo" tags [14:42:32] 2. remove […] [14:42:47] 3. swap surname and given name [14:43:11] so I did basically that [14:43:54] this is very elegant thanks :) [14:43:55] there are three mismatches though [14:44:00] ('Jac. de Nijs', True) ('Jack de Nijs', True) [14:44:00] ('Rob C. Croes', True) ('Rob Croes', True) [14:44:00] ('J.D. Noske', True) ('Daan Noske', True) [14:44:17] the first two already got category redirects [14:44:33] the last one, I created it [14:44:52] https://commons.wikimedia.org/wiki/Category:Photographs_by_J.D._Noske [14:45:45] but why are these mismatches? [14:46:02] https://commons.wikimedia.org/wiki/Category:J.D._Noske [14:46:14] a.k.a. Daan Noske [14:46:24] just two ways of calling the person [14:47:26] you also had quite a lot of zero-width invisible characters in the dict that I removed [14:47:44] they are the red dots in https://www.irccloud.com/pastebin/ypETXQVS/ [14:48:08] other things I changed: [14:48:17] pep8-ify-ed the imports [14:48:37] https://www.python.org/dev/peps/pep-0008/#imports [14:48:57] removed that blank line after class NationaalArchief(GenericGLAM): [14:49:09] and def fill_template(self, url): [14:49:25] got rid of hasPhotographerInDict [14:49:40] got rid of categories [14:50:11] added some blank lines for spacing between different logic [14:50:39] and removed unnecessary ones (eg within if.. elif.. else [14:51:15] made that TODO a comment instead of docstring, which doesn't make sense [14:51:53] .. and other random changes [14:52:09] All seems great except removing category from fill_template(). [14:52:23] why? [14:52:52] this was there to get the categories manually inserted by the user in a html form. [14:53:22] that's redundant to user editing the final generated wikitext [14:54:13] so there won't be any option to add category in the html form at all? [14:54:40] honestly, what do you think the html form should ask for? [14:55:05] the glamname, url, categories. [14:55:46] v2c ask for: source, conversion parameters (keep audio and/or video, format), destination (filename, format), and file description page [14:56:16] what is file description page? [14:56:22] wikitext [14:56:50] it's prefilled by the data extracted from the source, and users shall be able to do anything with it [14:56:59] whether it's adding or removing categories [14:57:09] let me check other tools [14:57:24] ok [14:57:46] https://tools.wmflabs.org/url2commons/index.html even worse [14:58:45] it ask for the file source url, target filename, and... ask you to fill an empty file description page [14:59:00] true [14:59:11] let me try flickr one https://tools.wmflabs.org/flickr2commons/#interface_language=en [15:00:20] this one is more complex [15:00:43] ask for the source, and extract everything to show on the form [15:01:00] ok [15:01:37] actually asking the glamname is also redundant if we already know the url [15:01:41] yes [15:02:37] it's as simple as: ask for either glam + id or full url, give and ask if the user is satisfied with the generated description, upload [15:02:53] *and allow users to modify it [15:03:28] sok right [15:03:34] ok [15:03:59] makes sense [15:04:17] oh btw, docstrings should be in double quotes per https://www.python.org/dev/peps/pep-0257/ pep 257 [15:05:18] ok I will do that [15:05:37] https://github.com/infobliss/sibutest2/blob/master/OOP/infobox_templates.py#L27 <= add a \ [15:05:49] https://github.com/infobliss/sibutest2/blob/master/OOP/infobox_templates.py#L80 <= same [15:06:01] https://github.com/infobliss/sibutest2/blob/master/OOP/infobox_templates.py#L81 <= {{{{ [15:06:15] each {{ will be converted to { in .format() [15:06:47] alright I see [15:06:54] https://github.com/infobliss/sibutest2/blob/master/OOP/infobox_templates.py#L104 where's the template closing }} wikitext (which needs }}}})? [15:07:06] https://github.com/infobliss/sibutest2/blob/master/OOP/infobox_templates.py#L111 what is this? [15:07:53] the license does not belong to any {{Artwork}}, {{Information}}, {{Photograph}}, etc. [15:08:09] this I had by mistake added the category and license info inside the template [15:08:32] k [15:08:44] but it will be part of wikitext to be shown to the user [15:10:25] https://github.com/infobliss/sibutest2/blob/master/OOP/GenericGLAM.py#L6 what is self.template for? [15:10:48] think about this: what should each instance represent? [15:12:00] This was there becuase I thought if we might initialise with a given template [15:12:15] as in? [15:13:00] I mean when/where/why it would be used that way? [15:13:35] once we got to know the right template for the given image do we make an instance of our class using that template? [15:14:31] well, you already used the instance to figure out the template [15:14:41] def choose_correct_template(url): [15:15:28] yeah right. I was considering the possibility that we figured it out even before [15:15:29] so instance-ization happened before you figure out the template [15:15:39] eg? [15:16:38] may be if we ask the user explicitly [15:16:50] hmm [15:17:09] sure, but that might make your code real complicated [15:17:30] to adapt to the various templates [15:17:31] since I don't know how we would choose the right template from the url [15:17:51] unless you can make a consistent api [15:19:27] do you know how one can figure out the right template for a given image? [15:19:48] the glam api may provide what the image is [15:20:06] or yeah you can ask the user for it [15:20:49] so unless the glam api provides it we can't help asking the user. [15:20:49] https://github.com/infobliss/sibutest2/blob/master/OOP/infobox_templates.py#L111 many of the parameters are similar [15:21:23] well, autodetect without glam providing it is out of my knowledge [15:21:50] might need AI [15:22:06] or maybe basvb has some ideas [15:22:47] By AI do you mean image processing? [15:23:01] for categorization [15:23:09] no, artificial intelligence [15:23:27] yeah I know [15:23:44] to distinguish between images of different types based on statistical learning [15:24:05] yeah that's what I was referring to [15:24:11] k [15:24:40] well I think you said that you would advise on the OOP design too. [15:24:55] but I will be going for dinner and come back [15:25:16] say in 15 minutes [15:25:23] the OOP question is still 23:10 think about this: what should each instance represent? [15:25:25] k [15:25:49] ok [15:36:43] hi [15:37:01] basvb: check the logs :P [15:37:18] will do [15:37:40] I can see whether I've some time this week, but am away on vacation and the internet is also a little off and on [15:37:46] I replaced the creator mapping with regex subsititutions [15:37:47] ok [15:38:06] so right now there is a problem of selecting which template [15:38:23] which infobox template? [15:38:29] yeah [15:38:45] depends on the collection of the glam [15:39:04] for pure photo's -> photograph [15:39:06] I guess that part is for you :) [15:39:12] for photos of artworks -> art photo [15:39:37] well, never hard code unless user themselves can configure the options [15:40:26] i.e. which collecion is pure photo and which are photos aof artworks [15:40:29] *of [15:40:45] yes that's mapping based configuration [15:41:01] some collections will have both [15:41:10] well, think of the long-term maintainability... [15:42:36] I think I'm not really following what you're pointing at [15:43:00] I mean glams will potentially have more and more collections [15:43:21] what should we do if we see a new collection? [15:47:25] I think this differs a lot from glam to glam? [15:47:39] most glams will have similar collections [15:47:44] hmm [15:47:53] so often it will be possible to use 1 mapping for multiple collections [15:48:12] they will have the same names? [15:48:13] sometimes it will be cleaner/needed to have multiple mappings for different collections [15:48:59] in nationaal archief there is the ANEFO collection, but also some ww1 + ww2 collections [15:49:06] all of those are images [15:49:19] but the ww1+ww2 have english descriptions, where anefo does not [15:49:46] you mean so all of them are photographs of non-artworks? [15:49:51] yes [15:49:55] ok [15:49:56] the photo itself is the object [15:50:25] the majority of stuufs on gallica are artworks iirc [15:50:29] *stuffs [15:50:45] but there are exceptions and idek if they have collections [15:50:50] an earlier upload I did (RCE) has mainly photos, but also some collections of photo's of objects within the archive (where art photo would suit better_ [15:50:51] (might need to ask yann) [15:51:26] so there will always go some work (and thinking) into a mapping for a specific glam, and we'll always get weird edge cases [15:51:50] therefor it's good to scale up the standard stuff, but allow for exceptions within mappings [15:53:49] hi [15:53:53] also what do you think about this wrokflow (similar to v2c's): ask the user for uri or glamname+imageid, generate the file description page and ask the user to edit, then upload with the provided description [15:53:57] I am done with dinner [15:54:11] user shall edit the page raw [15:54:30] I mean raw wikitext [15:55:42] there's one difficulty in switching between infobox templates manually: the set parameters are quite different [15:55:47] *the set of [15:56:19] I think it would be a good option for the end user to edit the file descriptions [15:56:39] but maybe not always (as in they could set it as an option) [15:56:43] raw or with a very complicated form? [15:56:47] hmm [15:57:08] infobliss: how are you javascript skills? [15:57:16] if user wishes we also like to reduce the number of steps [15:57:34] zhuyifei1999_: they'll have to work on wikicode based on the template, so I'd say raw [15:57:44] I did some minor work with javascript [15:57:53] shouldn't be a problem [15:57:56] I mean for some glams the dates might be formatted etc., [15:58:02] we can parse dates [15:58:34] as for which template to select: if for one glam multiple infobox templates are needed it's best to do it at the step where you decide on categories as well [15:58:41] we ask there: is this a photo or art photo [15:58:50] hmm [15:59:10] preferably we try to determine it based on collection or other metadata [16:00:27] 1. ask the user about the glam+id/url 2. try to guess art photo, or regular photo, then return to the user [16:00:44] 3. fill the template with information, and return to user [16:00:46] 4. upload [16:02:04] if we want to have dynamic switching between editing raw wikitext and complex forms, I highly suggest to do the switching purely with javascript [16:02:35] so work from the user does not get lost for no good reason [16:04:09] what are 'complex forms'? [16:04:46] a sec [16:05:15] true zhuyifei of not getting the users work lost, but it might result in quite a bit of overhead [16:05:22] the switching back and forth [16:06:20] still reading through the logs [16:06:31] https://usercontent.irccloud-cdn.com/file/Eau26inn/Screenshot%20from%202017-07-10%2000-06-11.png [16:06:53] like a lot of files, photographer, date, blah blah blah [16:06:53] the photographers were indeed hard coded as a quick solution, all of the code base really was just a quick 1 day hack to get this to work [16:07:26] so having a moderately large form [16:08:04] yep, but do we have to maintain such a form for each glam/collection? [16:08:40] no [16:08:56] * zhuyifei1999_ hates maintenance burden [16:09:55] infobliss: how do you feel about https://github.com/toollabs/video2commons/blob/master/video2commons/frontend/static/video2commons.js ? [16:10:22] I'm less good in js than python [16:11:13] on first sight it looks pretty large and complex :P [16:11:16] I'd estimate the amount of js used in switching from the form and back is about a quarter of that js [16:11:58] ok [16:12:18] yeah, that "large and complex" happens when you don't have a proper loading module to seperate the script into different files [16:12:36] that script is old-school [16:13:09] ok [16:13:25] as far as I am aware, the current front-end js development is insanity [16:13:47] https://hackernoon.com/javascript-vs-python-in-2017-d31efbb641b4 [16:13:56] don't fall into that rabbit hole :P [16:14:30] https://hackernoon.com/how-it-feels-to-learn-javascript-in-2016-d3a717dd577f [16:14:38] hahaha [16:14:52] its says "So…Are You Saying JavaScript is Dead?" [16:15:06] *it [16:15:51] well, it's some recommended reading so you know why never fall into that rabbit hole :P [16:16:03] so the dictionary of photographers to regex is a nice improvement [16:16:12] but the downside is, argh one-file-js [16:16:18] basvb: thx [16:16:36] there were some edge cases but I see you try to catch those (the known ones) commons side [16:16:43] yeah [16:16:52] 2 of 3 are already caught [16:17:03] the good thing here is that there are dozens of photographers not in the dict, which will work now as well [16:17:05] or wait [16:17:20] maybe that was the point, have to look at those [16:17:21] that [16:17:30] ? [16:18:16] only those with a category on commons were in the dict I think [16:18:45] well, new categories can be created by other users [16:18:49] yup [16:19:09] https://commons.wikimedia.org/wiki/Special:WantedCategories exists for a reason [16:19:13] but not every photographer with a one of photo should per se have their own (uncreated) category [16:19:34] that's why back then I decided on a dict as an easy solution on short term [16:20:07] the 80/20 rule aplied here very well, 80% of the photos were taken by 20% (or maybe even 5%) of the photographers [16:20:14] it's doesn't really hard anything to leave a redlink category imo [16:20:26] hard=hurt? [16:20:34] yeah [16:20:50] I think I was saying harm [16:21:08] muscle memory lol [16:21:18] well I agree it's no solution to do this dict style, so regex is a good improvement [16:21:26] I always type "saw" "was" [16:21:32] if red-cats is an issue we could end up doing category checking [16:21:56] something I would like to see a function for within the framework anyway (if it's not too complicated) [16:21:57] I think it's fine, unless someone complains [16:22:06] they do :) [16:22:17] I remember the 500k monuments upload [16:22:28] and the thousands of categories which were dumped on me [16:22:37] lol [16:22:41] we wanted to red cat them and replace them for suitable cats [16:22:54] somebody hated red cats and created loads of ugly categories [16:23:03] Category:RCE Suggested: Watertoren [16:23:23] where RCE suggested was part of the category [16:23:29] he went on to create thousands of those [16:23:48] and then said to us: your problem now I deal with the red cats now they are blue [16:24:11] I've fixed hundreds if not thousands over the years [16:24:12] hahaha [16:24:32] the philosophy of wikipedia that stuff will get done someday by somebody doesn't always work [16:24:43] and I hate to leave a mess behind [16:24:55] talking about that, I remembered the ex-admin Foroa [16:25:00] https://commons.wikimedia.org/wiki/Category:RCE_suggested_categories [16:25:06] might have been foroa [16:25:14] they always sort categories [16:25:39] https://commons.wikimedia.org/wiki/Category:Rijksmonumenten_places_to_be_classified [16:25:43] that one was even worse [16:25:52] we created lot of red cat place categories [16:25:59] I think a few hundred to thousands [16:26:02] I remember this category [16:26:10] he just dumped them all in this maintainance category [16:26:13] I think I had a bot task somewhat related [16:26:22] instead of leaving them red or finding out which region they are in [16:26:29] so I've been fixing those over the years [16:27:11] well big upload you can't avoid to leave behind some mess [16:27:24] we had a very nice system where you could type in the monument ID [16:27:35] oh, the reason why a lot of people hare fae and russavia [16:27:41] I think there are over 100.000 images identified by commons users [16:28:02] one of the biggest structural images clean up (in categorisation) has been happening here [16:28:13] rudolphous and bardenoki are doing amazing work [16:28:13] wow [16:28:22] they categorised them per monument [16:28:25] ok I know nether ^ [16:28:28] and identified tens of thousands [16:28:30] *neither [16:28:47] they are the type of user you never hear, but who are doing the work in the back ground [16:28:55] yeah [16:28:57] I think rudolphous is even a mod [16:29:06] to clean up his own errors [16:29:06] you mean admin? [16:29:09] yep [16:30:13] I mist admit that I'm the lazy type. make automation to do more with less work :) [16:30:17] *must [16:30:45] yep, well we started with auto + semi-auto [16:30:53] but in the end there is just a huge amount of hand work [16:31:02] sometimes https://www.xkcd.com/1319/ applies tjough [16:31:05] *though [16:31:24] lol [16:31:27] the dutch monument collections on commons and nlwiki are very nice connected and very complete. [16:31:35] we were taking pictures of monuments [16:31:44] and somebody went into the tourist info [16:31:52] to ask whether they knew a certain monument [16:32:02] and they said: they have no picture/info on it yep [16:32:12] who is they? we asked [16:32:19] turns out they were looking at wikipedia [16:32:28] the list we were taking the pictures for [16:32:33] lol [16:33:12] the same way: how to download from google arts project? search commons [16:33:25] yes automation and the perfection of solutions, my daily work, we talk about how to automate things for months [16:33:43] which complicated algorithm will work best [16:33:52] and then in the end the simple frequency table is optimal [16:34:15] hmm [16:34:27] what's that automating? [16:34:57] email routing [16:35:05] but with imperfect data [16:35:12] hmm [16:35:15] as in no routing data (labels) [16:35:27] but that's bureaucracy at it's best [16:35:29] o.O [16:36:24] for the real innovate stuff I should just do some nice wikiprojects + hackathons [16:37:08] ok I'm through the logs [16:37:57] https://github.com/infobliss/sibutest2/blob/master/OOP/NationaalArchiefGLAM.py#L150 why does it say dict [16:38:00] and the rest diction [16:38:29] infobliss: ^ [16:38:30] sorry typo [16:38:51] diction is a very bad name anyhow imo [16:38:57] true [16:39:03] try to go with real meaningful names [16:39:21] if I look at it I should know directly what it means [16:39:48] infobox_template_parameters -> but maybe that is too long zhuyifei1999_/ [16:39:51] ? [16:40:08] same for right_template [16:40:16] infobox_params [16:40:27] template_type for right_template? [16:40:32] right_template is bad. define 'right' [16:40:34] or rather infobox_type [16:40:39] yeah [16:40:53] infoxos is clearer [16:40:59] *infobox [16:41:13] ok [16:42:55] I'm off to dinner in a bit, in the following days I'll try to continue on the other mapping [16:43:05] but can't ensure that'll work with my connection here [16:43:15] any questions for now? [16:43:48] @Basvb so with the present approach I think scalability will be ensured. [16:44:05] https://github.com/infobliss/sibutest2/blob/master/OOP/infobox_templates.py please look at the differences between the templates [16:44:29] yes, we are more and more getting into more scalable and generic code [16:44:47] next steps are: getting a good work flow [16:45:00] both for end users as well as for adding a new template [16:45:06] I'll probably be away from july 14 to 19 [16:45:09] + documentation on how to add a new template [16:45:16] I'll take over at the 14th then ;) [16:45:23] k [16:45:27] ok [16:45:52] basically I can't guarantee that I'm available [16:46:01] I think the proposed work flow for end users by zhuyifei, where the end user can edit (have the last say) is a good one [16:46:43] please look at the differences between the templates <= do you mean the difference between the fields in them? [16:46:47] try to integrate the fix of the photographer dict, its a good step into better code [16:46:58] sure [16:47:01] I mean that you have license, and categories as part of the infobox template [16:47:09] but those are for every infobox template the same [16:47:31] oh that was pointed out before by zhuyifei1999_ [16:47:36] I will remove them [16:47:37] so there is a template for the infobox templates (art photo, ptotograph) [16:47:42] and below that there will always be the same [16:48:02] you can make a template for that as well, but it's just a few standard things [16:48:22] but should work the same accross the different infoboxes [16:48:54] + the closing }} are at the wrong point [16:49:03] yeah ik [16:49:09] Will change [16:49:12] they should be at https://github.com/infobliss/sibutest2/blob/master/OOP/infobox_templates.py#L104 [16:50:58] You may check the modified code sometime later [16:51:14] please upload the code frequently [16:51:20] after I incorporate the required changes in sometime [16:51:23] so we can always look at the version of the last hours [16:51:43] and try to catch the small errors with good testing [16:52:09] I am not sure but I was not able to get the previous versions of the code in github [16:52:10] so we can focus our criticue on the more complicated errors, then you will learn more ;) [16:52:35] sure I will [16:52:50] Does github save the latest version of code only? [16:53:11] * zhuyifei1999_ is back [16:53:12] no, all versions [16:53:25] git is a version control system [16:53:27] but I'm no git expert, get tangled up in it also some times [16:53:35] so it has to store all versions [16:53:58] how to see them? [16:54:10]