[00:26:06] does anyone here use AWB? [00:26:34] try #autowikibrowser [00:26:40] legoktm: I'm in there but not very active [00:27:12] legoktm: do you use it? [00:27:25] no [00:27:40] but you'd probably get a better response if you asked what your question actually is [00:28:04] Well I'm using AWB on a Mac using CrossOver and it's working fine [00:28:06] but the problem is: [00:28:55] I'm fixing typos using regex, I'm using a database dump. When it finds an error it stops processing the list of articles, and I'd like the list of articles to run in the background [00:29:08] so I can go AFK and come back and not have to wait 1 min for it to find a typo [00:29:14] Don't start the list maker via make list [00:29:22] *db scanner [00:29:28] I don't know how to do that [00:29:41] Give me more than 6 seconds to respond? [00:29:49] Oh sorry.. [00:31:08] Pressing stop in the main AWB window (not the scanner) should stop it processing [00:31:22] so, is https://commons.wikimedia.org/wiki/User:Odder/Sandbox a bugable offence, or just me being silly? [00:31:32] Reedy: But I don't want it to stop processing [00:31:40] I want it to keep processing even when it finds an error [00:31:51] parser functions behaving differently on translation pages than they do on normal ones [00:32:07] Options -> Use pre parse mode? [00:32:18] Reedy: I don't see that [00:32:34] oh found it [00:32:54] Reedy: What's that do? [00:32:57] Reedy: okay, when do I set that? [00:33:48] where* [00:33:58] twkozlowski: I guess it has to do with formatnum using a different language so #expr is parsing it wrong [00:34:06] twkozlowski: What? [00:34:22] Newyorkadam: Why not try it? [00:34:27] Reedy: I did [00:34:30] I just wanna know what it does [00:34:55] pre-parses articles [00:34:57] the name is in the title [00:35:14] Reedy: oh, you were talking to Newyorkadam :) [00:35:26] Reedy: I don't get how some people make changes on AWB like boom boom boom 30 changes every minute [00:35:31] and for me it's one a minute [00:35:37] oh they're using different plugins than I.. [00:35:48] Reedy: I shamelessly assumed you found the answer to my problem :) [00:36:51] Auto save [00:36:56] Better skipping [00:37:00] Faster connection/computer [00:37:01] Reedy: A few days ago someone asked me if I could ask someone to run the January database dump against the TypoScan on AWB [00:37:03] How do I do that? [00:37:09] It'll take weeks [00:37:44] Reedy: Really? [00:37:47] Yes [00:37:56] Reedy: I see you're the project leader of TypoScan [00:38:01] "project leader" [00:38:10] Well "WikiProject Lead" [00:38:12] Is what it says.. [00:38:36] It's been inactive for nearly 2 years [00:39:43] Reedy: I'm running a typo fixing contest [00:39:47] starts in 4 days [00:39:56] No chance [00:41:05] Reedy: Even a small dump? [00:41:14] Small being what? [00:41:24] idk something that can be done in less than 4 days [00:41:58] You've then got the additional problem that said dump doesn't exist [00:43:10] Reedy: What? [00:43:14] I just downloaded a dump.. [00:43:26] What file? [00:43:58] enwiki-20140102-pages-meta-current1.xml-p000000010p000010000 [00:45:29] so it contains any pages with a page id of 10 to 10,000? [00:45:49] Reedy: I don't know? [00:46:10] At 48.8M it's not going to contain much [00:46:19] Reedy: I know, I downloaded that in a few minutes [00:46:26] I'm saying something that will take four days to do [00:46:30] Maybe even 10 is ok [00:46:32] Just something [00:46:55] Well, all of those articles aren't going to have typos [00:47:09] Some of those that do, won't actually have typos when someone tries to fix it [00:47:18] How many people are participating? [00:48:25] The reason I never bothered (well, I did, and got annoyed by it) processing another dump [00:48:39] Even on a hex core i7, 24GB ram and on an SSD, most of the machine was sat idle [00:48:53] Somewhat of a waste leaving on the machine on to do very little [00:50:49] Reedy: You can use the rest of that to compile mw now! >.> [00:51:18] I don't think hiphop is that slow [00:51:39] Newyorkadam: For reference, against the last run, only 40% of pages at time of re-processing actually had typos that were fixed [00:52:09] Reedy: 9-10 people are participating so far [00:52:20] more coming probably, we have a signpost article about it [01:06:28] Reedy: When I'm on 'Random pages' and select 'Make list' it still filters my watchlist? [01:06:37] filter? [01:06:48] What do you mean? [01:06:57] What do you mean? [01:07:02] what's filter? [01:07:09] Oh yeah [01:07:13] Filtering out all talk pages [01:07:19] but i'm keeping all content pages [01:07:32] Oh I had to apply the filter [01:07:39] which I thought I did [01:08:04] I don't believe that it's gone through 100 pages without finding a single typo.. [01:08:51] It's quite possible [01:09:09] why is it finding so many pages I've editied.. [01:09:09] Newyorkadam: Open 100 browser tabs and try? [01:09:16] Gloria: What? [01:09:48] It lists so many articles I've edited.. [01:09:51] I thought you were sampling random pages. [01:10:03] Gloria: I am [01:10:08] but it's supposed to be random pages [01:10:10] not articles I've edited [01:12:23] Reedy: They've solved that dump-scanning problem, by the way. [01:12:34] You can do it in seconds now. [01:14:53] Reedy: ^ Does that mean it can be done? [01:18:22] Gloria: Can you help me with some AWB stuff please? [01:19:05] You can ask on [[m:Tech]] if you'd like. [01:19:14] I don't wanna wait :/ [01:19:36] Learn more. :-) [01:22:17] Newyorkadam: Ask in here. [01:22:24] Or #autowikibrowser. [01:22:27] And be patient. [01:22:49] I'm trying to find everything that says 'in January 1' to 'on January 1' [01:22:57] In options I set the find and replace [01:22:59] right? [01:23:07] Okay. [01:23:08] And then? [01:23:15] Then what.. [01:23:20] How do I search for that in articles? [01:23:23] Do you see that in any article? [01:23:30] Gloria: I know they exist around Wikipedia [01:23:31] You can use regular search for this. [01:23:35] Find one. :-) [01:23:37] Gloria: But AWB makes it automated [01:23:46] this guy does it [01:23:48] https://en.wikipedia.org/wiki/User:John_of_Reading/Typo_fixing_with_AutoWikiBrowser#In_MONTH_DAY_.3E_On_MONTH_DAY [01:23:59] He found like 13,000 of 'em [01:24:07] I don't wanna do 13,000 manually.. [01:24:15] What about "in January 1990"? [01:24:22] Do you want to find those? [01:24:24] Gloria: Yeah I don't know how to ignore [01:24:25] No [01:24:29] But John of Reading did it somehow [01:24:40] You should learn about regular expressions. [01:24:40] that's the error that the search keeps getting [01:24:44] I know some regex [01:24:53] Okay. [01:25:00] So you'd probably want to use regular expressions here. [01:25:12] But how do I use regex to -ignore- certain things using AWB? [01:25:31] by writing a proper regex?:) [01:25:35] [^0-9] probably. But really \b. [01:25:39] Maybe. [01:25:46] You need to limit number matching, I guess. [01:25:51] I only know basic regex [01:26:12] This is basic regex. [01:26:58] https://en.wikipedia.org/w/index.php?title=Special:Search&search=%22in+January+1%22 [01:27:01] like I know things like [01:27:05] It looks like there are about 88 instances on Wikipedia. [01:27:09] "hel(lo)?" [01:27:15] would find hel and hello [01:27:16] You're going to scan a whole dump for 88 instances? :-) [01:27:25] Yes. [01:27:26] Gloria: that's JUST January 1st [01:27:30] think about January 2 [01:27:33] and January 3 [01:27:33] and so on [01:27:36] Ah, right. [01:27:37] then February.. [01:27:52] Heh. [01:28:15] Someone else was asking me to scan a dump for them. [01:28:18] I could throw this in there. [01:28:25] Gloria: If I can find out how.. [01:28:32] Gloria: Can you also throw in TypoScan? [01:28:36] I'd just write a small script. [01:28:38] I'm running a typo-fixing contest in a few days [01:28:50] So I heard. [01:28:56] No, I'm not scanning for typos. [01:29:02] wait Gloria I'll find send you some false positives to watch out for in your script [01:29:04] k that's ok [01:29:13] want it in PM? [01:29:15] It's like 10 lines [01:29:20] Try a pastebin. [01:29:25] http://p.defau.lt [01:29:33] http://pastebin.com/bzPQ1f1d [01:29:51] http://pastebin.com/raw.php?i=bzPQ1f1d [01:30:01] If you insist on using pastebin.com. [01:30:25] oh sorry [01:30:30] I did it before you sent the message [01:30:57] Gloria: someone said it takes 4 days to scan. [01:31:42] What is time? [01:32:00] Newyorkadam: Do you think four days is a lot of time to scan a dump? [01:32:05] Gloria: Yes [01:32:14] Will you have scanned in a dump four days from now? :-) [01:32:24] Minus an in. [01:32:32] Typos, always killin' my game. [01:32:47] Gloria: No I'm just asking :p [01:33:03] You want to scan the English Wikipedia? [01:33:12] Do you have a dump? [01:33:29] Gloria: Not one specifically in mind [01:33:32] but yes en [01:33:38] Did you download it? [01:33:48] Gloria: Yes but a small version [01:33:50] only 40 mb [01:33:57] there's ones that're like 5 gbs [01:34:08] I think you downloaded a piece of a dump. [01:34:12] You should download the rest. [01:34:26] It's somewhere around 19G I think. [01:34:30] Gloria: I heard 40 GB [01:34:51] I think downloading alone could take you that long. [01:35:04] that long --> four days [01:35:16] Gloria: Would you mind sending me the regex so I can look at it :) ? [01:35:19] I wanna learn from it [01:35:22] when you're done [01:35:35] I don't think I want to match on the patterns you provided. [01:35:50] Did John of Reading publish his code? [01:36:00] Gloria: I don't see it anywhere [01:36:05] I don't know if he used regex.. [01:36:23] he said he used Google searches [01:36:38] enwiki-20140102-pages-articles-multistream.xml.bz2 is 10.5GB [01:36:46] What's multistream? [01:36:50] I'll see how much it extracts to... [01:37:06] I think it's done concurrently [01:37:09] I usually use the bigger one. [01:37:10] Then concatanated at the end [01:37:11] Because I want all pages. [01:37:50] > enwiki-20140102-pages-meta-current.xml.bz2 19.1 GB [01:37:53] That one. [01:37:57] It takes about 30 minutes to download. [01:38:07] To Labs, which probably already has a copy. [01:38:20] Gloria: If you tell me what to do with the regex I can probably do it [01:38:28] Wait [01:38:31] I think I can do it [01:38:53] Did you download the dump? [01:39:03] You can't scan a dump without a dump. [01:39:41] Seems to be about an hour to the office [01:41:07] Gloria: I DL'd a small dump [01:41:24] Gloria: Why doesn't this detect "on January 1" "on January 2" and "on January 3"? [01:41:27] "on January (1,2,3)?" [01:41:36] It only detects 'on January' [01:42:06] nvm got it [01:42:10] I need to replace the commas with | [01:43:11] Gloria: Here's all of January [01:43:14] on January (1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31)? [01:43:16] on January \d <-- better [01:43:30] on January \d\d? is even better I guess [01:43:44] No [01:43:46] Oh, yes [01:43:55] \d wouldn't work ;) [01:45:23] oh \d\d? is to 99 right? [01:45:24] that's cool [01:46:09] this is the final code I believe: [01:46:10] [1-2]?[0-9]|3[0-1] [01:46:10] on (January|February|March|April|May|June|July|August|September|October|November|December)? \d\d? [01:46:12] or January [123]?\d [01:46:30] still incorrect for some values [01:46:31] 31st Februrary [01:46:37] (e.g. accepts 0, feb 30th etc.) [01:46:59] paravoid: But then we can find articles that have Feb 30 [01:47:05] paravoid: It also includes January 99 ... [01:47:17] no it doesn't [01:47:28] paravoid: Why not? [01:48:45] \d\d? could include 99 [01:49:24] Gloria: So whadoo I do now? [01:52:33] right now I'm scanning for 'The the' (I think) [01:53:35] Reedy: Can you help me to make sure I scanned correctly? [01:55:44] or Gloria? [02:04:16] Newyorkadam: Sounds like you've got it under control. [02:04:28] Gloria: No it isn't working :/ [02:04:30] I don't know why [02:04:39] It's possible that it just found 0 errors but I doubt it [02:05:51] Newyorkadam: How big is the file you downloaded? [02:05:59] Gloria: Like 40 MB [02:06:04] I think a couple thousand articles [02:06:13] Probably the most scanned articles. [02:06:21] Given people like you. :-) [02:06:43] I have a find and replace for 'the the' [02:06:49] How's it going? [02:06:50] but it keeps detecting 'the' and removing it entirely.. [02:06:58] Ouch. [02:07:20] any ideas? [02:07:46] You're doing it wrong? [02:07:51] Well apparently [02:07:53] What are you doing specifically? [02:07:55] But I can't find out the errors [02:07:56] I can't see your screen. [02:08:00] I go to the 'Options' tab [02:08:03] I go to 'Find and replace [02:08:05] ' [02:08:05] You're replacing "the the" with ""? [02:08:11] I'm replacing 'the the' with 'the' [02:08:14] Well I'm supposed to be.. [02:08:20] Try a screenshot? :-) [02:08:56] in this screeny you can see that it deletes 'the' to nothing [02:09:04] and that I have 'the' replacing 'the the' [02:09:05] http://gyazo.com/809bbfb60960ee3cb5279a28f91996aa [02:09:55] Heh. [02:10:05] "the themes" is "the the" ;-) [02:10:06] Any idea? [02:10:10] oh.. [02:10:18] I love screenshots. [02:10:22] so I should add a space after the second 'the' [02:10:30] Or a word boundary. [02:10:31] Like \b. [02:10:35] But that's regex. [02:11:57] We should make this process a whole lot easier. [02:12:11] As I said, you can do this in a few seconds with the appropriate equipment. [02:12:20] Gloria: I don't know how.. [02:12:22] But we're not quite there yet. [02:12:38] afk for 1 min [02:12:54] I don't believe that it's skipped so many articles and hasn't found a single 'the the' [02:12:57] I guess that's good [02:38:01] Newyorkadam: if you want I can run a better scan faster than AWB can [02:39:44] Betacommand: how? [02:40:02] Newyorkadam: I have a python script [02:40:16] Betacommand: Can I use it ^.^? [02:40:21] Newyorkadam: It processes about 5k pages at once [02:40:28] wow [02:40:37] Why don't you publish it? [02:40:38] Newyorkadam: let me know what you want scanned for and I can create a report [02:41:03] "at once" [02:41:08] Newyorkadam: I havent gotten around to making it stand alone [02:41:17] Betacommand: Can you send it to me? [02:41:32] Newyorkadam: its not something that can be sent [02:41:49] Betacommand: oh [02:41:51] Newyorkadam: its tied to a lot of non-relevant code [02:42:07] Ive just been too lazy to make it stand alone [02:42:14] Betacommand: Well can you :D? [02:42:18] If you want [02:42:43] Betacommand: is it fully automatic or do you check each edit? [02:42:46] Newyorkadam: if you tell me what you want to look for I can just run a report for you until then [02:42:53] Newyorkadam: its a dump scanner [02:43:03] Betacommand: "the the" to "the" [02:43:14] Not capsensitive [02:43:17] it just finds the error [02:43:38] ok [02:44:15] Newyorkadam: give me a few minutes [02:44:22] Betacommand: k :) [02:44:29] there's a lot of false positives [02:44:44] Newyorkadam: there shouldnt be that many of them [02:45:02] Betacommand: yeah [02:45:05] but at least some [02:50:49] Newyorkadam: I tweaked the scan to only match \bthe the\b [02:51:42] Betacommand: What does that do? [02:51:44] the /b [02:51:47] \b [02:52:04] \b is a word boundary [02:52:40] Newyorkadam: http://tools.wmflabs.org/betacommand-dev/reports/db_scanner_the_the.log [02:52:48] the list is still updating [02:52:59] those are all what's been scanned, or articles with 'the the'? [02:53:12] Newyorkadam: its not 100% current but it uses the last database dump [02:53:24] Newyorkadam: just the articles with the problem [02:53:30] I'll check one [02:53:56] wow it's right :D [02:53:58] Newyorkadam: for example http://en.wikipedia.org/w/index.php?title=Animal_%28disambiguation%29&diff=592883869&oldid=582620286 was already fixed [02:54:02] Is there a way to make it automated? [02:54:10] Newyorkadam: define automated? [02:54:17] make it auto-correct all of those [02:54:23] HELL NO [02:54:41] :( [02:54:44] What dump is that? [02:55:14] Newyorkadam: http://en.wikipedia.org/wiki/Wikipedia:Bots/Frequently_denied_bots#Fully_automatic_spell-checking_bots [02:55:38] Newyorkadam: its using the 20140102 dump [02:55:39] Betacommand: I don't mean a bot [02:55:42] it'd be like AWB [02:56:03] Newyorkadam: take that list save it to a file, and load it with AWB [02:56:31] I'm confused? [02:56:38] Do I save it as .xml [02:56:45] No [02:56:56] save the .log as a txt file [02:57:04] open AWB [02:57:44] tell me when it's done downloading [02:57:48] See the area where it says "Make list" ? [02:57:58] 1 sec [02:58:00] Newyorkadam: it will be several hours [02:58:00] is it down running? [02:58:02] oh [02:58:06] I'll just do what we have [02:58:32] yeah I see make list [02:58:42] Newyorkadam: its parsing 4.4 million pages [02:58:47] Betacommand: oh [02:58:53] Betacommand: Where do I select log file? [02:59:07] text file? [02:59:10] source > text file [02:59:30] then make list [02:59:46] select the file you saved, and have at it [03:00:26] oh it worked :O [03:00:27] thanks [03:00:54] Newyorkadam: if you go to http://tools.wmflabs.org/?status and look for 2349066 you can see if the bot is still parsing [03:03:11] Newyorkadam: the file that the script is using is 19.1GB [03:03:47] so that should give you an idea of how much data its parsing [03:03:52] and thats compressed [03:05:07] Newyorkadam: Ive got tools for just about anything, If you need something feel free to drop me a line [03:05:40] Betacommand: This is amazing thanks :D :D [03:06:04] Newyorkadam: the script is far far faster than any searching that AWB can do [03:06:21] yeah it's really nice :) [03:06:25] and its zero stress on the servers ton find those [03:06:28] check out my contribs you'll see it's working well [03:07:00] Newyorkadam: the next dump should be sunday or so [03:07:17] we get about one a month [03:07:28] Betacommand: :) [03:08:06] Newyorkadam: if you want me to search for something else feel free to drop me an email or PM [03:08:16] Betacommand: k :) [03:09:05] I wish AWB were faster [03:09:06] but whatever [03:09:13] I could get done like 10 per minute then [03:09:18] instead I get three tops per minute [03:10:08] Betacommand: This is doing EVERY article? [03:10:17] Newyorkadam: yes [03:10:19] damn [03:10:33] Newyorkadam: its no big deal [03:10:48] Betacommand: It's AMAZING :) [03:11:23] Newyorkadam: this is just one of my many toys Ive got sitting around [03:11:32] Betacommand: What else :) [03:11:58] Newyorkadam: Ive got too many to list [03:12:07] Betacommand: Hit me up if you ever want/need anything done [03:12:14] Betacommand: Are they all written in python? [03:12:20] yes [03:12:23] k [03:12:26] i dunno python [03:12:37] Newyorkadam: Ive got one file thats 3.3k lines [03:12:43] damn [03:12:46] what's that? [03:12:54] its a library file [03:13:07] for doing all kinds of things [03:13:25] Betacommand: Mention if I tell other people about your tools? [03:13:34] Because I know some people who would love it [03:13:40] Newyorkadam: they are not a secret [03:13:52] Betacommand: k [03:13:59] Newyorkadam: http://tools.wmflabs.org/betacommand-dev/ [03:14:24] yeah I saw that nice thanks [03:14:42] Ive got afd parsers, contrib tools, tools for listing every page you ever created [03:15:00] I can find all users who have edited any two given pages [03:15:22] Single IP Lookup http://tools.wmflabs.org/betacommand-dev/SIL.html [03:15:27] I saw that^ [03:15:32] I might stop going through every edit [03:15:56] since some minor tweaks I haven't seen any false positives [03:16:52] Newyorkadam: please check every edit, otherwise Ill have to send my minions on you and you will get blocked [03:17:02] fine [03:17:22] If I get through 1,000,000 edits without a single false positive [03:17:36] I can do it without checking [03:17:38] deal? ;) [03:18:17] No [03:18:34] Typo bots are never fully automatic [03:19:06] :p [03:19:15] there is always exceptions [03:19:19] one example is the The or The the > how do you determine the capitalization of the single the? [03:19:21] and you will piss someone off [03:19:24] Betacommand: Have you used AWB? [03:19:29] Newyorkadam: Yes [03:19:31] Betacommand: I've already fixed that [03:19:36] it doesn't make any errors about it [03:19:43] I have 'the the' [03:19:44] * Betacommand laughs [03:19:44] 'The the' [03:19:46] and 'the The' [03:19:50] all caps sensitive [03:20:02] Newyorkadam: if you are not careful you will make mistakes [03:20:10] Betacommand: Is there a way to just store your edits and don't make them live yet, and then at one time submit them all? [03:20:23] Because I don't wanna have for every edit to submit [03:21:01] Newyorkadam: between myself and my bots Ive made about a million edits, Ive seen quite a few exceptions to any rule [03:21:18] Betacommand: Do you know the answer^? [03:21:47] Newyorkadam: you need to check, review and submit each edit [03:21:54] Betacommand: I am [03:22:02] I"m saying, can I keep my 'saved/submitted/ edits live? [03:22:03] *not live [03:22:07] can I keep them stored [03:22:09] and submit them all at one time [03:22:10] No [03:22:12] k [03:22:19] that would cause several issues [03:22:31] Betacommand: And is there a way to make the list generate even when I'm reviewing an edit? [03:22:35] edit conflicts, watchlist flooding and others [03:22:51] Like after I submit an edit it takes ~4 seconds to process the next article and find 'the the' [03:22:51] what? [03:23:04] Is it possible to process edits while I'm reviewing another edit? [03:23:05] Not really [03:23:07] :/ k [03:23:50] now I'm getting towards 4 edits per min [03:23:52] getting better [03:24:06] Its one of the software [03:24:53] Reedy: what are your thoughts on adding preloading to AWB? [03:25:16] As in? [03:25:32] Preload cached articles before they're processed? [03:25:51] that and pre-parse them [03:26:25] It'd be nice [03:26:27] so while the user is previewing a diff the next 5-10 articles are being prepared [03:26:32] It won't be exactly trivial though [03:31:26] Betacommand: this is gonna take awhile [03:32:21] Newyorkadam: I know [03:32:45] Betacommand: Did you make sure to add a space after the second the? [03:32:52] to ignore things like 'the then' [03:33:11] because an article just got skipped and I'm not sure why [03:33:15] Newyorkadam: thats what the \b does [03:33:37] Newyorkadam: keep in mind the error may have already been fixed [03:33:51] the file is 26 days old [03:34:08] Betacommand: Oh maybe [03:34:15] I've gotten mostly positives though [03:34:19] 98% so far [03:34:33] Betacommand: I know you're against automatic spell-check [03:34:40] but if cluebotNG is 99% positive [03:34:45] and this system is also 99% positive [03:34:54] what's the logic behind Cluebot being automatic and this not? [03:35:11] If I could prove through 1,000 edits that this is 99% positive [03:35:34] Spell checking is far more error prone [03:35:44] Betacommand: If I can *prove* that this is 99% positive [03:36:02] I actually haven't found a single false positive [03:36:05] with cluebot the cost/benefit is there [03:36:24] with spell checking its not [03:36:31] of course there's a benefit [03:37:03] Betacommand: How about I save 10 at a time and go back and check all the diffs and revert any false positives? [03:37:05] MUCH more efficient [03:37:07] Newyorkadam: there is a cost vs benefit that needs reviewed [03:37:24] Newyorkadam: do you want blocked? [03:37:39] Betacommand: no.. [03:37:41] I'm not going to do it [03:37:45] I'm just asking [03:38:08] Newyorkadam: those rules are not mine, the wiki takes a very hard line in that matter [03:38:29] Betacommand: Please don't say that again you got me really scared for a second :( [03:38:42] Newyorkadam: those rules are not mine, the wiki takes a very hard line in that matter [03:38:58] You can still say it nicer :/ [03:39:06] Newyorkadam: admins are more than willing to block first and ask questions later in regards to spell checking bots [03:39:22] oh I found my first false positive [03:39:24] error in a file [03:39:27] I can't change that [03:39:44] Newyorkadam: what do you mean error in the file? [03:39:57] "|image=Eva the the piano.jpg" [03:40:26] and thats just one example of why we dont have spell checking bots [03:41:12] Betacommand: I know.. [03:41:27] Betacommand: We could ignore that false positive [03:41:33] anything in an image paramater [03:41:42] *parameter [03:42:15] lol @ my contribs [03:42:50] Betacommand: Are you a dev for AWB? [03:43:33] Newyorkadam: No [03:43:37] Betacommand: k [03:43:42] Newyorkadam: I do my programming in python [03:43:42] Betacommand: I have an idea for an option: [03:43:48] what's AWB written in? [03:44:00] C# .NET [03:44:00] My idea: Have an option to only show the single sentence that's being edited [03:44:04] not the entire paragraph [03:47:15] Betacommand: Thanks again for all your help :) [03:47:23] You helped Wikipedia a ton [03:47:35] Newyorkadam: if you need a new report just let me know [03:47:39] Betacommand: I might soon [03:47:47] After I spend hours going through the the [03:48:04] Betacommand: What percent done do you think the download is? [03:48:07] or scan or whatever? [03:54:09] Betacommand? [03:55:27] Heh. [03:55:33] Are there still that many "the the"s? [03:55:50] Gloria: TON [03:55:51] TONS [03:55:54] 10s of thousands [03:55:59] I can only do so many!! [03:56:31] Interesting. [03:57:22] I'm surprised there are still so many. [03:57:38] Gloria: wanna help? [03:57:39] Newyorkadam: Im not sure and I really cant check [03:57:50] Betacommand: Like approxmiately [03:57:53] how long will it take [03:57:56] Gloria: This is the current list [03:57:57] http://tools.wmflabs.org/betacommand-dev/reports/db_scanner_the_the.log [03:58:02] Yeah, I'm looking through it. [03:58:15] > He currently serves in leadership, as a member on the the State Senate Majority Conference, and is its longest serving member, with the most seniority in the State Senate. [03:58:19] Newyorkadam: it takes 4-6 hours to process the dump normally [03:58:25] wowww [03:58:39] Betacommand: We're not even a half hour done [03:58:42] Betacommand: You should publish your source code more often. [03:58:44] already 2300 articles [03:58:52] Newyorkadam: You should learn Python. [03:58:55] This is a very simple script. [03:58:57] my contribs wut wut [03:58:58] https://en.wikipedia.org/wiki/Special:Contributions/Newyorkadam [03:59:17] * Gloria digs up source... [03:59:23] Gloria: I need to get stand alone code [03:59:31] Betacommand: Copy and paste. :-) [03:59:37] You don't have to re-type it. [03:59:41] Gloria: that makes dev a bitch [04:00:06] this is new.. [04:00:07] * Gloria shrugs. [04:00:10] Newyorkadam: https://en.wikipedia.org/wiki/User_talk:MZMcBride#Wish_list [04:00:13] There's a script there. [04:00:15] having central libraries makes it far easier to code on the fly [04:00:22] I doubt you'd have much trouble reading through it. [04:00:25] And figuring out what it does. [04:00:30] "Some 'rough mixes' have been streamed at the The The web site" [04:00:33] The The is a band [04:00:41] thanks :) [04:00:42] :-) [04:00:50] That script is super-hackish. [04:00:59] Ideally you'd parse XML with a library. [04:01:15] bz2 is a reference to the compressed file format. [04:01:54] Gloria: Help!! [04:01:56] It's so much [04:02:20] Newyorkadam: I have enough edits for today. [04:02:30] Newyorkadam: just let me know what you want a report for and Ill create one [04:02:49] Betacommand: k :) [04:02:52] This'll keep me busy for awhile.. [04:03:16] I thought the "CheckWiki" project was doing this. [04:03:23] https://en.wikipedia.org/wiki/Wikipedia:CHECKWIKI [04:03:27] Gloria: parsing a 20G bz2 xml file isnt easily done correctly [04:03:46] Betacommand: The size really doesn't matter. [04:03:49] Gloria: they have a general typo list [04:04:20] k thank god lol [04:04:28] I thought I forgot to mark my edits as minor [04:04:41] https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Check_Wikipedia/List_of_errors [04:04:44] I don't see typos. [04:04:45] Weird. [04:05:04] They may be more focused on syntax problems. [04:07:19] Gloria: http://en.wikipedia.org/wiki/Wikipedia:TypoScan [04:16:18] I'm gonna post my current task under the AWB discussion thing [04:18:55] up to 2,600.. [04:19:48] Betacommand: You're killin' me [04:19:53] keeping' me busy for weeks [04:22:42] this is gonna take hoursss [04:23:01] waiiiit [04:23:05] I'm not looking out for The The :o [04:23:27] done [04:23:36] another image false positive :l [04:31:07] Betacommand: WOuld you use a search that might only bring up five or ten positives? [04:32:01] another false positive [04:32:04] url [04:34:06] g2g guys bai [04:34:12] I'll do tons moar work tomorrow!! [08:57:08] too many spammers in http://www.sub-bavaria.de/w/index.php?title=Spezial:Letzte_%C3%84nderungen&limit=500 and it's so difficult to get rid of them [09:01:03] (sry ori, my client had not scrolled down) [09:01:21] Jasper_Deng: :) [09:02:10] please support antispam measurements in http://www.sub-bavaria.de/w/index.php?title=Spezial:Letzte_%C3%84nderungen&limit=500 thank you http://sensiblochamaeleon.blogspot.de/2013/10/help-spam-alarm-bei-sub-bavaria.html [09:30:27] hmpf bavarians, no more than 5 min they wait for an answer [09:30:32] too much beer I bet [09:31:20] bar.wikipedia community to the rescue [09:32:19] "Hoamseitn":) [09:32:42] when a tireless spambot meets a drunk wiki, the drunk wiki is dead [09:34:23] !quip [09:38:42] aaargh why is blogger so slow [09:40:01] grr fatal [09:40:53] Ihr Kommentar wird nach der Freigabe sichtbar. [09:41:12] schade [09:41:58] will become visible after confirmation [09:43:26] things I learnt on blogger today: 1) it's down or horribly slow with its JS, 2) it produces mysterious error codes and sends you to support with links such as https://productforums.google.com/forum/?hl=it#!searchin/blogger/bX-s3c5ad , 3) if you're first class citizen i.e. logged in it asks street number captcha, if anon reCAPTCHA [09:47:29] labs down again? o.o [09:53:38] It's just you. http://wmflabs.org is up. [09:54:22] twkozlowskiforeveryoneorjustme [09:58:44] labs.wikimedia.org, wikitech.wikimedia.org, wmflabs.org .. :) wfm [10:09:00] I mean http://tools.wmflabs.org/ [10:09:06] oh finally it loaded [10:12:26] well tool labs isnt strictly labs, but a special project within labs [10:14:32] my question was a bit generic, granted [11:34:13] hi, anyone around? [11:34:17] Translation tool on meta seems to be broken a bit [11:34:21] https://meta.wikimedia.org/wiki/Talk:Fundraising/Translation#Tool_for_changing_languages_is_broken [11:40:05] Utar: have you hard refreshed your cache? [11:40:48] nevermind, it doesn't help [11:42:42] Utar: replied there, please file in bugzilla [11:43:15] all language-related things are under an ongoing earthquake, it will take a few weeks to be back to normality [11:44:38] Nemo_bis: thanks, I will write there [11:45:10] what is happenning with language things these days? [11:54:54] is it caused by some MediaWiki change? [11:58:15] in a manner of speaking [11:58:57] ok, thanks for reply, gtg now [17:55:27] the abuse filter on enwiki is causing havoc [17:56:43] Betacommand: get enwiki editors to unbreak it? nothing we can do from here [18:00:38] well, it depends on what sort of havoc [18:01:04] did it deflag all sysops and abusefilter users? :) [18:09:33] Nemo_bis: causing db errors, and hitting almost 40% of all edits [18:09:53] Betacommand: you mean blocking 40% of edits? [18:10:03] i think AF has a safeguard to turn itself off if something like that happens [18:10:12] should block itself at 5 % [18:10:15] or 20 for AFT [18:10:18] MatmaRex: it flags that many edits [18:10:38] See filter 554 [18:11:08] Betacommand: then disable the filter? i can't do that [18:11:29] MatmaRex: I cant either [18:11:40] Of the last 3,324 actions, 0 (0.00%) have reached the condition limit of 1,000, and 50 (1.50%) have matched one of the filters currently enabled. [18:11:50] https://en.wikipedia.org/wiki/Special:AbuseFilter/554 [18:11:52] Of the last 3,114 actions, this filter has matched 3,378 (108.48%). On average, its run time is 0 ms, and it consumes 0 conditions of the condition limit. [18:11:53] lol [18:12:13] Warning: This filter was automatically disabled as a safety measure. It reached the limit of matching more than 5.00% of actions. [18:12:17] so yeah, the safeguard works [18:12:27] nothing to see here, troutslap whoever did htis [18:12:51] it seems kww is really after a deflag :P [18:13:14] heh [18:14:55] * Nemo_bis sticks a false alarm hat on Betacommand's head [18:15:14] Nemo_bis: its not a false alarm [18:15:31] Nemo_bis: there are two sections at [[WP:VPT]] about the problems [18:16:27] Betacommand: the problems should cease now that the filter disabled itself [18:16:46] Betacommand: and you can prevent that from happening in the future by fixing that particular filter [18:17:03] abusefilter is a very powerful tool and people who don't know what they're doing should not be allowed to use it [18:17:14] MatmaRex: I cant do anything about it [18:17:18] but that is not a matter for this channel [18:17:40] https://en.wikipedia.org/w/index.php?title=Special:ListUsers&group=abusefilter [18:17:43] MatmaRex: often people who can fix it are lurking around [18:17:45] that's a LOT of people [18:17:49] whoa [18:17:52] seriously [18:18:06] pl.wp has like 5 filter operators and we do fine [18:18:34] MatmaRex: it's because of the private filters, they add people to the group if they want to copy some of them [18:19:06] that's stupid [18:19:10] there's a user right for that [18:19:19] (for viewing filters but not editing them) [18:19:29] jesus, en.wp people are dumb [18:28:30] <^d> MatmaRex: s/en.wp// [18:28:36] <^d> No need for the qualifier :) [18:29:18] ^d: i dunno, the stupidest shit seems to come from that one wiki :> probably the sheer size is a factor, but still [18:33:02] hmmm that reminded me of requesting another feature for AbuseFilter: changes to filters should show up on watchlist of users with permission for editing them [18:33:44] I like to review changes to filters to look for bad regexes or misplaced parenthesis... [18:34:06] but this requires going to https://en.wikipedia.org/wiki/Special:AbuseFilter/history [18:34:13] or in other words abusefilter rules should be stored in wiki pages? [18:34:28] (in an ideal world where ACL wasn't a nightmare) [18:34:37] kind of :-) [18:34:44] what is ACL? [18:39:37] Access Control List [18:39:55] helderwiki: ask yuvipanda, he's The Magician [18:40:17] Nemo_bis: yeah, it's a bit sad that AF reimplements history and diffs [18:40:18] huh? [18:40:18] oh [18:40:20] THAT [18:40:40] twkozlowski: heh, sounds like a good idea. Eventually. Move it to Lua as well [18:40:41] yes, you think rewriting UploadCampaign will be enough? :-P [18:40:42] apparently private filters must be really really *really* private :) [18:41:04] https://bugzilla.wikimedia.org/show_bug.cgi?id=60588 [18:41:21] Lua: https://bugzilla.wikimedia.org/show_bug.cgi?id=47512 [21:21:08] hi [21:24:08] Betacommand: :) hi [21:24:20] Betacommand: Did you set it to not include talk pages? [21:25:53] Newyorkadam: its mainspace only [21:25:58] Betacommand: k [21:26:08] Sorry I'm not sure, does that include templates and help pages? [21:26:24] No [21:28:59] Betacommand: k thanks [21:29:04] I think it stopped at ~4,200 [21:29:19] ahh I wish everything loaded/saved faster :( [21:30:51] Newyorkadam: its limited to namespace = 0 [21:30:58] What does that mean? [21:31:09] What is the pre-parse mode? [21:31:43] Newyorkadam: thats for finding articles with problems [21:32:15] Betacommand: Will it affect me? [21:32:26] Newyorkadam: No [21:32:30] k [21:32:46] Betacommand: Why do anti-vandal reverts load almost instantly and AWB takes four seconds? [21:33:18] Newyorkadam: AWB does quit a bit more than just showing a diff [21:33:30] Betacommand: I mean saving the edits [21:33:32] not showing them [21:33:55] reverts can sometimes take quite a while to save [21:35:54] Betacommand: That's the opposite of what I'm saying :p [21:35:58] reverts save instantly for me [21:36:01] but AWB edits don't [21:36:39] Because a revert is done completely serverside [21:37:01] Reedy: Is there anything I can do to make this load faster [21:37:01] ? [21:37:07] just saves or diffs loading [21:37:12] Get a faster computer/connection [21:37:25] Depending on what's actually the slow point [21:40:16] Reedy: I don't think that's it :/ [21:40:24] Reedy: Loading diffs and saving edits are slow [21:40:32] Which can be your connection [21:40:44] Whether latency or bandwidth [21:41:11] I really don't think it's that [21:41:18] I'll just tamper with stuff [21:41:24] is there a way to turn off things from loading? [21:41:29] If I'm not using them? [21:41:30] such as? [21:41:38] Like the alerts and multiple wiki-links [21:41:46] in the 'Start' option [21:42:22] Diffs are computed locally [21:42:31] Edits require pushing the entire article text to the wmf servers [21:42:48] Reedy: Yeah I know [21:42:55] is there a way to stop using unneeded tools? [21:43:11] Not those ones [21:43:18] They won't be slowing it down [21:43:24] The fact you're running it virtuali[sz]ed environment to begin with... [21:44:01] that wouldn't be slowing down the network? [21:44:09] just the program [21:44:14] No [21:44:28] They don't do network requests [21:44:46] Turning off auto tag, general fixes and unicodify might save you a bit [21:45:59] Reedy: I want general fixes [21:46:03] what's auto tag and unicodify? [21:46:26] Auto tag is adding maintenance templates [21:46:50] I haven't seen any templates being added [21:46:54] are they shown in diffs? [21:46:57] Yes [21:47:42] hmm [21:48:11] it didn't noticeably speed up after I turned off those settings [21:48:30] Congrats [21:50:08] that's bad.. [21:58:37] Reedy: Is there a way for it to apply general fixes without it showing the diff for those fixes? [22:02:56] No [22:03:15] You shouldn't be saving articles if you don't know what has been changed [22:07:12] Reedy: I might just turn that setting off then [22:23:48] Newyorkadam: afaik, en.wiki bans genfix only changes anyway [22:23:56] have you read the bot policies? [22:24:08] p858snake|l: nope [22:24:32] (hint, awb edits are counted as bot edits) [22:24:51] p858snake|l: ok? [22:24:55] I don't care about my edit count [23:29:33] Jamesofur: btw, can I have the ability to manage queues, etc and otherwise admin the test RT instance? [23:54:06] RoanKattouw: Is there a VE IRC channel (for asking questions)? [23:54:17] kaldari: #mediawiki-visualeditor [23:54:29] thanks [23:54:49] * DanielK_WMDE waves [23:55:51] <^d> It's a wild DanielK_WMDE! [23:56:22] <^d> RoanKattouw: You need a #mediawiki-roan so I can always find you :) [23:56:34] isn't that called /query :p [23:56:39] <^d> /topic All Roan, all the time!