[02:54:39] Hi all, I'm moving my wiki bots from toolserver to labs, and have a few questions. Where do you recommend locating the scripts, username and password files, and what permissions should be set to maintain security of the username/password? [02:55:45] TheLetterE: 301 #wikimedia-labs [02:56:31] jeremyb: thanks :) [03:20:10] anyone going to LISA? this will be one of the talks: http://marc.merlins.org/linux/talks/ProdNG-LinuxCon2013/ProdNG.pdf [03:25:43] Is someone here real knowledgeable about template who may also have a little time to help me, I could do what I intend to do but it will certinely take me a few days probably. [03:27:10] Guess I better get started. [03:36:36] My76Strat: you should tell people what your question is if you want an answer [03:37:31] I understand, It's more than I could reduce to a single question. [03:38:20] I guess I'll work the matter until each specific situation arises. [03:39:01] Thanks however. [05:34:22] ? [05:35:26] bd808|BUFFER: ?? :) [05:52:57] My76Strat: Asking a string of questions is allowed, but when you ask zero questions, it's impossible to answer them. [05:58:50] I understand, I was premature in my original post. I need to do a lot of stuff to even know what t o ask. [12:02:57] @techs: hi, upload by url (Upload Wirzard Extension) dos not work on commons [12:06:45] Any errors given? [12:09:38] nope [12:31:56] Reedy: i open a ticket? [12:32:40] Might aswell [15:02:31] apergos: hello [15:02:38] hello [15:02:49] and welcome from vacation [15:02:54] thanks [15:03:06] i don't see parent here [15:03:13] lemme see if he's gtalkable [15:03:18] ok [15:03:42] shows as there, pinged [15:08:12] guess he's not going to answer... he can check the backread [15:08:16] how are things going? [15:09:42] i'm working on compression now [15:10:17] right now, i'm trying different things and seeing what's best [15:10:28] some preliminary results are at https://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/Compression [15:10:35] oohhh [15:11:05] separate means each rev text is separate? [15:11:25] yes, that's the way it was until now [15:11:32] yup [15:12:00] i know you're not going to understand most of what those names mean [15:12:18] you're taking notes for yourself, uh huh [15:12:40] but if you take a couple minutes and scribble at the bottom a legend that would be good [15:13:27] what are the 'groups per'? [15:14:06] when you have chains of delta compressed revisions, and you break the chain after each n revisions [15:14:25] (so that you don't have to read the whole chain to read the last revision) [15:14:31] yup [15:15:03] the groups per number could be tunable at first [15:15:15] because there will be speed considerations as well [15:16:46] i'm not sure that number will affect the speed much [15:17:31] I wasn't thinking about speed of generation [15:17:57] but of reading, thugh I guess... [15:18:07] OK. Finally [15:18:08] that someone wanting to read a specific revision might have to suck it up [15:18:34] yeah; though currently, there is no way to do that [15:18:35] as opposed to our use case which is 'read them all in serial order and dump then out in some other format' [15:18:55] that's nt for you to hacve to write (retrieving a specific revision) [15:18:59] *not [15:19:00] *have [15:20:31] so I guess you will skip lzma altogether, given this chart? [15:20:48] well, it's something i would like to do; but it's clear i won't get to that anytime soon (certainly not before GSoC ends) [15:20:53] So I missed the beginning. What happened with LZMA? [15:21:05] https://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/Compression [15:21:38] it's a half-cryptic table of the results of compression tests so far [15:21:51] kk, finally got the page to load [15:21:58] are you on by phone? [15:22:08] Nah, desktop. But I have comcast. [15:22:10] or are you at some conference with horibble wifi? [15:22:13] geeeez [15:22:23] Yeah this is my usual internet [15:22:35] probably the feed going through the nsa filters, it happens [15:22:36] ( :-P ) [15:22:42] Lol probably [15:22:59] So it looks like almost any combination of zdelta yields the same size [15:23:07] apergos: i'm not sure yet, if you look at the last line, grouping revisions and using LZMA on that might give better results [15:23:10] And the only thing smaller is two LZMA the entire thing [15:23:16] yeah I see that [15:23:34] i want to try LZMA on groups of reasonable size and see how does that fare [15:23:44] sure [15:24:08] svick: did you ever get a chance to email the papers' authors? [15:24:20] and i also want to try current dumps, because that's likely going to require different approach than history dumps [15:24:30] heh yes it likely will [15:25:15] parent5446: yeah, i did, he responded quickly, but didn't actually have much useful advice (and he didn't do anything with delta compression for some years) [15:25:38] :/ figures. It seems like if they continued research they might have been able to get that compression down further [15:26:37] though the last papers focused on compressing groups of files by reordering them, which isn't much useful to us [15:27:25] no, the less we read and shuffle around the better [15:27:46] yeah [15:28:49] also, apergos i didn't look at it exactly yet, but i think that out of those 4 MB, something like 0.5 to 1 MB is going to be metadata, so i also want to look at how to compress that [15:29:01] that's rather a lot [15:29:08] very interesting [15:29:30] for example, if i have 4 bytes of parent id, i can most likely express the revision id using less than the full 4 bytes [15:30:32] other possibilities are adding an index for users (so i don't have to repeat usernames) and compressing comments (though i don't know if that will help) [15:30:54] some comments are going to be compressible (you'll gain a little) [15:31:03] a lot of them, probably close to breaking even [15:32:13] the page creation ones will compress down a little [15:32:29] i could add a flag saying if the comment is compressed, so only compressible comments will be compressed [15:32:47] well mmm meh [15:32:56] if you already take the time to try to compress and then compare size [15:33:04] might as well just compress them and be done with it [15:33:34] maybe, i'll try and see [15:33:34] Worst case scenario the compression does nothing. [15:33:52] well worst case compression actually increases the size a few bytes [15:34:16] yeah, and that could make a difference for short comments [15:34:19] :/ well as long as it's legible text and not random numbers, the chances of that are pretty low, although you're right [15:34:28] and then you have a flag (another byte) [15:34:45] anyways, more testing for you ;-) [15:35:12] apergos: flag is just a bit, i think i can find one free bit in revision flags [15:35:38] if you've got one free, go for it [15:35:54] heck if you don't, try it anyways, might as well [15:36:46] one idea: the page creating comments you mentioned could be compressed well by delta compressing against the text, but they probably don't happen often enough to make this worth it [15:38:54] no, I wuldn't go that route [15:41:09] also, out of those 100-200 bytes per revision i have now, 20 are SHA1 of the revision text [15:43:19] 20 times 500 million [15:43:29] that'll help some [15:44:53] on the other hand, some sort of checksum might be useful, though it probably doesn't have to be 20 bytes [15:45:21] well the actual sha1 needs to be in the dump cause people use it and shouldn't have to compute it [15:46:36] but couldn't my application compute it when creating the XML version? [15:49:13] yes but then it would be slowww [15:49:38] sha1 * 500 milliion? [15:49:58] well you can do timing tests (or ask me to run them on large datasets) [15:52:02] ok, i have no idea how fast SHA1 is, especially when compared with decompression [15:54:44] it will vary according to text length too [15:55:00] we've got some texts that are baically nothing (most aren't big, as you recall) [15:55:20] but there are some in there, typically talk/etc, that are ginormous, the max 10mb [15:55:50] anyways I think that's a place where it's ok to just take the data the db gives us [15:57:41] according to https://en.wikipedia.org/wiki/Wikipedia:Database_reports/Long_pages, the longest pages on enwiki is 3.5 MB, and even most pages on that list are < 1 MB [15:58:21] yes but [15:58:23] http://meta.wikimedia.org/wiki/Data_dumps/FAQ#Which_page_has_the_longest_text_in_the_en_wp_dump.3F [15:58:32] Do pages that large even load? [15:58:36] history will getcha every time [15:58:54] there's all kinda weird crap prople have shovelled in there [15:59:12] ooh [15:59:26] obviously those are the exception rather than the rule ;-) [15:59:50] well, those pages will take a relatively long time to decompress too, so i think it makes sense to measure it relative to that [16:00:02] but you're likely right that it would take too long [16:00:16] so i won't do anything with that for now [16:01:23] yep, time to tweak stuff later (post gsoc maybe) [16:01:35] iffff you have a bit of time and interest [16:02:05] yeah [16:02:55] i can't think of anything else, what about you? [16:03:32] nope, interested to see more numbers if/when there are some [16:03:41] parent5446: ? [16:09:09] I'm gonna take that as a "my internet connection is too crappy for questions" and say "see ya tomorrow" :-) [16:09:32] huh, that 10 MB revision on Talk:Japan is a vandalism repeating the same pragraph over and over [16:10:06] see you both tomorrow [16:49:16] hah, 10 MB revision is nothing; the worst instance I found is https://archive.org/download/wiki-wikicafe.metacafe.com_en/User_talkMarkjp.7z , talk with about 40 GB of spam in total [16:49:58] we have a 10mb cap on revision lengths [16:50:22] the person in question filled the buffer (prolly with more text than that) [21:11:52] Krinkle|detached: poke [22:25:16] What effect on performance does switching to HTTPS entirely have on Wikipedia? [22:31:06] it is slower! [22:31:15] especially in global south! [22:35:29] kmg90: it's slower, especially on high latency connections [22:36:25] ori-l: Is there a way to specify a default value for a column in a ContentnHandler Schema? [23:44:02] Commons search has problems [23:44:13] Time out [23:44:37] Tie out with HTTP request it says in red [23:46:04] Romaine: can't repro [23:46:12] <^d> Me either. [23:46:35] still have it: use: https://commons.wikimedia.org/w/index.php?title=Special%3ASearch&profile=advanced&search=Upload+size&fulltext=Search&ns4=1&redirs=1&profile=advanced [23:46:49] <^d> Worksforme. [23:47:16] An error has occurred while searching: HTTP request timed out. [23:47:23] I keep on getting that [23:52:22] jeremyb / ^d: I keep getting that [23:58:05] any idea why I get this? [23:58:11] location depended? [23:58:20] ^d: he's presumably coming from europe so that's extra RTT? [23:58:40] (wfm India) [23:58:58] <^d> jeremyb: Shouldn't make much a difference, special:search is going to hit the apaches anyway. [23:58:58] http://status.wikimedia.org/ does say performance issues for commons, and "UNCACHED". I'm not sure how technical vs PR-speak that page is though. [23:59:11] (and IANAD, at all.) [23:59:57] ^d: in running our pywikibot tests over the past 2 weeks, we've noticed random HTTP request timeout errors when searching via the API.