[00:37:34] !log aaron synchronized php-1.19/includes/Block.php 'deployed r114672' [00:37:35] Logged the message, Master [02:16:32] !log LocalisationUpdate completed (1.19) at Tue Apr 3 02:16:32 UTC 2012 [02:16:35] Logged the message, Master [05:19:36] underscor: pinggg [05:21:18] apergos: pong [05:21:22] :) [05:21:34] * apergos puts some english on the ball and sends it back [05:21:56] so what were you thinking about for images? I didn't get a clear idea from the emails [05:22:16] * underscor does a fancy twirling maneuverer to counteract the spin [05:23:02] Okay, so. The archive would like to have images batched (by a certain length of time), so that there's a historical record [05:23:31] Because one of the things that we dislike about the media is that once a piece of media is deleted, it's gone. [05:23:45] Anyway, we were thinking a sort of hierarchy like: [05:24:29] wikimedia->wiki(books|source|pedia|etc)->(langcode)->(batch period) [05:24:50] Depending on the project/lang, the "batch period" may be a different length [05:25:16] Like, the Andorran wiktionary media could probably be dumped once a year [05:25:38] Whereas enwp would probably be weekly at the most infrequent [05:25:50] Er, well, I guess new media goes to commons instead of enwp, right? [05:25:57] Or do people still upload media to enwp? [05:26:31] Anyway, you sorta get the idea. The constraints on our end are that we try to keep each "item" under 10,000 files and 10GB [05:26:43] Neither are hard and fast, but it's our target [05:26:46] ok so here's one of the things about deletion [05:26:52] (which is why we were thinking about that way) [05:26:58] Legality is a huge part, I'm sure [05:26:59] many images are deleted for copyright reasons [05:27:05] Yep. [05:27:19] it's actually not ok to continue to mirror those forever [05:27:20] We have terabytes of copyrighted content uploaded daily. [05:27:48] The unofficial policy is "public until someone complains, then private once they do" [05:27:54] uh huh [05:28:12] The official one is that archive.org has no copyrighted content, anywhere. [05:28:27] for us the policy has to be much more aggressive (and so editors try to be very vigilant, we have stringent policies about copyright and media on commons, and so on) [05:28:31] Of course, when you grow by 20TB a day, that's very difficult. [05:28:49] Yeah, it's totally understandable from your perspective. I don't fault you at all [05:29:26] what we would ideally do (what needs to happen in the medium term with our image mirrors) is that we have them available for rsync and in tarbals per project [05:29:45] (maybe the large projects have a few tarballs so they aren't ginormous) [05:30:10] Yeah, that'd be good. 2TB size limit per tarball, so they'll fit on a single mount? [05:30:19] there would be at a minimum two tarballs per project: images uploaded locally and images hosted on commons but used on th eproject [05:30:29] That would make life easier, definitely. [05:30:36] Are there images on commons that are used nowhere? [05:30:41] along with that we would need xml files of the file descriptions in similar bundles, plus a readme file that describes the licensing scheme [05:30:42] (assuming yes) [05:31:04] probably much less than 2T [05:31:11] we want normal humans to be able to download these [05:31:23] xml could be scraped from api.php, but I'm assuming you don't want a couple hundred of thousand requests thrown at your frontends [05:31:27] s/scraped/retreived/ [05:31:29] no no [05:31:32] Yeah, true. [05:31:55] I mean, I guess they could be packed in the same structure they are now (two md5 tiers) [05:33:27] it's much easier for us to add that to our dump scripts or to stuff them into a similar process that runs regularly here [05:33:28] Unless you think that's TOO fine grained [05:33:28] regularly might be every two weeks, it might be more often, it might be less, these are things to work out [05:33:28] ofc [05:33:28] so answering a couple of your questions [05:33:28] yes, people still upload some images to the local projects [05:33:28] in some cases, they do because it's more convenient [05:33:28] because they are used to it [05:33:28] because there is a languag barrier for them in uploading to commons and wading through all the licensing info [05:33:28] etc [05:33:39] Makes sense [05:33:40] in other cases they do so because commons must host only free images [05:33:50] btw, how large is the entire image backup? [05:33:51] as in free, not as in beer ;-) [05:33:56] Just so I can get an estimate [05:33:58] so fair use images must go to the local projects [05:34:04] 16-17T now [05:34:16] Oh, I see. I didn't realize the different projects had different licensing restrictions [05:34:17] expect growth of anything from 300gb to a T amonth, [05:34:25] Oh, only 17TB? [05:34:30] Not bad at all! [05:34:35] if we ever start getting tons of video we will be up *&^% creek without a paddle [05:34:38] I was expecting at least a hundred [05:34:53] yes but we don't have a few spare petabytes lying around like you guys ;-) [05:34:56] hahaha [05:35:08] http://home.us.archive.org/~edward/costco_drives/ [05:35:48] (due to rationing of bare internals, it was cheaper/easier to buy externals and cannibalize them) [05:35:55] when I was at the archives a couple years ago they had some nice little tower that was a few peta, just sitting quietly in a corner with nice blinky lights [05:36:08] I could imagine it in the corner of my living room... except for the power consumption of course [05:36:10] There's now a room full of 750 or so empty casings [05:36:18] Yeah, each rack draws 2kw o.O [05:36:25] yeah, a bit prohibitive [05:36:25] kwh* [05:36:55] so going back to the list you want and the structure you want [05:37:18] is the idea that each image would be listed separately? [05:37:35] (this is on your end, I'm not talking about what we provide) [05:37:52] Well, we'd like images to be directly accessible [05:38:10] so someone could get the 5 images from their favorit 19th century painter [05:38:10] But we have this cool magic now where you can append a slash to the end of a tar.gz or tar.bz2 and browse inside it [05:38:16] and not have to deal with the rest [05:38:20] ohhh [05:38:23] ok very nice [05:38:25] So it would be feasible to have sets of tar.gz or bz2 [05:38:39] can I ask how that works? do you have an index pre-built [05:38:43] or do you uncompress on the fly? [05:38:59] uncompress on the fly [05:39:02] ok [05:39:10] I can imagine that would get slow for large files [05:39:14] The datanodes have something like 24 cores and 96gb of ram each [05:39:22] uh huh [05:39:23] So it's cheaper to just do it on the fly [05:39:27] I don't have a targz handy, but here's an iso [05:39:30] http://ia600601.us.archive.org/isoview.php?iso=/32/items/cdrom-pccollectorn05cd1/pccollectorn05cd1.iso&file= [05:39:56] And you can deeplink directly into it [05:39:58] right [05:39:59] http://archive.org/download/cdrom-pccollectorn05cd1/pccollectorn05cd1.iso/IMAGES%2FDEMO2.BMP [05:40:16] hm it did not like that link [05:40:21] anyways I get the idea [05:40:25] oh [05:40:30] the %2F should be / [05:40:33] Silly firefox [05:40:35] heh [05:40:42] Anyway, sorry. I digress. [05:41:11] so then you would want tarballs for a given batch period, as you say [05:41:16] yeah [05:41:41] I was playing with parsing rsync --list-only output, but it would be much easier on your side to just run find with mtime filters [05:41:48] where you mean that the latest version of the file was uploaded between X and Y [05:41:55] Yep! [05:42:10] to generate a list of filenames, which then I could feed to rsync --from-file=list.txt [05:42:30] mtime will not of course get it [05:42:50] so first off, many files I am sure no longer have their original ctime or mtimes [05:43:03] we've moved data aroud several times [05:43:07] *around [05:43:17] Hmm [05:43:22] the other fun thing is that soon-ish images wil be served from swift [05:43:36] so I'll be retrieving them with a completely different mechanism tbd [05:43:42] (i.e. to be written) [05:43:42] The upload dates are stored in the DB, right? [05:43:52] (I have no idea how expensive that is) [05:43:59] but that'd work [05:44:00] yep, and that's what I'm going to need is to do some db quieries and dump lists that way [05:44:25] I'll already have to to some of that to generate lists of files housed on commons but used on the local project [05:44:32] *have to do [05:44:57] one thing you might consider is [05:45:08] that we don't need to have the weely or whatever interval all the way back to 2002 [05:45:11] swift is similar to mogilefs, right? [05:45:16] Oh, yeah, definitely [05:45:36] I mean, the interval size is not as important as targeting <100gb per interval [05:45:41] so you could have "starter files" from 200.. 5? whatever [05:45:51] something where it's not too ginormous [05:45:59] yeah. [05:46:00] then maybe monthly batches for awhile [05:46:20] then when it gets closer to recent if those need to be biweekly divisions it could happen [05:46:34] anyways, it seems that all you need is the list of upload dates [05:46:51] yep [05:46:52] um, we provide old versions of files when there have been changes [05:46:58] what do you want to do about those? [05:47:17] hmm. [05:47:33] Well, do they have multiple DB entries, or how do they fit into the schema? [05:47:50] they get moved into an 'archive' directory [05:47:56] Like, would the list for "march 2005" include the old version and "february 2011" have the new version? [05:47:58] and someone looking at the image history can see them [05:48:08] How big is the archive directory? [05:48:14] (if you know) [05:48:15] each project has their own [05:48:26] much smaller compared to the rest of the content [05:48:47] these are not the same as deleted images [05:48:52] once images are deleted we cannot give them to you [05:49:01] Right. Deleted images are removed from the fs [05:49:53] I mean, I guess you could have /wikimedia/wikipedia/en/archives/(all the batches here) [05:50:06] and /wikimedia/wikipedia/en/(nonarchive batches here) [05:50:30] But somehow you need to preserve the link between the current and the archived version [05:50:37] Does the XML have that info? [05:50:53] well deleted images are not removed from the fs [05:50:59] but we cannot publish them to the outside world [05:51:06] they are moved elsewhere [05:51:27] oh, I didn't know you held onto them [05:51:31] the xml dumps include file description pages of allold versions [05:51:39] Regardless, they're effectively "gone" :) [05:51:41] yes, because files can be undeleted, just like page content [05:51:57] because content and images are deleted for a variety of reasons [05:52:21] yeah, of course [05:52:33] including license, "inappropriate for X project", targets soneone personally, etc [05:52:49] once in a while we have a rogue admin who deletes things [05:52:49] or "illegal in the US" [05:52:59] *cough* main page *cough* [05:53:06] imagine if we couldn't retrieve it later [05:53:24] That's happened? [05:53:28] Dang. [05:53:39] in very rare cases an image wil be removed forever so it cannot be restored, but that's not the rule [05:53:44] um yeah [05:53:50] Bummer. [05:53:55] until we made it impossible to delete the en wp main page [05:54:07] those were the days :-D [05:54:11] hahaha [05:54:45] "ariel, $rogue-admin deleted the page again. can you roll it back?" [05:54:47] impossible? you just havn't put the effort in apergos [05:54:48] 8) [05:55:05] so if you are serving the iamges in this way you'll need a copy of all the file description pages as well plus a 1 pager about the license info, a sort of readme [05:55:30] p858snake|l: I'm trying to discourage people from working hard to find new ways, and you're not helping here :-P [05:55:36] Which can vary by project? [05:56:09] (as a side note, new goal: become wp admin, figure out way to remove main page AND insert backdoor that disallows recreating it) [05:56:16] ;) [05:56:40] Oh wow, main_page has a talk: [05:56:49] I would've thought it'd be disabled or something [05:56:51] no, fortunately admin as an editor != have shell on our cluster [05:56:57] so that wasn't going to happen :-D [05:57:53] so yes, some of our projects do not have the usual cc-by-sa (and gfdl/cc-by-sa for older content) [05:58:17] wikinews for example has cc-by [05:58:52] the simple thing there is to have one file that lists the licenses per project where they differ (or says "go look at X license link") [05:58:58] I see. [05:59:21] and people upload images to commons with a variety of different licenses [05:59:23] Does that already exist somewhere? (Special: page on wmf?) [05:59:30] not to my knowledge, sadly [05:59:47] (if not, easy (if tedious) to manually generate) [06:00:22] uh huh [06:00:24] at least doable [06:01:19] so ideally, someone downliading a list of images also gets the file description pages for each, along with one copy of the general license information [06:02:30] They could be just next to each other in the tarball, possibly? [06:02:32] that's a bit more of a PITA for you guys, as the file description info is XML in a bz2 or 7z file, without separate files for each [06:02:40] Then they'll sort right near each other [06:02:48] there was at one stage a project to move license data into db magic compared to plain wikitext on the page, I wonder what ever happened with that [06:02:50] Just throw it into lxml and extract the bits for the file [06:02:56] 8) [06:03:05] I wonder how lxml likes multigb xml files [06:03:05] haha [06:03:28] so this might be another area where we tell you "here's the xml file with all file description pages as of X, have at" [06:04:58] noone of what I am saying stops you from starting to grab stuff from the rsync mirror we have now, if you want to get started on this [06:05:24] I"m going to be writing scripts and shuffling things to get tarballs etc going so these other things will happen over time [06:06:36] ok, excellent. [06:06:42] https://secure.wikimedia.org/wikipedia/commons/w/index.php?title=File:Ariel_Glenn_-_IFLA_Pre-Satellite_Conference_2010_-_Managing_Quality.pdf&page=1 [06:06:45] That's cool! [06:06:54] I didn't know it intelligently handled pdf files [06:06:57] it was entertaining [06:07:41] so bear in mind you're getting any kind of random thing in here,not just images [06:09:10] apergos: apart from PDFs I don't think we allow anything other than images do we? [06:09:11] yeah [06:09:18] p858snake|l: video [06:09:25] oh yeah, that [06:09:49] plus the executables I want to distribute that I rename to .png [06:09:54] ;D [06:10:06] sound and video [06:10:47] there are prjects centered around audio recording of page content [06:13:26] http://i.imgur.com/d0k7m.jpg took that last time I was visiting hq [06:13:39] nice [06:13:44] anyway, 's getting pretty late here, I should probably snooze [06:13:49] why isn't it on commons? :-P [06:13:49] (edt) [06:14:02] wow it is late then [06:14:03] It's WTFPL/2.0 :P [06:14:16] I guess I could relicense it [06:14:21] I guess you could :-D [06:14:39] just make sure that the logo isn't going to be an issue [06:14:46] It's part of an art installation [06:14:46] http://stare.com/art/petabytes/index.html [06:14:59] We each took a picture of the same exact thing [06:15:27] But because of how the law works, whoever pushed the shutter is the owner and determines the license for that specific "color" of bits [06:15:30] so next steps, I assume you'll want to start doing a first rsync? and I (weird that is really weird.) [06:15:49] (http://ansuz.sooke.bc.ca/entry/23) [06:16:06] It's sort of a thing to say how screwed up the current wording is [06:16:07] :) [06:16:07] and I will drop you a line when I have something else available (eg file description tarballs, or lists of images by upload date, or whatever) [06:16:18] Anyway, yeah. Sounds good. [06:16:26] !log deploying limited/split apache syslog (https://gerrit.wikimedia.org/r/#change,4149) [06:16:30] Logged the message, Master [06:16:59] Need to figure out a way to logically break up the sync, because the temp storage vms have 2tb disks [06:17:01] ok. if you need anything from me in the meantime or have any questions/ideas, feel free to drop in [06:17:08] ah so [06:17:19] we have this layout with 256 "shards" sort of naturally [06:17:30] Oh, really? :o [06:17:40] based on the first digit and the first two digits of the hash of the image name [06:17:59] oh, right. Forgot about that [06:18:02] so maybe you want to grab commons/a/a0 or something [06:18:11] Yeah [06:18:15] those would be small pieces for you guys [06:18:28] Grab shards until disk full, then move to the next disk [06:18:35] the other projects will all be under size [06:18:35] Excellent. [06:19:04] for now.... [06:19:06] :-D [06:19:42] ok, I gotta get going, I'm stillin my pjs here [06:19:50] have a good night! [06:19:51] haha [06:19:53] see ya! [06:19:57] Thanks! [06:33:43] Hi. I'm getting all sorts of weird errors trying to deal with an image deletion. [06:33:50] > Error undeleting file: A non-identical file already exists at mwstore://local-backend/local-public/3/36/Ultimate_Spider-Man_-98.jpg. [06:34:06] I also got weird errors when I deleted it (I'm clearing out copyvios) [06:34:42] https://en.wikipedia.org/w/index.php?title=File:Ultimate_Spider-Man_-98.jpg&action=edit&redlink=1 [06:38:36] anyone? [06:40:28] > A non-identical file already exists at mwstore://local-backend/local-public/archive/3/36/20120403062050!Ultimate_Spider-Man_-98.jpg. [06:40:32] more errors :( [06:44:45] apergos: Who would I ping about this? [06:48:17] hmm [06:48:27] this is the wrong time of day I guess [06:48:46] aaron or ben are going to be the people you want, most likely [06:49:06] but this is now past their bedtimes [06:49:25] Okay, well I've taken a screenshot: http://i.imgur.com/j8t5C.jpg [06:50:38] I tried to undelete everything (rather than partial undelete) and it didn't help. [06:51:04] I would bugzilla it and let one of them know [06:51:26] Or both. [06:51:45] Though that's really Hexmode's job. [06:51:56] So you could just e-mail him. [06:52:13] I hate filling out bugzilla forms. [06:52:15] But okay. [06:52:26] I forgive you. [06:53:25] the thing about bugzilla is that then there's a puclic record anyone can get to relatively quickly [06:56:40] Component as "Deleting"? [06:57:48] yes [06:58:04] images and files [06:58:39] ok then [06:58:50] https://bugzilla.wikimedia.org/show_bug.cgi?id=35656 [06:59:20] gj [06:59:43] Generally screenshots should be attached to the bug. [06:59:58] So that they don't get lost if the external repo dies or goes away or whatever. [07:00:01] But that's not a big deal. [07:00:15] If someone really cares, they'll download and upload them. [07:00:18] I just realized I had this tab open: http://i.imgur.com/TNfjv.jpg [07:00:22] Should I include that link? [07:00:36] errors I got when originally deleting [07:00:45] Yes. [07:00:46] sure [07:01:24] More information is rarely problematic. Not having enough information is often problematic. [07:02:34] k, added [07:03:20] So you deleted the whole file? [07:03:28] Rather than deleting specific old revisions? [07:03:47] Old versions, I mean. They have separate (delete) links, right? [07:04:04] I deleted the whole file [07:04:15] Right. Why? [07:04:19] in order to restore selectively [07:04:26] Rather than deleting selectively? [07:04:34] yes because there were fewer clicks that way [07:04:38] Okay. [07:04:48] It's a still a bug you've found. I'm just curious about your behavior. [07:05:44] People seem to have no issue removing vandalism from image histories, but they wouldn't do so (typically) for page histories. [07:05:52] I wonder if it's due to the visual nature of image upload vandalism. [07:05:59] oh this is no good. [07:06:17] turning on the news and seeing oakland in the headlines cannot be any good whatsoever [07:06:23] Perhaps image upload history should be banished to a history page. [07:06:26] Or collapsed or something. [08:21:56] hello [08:22:06] http://commons.wikimedia.org/wiki/Commons:Administrators%27_noticeboard#Botched_file_move [08:22:24] Oh :( [08:22:24] a file disappeared [08:22:43] Perhaps related to my bug! https://bugzilla.wikimedia.org/show_bug.cgi?id=35656 [08:23:36] joan! [08:23:43] files are disappearing [08:23:47] is the rapture happening?! [08:27:26] anyhow, anyone that could be potentially helpful is asleep [08:31:41] the image itself got moved [08:31:44] http://upload.wikimedia.org/wikipedia/commons/b/bc/St_Boswells_station_geograph-2328602-by-Ben-Brooksbank.jpg [08:31:55] the file descripton page may not have made it [08:32:35] nothing in the logs, either [15:15:46] !log reedy synchronized closed.dblist 'Bug 35581 - Closure of nz.wikimedia.org' [15:15:48] Logged the message, Master [15:16:19] !log reedy synchronized wmf-config/InitialiseSettings.php 'Bug 35581 - Closure of nz.wikimedia.org' [15:16:21] Logged the message, Master [15:23:20] !log reedy synchronized wmf-config/InitialiseSettings.php 'Bug 35603 - Enable Transwiki import on KN:WP' [15:23:22] Logged the message, Master [15:33:39] !log reedy synchronized wmf-config/InitialiseSettings.php 'Bug 35624 - Subject namespace for the Vietnamese Wikibooks' [15:33:41] Logged the message, Master [15:39:54] !log reedy synchronized wmf-config/InitialiseSettings.php 'Bug 35545 - Grant the abusefilter-log-detail right to patrollers on Commons' [15:39:56] Logged the message, Master [15:41:20] !log reedy synchronized wmf-config/InitialiseSettings.php 'Bug 35545 - Grant the abusefilter-log-detail right to patrollers on Commons' [15:41:22] Logged the message, Master [15:51:16] !log reedy synchronized wmf-config/InitialiseSettings.php 'Bug 35482 - Add Patroller & Autopatroller groups on ml.wikisource' [15:51:17] Logged the message, Master [15:53:26] !log reedy synchronized wmf-config/InitialiseSettings.php 'Bug 35482 - Add Patroller & Autopatroller groups on ml.wikisource' [15:53:27] Logged the message, Master [17:31:41] hmm, Reedy - there's a new user group on enwiki, "autochecked users" - but I can't find any bug reports for it, was it added recently? [17:33:12] Thehelpfulone: I haven't (knowingly) added it [17:33:32] ok, seems like there was a vp discussion about it http://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&diff=next&oldid=463474936#.22Autochecked_users.22 [17:34:00] ah, https://bugzilla.wikimedia.org/show_bug.cgi?id=32751 [17:34:10] was probably done with 1.19 just nobody noticed it until yesterday [17:34:27] possibly if it's from FR [17:36:05] !log reedy synchronized wmf-config/InitialiseSettings.php 'Fix commas in guwikisource namespaces' [17:36:07] Logged the message, Master [17:38:35] !log reedy synchronized wmf-config/InitialiseSettings.php 'Fix commas in guwikisource namespaces' [17:38:37] Logged the message, Master [17:52:09] !log reedy synchronized php-1.19/extensions/MobileFrontend/ [17:52:11] Logged the message, Master [17:59:54] !log Synchronized payments cluster to r114642 [17:59:56] Logged the message, Master [18:26:45] hi @ all [18:27:26] i got a problem with wikipedia [18:28:56] under mac and chrome if i dont use the 100 zoom level i got a blue background [18:31:38] Stiffi: A screenshot could help, perhapss upload one to https://imgur.com/ or similar? [18:32:11] http://imageshack.us/photo/my-images/99/screenshot20120403at813.png/ [18:35:49] okay the bug is known [18:36:42] thanks and bye [18:37:52] !log root synchronized wmf-config/mc.php [18:37:54] Logged the message, Master [20:54:34] root who? :) [21:31:29] Tim-away / AaronSchulz: either of you around atm? [21:32:01] Snowolf: http://bit.ly/HlPo27 [21:32:29] Logan_: uh? [21:32:33] ATM. [21:32:36] You're welcome. [21:32:45] lol [21:35:37] lol [21:35:58] or robla [21:43:24] Logan_ busy? [21:43:34] No. What's up? [21:43:41] rank insignia :) [21:43:45] Eh. :-P [21:43:51] Template:Ranks_and_insignia_of_NATO/Generic/Army [21:43:57] http://en.wikipedia.org/wiki/Template:Ranks_and_insignia_of_NATO/Generic/Army [21:43:58] Somebody else, please. :-P [21:44:11] you are a party pooper. [21:44:13] :p [21:44:27] Snowolf is worse. [21:44:56] Actually, Joan is worse. [21:44:58] But I digress. [22:06:50]