[08:30:18] Hello [09:18:08] legoktm: download.kiwix.org uses MirrorBrain, to automatically generate the torrent files and distribute load over the mirrors [09:18:49] Even without running an actual bittorrent client or tracker, you can just create torrents which use webfeeds and DHT [09:19:32] Nemo_bis: that's what I did, just add webseeds and a few public tracker addresses [09:20:17] https://phabricator.wikimedia.org/diffusion/2037/browse/master/run.py;ec1c8f926c750c0adf92a3dbc944d85fc6b68af6$42 [09:20:36] Yeah, now I saw [09:20:54] I thought openbittorrent.com had died [09:21:08] Nowadays many seem to recommend http://coppersurfer.tk/ [09:22:34] Can one just add all mirrors as webseeds and let some 404 (if they didn't sync yet or already deleted the dump as too old)? [09:23:19] yeah, that should work [09:25:38] http://ftp.acc.umu.se/mirror/wikimedia.org/dumps/aawiki/ doesn't have the 2017 dump yet [09:25:56] Yes, they seem to be a bit slower than your.org [09:26:19] But definitely worth including, I managed to download a 100 MiB/s from them yesterday ^^ [09:29:58] ok, done https://phabricator.wikimedia.org/R2037:a20b54a371229895a274b28a864726569a1cecf3 [09:31:50] Now I'm curious how long it takes to hash all those files given the slow read :) [09:34:02] aawiki has tiny dumps so the whole script was pretty fast [09:34:02] real 0m1.791s [09:34:02] user 0m0.129s [09:34:03] sys 0m0.424s [09:37:47] system CPU is 4 times as much as user CPU [09:38:18] That will probably be a tiny bit better with bigger files, but rarely decent. [09:40:44] alright, I'll run the other wikis tomorrow, and add the --piece-length=20 setting :) night! [10:06:53] Nemo_bis AND Yvette: we provide bz2 because it's block oriented and we can recover from issues in the middle [10:07:16] for small dumps we don't care but for large ones the run takes a good chunk of time, so that's a hard requirement, to be able to pick up in the middle [10:08:08] thanks Nemo for the edits! [10:17:57] apergos: isn't xz too? You can even concatenate [10:18:20] But xz is a bit too obscure for most people and LZMA is mostly useful for full history dumps anyway [10:20:08] yes, xz has some sort of built in indexing of blocks, I haven't looked into it in depth yet [10:20:20] that's for the dumps 2.0 rewrite, see what sort of compression we want to end up with [10:20:58] and probably split the files into many small pieces to be concat on demand for download, something like that [10:40:46] apergos: with the current setup, how hard would it be to add MirrorBrain? [10:41:09] It would be nice to send HTTP requests to the most local mirror. Some of them are amazingly fast [10:41:27] with the current setup, all my non dumps 2.0 work is limited to one day a week,or I'll never get that done [10:41:31] bearing that in mind... [10:41:41] Good point :) [10:41:55] I don't know how long it took for Kelson [10:41:56] I'd rather have someone else take this on, and let me be reviewer/provide feedback/ etc [10:42:08] and of course do the actual deployment [10:43:15] so: investigate how mirrorbrain works, whether it makes sense when there are only a few mirrors, what config files it needs, how it decides which mirrors are faster/active/have current content etc [10:43:53] if it looks like it meets our needs, open a phab task describing the results of that investigation, proposing that it be adopted [10:44:05] figure out what puppet config would look like [10:44:55] (probably needs my help a bit for that part, at least for integrating with existing setup) [10:45:12] and get some preliminary patches in for us to look at [10:50:54] Nemo_bis: if you haven't been able to tell from the above, I'm hoping you know someone(s) who are willing to take that on and run with it [10:55:17] apergos: yeah I got it ;) [10:55:30] :-) [10:55:58] Maybe I'll give a look to MirrorBrain and see if I can send a puppet patch [11:04:48] Nemo_bis: cool! If you open a phab task, please add me as a subscriber [11:05:03] you cna put it in the uh "datasets general other" project I guess [14:06:52] apergos: I still think we should divide the files by namespace. [14:07:29] huh? [14:07:37] I missed the context there... [14:08:10] For like meta-pages-history or whatever, there's no way to just scan articles. [14:08:16] You gotta scan everything and then filter. [14:08:46] It would reduce the file sizes and scan times to separate things a bit more. [14:10:16] I guess https://phabricator.wikimedia.org/T99483 [14:10:43] It just always feels so silly to me that we make people download/scan/load so many revisions that they don't probably don't care about. [14:11:06] And/or https://phabricator.wikimedia.org/T20919 [14:12:45] https://en.wikipedia.org/wiki/Wikipedia:Database_reports/Pages_with_the_most_revisions [14:12:54] So many non-article revisions to wade through. [14:13:43] Or even with current page history only: https://en.wikipedia.org/wiki/Wikipedia:Database_reports/Page_count_by_namespace [14:48:38] People should just talk less in ns>0! [21:08:11] can contents translation suggestion turned off completely? [21:08:25] I don't want to use it ever still it occationally shows up and annoys me [21:13:34] revi: suggestions where? [21:14:33] Maybe you mean the popup upon editing a red link? There are others too. [21:17:21] Nemo_bis: yeah, that popup