[21:46:52] Hi all, I'm wondering about best practices for setting up a long running capture of the recentchanges (and similar) event stream? Any chance the kafka queues are directly accessible? [21:47:08] hiya [21:47:26] turtles_at_work: not from external networks [21:47:40] does eventstreams work for you? [21:47:43] stream.wikimedia.org? [21:47:59] https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams [21:48:12] It does, but it is unstable - if I try to go back a few days I'm able to stream a few million records, then the consumer dies [21:49:10] Unless there are caveats about using the "since" parameter [21:49:45] hm [21:49:52] the consumer dies? [21:49:56] I even tried building in some smarts to keep track of the currently recovered timestamp and restart the stream from there [21:50:00] Seems to, yes [21:50:38] ya if you persist the id field you should be able to re-use it in your reconnect with as last-event-id header [21:50:42] and start from where you left off [21:50:47] what client are you using? [21:51:06] it might just not be able to consume fast enough? or puts to much in memory? [21:51:15] or [21:51:19] if you really need historical data [21:51:23] The async sse library in python, and curl, writing out to bz2 [21:51:46] https://dumps.wikimedia.org/other/mediawiki_history/readme.html [21:53:05] Ah cool, that historical data will be useful, thank you. [21:53:50] turtles_at_work: the curl process dies in the same way as the python sse client? [21:53:57] It does yes [21:54:19] how long does it usually take before that happens? [21:54:40] An hour or two, looks like it gets about a whole day of data [21:55:46] does the curl process just die, or just stop consuming data? [21:56:21] Fully dies - I don't see any indication of errors, though, so it seems to be exiting gracefully [21:57:52] ok, going to see if i can repro. i gotta run in a just a bit, but if you still have problems, can you submit a phabricator ticket about this? [21:58:37] Sure, will do. This is also if I'm attempting to consume the historical stream - I haven't verified if the same behavior occurs without specifying a start point [21:58:45] Will give that a shot too. Thanks! [21:59:01] oh right start point [21:59:15] how long ago do you start? [21:59:24] 11/11/19 [21:59:26] a few days eh? [21:59:26] ok [21:59:27] will try [21:59:28] that too [21:59:36] thanks! [22:00:25] hmm yeah i wonder if an eventstreams server process is dying if a client asks for a lot of data, and then disconnects your client [22:00:28] it is possible. [22:00:30] Oh, I do have an error! [22:00:42] let me pastebin it [22:01:05] https://pastebin.com/ZZY0xbPa [22:03:28] hm, that's in bz2 though [22:03:36] so that coudl just be because the stream ends [22:03:43] turtles_at_work: i don't know how bz2 works, can it compress the stream on the fly? [22:04:14] Yeah, pipe stdin to it with a -c flag [22:04:19] ok [22:04:23] i'm using CLI curl with > /dev/null [22:04:33] but yeah, that error looks like the stream just disconnected. [22:04:33] hm [22:04:40] looks like this took 4.5 hours to halt [22:04:52] but wait, why pipe this into bz2, don't you need to batch it? [22:04:57] the stream never ends [22:05:01] so where does your output go? [22:06:09] planning to mirror into elasticsearch at some point - wanted to capture backlog in one go so that I wouldn't need to grab it again. Figured compressed on disk would save me some room [22:06:40] Live stream would be handled a little differently, this is all just experimentation/pre-alpha stuff :) [22:08:15] I'm wondering if maybe it's just a version conflict issue with requests, part of that error message was [22:08:17] RequestsDependencyWarning: urllib3 (1.25.7) or chardet (3.0.4) doesn't match a supported version! [22:09:39] Yeah, googling around this is looking like a requests problem, not a event stream problem. A lot of the advice is to downgrade urllib3 to a 1.21 [22:10:51] I'll give that a shot and try to strengthen the logic on recovery from stream failure [22:11:07] hmm ok [22:11:08] ! [23:44:35] Hi all, I was pointed towards https://dumps.wikimedia.org/other/mediawiki_history/readme.html - which looks like a fantastic resource! Only trouble so far is that I can't seem to find headers for the TSVs. The schema at https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history#Schema doesn't seem to quite match up [23:54:44] You might have better luck asking in #wikimedia-analytics [23:54:52] Are they a lot different? [23:55:37] You'd kinda expected a header file would be dumped every month [23:58:11] Looking again, it looks like that schema is alright - had some PEBKAC in terms of columns, and the "snapshot" column isn't actually in the data :)