[01:45:32] <^Mike> I was just reading https://blog.wikimedia.org/2016/04/18/wikimedia-server-switch/ and I wondered what the limitation is in MediaWiki that requires a period of read-only traffic in order to failover between clusters [01:51:17] <^Mike> Is this really a limitation in MediaWiki, or with the MySQL backend? [01:51:44] ^Mike: we don't want to have a period of time in which there are two databases acting as masters, because when the two servers settle their accounts there is the possibility of irreconcilable conflicts [01:51:49] we don't want to tell a user that an edit has been saved successfully and then have it disappear. [01:53:11] so we have to have a brief period of time during which we do not process updates but during which updates which are already in-flight are allowed to complete successfully [01:54:32] we only need to be in read-only mode very briefly [01:55:25] but not all the steps that need to happen during a failover are automated, so there is a human speed / error factor [01:55:43] Hi, could someone with knowledge of the Spanish view this proposal? [[https://es.wikipedia.org/wiki/Wikiproyecto_Discusi%C3%B3n:Ning%C3%BAn_municipio_espa%C3%B1ol_sin_fotograf%C3%ADa#Mapa_SVG_de_Espa.C3.B1a_con_municipios_etiquetados]] [01:57:11] <^Mike> ori: cool, I figured it was master election [01:57:37] no one? [01:58:29] Miguu: try #wikipedia-es maybe? [01:58:42] <^Mike> Is there any work planned to move towards a distributed data store? [01:58:48] ^Mike: No. [02:00:34] ^Mike: (Well, it's complicated. ;-)) [02:00:54] ori: [02:00:54] It's just a technical matter, is not whether there is a technical channel in Spanish ?, PS: Use automatic translator so you can be misspelled [02:01:58] Miguu: Re-creating maps.wikimedia.org inside an SVG file sounds like a bad idea to me? [02:02:02] <^Mike> Sure, I imagine the assumption that your backend is MySQL is embedded pretty heavily in the MW code. [02:02:09] ^Mike: I guess it depends on your definition of "distributed data store". If MySQL in slave/master configuration qualifies, then we already are using a distributed data store. If, however [02:02:25] ^Mike: And no, there's no technical support channel in Spanish, sorry. [02:02:36] Bah. [02:02:45] Miguu: And no, there's no technical support channel in Spanish, sorry. [02:02:47] <^Mike> ori: nice try though :P [02:03:14] ^Mike: Fundamentally, a system without a global version of the truth isn't a "wiki", it's a different thing. [02:03:46] if you're asking whether we have considered weakening the consistency model to increase performance or improve fault-tolerance, the answer is mostly "no" [02:04:03] This is a template that automatically color certain municipalities [02:04:17] <^Mike> Sure, but you can have a global version of truth in a distributed system if you're willing to pay for it [02:04:37] yeah, it's just really hard to get right, especially with a live, evolving system [02:04:45] <^Mike> dunno what the open-source options are though [02:04:46] we're not that smart :) [02:04:51] * James_F grins. [02:05:49] * ori read Google's Spanner paper and even understood (parts of) it! [02:06:25] Consensus consistency? Hmm. [02:07:52] ^Mike: it is true that we are massively invested in MySQL (well, MariaDB in our case, but you know). We're a pretty small org and we have a lot of operational experience in running MySQL. We just don't have enough spare cycles to migrate to something totally new [02:08:24] ^Mike: That said, our ORM layer is pretty robust and theoretically we could port to a different RDBMS. [02:08:56] but there is work that is currently ongoing that shares the same goal: namely, serving reads from a secondary datacenter [02:09:14] There's SQLite and postgres support still in master. OracleDB and MSSQL used to work but got cut for lack of love. [02:09:44] <^Mike> Cool. Thanks, folks. [02:09:50] Any time. :-) [02:10:16] <^Mike> although... [02:10:58] <^Mike> I was also wondering a while back what WMF's production environment looks like. Do you use a cluster scheduler like Mesos or something? Or have you gone in the Chef/Puppet direction [02:11:21] <^Mike> (or both, I suppose they're not mutually incompatible) [02:12:59] we use puppet (our puppet repo is https://github.com/wikimedia/operations-puppet). the ops team recently decided to experiment with kubernetes by running a low-stakes production service on it [02:14:14] * ^Mike thumbs up [02:16:45] ^Mike: it sounds like you know a thing or two about this topic. If you have a lot of experience, your feedback could be really valuable. If you reach out (I'm ori@wikimedia.org) and let me know what your areas are I would reply with pointers to where the relevant conversations are taking place. [02:18:09] gotta run, o/ [02:18:34] Bye ori. [02:21:20] <^Mike> I do, but I doubt my employer be okay with me moonlighting [02:30:12] Any Phabricator admins about? [02:35:10] I'm having a multi-account issue [02:36:29] Reducto, #wikimedia-devtools (which I see you just joined!) is best bet. [02:36:44] Okay, thanks [07:55:48] [[Tech]]; ArchiverBot; Bot: Archiving 1 thread (older than 30 days) to [[Tech/Archives/2016]].; https://meta.wikimedia.org/w/index.php?diff=15536845&oldid=15534246&rcid=7666927 [12:18:03] You've probably already seen this, but a friendly reminder just to be on the safe side: We won't be able to edit the wikis for about half an hour or so today because of some testing. [12:18:03] This starts in about two hours. https://meta.wikimedia.org/wiki/Tech/Server_switch_2016 [14:32:44] 30 minutes since the switch. Any updates? [14:36:32] we're nearly ready to go back read-write [14:36:39] slightly delayed [14:46:07] we're going read-write now [14:46:25] Nice! [14:46:30] Hurray! [14:48:48] and we're back :) [14:50:54] [VxZFugrAEEMAAAwQGjsAAABY] 2016-04-19 14:50:35: Fatal exception of type "JobQueueError" [14:51:09] jobqueue isn't back yet [14:51:11] Banners are now off, fwiw. [14:51:12] kk [14:51:25] (CN banners that is) [14:52:27] and things might be bumpy [14:53:17] Specia:RecentChanges is not refreshing (feedback from en.wp and es.wp [14:53:18] ) [14:53:51] Trizek: what does that mean? [14:54:01] it's out of date? [14:54:09] no edit in two minutes [14:54:30] Yes, out of date. [14:54:40] [I know the switchover is happening, and this is probably related to that, and possibly expected] I'm getting [VxZGHArAEFcAAHJWFkQAAAAE] /w/index.php?title=Special:RevisionDelete&action=submit JobQueueError from line 200 of /srv/mediawiki/php-1.27.0-wmf.21/includes/jobqueue/JobQueueFederated.php: Could not insert job(s), 5 partitions tried. [14:54:43] on testwiki [14:54:54] these are all related [14:54:55] yes, job queue is not back yet [14:54:58] the job queue has not been restarted yet [14:55:02] only two edits on RC on fr.wp for example https://fr.wikipedia.org/wiki/Sp%C3%A9cial:Modifications_r%C3%A9centes [14:55:08] Both are Log entries. [14:55:12] Ok, thanks [14:55:38] and we're delaying that process as some database servers are overloaded and need weights adjusted [14:59:54] all rc logs dead guys.. [15:00:37] tech guys already know [15:01:51] yeah just read up :/ [15:02:27] logs work though (move/block/reviewabusefilter) [15:02:34] job queue runners are starting up [15:03:49] mark: excellent. :) [15:04:05] Should I re-enable the central notice banners? [15:04:33] not yet.. [15:04:57] unless 'starting up' takes another 49 minutes :P [15:05:27] Looks like lots of admins are holding their "Delete" trigger XD [15:06:02] i am the one of them lol [15:07:34] yeah you might as well reenable the CN banners [15:07:42] because RC still not working isn't as planned [15:08:16] without rcfeeds, all the patrol tools will not work [15:08:48] it's possible there's a lot of vandalism done during this time, impossible to find... [15:09:03] yes :( [15:09:16] lol revdel works though [15:09:28] Main wikis have RC tools based on IRC. And IRC is back iiuc. [15:09:43] IRC only shows logs... [15:09:48] now [15:10:25] Hi there, can someone delete this copyvio : http://v2c3.wmflabs.org/example_of_bvs.webm [15:10:36] and what's "v2c3.wmflabs.org" ? [15:10:54] job queue / recent changes / rc feed should all be back [15:11:06] and why people can upload copyrighted movies on wmflabs.org servers ? [15:11:18] yeah, they are back now, great [15:11:18] Thibaut120094, find the person running v2c3 and ask them to delete it [15:12:46] ok so we can upload copyvios on wmflabs.org, good to know. [15:13:27] no one said that [15:13:40] Thibaut120094: no, you can't. please don't [15:15:52] Thibaut120094: v2c3 is probably video2commons. Steinsplitter can probably delete that. [15:16:02] thanks, I'll ask him. [15:16:10] there's a list of tools at http://tools.wmflabs.org/ [15:16:30] yay rcfeeds si back [15:16:44] will do [15:16:45] Partially, all edits aren't turning up yet. [15:16:47] i'm not sure where to find a list of domains for each tool… there should be one [15:16:52] JohanJ, what? [15:17:10] yeah, its really hard to figure which domain is which [15:17:19] tested, works, its live.. [15:17:40] Krenair: I'm looking at the recent changes page at Swedish Wikipedia. My edits aren't turning up there, but some are. [15:17:45] JohanJ, you mean the edits made while RC was broken are gone from it? [15:17:45] Or were we talking about another feed? [15:17:47] yes [15:17:49] that's known [15:18:01] Krenair: OK, good. [15:18:04] if new edits are missing, that's another problem [15:18:08] I got pinged ('Swedish')... [15:18:47] Ned to turn off stalkwords... [15:27:02] <_joe_> if you're missing your _new_ edits from now on, that's unexpected [15:39:24] Thibaut120094, looks like it was dealt with in https://phabricator.wikimedia.org/T133010 [15:42:17] yep [16:45:46] hi [16:46:17] since the server change, the left column changed on fr.wikisource [16:46:29] the left column? [16:46:47] yes [16:47:28] there were 2 parts, 1 for reading, 1 for editing [16:48:03] now, there is only one, Outils (Tools) [16:48:18] yannf: can you take a screenshot? [16:48:47] now yes [16:48:59] i see this: https://i.imgur.com/HmmzTJC.png [16:49:49] it seems to match the sidebar definition in https://fr.wikisource.org/wiki/MediaWiki:Sidebar [16:49:58] but perhaps something got cached wrong during the switchover. [16:51:31] it changes when I purge the page [16:57:22] MatmaRex, https://imgur.com/qC0IwIQ after purge [16:59:21] MatmaRex, https://imgur.com/1u6iqSb before purge [17:08:49] MatmaRex, ? [17:09:49] yannf: sorry, i was away for a bit [17:10:13] interesting, i can see the bad sidebar when logged out, but i see the right one when logged in [17:10:27] let's just file a bug. i can't really help with this myself [17:10:31] i'll file it [17:11:34] ok thanks [17:11:51] MatmaRex, does it change when you purge the page? [17:12:43] yannf: not when i'm not logged in [17:12:52] weird [17:16:03] yannf: https://phabricator.wikimedia.org/T133069 please add any details you have [17:22:57] MatmaRex, I purge the Mediawiki page, and it is OK now [17:23:30] yannf: people are apparently seeing similar problems at wikimediafoundation.org, it's being investigated [17:24:05] ok [17:24:55] bblack, paravoid: I can reproduce https://phabricator.wikimedia.org/T133069#2218908 in multiple incognito browser sessions [17:24:59] re: sidebar, I guess the null-edit hack is because of some issue with pc/mc? [17:25:16] is the pc/mc issue really identified/fixed? [17:25:56] <_joe_> bblack: i'm not sure there is one [17:26:15] <_joe_> it might well be, though [17:26:29] well, on multiple sites the sidebar reverted from what it should be to defaults, and now people are null-editing /wiki/MediaWiki:Sidebar and then it starts looking good again [17:26:48] I don't see any explanation for why it broke or why that's enough to fix it, yet [17:26:53] that sounds a lot like PC [17:27:08] <_joe_> yes [17:27:19] PC == parser cache [17:27:27] * bd808 tries to de-jargon [17:27:49] <_joe_> but it was replicated from eqiad, so... [17:28:09] <_joe_> i can only guess that pc contained the wrong version and memcached had the right one [17:28:45] <_joe_> we had a brief window where all databases were readonly, that might explain it [17:32:06] I think the ones I'm seeing may have something to do with redirects [17:32:49] In the chrome debug tools I see a 200 response for the redir page and no page load for the actual redir target page [17:33:09] and different x-cache info if I hard reload [17:33:54] <_joe_> bd808: that phase with tons of "content already sent" errors could explain that then? [17:34:19] bd808: MediaWiki redirects are done client-side with the HTML5 History API [17:34:35] they just change the page URL, don't reload anything [17:35:23] MatmaRex: ok. that kind of makes sense. [17:35:46] but the different x-cache looks fishy -- https://phabricator.wikimedia.org/T133069#2219003 [17:36:58] * bd808 tries a purge on https://wikimediafoundation.org/w/index.php?title=Questions_for_Wikimedia%3F&redirect=no [17:37:25] taht seems to have fixed it [17:38:14] ok yes, what's confusing me is the meaning of 'redirect' [17:38:19] MW redirects are not HTTP redirects [17:39:19] you haven't seen a variance in the sidebar emitted for two different fetches of HTTP://.../wiki/Home [17:39:22] "titles that point to #REDIRECT" seem to be cached with the old sidebar [17:39:41] bblack: try https://wikimediafoundation.org/wiki/Main_Page as an anon [17:39:55] you've seen a variance in the sidebar between the real /wiki/Home and the one internally-redirected-to by /wiki/Main_Page [17:40:03] yes [17:40:21] which are different cache objects, so the X-Cache output is unsurprising [17:41:17] sure. So I guess this is just another page which needs a manual purge after the PC cache issue was fixed [17:41:21] I don't know much about sidebars. are they ever static html, or is that Sidebar.js nulledit supposed to fix them all directly? [17:42:06] they are transcludes at parse time I think. which is why I have needed to ?action=purge things [17:42:21] yeah but that's not gonna scale if we have a general issue here [17:42:32] people going around doing ?action=purge on every single page because they all have sidebars [17:42:58] (or every single MW-redirect?) [17:44:07] Here's a non-MW-redirect that is still showing the stale sidebar -- https://wikimediafoundation.org/wiki/Financial_reports [17:44:29] that's due to varnish caching [17:44:35] right [17:45:27] so do we understand the nature of the real underlying problem? are we sure that MW (well pc/mc?) are not outputting any more bad sidebars? [17:45:37] so we have 2 questions I think: why was the PC cache stale for the MediaWiki:Sidebar page; and how can we clear the cached versions from Varnish now that the PC is in a better state? [17:45:43] do we have to purge every page on every wiki with non-default sidebars to fix the caching? does it affect more things than sidebars? [17:46:41] since every page has a sidebar, my default assumption is that this has potentially infected many pages, and there's no easy way to selectively purge them, maybe a hard way via datestamp range [17:47:14] but before going down that kind of road, we really need to know the real problem is fixed. lack of null-edit wasn't the problem, it was a hack for one specific case of the problem. [17:48:51] The mechanics of finding parser cache pollution are beyond my MediaWiki knowledge [17:49:07] yeah [17:49:14] mine too by far [17:49:58] anomie: ^ any ideas on how to hunt for things in PC that are "stale"? [17:50:32] it seems unlikely there's anything eqiad-specific about sidebars in general [17:50:40] I wonder if this is not a broader problem than sidebars [17:51:00] sidebars would just be a very visible manifestation [17:51:12] and yeah I would doubt that it was only that [17:52:03] Other local message overrides might be something to look for [17:52:04] objectstore? [18:11:54] um, yeah, just purge the sidebar cache? [18:12:04] oh, I'm reading old scrollback [18:12:33] bd808: sidebar isn't in the parser cache, it's in the sidebar cache [18:13:04] legoktm: ah! [18:13:10] which is sotred where? [18:13:14] *stored [18:13:18] $cache = ObjectCache::getMainWANInstance(); [18:13:18] $sidebar = $cache->getWithSetCallback( [18:13:18] $cache->makeKey( 'sidebar', $this->getLanguage()->getCode() ), [18:13:39] see Skin::buildSidebar() [18:13:50] and wancache is redis? [18:14:08] I think so [18:14:38] AaronSchulz and Krinkle should know what the WANCache backend is, they invented that system [18:16:38] RoanKattouw: WANCache is a PHP layer on top of BagOStuff MainCache [18:16:41] So memcached [18:16:42] legoktm: [18:16:52] ok then [18:17:00] With prefixed keys and some extra behaviours for multi-DC support. [18:17:17] memcached wouldn't be replicated cross-dc right? [18:17:24] Nope, it isn't. [18:17:36] And we wiped memcached in codfw before the database switch [18:17:50] The open question at the moment is how we ended up with default sidebars in (some? many?) projects [18:18:13] and whether that is now "fixed" [18:18:13] yeah that [18:18:19] message cache lookup failed during a request so it fellback to the default and that got cached? [18:18:32] that's probably not a decent behavior to have [18:19:01] would there be other cases like this, where non-default things got replaced by defaults due to temporary redis connectfail? [18:19:44] we did have all the redis config messed up for much of the cut-over period [18:20:08] well I assume the defaulting on "message cache lookup failed" means e.g. connectfail, not no-such-key [18:21:07] if my guess about the message cache lookup failing is correct, then any client-side message that RL caches? [18:21:46] yeah this seems like something fairly broad in scope [18:22:18] it's not generally safe for code pulling stuff from redis to just replace it with defaults (which might give a very broken UX) because of connectfail. it has to actually fail. [18:22:45] that's like saying "we can't connect to mysql to get the article text, so just output 200 OK and put lorem ipsum in the content's place" [18:23:49] isn't a wrong sidebar a little better than the site being down? [18:24:04] not really [18:24:53] if we knew what all the other things were in the class of sidebar like things for this purpose, maybe. but if this is general to the logic for anything we store in wancache, probably not. who knows what kind of fallout randomly pulling in defaults in place of customization has? [18:25:13] but more importantly, if we emit bad content and call it silently good with 200 OK, it goes into varnish and gets cached for everyone [18:26:03] if these had emitted 500 errors due to lack of redis connectivity, we'd have had a brief 5xx spike, we would've found the redis problem faster than we did, and the caches would remain unpolluted [18:26:15] the not-so-great pattern here is taking the results of a parsed message and caching it in some other store [18:26:43] this probably happens all the time in small corner cases we don't notice. random 1/100000 redis connectfail -> cache obscure article with wrong sidebar output [18:26:54] which reminds me, has anyone complained about titleblacklist or spamblacklist not working? those rely on message caching too, but they would have expired and repopulated properly by now [18:28:07] I guess it's a philosophy thing, defensive coding vs fail-fast [18:28:38] do you know if there are any wikis still with broken sidebars? [18:28:41] the defensive angle is do the best you can to hide the internal errors and make things look ok-ish. the fail-fast angle is all deep errors should become obvious high-level errors [18:28:45] * legoktm nods [18:28:56] fr.wikisource.org presumably, it was mentioned in the ticket [18:29:59] hm, its fixed now aside from cached pages [18:30:54] from my reading of MessageCache, there would have had to be a memcached failure, and then a failure to get the text out of external store too [18:36:18] morning legoktm [18:36:55] o/ [18:39:05] addshore: I think we need to purge the sidebar cache across all wikis for completeness, and then purge varnish entries from that timespan? bblack? [18:39:19] would be nice :) [18:39:41] I'm trying to track down a timeframe and then list of edits on wikidata that didn't get dispatched and see if we can do something about that too [18:42:28] <_joe_> legoktm: yeah we had an overload on external storage [18:42:34] <_joe_> and we had purged memcached [18:42:43] <_joe_> so my hypothesis would be [18:42:59] <_joe_> we had a clean memcached because consistency, by choice [18:43:08] <_joe_> that added load on external storage, probably [18:43:27] <_joe_> making them unavailable for a small period of time right after switching over traffic [18:43:56] <_joe_> not properly unavailable, but there were a ton of errors for sure [18:44:57] ok, makes sense then [18:49:07] <_joe_> should we add these reasonings to the ticket? [18:52:01] <_joe_> I did [18:52:42] thanks, I'm figuring out how to purge a memcache key across all wikis [19:07:06] running script now [19:34:58] legoktm: thanks for jumping in on that sidebar issue. You're a superstar :) [19:35:27] :) [22:30:13] tgr: what implications does AuthManager have for OAuth? If there's any testing to do around the OAuth login flow for mobile devices, I'm happy to help out. [22:30:46] logging in via OAuth on mobile is a bad experience right now. [23:45:22] ragesoss_: AuthManager really won't be touching OAuth [23:45:45] bd808: thanks [23:46:17] its more about Special:UserLogin and the AuthPlugins behind that [23:47:35] bd808: Is it going to land for the 1.27.0 cut (i.e., in the next 13 days)? [23:47:42] (I hope so.) [23:47:50] if the security review gets done [23:47:56] and if not we will backport it [23:48:05] it has to be in 1.27 [23:48:21] ragesoss: more gory details at https://www.mediawiki.org/wiki/User:Anomie/SessionManager_and_AuthManager if you care :) [23:48:46] * James_F nods.