[00:01:26] we'd better go to read-only mode [00:02:50] sync-file not working for me [00:03:00] Reedy: can you try it? [00:03:38] * Reedy tries [00:03:47] !log reedy synchronized wmf-config/db.php 's1/enwiki into readonly' [00:03:58] Logged the message, Master [00:06:17] !log on hume: stopped all populateRevisionSha1.php processes with kill -STOP [00:06:27] Logged the message, Master [00:10:04] Too much revision table locking? [00:11:39] they've all caught up now [00:12:10] !log tstarling synchronized wmf-config/db.php [00:12:15] !log switched enwiki back to r/w [00:12:19] Logged the message, Master [00:12:29] Logged the message, Master [00:13:09] I don't think locking was the problem, since write queries on slaves run in a single thread [00:13:13] just too many slow write queries [00:13:31] maybe popularRevisionSha1.php was the culprit, maybe not [00:13:37] I saw it in a couple of show processlists [00:16:20] db32 seems rather busy [00:16:29] just gone back to over a minute lagged [00:17:26] it was the last one to catch back up [00:21:47] there's no obvious problems with it [00:21:58] it's the snapshot host so you'd expect the I/O load to be a bit higher on it [00:23:48] seems to be mostly watchlist-related queries [00:24:07] hmm, no, not anymore [00:24:19] maybe that bit of the drive was temporarily slow [00:24:42] anyway it can just depool itself in the normal way [00:25:17] I may as well drop its read load to zero [00:25:42] o hai [00:25:58] !log tstarling synchronized wmf-config/db.php 'reduce db32 read load to zero due to persistent lag' [00:26:08] Logged the message, Master [00:47:13] !log on nfs1: trying leap second fix suggested at https://bugzilla.mozilla.org/show_bug.cgi?id=769972#c5 [00:47:23] Logged the message, Master [01:48:26] !log kill -CONT on populateRevisionSha1.php processes [01:48:37] Logged the message, Master [02:28:49] !log LocalisationUpdate completed (1.20wmf6) at Mon Jul 2 02:28:48 UTC 2012 [02:29:00] Logged the message, Master [02:53:51] !log LocalisationUpdate completed (1.20wmf5) at Mon Jul 2 02:53:51 UTC 2012 [02:54:01] Logged the message, Master [07:30:52] hi [07:35:40] hi hashar [07:36:27] good morning :-] [08:14:14] !log hashar synchronized wmf-config/InitialiseSettings.php 'Bug 37457 - fix import sources for viwikibooks' [08:14:24] Logged the message, Master [08:14:50] !log hashar: srv190 srv266 srv281 timeouts on sync-file [08:15:00] Logged the message, Master [09:08:53] gerrit is very slow [09:12:02] Very slow is an understatement [09:12:53] https://bugzilla.wikimedia.org/38115 [09:13:08] It's written in java [09:14:25] Database is the problem. [09:14:33] "Due to high database server lag, changes newer than 436 seconds may not appear in this list. " [09:15:29] Actually if Gerrit is java, there might be another issue other than the db [09:26:41] @replag [09:26:43] Nemo_bis: [s1] db36: 2s, db32: 2s, db59: 2s, db12: 504s [09:26:51] hmpf [09:43:23] !log on fenari: fixed leap second issue with the mozilla method [09:43:33] Logged the message, Master [09:43:53] Nikerabbit: If you are nice and ask TimStarling, He might look at the gerrit speed issue (unless you have already raised it in -operations) [09:47:25] !log fixing leap second issue on formey,grosley,hooper,sanger,sockpuppet [09:47:35] Logged the message, Master [09:47:54] thanks [09:51:10] there's some in eqiad as well [09:52:31] including manganese, the gerrit server [09:52:56] !log fixing leap second issue on aluminium,gallium,manganese [09:53:06] Logged the message, Master [09:56:56] !log fixing leap second issue on virt1,virt2,virt3,virt4,virt5 [09:57:06] Logged the message, Master [10:04:37] Nikerabbit: gerrit should be better now [10:05:13] \o/ [10:05:17] I heard about that, TimStarling. [10:05:22] Caused problems with lots of websites [10:06:52] so is anyone aware of/working on the lag on db12? [10:07:41] @info db12 [10:07:41] Nemo_bis: [db12: s1] 10.0.6.22 [10:07:51] @replag [10:07:52] Nemo_bis: [s1] db12: 658s [10:08:16] that much! [10:08:17] mm [10:08:35] +150s in 20 min [10:08:41] no, 40 [10:44:50] who needs to be pinged to fix the replag? [10:45:42] nobody [10:47:49] !log fixed leap second issue on bastion-restricted [10:48:00] Logged the message, Master [10:49:37] ops [10:49:40] *opps [10:49:50] bastion-restricted = labs [10:50:23] thanks to both of you [10:50:43] !log fixing leap second issue on bastion1 by rebooting it [10:50:53] Logged the message, Master [11:53:22] db12 is getting fairly lagged [11:54:20] yep [12:06:29] PopulateRevisionSha1::upgradeRow [12:06:37] probably it doesn't care about db12 lag :) [12:08:03] it has a wfWaitForSlaves() [12:23:44] !log hashar synchronized live-1.5/CREDITS [12:23:54] Logged the message, Master [12:41:45] has db12 lag already been reported? [12:45:14] MaxSem: yes [12:45:25] MaxSem: not saying anyone is working on it though :)) [12:54:48] @replag [12:54:50] Nemo_bis: [s1] db36: 1s, db32: 1s, db59: 1s, db60: 1s, db12: 1302s [12:55:11] still not completely tragic [12:59:40] db12 is acting like my computer ATM [13:00:13] is that good? [13:00:20] No [13:00:50] But my computer is Windows but db12 isn't [13:01:57] Think a reboot is in order! :| [14:21:39] Reedy: would you mind looking at https://gerrit.wikimedia.org/r/#/c/13316/ please ? Adds a static-master docroot, I need that for labs which runs mediawiki version 'master' :-) [14:29:21] Reedy: danke [14:29:41] now I am going to deploy https://gerrit.wikimedia.org/r/#/c/13888/ [14:35:16] anyone going to fix db12 ? [14:35:46] I am looking at it but I don't know what's wrong :-( [14:36:32] we could switch whatchlist to another db [14:36:40] or stop the dump process and make it uses another db [14:37:03] I don't think it's the dumps. there are 7 other dump processes doing things [14:37:06] (for other dbs) [14:37:17] and no reports of lags anywhere else [14:37:25] @replag [14:37:29] Betacommand: [s1] db36: 10s, db12: 1651s [14:37:32] maybe the dump command that strike db12 is malformed ? [14:37:40] i's just mysqldump [14:37:48] or that is just because its hitting the huuuuuuugggeee enwiki templatelinks table [14:37:52] no [14:38:00] because this table has been going about an hour [14:38:04] other things went before that [14:38:32] a lot of things are backed up with [14:38:41] "Waiting for the slave SQL thread to advance position" [14:39:37] obviously, since there is lag [14:40:34] we could make enwiki r/o again for a while [14:40:58] ugh [14:41:23] beyond that, it's moving rc, watchlists etc to a different slave [14:41:40] Or hang on for Asher [14:41:45] should be about in 2-3 hours [14:41:49] ouch [14:43:01] I hate to read/only because that is global [14:43:20] yeah [14:43:30] if it gets a lot worse, to allow it to catch up, it might not be a bad idea [14:43:43] Did new dumps start around 06:00 UTC? [14:43:53] Cause i noticed enwiki ones weren't running last night [14:44:04] I started them this morning [14:44:08] I do em once a month every month [14:44:15] yeah [14:44:20] same old mysqldump [14:44:25] same boring order, same everything [14:44:25] there's a bump in wait cpu, and a bit more load [14:44:34] but per domas, it's not as if the host is overloaded [14:44:35] far from it [14:44:45] yeah, I was looking at that, not even one core [14:45:18] can you think of any special reason to NOT have jobs.wikimedia on wikimedia-lb ? [14:45:56] where's it served from? [14:46:06] jobs 1H IN CNAME wikimediafoundation.org. [14:46:27] we got a request to change it to wikimedia-lb [14:46:36] because...? [14:46:51] "fix SSL cert mismatches." [14:47:26] and yes, it appears this is true: "The response given at therequested new DNS entry is the same as the response given over HTTP(in the clear) or if you override/ignore the cert mismatch warning. Soyou should be able to just change the entry and forget about it." [14:48:28] Hmm I see [14:49:10] CNAME for wikimediafoundation.org , and then ServerName jobs.wikimedia.org in wikimedia.conf , redirecting to wikimediafoundation.org [14:52:15] Request: GET http://meta.wikimedia.org/w/index.php?title=Requests_for_comment/Dispute_over_Turkish_representation_at_Turkic_Wikimedia_Conference_2012&action=history, from 10.64.0.128 via cp1014.eqiad.wmnet (squid/2.7.STABLE9) to () [14:52:16] Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Mon, 02 Jul 2012 14:51:44 GMT [14:56:36] ToAruShiroiNeko, do you also have the same problems reported by JuTa on #wikimedia-commons? [15:03:59] !log authdns-update to switch jobs.wm redirect to wikimedia-lb to fix SSL cert mismatch (RT-3071) [15:04:09] Logged the message, Master [15:05:14] Request: GET http://zh.wikipedia.org/wiki/Special:%E7%9B%91%E8%A7%86%E5%88%97%E8%A1%A8, from 10.64.0.138 via cp1015.eqiad.wmnet (squid/2.7.STABLE9) to () [15:05:15] Error: ERR_CANNOT_FORWARD, errno [No Error] at Mon, 02 Jul 2012 15:04:40 GMT [15:07:50] any idea why while I am hitting http://en.wikipedia.org/wiki/Main_Page I get a different server each time ? [15:08:54] oh it is cached by different squids :-) [15:18:17] !log hashar synchronized docroot/bits/static-master '(bug 37245) docroot 'static-master' for beta bits' [15:18:26] Logged the message, Master [15:19:53] deploying https://gerrit.wikimedia.org/r/13888 [15:20:02] which add the /etc/wikimedia-realm detection [15:20:29] test.wikipedia.org load fine [15:21:54] !log hashar synchronized wmf-config/CommonSettings.php '/etc/wikimedia-realm detection https://gerrit.wikimedia.org/r/13888' [15:22:03] Logged the message, Master [15:22:17] works!!! [15:22:31] * hashar does some erratic dance [15:23:08] You sure this time? :P [15:23:21] yup I have a test [15:23:37] spent the last half hour or so to verify CommonSettings;php [15:23:47] whatever broke the stylesheets must be https://gerrit.wikimedia.org/r/#/c/13322/ ;-] [15:27:18] db12 lag is over 2000 [15:27:58] maybe it's a e3 test to see if this reduces edit warring and unproductive flames [15:47:20] ^demon: yet another gerrit bug, the font does not handle ascii art https://gerrit.wikimedia.org/r/#/c/13892/ [15:48:08] <^demon> If you submit that to bugzilla, I'm going to whack you over the head with a pipe. [15:48:09] <^demon> :) [15:48:23] I used libcaca from Sam Hocevar [15:48:38] and it renders fine for me :) [15:50:27] who broke gerrit again [15:50:36] not loading for me [15:50:49] and works again [15:52:59] ^demon: http://imgur.com/Cd25K :) [16:05:36] afternoon guys, any idea what's going on with enwp at the moment please? I'm accessing from a different machine right now, and seeing almost 35 minutes lag on the db. Any clues? [16:06:07] that's just in my own contribs, i've not seen what recent changes is carrying right now. [16:06:47] well we don't know what caused the replag [16:06:51] we do know it's dropping now [16:07:09] gonna ask asher to have a look when he comes in, I couldn't see what the issue was [16:07:49] apergos: that's good news. Last I saw was around 2100 seconds, my maths is crud but I'm shooting that 1800 seconds = 30 mins, so it was over that at least. [16:08:12] we're at about 2000 now [16:08:19] it was 2150 or so [16:10:07] ok then, well I'll drop out for now apergos, and keep a watch on it. I'll be back on later from home, so I'll drop in later if things are still slow. [16:10:28] cheers for the information anyways, hope it sorts out :) [16:10:29] see ya [16:29:05] On https://pt.wikipedia.org/wiki/Especial:P%C3%A1ginas_vigiadas I got the following:Request: GET http://pt.wikipedia.org/wiki/Especial:P%C3%A1ginas_vigiadas, from 10.64.0.127 via cp1015.eqiad.wmnet (squid/2.7.STABLE9) to () Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Mon, 02 Jul 2012 16:27:37 GMT [16:31:41] Looks like I got a server hiccup. [16:37:35] @replag db12 [16:37:39] djhartman: [db12: s1] db12: 1763s [16:40:31] @replag [16:40:33] Nathan2055: [s1] db36: 6s, db12: 1727s [16:41:06] @help [16:41:06] Type @commands for list of commands. This bot is running http://meta.wikimedia.org/wiki/WM-Bot version wikimedia bot v. 1.8.2.4 source code licensed under GPL and located at https://github.com/benapetr/wikimedia-bot [16:41:15] @commands [16:41:16] Commands: channellist, trusted, trustadd, trustdel, info, configure, infobot-link, infobot-share-trust+, infobot-share-trust-, infobot-share-off, infobot-share-on, infobot-off, refresh, infobot-on, drop, whoami, add, reload, suppress-off, suppress-on, help, RC-, recentchanges-on, language, infobot-ignore+, infobot-ignore-, recentchanges-off, logon, logoff, recentchanges-, recentchanges+, RC+ [17:41:03] Request: GET http://cs.wikinews.org/wiki/Kategorie:Afgh%C3%A1nist%C3%A1n, from 10.64.0.138 via cp1015.eqiad.wmnet (squid/2.7.STABLE9) to () [17:41:07] Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Mon, 02 Jul 2012 17:40:40 GMT [17:59:16] Reedy: ready to deploy? [18:00:17] I spoke with Jamesofur about the Shop link extension. it's ready to deploy to enwiki, but is turned off by default [18:01:13] https://bugzilla.wikimedia.org/show_bug.cgi?id=37979 [18:01:29] yeah [18:01:39] x2 [18:02:10] Reedy: enwiki? [18:02:51] er [18:02:57] we still have replag on one slave [18:03:08] 598 db12 [18:03:17] it's been dropping for a few hours but [18:03:24] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: enwiki to 1.20wmf6 [18:03:30] heh [18:03:35] Logged the message, Master [18:04:28] apergos: these deploys have generally been non-events for the most part. I'm probably jinxing it now, but I wouldn't worry about this compounding any replag problmes [18:04:50] well I was thinking that replag might interfere [18:04:58] rather than the other way around... with users reporting issues [18:05:08] anyways as long as you have it in mind [18:05:20] k...thanks for the heads up! [18:05:24] yep [18:06:31] a lot of ooms appearing on 1.20wmf5 [18:06:42] obviously irrelevant [18:07:49] ignoring those, nothing new [18:09:29] ooms? [18:09:34] ooms for the poor [18:09:37] out of memory [18:09:39] :-D [18:09:43] nice one rob la [18:09:44] :) [18:10:37] alright...the rest in one fell swoop? [18:10:46] might aswell [18:10:56] swoop away [18:11:45] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: 273 wikipedias to 1.20wmf6 [18:11:54] Logged the message, Master [18:15:50] all still looks fine [18:16:05] just waiting for the wmf5 errors to fully go away [18:16:26] be nice to get this cldr noise shut up [18:17:09] cldr tldr [18:19:30] actually, cldr makes me think of Billy Mays commercials for CLR [18:19:43] 'ello peeps. Can I persuade someone to put a priority on this to keep the Greeks from running riot? https://bugzilla.wikimedia.org/show_bug.cgi?id=37608 [18:19:48] Justin case this is useful, I got the following error two times: Request: GET http://pt.wikibooks.org/w/index.php?title=WikiRPG&curid=5421&action=history, from 10.64.0.123 via cp1015.eqiad.wmnet (squid/2.7.STABLE9) to () Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Mon, 02 Jul 2012 18:18:23 GMT [18:20:19] Reedy: I've checked most of the wikis, and they look fine. I say we deploy the Shop link extension and call it good [18:20:26] brianmc: does it fix the eurozone crisis? [18:20:47] no [18:21:06] nothing can fix that [18:21:24] * robla looks if it'll make the Germans mad, which might at least satiate some rioting [18:21:55] hmmm...I don't think the Germans will care [18:22:06] not if evil Finns like Nikerabbit continue to oppose even the slightest agreements [18:23:18] why was I highlighted? [18:23:37] !log reedy synchronized wmf-config/ 'Enable WikimediaShopLink on enwiki' [18:23:43] <^demon> Nikerabbit: Because you're Finnish, I suspect. [18:23:48] Logged the message, Master [18:24:06] PHP Warning: Invalid argument supplied for foreach() in /usr/local/apache/common-local/php-1.20wmf6/extensions/Collection/Collection.suggest.php on line 359 [18:24:10] Appearing for wmf5 also [18:26:39] Request: POST http://www.mediawiki.org/w/index.php?title=MediaWiki_1.20/Roadmap&action=submit, from 10.64.0.130 via cp1006.eqiad.wmnet (squid/2.7.STABLE9) to () Error: ERR_CANNOT_FORWARD, errno [No Error] at Mon, 02 Jul 2012 18:26:25 GMT [18:27:01] brianmc: is there any particular reason not to just deploy the enwiki configuration to *.wikinews.org? [18:27:21] needs some work with google webmaster tools also, IIRC [18:28:00] oh, so each wiki needs someone to go in and futz with the web tool? [18:28:54] I believe so [18:29:03] to tell google where the sitemap is [18:32:17] urgh [18:33:01] Yep, it needs webmaster tools access. [18:33:40] And, sadly, can't be deployed on *.wikinews - not all versions have introduced peer-review enforced with FlaggedRevs. [18:34:19] Reedy: could you review and push https://gerrit.wikimedia.org/r/13900 ? [18:34:45] !log reedy synchronized wmf-config/CommonSettings.php [18:34:56] Logged the message, Master [18:35:55] robla, you can deploy about 99% of the en.wn config to Greek Wikinews, but the critical part is persuading Google to recognise it. [18:37:39] brb [18:38:06] okee doke [18:41:25] I've no idea who has close contacts with Google neuz these days. Erik put me in touch with someone there about 3-4 years ago when we first started having 'editorial control' that exceeded their criteria. That person has been "promoted" a little bit. But, they were pretty good at sorting things out. [18:42:17] if this is a blocker for other things I can poke my contact there, I have a pending question anyways [18:42:51] if it can wait, let me know [18:43:00] It isn't a 'blocker' for anything, but if you've a google contact then a friendly "who do we speak to?" email is all it'll take. [18:43:14] ok. give me the one sentence summary? [18:43:21] (I"m on a phone meeting so only half here) [18:43:45] who do we speak to in order to get google to... .X" ... what is the magic incantation that goes there? [18:44:20] "Greek Wikinews have implemented an independent editorial review process; they're really keen to get articles in the main Google News Greek language index, who do we speak to?" [18:44:36] ok. is this true for any other wikinews project or only them? [18:45:28] en.wn have it, sr.wn and es.wn too. (English, Serbian, and Spanish). [18:45:43] En is already listed. [18:46:36] If you want to 'pass the buck', I'm happy to spend the time chatting over the non-technical aspects - brian.mcneil@wikinewsie.org [18:47:04] I am getting errors all over the place. [18:47:06] Request: GET http://meta.wikimedia.org/w/index.php?search=Editor+Survey+2011&title=Special%3ASearch, from 10.64.0.135 via cp1014.eqiad.wmnet (squid/2.7.STABLE9) to () [18:47:06] Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Mon, 02 Jul 2012 18:46:43 GMT [18:47:46] Request: GET http://en.wikipedia.org/wiki/Special:Watchlist, from 10.64.0.138 via cp1015.eqiad.wmnet (squid/2.7.STABLE9) to () [18:47:47] * helderwiki too: Request: GET http://pt.wikipedia.org/wiki/Especial:P%C3%A1ginas_vigiadas, from 10.64.0.132 via cp1015.eqiad.wmnet (squid/2.7.STABLE9) to () Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Mon, 02 Jul 2012 18:47:21 GMT [18:47:48] Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Mon, 02 Jul 2012 18:47:14 GMT [18:47:57] I think things are broken :P [18:48:55] FastLizard4: are these resolved after reloading ? [18:49:01] we are having some network issues at the moment [18:49:04] Yes [18:49:08] A reload works [18:49:10] +1 [18:51:54] cool … :( [18:51:55] brianmc: Google News has some web form to submit sites to, don't remember where it is though. [18:53:36] LeslieCarr: yeah, they've been intermittant for the afternoon [18:53:42] brianmc: the process is completely untransparent and can take months (or years even) to get approved, depending on the whims of whoever reviews the site. [18:53:50] robla, apergos, I've no problem if you simply point someone from Google in my direction. It'd be great to get the tech details documented on the greek wikinews 'bug', but the more 'human' aspect of being sure there's a proper peer-review process in place is my baby. The web form is meh, verging on okay, but assumes you're proffering a TLD. We're not, kaldari. Implementation of FlaggedRevs is language-by language. [18:53:52] yeah… i have a feeling it's due to the routers being borked [18:53:55] which i am working on [18:54:53] brianmc: he's pointing me to http://support.google.com/news/publisher/bin/answer.py?hl=en&answer=40787 [18:54:58] so I assume that's not helpful? [18:55:53] can anybody tell me how caching works in case pages contain user specific content? [18:56:10] kaldari, it's not what you know, it's who you know. apergos, it is 'almost' helpful. [18:56:25] Give your contact my email address. [18:56:46] Daniel_WMDE: how is the user specific content being generated? JS ? [18:56:54] Danwe_WMDE: logged in users don't get cached pages [18:57:50] 500s on mediawiki.org [18:58:00] brianmc: Good luck. We have just as much trouble getting the time of day from anyone at Google as you do :P [18:58:19] brion: no more info? [18:58:26] Request: GET http://www.mediawiki.org/wiki/Special:Contributions/Brion_VIBBER, from 10.64.0.141 via cp1014.eqiad.wmnet (squid/2.7.STABLE9) to () [18:58:26] Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Mon, 02 Jul 2012 18:57:39 GMT [18:58:40] seems to have self-resolved [18:58:43] …so far [18:58:51] brianmc: see pm [18:58:51] I've been getting those all morning :) [18:58:58] I dunno. Getting help out of these guys has, in my experience, been a matter of respecting their role as gatekeepers and justifying your listing. [18:59:54] brianmc: and finding someone who will return your emails [19:01:45] I'm just hoping to satisfy my curiosity here and be educated: how come the Ganglia landing place shows up looking fine while there is replag? [19:02:21] Why wouldn't it? [19:03:36] because replag is bad(tm) and that would usually concide with high load somewhere that would show up on Ganglia? [19:03:50] replag on en.wiki is not a general worldwide famine [19:04:01] No, replag doesn't have to co-incide with high load [19:04:01] it's just a problem for en.wiki [19:04:40] squids down? [19:04:43] Replag is monitored by nagios [19:04:45] ah, so the causes would be below the ganglia 'problem' threshold? [19:05:14] you'd have to drill down to see that info [19:05:15] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=MySQL%20pmtpa&h=db12.pmtpa.wmnet&v=13&m=mysql_slave_lag&r=hour&z=small&jr=&js=&st=1341255854&vl=secs&ti=mysql_slave_lag&z=large [19:07:09] Reedy, so if I understand this properly: the en servers are too small an impact to the general server park health to make a dent in any of the clusters health as displayed on the Ganglia landing page? [19:07:22] no [19:07:36] :( [19:07:38] the landing page just doesn't show that info [19:07:44] ah [19:07:49] indeed [19:07:58] and we don't use ganglia to monitor the replag metric [19:08:13] such aggregated info is easier to find on https://gdash.wikimedia.org/ nowadays [19:08:27] for us non-techies [19:08:52] nothing about replag though, unless i'm mistaken [19:09:45] Reedy, wouldn't it make sense for the replag to be either caused by high IO load (monitored by Ganglia) or network load (monitored by Ganglia)? [19:10:05] It could be [19:10:05] for instance https://gdash.wikimedia.org/dashboards/jobq/ at last provides whining users a weapon to complain about mass edits despite https://en.wikipedia.org/wiki/Wikipedia:PERF [19:10:10] feel free to do something actually useful btw, I'm really just satisfying my curiosity/learning something [19:10:14] nothing mission critial ;) [19:10:21] Or someone makes a schema change, which isn't shown on either of those places will cause the replag [19:10:55] that would go hand in hand with fairly massive disk IO wouldn't it? [19:11:14] Maybe [19:11:15] Maybe not [19:13:06] well, at least some hardware or network metric metric measured by Ganglia, wouldn't it? [19:16:13] https://upload.wikimedia.org/wikipedia/commons/d/d8/Wikimedia-servers-2010-12-28.svg -- why don't you use Varnish instead of the many squids? [19:16:13] :-) [19:16:13] Varnish is MUCH more efficient [19:17:44] I see you are already using it at bits.wikimedia.org [19:17:52] ToBeFree, have fun: http://markmail.org/search/?q=varnish%20list%3Aorg.wikimedia [19:17:56] Unfortunately, you just can't switch something that in 5 minutes [19:18:13] Reedy: so that's it? just disabling the cache whenever {{int}} or other evil stuff is being used? [19:18:48] Someone just told me there is some caching and I should ask here but so I assume there actually isn't [19:18:54] There is [19:18:59] Anything anonymous users see is cached [19:19:11] At the squid (so html) level [19:19:13] yes, because they fall back to the sites language [19:19:27] ToBeFree: we are working with varnish ab on testing improvements to streaming support and the persistent storage backend which need to be complete and stable before we can use it everywhere [19:19:28] what if they use uselang ? [19:19:40] it's a different url [19:19:49] so it'd be parsed by apache, and cached at that url [19:20:28] binasher: thank you for the information :-) [19:20:47] Reedy: so for annonymous users all the uselang variants are cached by the squid? [19:21:00] they would be, if they were used [19:21:10] I can't imagine it's too common of a usecase for anon users [19:21:24] and the logged in users don't even have parser cache in case int is used? [19:21:42] That would sound potentially right [19:22:12] but for pages where it isn't used they get the parser cached version... I guess I'll take a look at the 'int' parser function. Thanks. [19:33:41] Still getting ERR_CANNOT_FORWARD errors, BTW. [19:53:35] damn. missed him by 3 minutes.. [20:05:00] Our servers are currently experiencing a technical problem. again and again [20:14:39] Frakir: known issue [20:40:19] thedj: ^^ [20:40:33] wait, depends who you were after... [20:41:04] Reedy: already taken care of :D [20:57:13] bed time. have a good night [21:16:33] AaronSchulz: is the job_token uuid only intended to mark that a job has been popperd off by a single runner? [21:17:47] that it has been claimed, yes [21:21:33] binasher: btw, would it be sacrilege to store 128bit uuids as binary(16)? :) [21:24:02] ewwww binary fields [21:29:06] evening guys :) apergos - did you manage to get to the bottom of the lag on db12 earlier? I know you were sort of in the dark when I spoke to you a few hours back. [21:33:51] AaronSchulz: that should be fine [21:40:44] AaronSchulz: about deleting duplicate jobs by job_sha1 - is there any chance there will occasionally be lots (>10k) of one job? [21:40:44] suddenly... I feeling slow speed from upload [21:42:44] binasher: I can't think of any case where that happens [21:43:29] * AaronSchulz goes afk [21:45:07] I assume "it's just me"? [21:46:07] http://paste.debian.net/177473/ [21:46:24] esams don't like me atm [21:47:27] er guys, the ERR_CANNOT_FORWARD is back. [21:47:30] ERR_CANNOT_FORWARD [21:47:31] yes [21:47:32] Servers are down [21:47:37] jo [21:50:16] And things are MIA again - just to confirm :) [21:58:59] And right back down again [21:59:04] And down again [21:59:23] I'm getting a 502 bad gateway [21:59:24] I've been down for the last 15 minutes [21:59:27] to whomever if may concern [21:59:29] it [21:59:42] It came back up for a while but now it's back down [22:00:30] yikes [22:00:50] looks like the ops kids are working on something [22:01:17] bsod [22:01:34] i quote, "network fucked due to routers" [22:01:44] lol [22:01:45] they're working on it :) [22:01:53] ... [22:01:58] brion, that's quite possible, nagios' history shows that the servers entered a "period of scheduled downtime" at 21.49GMT, about 12 mins ago [22:02:11] brion: Your network would be really fucked if you didn't have any routers anymore ;-) [22:02:13] the server: I'M MELLLLLLLLLLLLLLTINNNNNNNNNNNG [22:02:16] *servers [22:02:19] heh [22:02:24] brion, add the entry to the SAL? [22:02:27] the "server" ツ [22:02:30] seems they forgot to note it :) [22:02:46] wikidown wiki down :( [22:02:54] they are running WM out of an 486 in brions basement [22:02:56] !log fun with routers in tampa, wikis down [22:02:59] see if that bot still works [22:03:07] Logged the message, Master [22:03:43] How about the commonest response from tech support, brion... "Have you tried switching it off, then back on again?" :P [22:03:44] {{sofixit}} [22:03:47] * AzaToth hides [22:04:00] is it plugged in? [22:04:03] lol [22:04:09] * definitley [22:04:12] ah, and I was going to report my latest error message. I see everyone is already busily trying to fix things. [22:04:13] did you connect the mains cable? [22:04:26] is the power-switch in "on"-position? [22:04:28] Have you switched it on yet? @_ [22:05:11] brion: I hope it's not the waffles now again... [22:05:25] everybody sacrifice waffles to the ops gods [22:05:50] whom is the ops god? [22:06:04] ya see, this is what happens when you try to do your eggs sunny side up on the server PSU :P [22:06:09] i'm not an expert on the router talk in there, but it sounds like a couple of vlans aren't talking to each other, hence inability for squids and app servers to communicate [22:06:13] hehe [22:06:21] :) [22:06:30] * DarkoNeko sacrifices some cookies [22:06:39] BarkingFish: better to do them scrambled? [22:06:52] i'm seeing some working things now, that's a good sign [22:07:11] AzaToth, do them as raisin pancakes, they take up less space :) [22:07:30] ok, the site is back, but I forgot what I was doing :P [22:07:46] I still can connect with fenari, so it shouldn't be totally hopeless [22:07:58] * AzaToth wonder if you can fry bacons on a server psu... [22:08:21] frwiki's back online [22:08:39] (usually, it's thevery moment I say that that everything comes back down again) [22:09:14] DarkoNeko: then: Shut up! ;) [22:09:20] alright, alright [22:09:40] with ops like these, who needs hackers to take us down ? /me runs like hell [22:10:32] looks like things are perking up now, seeing some things coming back and popping up as OK in the history, payments2, wiktionary-lb@esams... [22:11:19] enwp live [22:12:18] BarkingFish: it's "OK" until it's timed out again [22:13:38] on the contrary AzaToth - i'm seeing lots more stuff coming up green :) [22:13:43] thx ops. good recovery [22:14:19] gn8 folks [22:14:24] n8 DaBPunkt :) [22:16:12] < !log asher synchronized wmf-config/db.php 'temp pulling db36' [22:25:56] Logged the message, Master [22:43:48] How are we going guys? Everything fixed up and live yet? [22:44:20] Nothing was exactly broken [22:44:33] well except the routers [22:44:37] so they are now sort of working [22:44:42] Depends which issue we're talking about ;) [22:44:53] It's wikipedians faults for wanting old revision entries to have populated sha1 hashes [22:44:58] oh [22:45:00] yes [22:45:17] :D [22:45:54] LeslieCarr: Wikipedians are always to blame ;D [22:45:58] um yes [22:46:03] wikipedians killed the routers [22:46:07] why did you do that wikipedians ? [22:46:14] 00.00 [22:46:52] This is wierd. I'm seeing a large mass of red ! on nagios history, most of it coming from db's. [22:47:10] * BarkingFish puts the bunker up and prepares for things to start falling again. [22:47:47] ? [22:48:00] NTP issues [22:48:15] well, mostly [22:48:18] yeah, i just noticed that. I hate NTP. [22:48:24] Time sucks [22:49:43] Presumably leap second fallout [22:49:54] the only other one i see that I can understand is search pool 3 being out, no route to host [22:50:03] http://googleblog.blogspot.com/2011/09/time-technology-and-leaping-seconds.html [22:50:10] Here's how Google did theirs [22:50:22] i don't know anywhere near as much as you guys, I just spot shit and figure it out from what I can make of it. [22:50:52] The java related ones were fixed, as were the nfs ones [22:51:23] binasher: fancy fixing hte NTP CRITICAL: Offset unknown errors on the db hosts? [22:51:24] "We modified our internal NTP servers to gradually add a couple of milliseconds to every update, varying over a time window before the moment when the leap second actually happens. This meant that when it became time to add an extra second at midnight, our clocks had already taken this into account, by skewing the time over the course of the day." [22:52:17] sounds a lot of hassle ;) [22:53:26] Reedy: it doesn't need fixing [22:53:50] just nagios being fail? [22:55:25] tbh, i really can't see the issue. so there was one extra second in the day. big whoopee :) I don't see this much hassle when there's a whole extra flaming day in the year :P [22:57:33] It's because of the specific way leap seconds are implemented [22:58:13] The clock is set back one second so the same second happens twice [22:58:52] And it's not too surprising that there's code that assumes the clock always goes forwards [22:59:00] so what exactly happens if that is not done? Does the world implode or something? Do we all catch flu? [22:59:02] :P [22:59:25] GMT and UTC get very slightly out of sync [22:59:30] It seems like an awful lot of cack over something so miniscule it barely gets noticed in passing :P [22:59:39] Per Tim's blog post, this could also be fixed by adding a leap minute every 600 years [23:00:17] Sorry, a leap hour [23:00:27] * AaronSchulz looks at http://us.php.net/manual/en/function.uniqid.php [23:01:00] See https://blog.wikimedia.org/2012/07/01/search-restored-after-leap-second-bug/ and https://en.wikipedia.org/wiki/Leap_second [23:01:16] or we could just abandon earth-centric time [23:01:23] use monotonically increasing unix timestamps FOREVER [23:03:54] why don't we all adopt a universal standard, like Outerworlds.com did, brion? We just set our servers to "Virtual Reality Time" (GMT-2H, Azores, Mid-Atlantic) - and we wouldn't have to worry about GMT :P [23:04:02] RoanKattouw: heh [23:04:16] http://en.wikipedia.org/wiki/Swatch_Internet_Time [23:04:42] Yeah, that would do it. Just use beats instead of GMT or UTC or whatever you want to call it :) [23:09:32] much easier now, it's @8, i'll be off @41. Just direct everyone to the Swatch website and we can forget messing then, brion :P [23:09:39] hehe