[00:03:08] did it with command line, but odd that the mirror option isn't in the dialgos [00:11:36] James_F: all good? [00:12:27] Working (partially) on beta for me [00:12:41] At least it's better than before [00:12:42] partially? [00:12:54] I think...hang on [00:13:20] Asking in -ve [00:14:46] greg-g: It may be our fault. [00:15:19] "good" [00:15:43] Well yeah. [00:25:56] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 498370 bytes in 9.440 second response time [00:25:57] (03PS2) 10Ori.livneh: hhvm: abstract out backports to a hhvm module [operations/puppet] - 10https://gerrit.wikimedia.org/r/123573 (owner: 10Hashar) [00:30:59] (03CR) 10Ori.livneh: [C: 032] hhvm: abstract out backports to a hhvm module [operations/puppet] - 10https://gerrit.wikimedia.org/r/123573 (owner: 10Hashar) [00:40:37] (03PS2) 10Ori.livneh: performance.wikimedia.org: use ScriptAlias rather than ScriptAliasMatch [operations/puppet] - 10https://gerrit.wikimedia.org/r/122868 [00:41:56] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:42:33] (03CR) 10Ori.livneh: [C: 032] performance.wikimedia.org: use ScriptAlias rather than ScriptAliasMatch [operations/puppet] - 10https://gerrit.wikimedia.org/r/122868 (owner: 10Ori.livneh) [00:48:12] PROBLEM - puppet disabled on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:48:12] PROBLEM - SSH on labstore1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:48:12] PROBLEM - RAID on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:48:12] RECOVERY - puppet disabled on labstore1001 is OK: OK [00:48:12] RECOVERY - SSH on labstore1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.2 (protocol 2.0) [00:48:12] RECOVERY - RAID on labstore1001 is OK: OK: optimal, 60 logical, 60 physical [00:55:07] greg-g: OK to deploy to phase0 then? [01:08:36] !log krinkle synchronized php-1.23wmf21/resources 'I6e93d9ab0e4a926c09c' [01:08:44] Logged the message, Master [01:09:01] James_F: Krenair: ^ [01:14:31] ori: Do you know where bits.wikimedia.org is maintained? the performance.wm.o one is in puppet [01:16:23] is there somewhere I can see all our git repos? I know I have found a page in the past [01:16:31] or thought I knew it [01:16:37] chasemp: https://git.wikimedia.org/ [01:16:44] chasemp: https://github.org/wikimedia [01:16:51] chasemp: https://github.com/wikimedia [01:16:54] chasemp: https://gerrit.wikimedia.org/r/#/admin/projects/ [01:17:11] the latter one is most complete, but also hardest to use :) [01:17:31] Oh, it has gitblit links now [01:17:32] thank you, cool [01:17:32] nie [01:17:33] nice [01:17:53] ori: Hm.. bits index is in wmf-config/../docroot/bits [01:17:54] interesting [01:17:56] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 486675 bytes in 7.878 second response time [01:18:17] bd808: btw, any idea what's up with gitblit/antimony claiming critical/recovery in loop constantly [01:19:19] Krinkle: Nope. I haven't been paying attention to it. I did see some folks complaining about git.wm.o being down for them a bit ago. [01:36:26] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [01:36:26] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [01:36:26] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [01:36:26] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [01:58:52] Krinkle: it's extremely slow, and its response times often exceed the generous threshold set by the alert script [01:59:26] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [01:59:39] OK, so it's genuine. Not like the incinga test is using the wrong hostname or using a differnet lvs than external requests [01:59:40] Krinkle: I proposed a solution upstream that they think is sensible: [01:59:49] ori: looks suspicious because it says 'gitblit.wikimedia.org' [02:00:27] tl;dr: putting varnish in front and writing a gerrit stream-events subscriber that purges pages [02:00:48] but not enough hours in the day :/ [02:01:46] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [02:02:35] Krinkle: it's checking the host using its internal address, but using a host header to make sure the right vhost is served [02:02:47] Sounds like you need, *puts sunglasses on*, +2. [02:02:49] Gloria: Behave, please. [02:03:27] :) [02:20:22] did gerrit die again? [02:22:45] Is gerrit down and why? [02:24:51] !log LocalisationUpdate completed (1.23wmf20) at 2014-04-04 02:24:51+00:00 [02:24:57] Logged the message, Master [02:31:51] because of PiRSquared [02:31:54] j/k [02:32:17] Bsadowski1: obviously it's your fault [02:32:38] seems to load, but slowly [02:36:56] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:46:36] PROBLEM - MySQL InnoDB on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:47:26] RECOVERY - MySQL InnoDB on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [02:50:26] !log LocalisationUpdate completed (1.23wmf21) at 2014-04-04 02:50:26+00:00 [02:50:30] Logged the message, Master [03:12:24] ori: here [03:12:53] OK, so 5xx resps are spiking, mediawikiwiki has missing messages all over its interface, and db1047 is flapping [03:13:14] springle, ping re: db1047; we'll look at the other stuff [03:14:21] the mw error count graph doesn't show a current spike of fatals or exceptions, so the 5xx must be generated in varnish [03:15:33] hmph, I can't use this on eval.php to get the debug log output: > $wgDebugLogFile = '/dev/stderr' [03:15:52] what are you trying to do? [03:16:18] 5xx resps don't seem abnormally high, pace the alert, so that makes the missing messages the top priority [03:16:26] (looking at http://gdash.wikimedia.org/dashboards/reqerror/ ) [03:16:42] I'm trying to get debugging output from wfMessage( 'pagetitle' ) [03:16:43] ori: ok great [03:16:56] since you're looking at this, I might step away [03:17:01] I'm clearly very rusty on the shell front [03:18:54] i'm just going to post a message on Project:Support_desk saying we're aware of the issue and may let it stand for a bit as we diagnose [03:23:16] {{done}} [03:23:17] ori: do you know what the problem is with group 0? [03:23:53] * PiRSquared finds post [03:24:18] ugh https://www.mediawiki.org/wiki/Special:RecentChanges is ugly [03:24:59] https://www.mediawiki.org/wiki/Thread:Project:Support_desk/Missing_interface_messages [03:25:16] wooo LQT [03:25:51] PiRSquared: why did you say group 0? are you seeing this anywhere else? [03:25:55] testwiki [03:26:13] ah, yep [03:26:17] test2wiki [03:26:20] !log LocalisationUpdate completed (1.23wmf21) at 2014-04-04 02:50:26+00:00 [02:50:26] [03:26:28] not sure what other wikis are group 0 [03:26:33] and earlier: !log LocalisationUpdate completed (1.23wmf20) at 2014-04-04 02:24:51+00:00 [03:27:26] MediaWiki.org testwiki test2wiki testwikidata [03:27:49] ok [03:27:52] interface messages are missing on all of them [03:28:07] all on 1.23wmf21 [03:28:36] !log Interface messages are missing on group0 / 1.23wmf21 wikis (mediawikiwiki, testwiki, test2wiki, and testwikidata) [03:28:38] I assume it's related to the JSON thing? [03:28:42] Logged the message, Master [03:28:55] werdna: well, nothing is broken on http://deployment.wikimedia.beta.wmflabs.org/wiki/Main_Page [03:29:21] so it is production-specific? [03:30:03] yes [03:30:06] /var/log/l10nupdatelog/l10nupdate.log is useful [03:30:07] on tin [03:32:06] <^demon|away> yo werdna, ori [03:32:08] <^demon|away> I saw the panic e-mails. [03:32:12] sup ^demon [03:32:35] I've bowed out. Everything's changed and I don't know how to shell anymore [03:32:37] http://p.defau.lt/?XBHp0vWrML918CKsbt2_KA is the log of the failed sync [03:32:39] <^demon> well I was playing playstation :p [03:32:59] PHP Warning: LU_Updater::readMessages: Unable to parse messages from file:///a/common/php-1.23wmf21/extensions/PagedTiffHandler/PagedTiffHandler.i18n.php in /a/common/php-1.23wmf21/extensions/LocalisationUpdate/Updater.php on line 63 [03:33:16] keep in mind that most of these errors / warnings were also logged for the 1.20 update, which did succeed, as far as we know [03:33:54] $reader = $readerFactory->getReader( $filename ); [03:33:57] I bet PagedTiffHandler now loads from json and so it doesn't recognise the format [03:33:59] or something [03:34:18] hmm this too [03:34:18] Warning: include(): Failed opening '/a/common/php-1.23wmf21/extensions/Wikidata/extensions/Wikibase/repo/Wikibase.i18n.php' for inclusion (include_path='/a/common/php-1.23wmf21/extensions/TimedMediaHandler/handlers/OggHandler/PEAR/File_Ogg:/a/common/php-1.23wmf21:/usr/local/lib/php:/usr/share/php') in /a/common/php-1.23wmf21/includes/cache/LocalisationCache.php on line 517 [03:36:25] <^demon> LU on the cluster appears to have the json code now. [03:38:07] * ^demon is poking [03:40:11] hahaha [03:40:11] fenari: Failed to add the RSA host key for IP address '2620:0:860:2:208:80:152:165' to the list of known hosts (/home/l10nupdate/.ssh/known_ho). [03:40:18] fenari is a known ho? [03:40:46] <^demon> def. [03:41:13] yeah, so the cdb files are mismatched [03:42:55] * ^demon is doing some live debugging on tin. [03:42:59] <^demon> nobody scap or anything. [03:43:06] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Apr 4 03:43:03 UTC 2014 (duration 43m 2s) [03:43:11] Logged the message, Master [03:43:14] <^demon> Probably a lie. [03:43:18] <^demon> ^ [03:43:19] does that count as "anything"? :p [03:43:24] * werdna slaps logmsgbot [03:43:46] however, the JSON files are exactly synchronized [03:43:52] between tin and mw1001 (picked at random) [03:44:06] so the script that is supposed to run and update the cdb files based on the json contents is failing [03:45:29] <^demon> Well yes [03:45:53] <^demon> werdna pointed to the right bit earlier. [03:45:54] <^demon> PHP Warning: LU_Updater::readMessages: Unable to parse messages from file:///a/common/php-1.23wmf21/extensions/PagedTiffHandler/PagedTiffHandler.i18n.php in /a/common/php-1.23wmf21/extensions/LocalisationUpdate/Updater.php on line 63 [03:46:04] <^demon> (It's not limited to PagedTiffHandler tho) [03:47:12] <^demon> Somebody didn't test all this l10n update stuff. [03:48:03] <^demon> The exception's actually pretty interesting and gives a lot of info, shall pastebin. [03:48:20] <^demon> http://p.defau.lt/?bk_8Uw9pjQTdpi7aVG3pxQ [03:48:38] <^demon> So basically something's still feeding l10nupdate .php files but it's configured for json. [03:49:25] <^demon> Or it's trying to read the json as php. [03:49:30] <^demon> This is weird. [03:49:43] Here's a diff between LocalisationUpdate on wmf20 and wmf21 [03:49:43] http://p.defau.lt/?JaFVYdJT1aJyAr0ZVnaw8w [03:49:52] I've deleted all of the actual localisation update files [03:49:56] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 497320 bytes in 9.160 second response time [03:50:14] <^demon> Yep. [03:50:15] Full diff is here: http://p.defau.lt/?BZFyi_jsAYoeWHKFhIcxrQ [03:50:30] It looks as if LocalisationUpdate hasn't actually been changed at all [03:50:45] <^demon> Well the json-y stuff was merged a little further back I think. [03:51:15] <^demon> Like, a week ago? 2? [03:51:40] (please don't delete files) [03:52:09] <^demon> (from the diff) [03:52:27] ori: I mean I deleted them from the diff [03:52:43] ori: I am logged out of tin, I am not doing anything on any wmf servers [03:53:07] ^demon: here's my hypothesis: the LU_ReaderFactory takes a file name and gives back a reader [03:53:14] because the Localisation files have a .php extension [03:53:18] it gives a PHP reader [03:53:34] but the .php extension is a shim around the .json files [03:53:42] so the PHP reader doesn't recognise them as localisation files [03:53:44] and errors out [03:53:51] <^demon> That's about where I'm at with this too. [03:54:09] if ( preg_match( '/i18n\.php$/', $filename ) ) { [03:54:09] return new LU_PHPReader(); [03:54:09] } [03:54:12] yep [03:55:01] so how do we fix this? [03:55:09] <^demon> First lemme revert my debug hack to LU so it doesn't accidentally sync. [03:55:27] <^demon> Ok done [03:55:37] ^demon: we could always look for $fileName = __DIR__ . "/i18n/$csCode.json"; [03:56:09] <^demon> What if we moved the check for json above the php? [03:56:09] if it fails, then we look for that in the file, and if that's the case then we add i18n/*.json [03:56:11] ORRR [03:56:21] we just add i18n/*.json generally [03:56:24] <^demon> If we've got json we obvs. wanna ignore teh php. [03:56:29] and let the PHP error out anyways [03:56:33] yeah, or that [03:56:48] <^demon> Lemme try this on tin [03:57:19] it already exists on HEAD [03:57:19] // Json should take priority if both exist [03:57:19] unset( $this->php[$key] ); [04:00:10] <^demon> My idea didn't work :\ [04:02:09] $finder = new LU_Finder( $wgExtensionMessagesFiles, $wgMessagesDirs, $IP ); [04:02:17] are all of these extensions setting $wgMessagesDirs [04:03:47] <^demon> Most of them? [04:03:49] <^demon> Best I can tell. [04:03:56] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:04:12] and anyway this isn't extensions it's core [04:04:20] so why is core breaking? [04:04:35] Honestly I'd be tempted to back it out of 1.23wmf21 and get the l10n team to fix it [04:06:14] <^demon> Guess it's middle of the night for most of them. [04:06:20] <^demon> And RoanKattouw is _away [04:06:43] right, … tomorrow :P [04:07:53] this happened the other day [04:07:56] or something very similar, at least [04:08:02] and a manual l10nupdate fixed it [04:08:07] i'm suggesting we hold off for another minute or two [04:08:10] i'm still poking [04:09:37] <^demon> Could we roll those 4 wmf21 wikis back to wmf20? [04:12:42] well, a manual l10nupdate, if it fixes it, would be less invasive [04:13:02] anyways, cdb files and json files are now in sync everywhere, so it's not a synchronization issue [04:13:21] but the en cdb file for 1.21 is 1.4mb, whereas it's 2mb for 1.20 [04:13:50] <^demon> I ran it manually a few times. [04:13:54] <^demon> Same result as waiting for the cron. [04:15:32] I mean, I could set up L10nUpdate locally and poke at the breakage [04:15:41] but I figure it would be more efficient for the l10n team to fix it [04:16:01] <^demon> Well that's why I'm wondering if I roll the wikis back to wmf20 [04:16:20] <^demon> Will they get ok l10n to stop-gap until we can prod l10n ppl in their AM? [04:17:04] you could just roll mw.org back [04:17:07] the rest aren't important right? [04:17:07] !log demon rebuilt wikiversions.cdb and synchronized wikiversions files: mw.org back to 1.23wmf20 [04:17:11] <^demon> Trying [04:17:12] Logged the message, Master [04:17:38] MediaWiki.org testwiki test2wiki testwikidata [04:17:52] <^demon> Much better on mw.org [04:18:15] (03PS1) 10Chad: Roll mw.org, test2.wp, test.wikidata back to 1.23wmf20 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/123831 [04:18:24] ^demon: no, wait [04:18:37] rolling back mediawiki.org was a good idea; there was no need to wait with the other wikis available for testing [04:18:41] but the other wikis are really far less urgent [04:18:48] so please limit it to just mediawiki.org [04:18:57] <^demon> Some people run tests against test2 and test.wikidata [04:19:04] <^demon> I'm going to leave test.wp broken for testing [04:19:21] <^demon> We only need one f'd up wiki. [04:20:09] (03CR) 10Chad: [C: 032] Roll mw.org, test2.wp, test.wikidata back to 1.23wmf20 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/123831 (owner: 10Chad) [04:20:10] so indeed the difference between the 1.21 and 1.20 cdb files represents core [04:20:16] (03Merged) 10jenkins-bot: Roll mw.org, test2.wp, test.wikidata back to 1.23wmf20 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/123831 (owner: 10Chad) [04:20:48] !log demon rebuilt wikiversions.cdb and synchronized wikiversions files: unbreak test2.wp and test.wikidata as well [04:20:53] Logged the message, Master [04:28:14] ok, doing an l10nupdate that calls rebuildLocalizationCache with --force [04:28:47] because i suspect LocalisationCache::isExpired returning false is the issue [04:30:48] bbiab [04:37:26] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [04:37:26] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [04:37:26] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [04:37:26] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [04:38:46] whoa! 1970! [04:39:15] wow [04:39:22] * greg-g is now caught up [04:39:35] thanks ^demon ori werdna [04:39:46] no worries greg-g [04:41:10] <^demon> greg-g: I think we're ok enough for now, all but one of the wikis is un-broke. [04:41:30] <^demon> Enough for me to go |away again and not panic. But people should tread lightly with wmf21. [04:41:35] i think 1.21 will be unbroken shortly [04:41:48] ^demon: yep, go back to playstation [04:41:48] <^demon> 1.23wmf21 :) [04:41:50] but yeah, thanks ^demon [04:41:56] <^demon> We haven't had 1.21 in awhile ;-) [04:42:03] wikia is on 1.19 [04:42:07] but yeah :P [04:45:01] !log LocalisationUpdate completed (1.23wmf20) at 2014-04-04 04:45:01+00:00 [04:45:04] Logged the message, Master [04:45:56] * greg-g waits for 21 [04:50:56] greg-g: you're like a college student outside a bar [04:54:18] werdna: POWER HOUR! [04:54:25] :p [04:54:46] australia's drinking age is 18! don't pander to us americans, we're on to you [04:55:36] also wouldn't say "college student" over here [04:56:07] !log LocalisationUpdate completed (1.23wmf21) at 2014-04-04 04:56:06+00:00 [04:56:12] Logged the message, Master [04:56:15] to an Australian that means you live on campus and it includes food. [04:56:23] greg-g: you can go in and get wasted now, 21 is here! [04:56:33] nope [04:56:38] test.wikipedia.org still borked [04:57:11] yep :/ [04:57:18] welp, time for bed then [04:57:25] at least for me [04:57:59] gnight greg-g [04:58:35] * greg-g waves [05:03:47] oops. Just landed here. [05:04:12] ori: LocalisationCache::isExpired returning false --> was the only issue? [05:05:07] kart_: no, i tried updating with rebuildLocalisationCache.php --force to test that theory [05:07:06] ok. Will check with Niklas once he is up (meanwhile, will look at if I can do anything). [05:08:49] kart_: did you see my email? [05:11:50] werdna: yes. thanks! [05:12:43] yw [05:12:45] good luck :p [05:33:56] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 500359 bytes in 8.910 second response time [05:45:11] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Apr 4 05:45:07 UTC 2014 (duration 18m 25s) [05:45:16] Logged the message, Master [05:55:56] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:14:54] morning ori [06:19:55] Nikerabbit: moin. [06:23:09] so nobody here anymore? sigh [06:23:16] I'm here [06:23:25] what's up? [06:23:54] Nikerabbit: ^ [06:24:38] paravoid: trying to catch up with the message failure last night [06:26:41] no clue :) [06:26:47] but if I can help somehow, do let me know [06:28:31] https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Trying_tin.27s_code_on_testwiki not completely up to date [06:31:23] wow setting up servers in eqiad using carbon as the web proxy is blazingly fast [06:31:57] (and apt) [06:32:10] https://wikitech.wikimedia.org/wiki/Test.wikipedia.org also not competely up to date [06:32:49] but it looks like it is safe to do scap on 1.23wmf21 to test things [06:33:44] so this looks exactly the same issue as last time [06:37:14] https://wikitech.wikimedia.org/wiki/Configuration_files#extension-list_and_ExtensionMessages-XXX.php also not completely up to date [06:37:27] all three places seem to have wrong paths to "common" [06:39:29] you can add {{old}} to the wrong sections [06:48:10] mutante: good morning [06:48:35] mutante: the system_role/salt grain feature is broken on new installs :( [06:48:38] paravoid: hi [06:48:56] grain-ensure runs from the system_role before salt has been configured [06:49:00] and it needs a working minion [06:49:06] oh..hmm [06:49:08] but the minion is unconfigured at that point (doesn't have a master set up) [06:50:57] does it break the puppet run? [06:51:01] yes [06:51:10] grain-ensure blocks forever [06:51:21] I'm pretty sure there was a way to scap only one version... does someone know the command for that? [06:53:12] paravoid: grmbl..ok.. let me disable it and think about it more [06:53:26] or did you already comment it or something [06:53:43] I didn't [06:54:07] ok, thanks, i will (for now) [06:54:21] it's not just a dependency on the minion config, it also hangs if you haven't signed the salt key on the master [06:54:44] I think the best would be for grain-ensure to try once, then just fail [06:54:47] instead of blocking forever [06:54:54] but I have no idea how the salt API works [06:55:35] that makes sense (salt key).. and good hints..hmm [06:56:29] did not think enough about new install situation... [06:59:26] paravoid: unrelated question. did you rename yourself in LDAP/Gerrit in the past or did you have it from the beginning [06:59:59] re: having actual "Firstname Lastname" [07:00:13] ah it is no longer possible [07:00:40] ori-l_: you sleeping already? [07:00:41] rename [07:00:47] Ryan renamed me [07:01:20] gotcha. i think i also want to do it (need to check what happens to author name in old commits) [07:01:33] i'll catch up with him [07:04:56] /dev/sda1 2.8T 33M 2.8T 1% /srv/swift-storage/sda1 [07:05:01] ;) [07:06:13] ooh, you copied them all? [07:06:59] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 500716 bytes in 8.145 second response time [07:08:34] no [07:08:37] new boxes [07:08:38] with 3T disks [07:08:47] ms-be1013/4/5 [07:09:16] ah! nice [07:10:02] (03PS1) 10Dzahn: deactivating salt grain-ensure in global role [operations/puppet] - 10https://gerrit.wikimedia.org/r/123834 [07:11:30] so those were the new installs being blocked i suppose [07:12:03] yup [07:12:22] (03CR) 10Dzahn: [C: 032] "i'll try to find a solution later, just unblocking new installs for now" [operations/puppet] - 10https://gerrit.wikimedia.org/r/123834 (owner: 10Dzahn) [07:13:36] got it, there, should be unblocked now [07:14:41] nah, I killed it manually [07:14:48] fixed, salt, rerun puppet [07:14:55] alright [07:15:02] they're fine now, I just pinged you for future boxes [07:16:29] ok, yes. thanks, others would have ran into it next [07:17:17] btw, the strike in Germany is over now.. it seems i can actually fly on Monday [07:17:22] mutante: afaik old commits can't be changed [07:17:46] yay german strikes; not as serious as the french, but still striking ;) [07:17:59] PROBLEM - NTP on ms-be1014 is CRITICAL: NTP CRITICAL: Offset unknown [07:18:03] (or if they can be changed someone should tell Matma and Siebrand= [07:18:38] Nemo_bis: hmm.. ok ohloh has feature to merge multiple users into one for stats afair [07:19:31] Nemo_bis: http://www.bbc.co.uk/news/world-europe-26846479 [07:19:44] mutante: merkel is coming here next friday [07:19:46] it's gonna be fun [07:20:39] paravoid: yes! i heard from apergos .....much security [07:21:27] "nice" combo..all that.. strike in .de , Merkel going there... [07:23:10] paravoid: and our hotel is right in the good location where any assembly of over 3 people is not allowed ?:p [07:23:53] we'll see if they outlaw assemblies or "only"marches [07:23:59] RECOVERY - NTP on ms-be1014 is OK: NTP OK: Offset -0.007594466209 secs [07:24:07] mutante: yes, ohloh does [07:24:07] apergos: ah. morning! [07:24:09] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:24:30] yes, it is [07:24:46] didn't that _just_ recover [07:25:44] !log restarting gitblit [07:25:46] do any localisationupdate wikis hit git.wm.o btw [07:25:48] Logged the message, Master [07:26:21] dunno [07:29:09] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 518131 bytes in 9.571 second response time [07:29:12] <_joe_> paravoid: merkel is in athens on friday? I may join the protests, then [07:29:24] <_joe_> mutante: welcome back [07:33:55] ugh [07:34:20] I'm having 150ms pings to bast1001 and 10% packet loss [07:34:31] <_joe_> Nikerabbit: from where? [07:34:56] 7. 100ge5-2.core1.par2.he.net 0.0% 48 17.9 20.1 17.7 28.2 3.2 [07:35:00] 8. 10ge15-1.core1.ash1.he.net 6.2% 48 167.6 157.3 136.8 167.6 7.1 [07:35:03] _joe_: Finland [07:35:06] <_joe_> I have 130 ms from my crappy DSL in italy [07:35:43] it definitely was better just a little while ago as I was debugging the message issues [07:36:02] anyway, now I'm done, doesn't matter [07:36:31] _joe_: hi:) [07:37:06] Nikerabbit: that argument / option was taken out iirc [07:37:28] 106ms 6% loss , from Germany [07:37:39] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [07:37:39] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [07:37:39] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [07:37:39] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [07:43:09] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:43:58] bleh he.net [07:44:17] (03PS1) 10Dzahn: give Chris Steipp access to release servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/123835 [07:44:36] apergos: ^ ? [07:44:55] is it really just 1 line [07:44:58] he doesn't? [07:45:02] no [07:45:04] he asked for it [07:45:19] * apergos frowns [07:45:29] "noticed today that I don't have access to the releases server. Since I occasionally do the releases, it would be helpful " ? [07:45:36] 0.0 % packet loss for me, via ntt.net [07:45:39] makes sense to me [07:45:42] and yes it's one line [07:45:53] thx [07:46:09] someone (maybe jeff?) nicely turned that stuff into a class after we needed more than two people over there [07:46:34] but!! [07:46:37] don't merge don't merge [07:46:48] he is already there [07:46:57] mutante: ^^ [07:47:03] ok [07:47:14] see he's int here a few lines up [07:47:14] duuh, i'm blind [07:47:24] yea [07:47:54] hmm, let me check why he cant get access then [07:48:01] yep [07:49:17] (03CR) 10Dzahn: [C: 04-2] "Chris, re: #7188, this should already work, was about to add this duplicate.. checking your key" [operations/puppet] - 10https://gerrit.wikimedia.org/r/123835 (owner: 10Dzahn) [08:07:11] !log deactivating cr1-eqiad<->HE peerings, translantic par2<->ash1 is congested [08:07:15] Logged the message, Master [08:07:17] Nikerabbit: better? [08:07:29] paravoid: let me check [08:09:21] paravoid: ah sorry I was actually doing mtr from Germany, that's now 90 ms no packet loss, from Finland it's now 120 ms and no packet loss, so definitely better [08:14:25] great, thanks [08:42:27] (03CR) 10Dzahn: [C: 031] Add an account for subbu on Parsoid / Cassandra test hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/123433 (owner: 10GWicke) [08:51:01] (03Abandoned) 10Dzahn: give Chris Steipp access to release servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/123835 (owner: 10Dzahn) [08:59:19] (03CR) 10Dzahn: [C: 032] Clean up DSH groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/123213 (owner: 10MaxSem) [09:18:09] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 506654 bytes in 9.901 second response time [09:26:09] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:26:34] ^ so now that keeps just being too often [09:27:01] it happened before but just every once in a while.. but that is worse [09:27:25] simple service restart doesnt cut it [09:28:16] ok [09:28:22] so how are you going to debug this further? :) [09:29:17] had the cause been ever identified? [09:33:00] java is using a lot of CPU when it timeouts. restarting it fixes it, for a little while, until it becomes busy again .. how to..restart in debug mode and attach with debugger or something? [09:33:09] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 506659 bytes in 9.786 second response time [09:34:47] <_joe_> mutante: with java apps you can see if they have JMX enabled and connect with jconsole [09:35:14] <_joe_> and look at the GC for example - very high cpu load usually means a lot of GC going on [09:35:30] anything in logs? I had a similar problem with solr, it was caused by GC going bonkers and the fix was to pick a different GC algorithm [09:36:47] <_joe_> MaxSem: it really depends on the memory usage profile usually [09:43:29] _joe_: jconsole 5043 .. nothing really happens [09:43:54] MaxSem: i just see the service restarts in syslog..hmm [09:44:28] " version of JConsole provided with the Java SE 6 platform can attach to any application that supports the Attach API. " [09:45:16] <_joe_> mutante: jconsole should give you info about the last GC event, for example [09:46:09] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:46:24] running jconsole with or without a , i dont get output or an error, just nothing [09:47:15] ah.. x11..ok [09:52:03] I guess we don't have any access log for giblit do we ? [09:52:20] would be interesting to dig in and see which URLs are being hit, possibly with the time it took to process them [09:52:31] there might be some bad bot hitting heavy processing URI [09:56:09] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 503069 bytes in 9.656 second response time [09:58:35] hashar: yea,no,access.log in VirtualHost but nothing in it besides monitoring [09:59:29]