[00:03:08] did it with command line, but odd that the mirror option isn't in the dialgos [00:11:36] James_F: all good? [00:12:27] Working (partially) on beta for me [00:12:41] At least it's better than before [00:12:42] partially? [00:12:54] I think...hang on [00:13:20] Asking in -ve [00:14:46] greg-g: It may be our fault. [00:15:19] "good" [00:15:43] Well yeah. [00:25:56] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 498370 bytes in 9.440 second response time [00:25:57] (03PS2) 10Ori.livneh: hhvm: abstract out backports to a hhvm module [operations/puppet] - 10https://gerrit.wikimedia.org/r/123573 (owner: 10Hashar) [00:30:59] (03CR) 10Ori.livneh: [C: 032] hhvm: abstract out backports to a hhvm module [operations/puppet] - 10https://gerrit.wikimedia.org/r/123573 (owner: 10Hashar) [00:40:37] (03PS2) 10Ori.livneh: performance.wikimedia.org: use ScriptAlias rather than ScriptAliasMatch [operations/puppet] - 10https://gerrit.wikimedia.org/r/122868 [00:41:56] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:42:33] (03CR) 10Ori.livneh: [C: 032] performance.wikimedia.org: use ScriptAlias rather than ScriptAliasMatch [operations/puppet] - 10https://gerrit.wikimedia.org/r/122868 (owner: 10Ori.livneh) [00:48:12] PROBLEM - puppet disabled on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:48:12] PROBLEM - SSH on labstore1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:48:12] PROBLEM - RAID on labstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:48:12] RECOVERY - puppet disabled on labstore1001 is OK: OK [00:48:12] RECOVERY - SSH on labstore1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.2 (protocol 2.0) [00:48:12] RECOVERY - RAID on labstore1001 is OK: OK: optimal, 60 logical, 60 physical [00:55:07] greg-g: OK to deploy to phase0 then? [01:08:36] !log krinkle synchronized php-1.23wmf21/resources 'I6e93d9ab0e4a926c09c' [01:08:44] Logged the message, Master [01:09:01] James_F: Krenair: ^ [01:14:31] ori: Do you know where bits.wikimedia.org is maintained? the performance.wm.o one is in puppet [01:16:23] is there somewhere I can see all our git repos? I know I have found a page in the past [01:16:31] or thought I knew it [01:16:37] chasemp: https://git.wikimedia.org/ [01:16:44] chasemp: https://github.org/wikimedia [01:16:51] chasemp: https://github.com/wikimedia [01:16:54] chasemp: https://gerrit.wikimedia.org/r/#/admin/projects/ [01:17:11] the latter one is most complete, but also hardest to use :) [01:17:31] Oh, it has gitblit links now [01:17:32] thank you, cool [01:17:32] nie [01:17:33] nice [01:17:53] ori: Hm.. bits index is in wmf-config/../docroot/bits [01:17:54] interesting [01:17:56] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 486675 bytes in 7.878 second response time [01:18:17] bd808: btw, any idea what's up with gitblit/antimony claiming critical/recovery in loop constantly [01:19:19] Krinkle: Nope. I haven't been paying attention to it. I did see some folks complaining about git.wm.o being down for them a bit ago. [01:36:26] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [01:36:26] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [01:36:26] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [01:36:26] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [01:58:52] Krinkle: it's extremely slow, and its response times often exceed the generous threshold set by the alert script [01:59:26] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [01:59:39] OK, so it's genuine. Not like the incinga test is using the wrong hostname or using a differnet lvs than external requests [01:59:40] Krinkle: I proposed a solution upstream that they think is sensible: [01:59:49] ori: looks suspicious because it says 'gitblit.wikimedia.org' [02:00:27] tl;dr: putting varnish in front and writing a gerrit stream-events subscriber that purges pages [02:00:48] but not enough hours in the day :/ [02:01:46] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [02:02:35] Krinkle: it's checking the host using its internal address, but using a host header to make sure the right vhost is served [02:02:47] Sounds like you need, *puts sunglasses on*, +2. [02:02:49] Gloria: Behave, please. [02:03:27] :) [02:20:22] did gerrit die again? [02:22:45] Is gerrit down and why? [02:24:51] !log LocalisationUpdate completed (1.23wmf20) at 2014-04-04 02:24:51+00:00 [02:24:57] Logged the message, Master [02:31:51] because of PiRSquared [02:31:54] j/k [02:32:17] Bsadowski1: obviously it's your fault [02:32:38] seems to load, but slowly [02:36:56] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:46:36] PROBLEM - MySQL InnoDB on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:47:26] RECOVERY - MySQL InnoDB on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [02:50:26] !log LocalisationUpdate completed (1.23wmf21) at 2014-04-04 02:50:26+00:00 [02:50:30] Logged the message, Master [03:12:24] ori: here [03:12:53] OK, so 5xx resps are spiking, mediawikiwiki has missing messages all over its interface, and db1047 is flapping [03:13:14] springle, ping re: db1047; we'll look at the other stuff [03:14:21] the mw error count graph doesn't show a current spike of fatals or exceptions, so the 5xx must be generated in varnish [03:15:33] hmph, I can't use this on eval.php to get the debug log output: > $wgDebugLogFile = '/dev/stderr' [03:15:52] what are you trying to do? [03:16:18] 5xx resps don't seem abnormally high, pace the alert, so that makes the missing messages the top priority [03:16:26] (looking at http://gdash.wikimedia.org/dashboards/reqerror/ ) [03:16:42] I'm trying to get debugging output from wfMessage( 'pagetitle' ) [03:16:43] ori: ok great [03:16:56] since you're looking at this, I might step away [03:17:01] I'm clearly very rusty on the shell front [03:18:54] i'm just going to post a message on Project:Support_desk saying we're aware of the issue and may let it stand for a bit as we diagnose [03:23:16] {{done}} [03:23:17] ori: do you know what the problem is with group 0? [03:23:53] * PiRSquared finds post [03:24:18] ugh https://www.mediawiki.org/wiki/Special:RecentChanges is ugly [03:24:59] https://www.mediawiki.org/wiki/Thread:Project:Support_desk/Missing_interface_messages [03:25:16] wooo LQT [03:25:51] PiRSquared: why did you say group 0? are you seeing this anywhere else? [03:25:55] testwiki [03:26:13] ah, yep [03:26:17] test2wiki [03:26:20] !log LocalisationUpdate completed (1.23wmf21) at 2014-04-04 02:50:26+00:00 [02:50:26] [03:26:28] not sure what other wikis are group 0 [03:26:33] and earlier: !log LocalisationUpdate completed (1.23wmf20) at 2014-04-04 02:24:51+00:00 [03:27:26] MediaWiki.org testwiki test2wiki testwikidata [03:27:49] ok [03:27:52] interface messages are missing on all of them [03:28:07] all on 1.23wmf21 [03:28:36] !log Interface messages are missing on group0 / 1.23wmf21 wikis (mediawikiwiki, testwiki, test2wiki, and testwikidata) [03:28:38] I assume it's related to the JSON thing? [03:28:42] Logged the message, Master [03:28:55] werdna: well, nothing is broken on http://deployment.wikimedia.beta.wmflabs.org/wiki/Main_Page [03:29:21] so it is production-specific? [03:30:03] yes [03:30:06] /var/log/l10nupdatelog/l10nupdate.log is useful [03:30:07] on tin [03:32:06] <^demon|away> yo werdna, ori [03:32:08] <^demon|away> I saw the panic e-mails. [03:32:12] sup ^demon [03:32:35] I've bowed out. Everything's changed and I don't know how to shell anymore [03:32:37] http://p.defau.lt/?XBHp0vWrML918CKsbt2_KA is the log of the failed sync [03:32:39] <^demon> well I was playing playstation :p [03:32:59] PHP Warning: LU_Updater::readMessages: Unable to parse messages from file:///a/common/php-1.23wmf21/extensions/PagedTiffHandler/PagedTiffHandler.i18n.php in /a/common/php-1.23wmf21/extensions/LocalisationUpdate/Updater.php on line 63 [03:33:16] keep in mind that most of these errors / warnings were also logged for the 1.20 update, which did succeed, as far as we know [03:33:54] $reader = $readerFactory->getReader( $filename ); [03:33:57] I bet PagedTiffHandler now loads from json and so it doesn't recognise the format [03:33:59] or something [03:34:18] hmm this too [03:34:18] Warning: include(): Failed opening '/a/common/php-1.23wmf21/extensions/Wikidata/extensions/Wikibase/repo/Wikibase.i18n.php' for inclusion (include_path='/a/common/php-1.23wmf21/extensions/TimedMediaHandler/handlers/OggHandler/PEAR/File_Ogg:/a/common/php-1.23wmf21:/usr/local/lib/php:/usr/share/php') in /a/common/php-1.23wmf21/includes/cache/LocalisationCache.php on line 517 [03:36:25] <^demon> LU on the cluster appears to have the json code now. [03:38:07] * ^demon is poking [03:40:11] hahaha [03:40:11] fenari: Failed to add the RSA host key for IP address '2620:0:860:2:208:80:152:165' to the list of known hosts (/home/l10nupdate/.ssh/known_ho). [03:40:18] fenari is a known ho? [03:40:46] <^demon> def. [03:41:13] yeah, so the cdb files are mismatched [03:42:55] * ^demon is doing some live debugging on tin. [03:42:59] <^demon> nobody scap or anything. [03:43:06] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Apr 4 03:43:03 UTC 2014 (duration 43m 2s) [03:43:11] Logged the message, Master [03:43:14] <^demon> Probably a lie. [03:43:18] <^demon> ^ [03:43:19] does that count as "anything"? :p [03:43:24] * werdna slaps logmsgbot  [03:43:46] however, the JSON files are exactly synchronized [03:43:52] between tin and mw1001 (picked at random) [03:44:06] so the script that is supposed to run and update the cdb files based on the json contents is failing [03:45:29] <^demon> Well yes [03:45:53] <^demon> werdna pointed to the right bit earlier. [03:45:54] <^demon> PHP Warning: LU_Updater::readMessages: Unable to parse messages from file:///a/common/php-1.23wmf21/extensions/PagedTiffHandler/PagedTiffHandler.i18n.php in /a/common/php-1.23wmf21/extensions/LocalisationUpdate/Updater.php on line 63 [03:46:04] <^demon> (It's not limited to PagedTiffHandler tho) [03:47:12] <^demon> Somebody didn't test all this l10n update stuff. [03:48:03] <^demon> The exception's actually pretty interesting and gives a lot of info, shall pastebin. [03:48:20] <^demon> http://p.defau.lt/?bk_8Uw9pjQTdpi7aVG3pxQ [03:48:38] <^demon> So basically something's still feeding l10nupdate .php files but it's configured for json. [03:49:25] <^demon> Or it's trying to read the json as php. [03:49:30] <^demon> This is weird. [03:49:43] Here's a diff between LocalisationUpdate on wmf20 and wmf21 [03:49:43] http://p.defau.lt/?JaFVYdJT1aJyAr0ZVnaw8w [03:49:52] I've deleted all of the actual localisation update files [03:49:56] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 497320 bytes in 9.160 second response time [03:50:14] <^demon> Yep. [03:50:15] Full diff is here: http://p.defau.lt/?BZFyi_jsAYoeWHKFhIcxrQ [03:50:30] It looks as if LocalisationUpdate hasn't actually been changed at all [03:50:45] <^demon> Well the json-y stuff was merged a little further back I think. [03:51:15] <^demon> Like, a week ago? 2? [03:51:40] (please don't delete files) [03:52:09] <^demon> (from the diff) [03:52:27] ori: I mean I deleted them from the diff [03:52:43] ori: I am logged out of tin, I am not doing anything on any wmf servers [03:53:07] ^demon: here's my hypothesis: the LU_ReaderFactory takes a file name and gives back a reader [03:53:14] because the Localisation files have a .php extension [03:53:18] it gives a PHP reader [03:53:34] but the .php extension is a shim around the .json files [03:53:42] so the PHP reader doesn't recognise them as localisation files [03:53:44] and errors out [03:53:51] <^demon> That's about where I'm at with this too. [03:54:09] if ( preg_match( '/i18n\.php$/', $filename ) ) { [03:54:09] return new LU_PHPReader(); [03:54:09] } [03:54:12] yep [03:55:01] so how do we fix this? [03:55:09] <^demon> First lemme revert my debug hack to LU so it doesn't accidentally sync. [03:55:27] <^demon> Ok done [03:55:37] ^demon: we could always look for $fileName = __DIR__ . "/i18n/$csCode.json"; [03:56:09] <^demon> What if we moved the check for json above the php? [03:56:09] if it fails, then we look for that in the file, and if that's the case then we add i18n/*.json [03:56:11] ORRR [03:56:21] we just add i18n/*.json generally [03:56:24] <^demon> If we've got json we obvs. wanna ignore teh php. [03:56:29] and let the PHP error out anyways [03:56:33] yeah, or that [03:56:48] <^demon> Lemme try this on tin [03:57:19] it already exists on HEAD [03:57:19] // Json should take priority if both exist [03:57:19] unset( $this->php[$key] ); [04:00:10] <^demon> My idea didn't work :\ [04:02:09] $finder = new LU_Finder( $wgExtensionMessagesFiles, $wgMessagesDirs, $IP ); [04:02:17] are all of these extensions setting $wgMessagesDirs [04:03:47] <^demon> Most of them? [04:03:49] <^demon> Best I can tell. [04:03:56] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:04:12] and anyway this isn't extensions it's core [04:04:20] so why is core breaking? [04:04:35] Honestly I'd be tempted to back it out of 1.23wmf21 and get the l10n team to fix it [04:06:14] <^demon> Guess it's middle of the night for most of them. [04:06:20] <^demon> And RoanKattouw is _away [04:06:43] right, … tomorrow :P [04:07:53] this happened the other day [04:07:56] or something very similar, at least [04:08:02] and a manual l10nupdate fixed it [04:08:07] i'm suggesting we hold off for another minute or two [04:08:10] i'm still poking [04:09:37] <^demon> Could we roll those 4 wmf21 wikis back to wmf20? [04:12:42] well, a manual l10nupdate, if it fixes it, would be less invasive [04:13:02] anyways, cdb files and json files are now in sync everywhere, so it's not a synchronization issue [04:13:21] but the en cdb file for 1.21 is 1.4mb, whereas it's 2mb for 1.20 [04:13:50] <^demon> I ran it manually a few times. [04:13:54] <^demon> Same result as waiting for the cron. [04:15:32] I mean, I could set up L10nUpdate locally and poke at the breakage [04:15:41] but I figure it would be more efficient for the l10n team to fix it [04:16:01] <^demon> Well that's why I'm wondering if I roll the wikis back to wmf20 [04:16:20] <^demon> Will they get ok l10n to stop-gap until we can prod l10n ppl in their AM? [04:17:04] you could just roll mw.org back [04:17:07] the rest aren't important right? [04:17:07] !log demon rebuilt wikiversions.cdb and synchronized wikiversions files: mw.org back to 1.23wmf20 [04:17:11] <^demon> Trying [04:17:12] Logged the message, Master [04:17:38] MediaWiki.org testwiki test2wiki testwikidata [04:17:52] <^demon> Much better on mw.org [04:18:15] (03PS1) 10Chad: Roll mw.org, test2.wp, test.wikidata back to 1.23wmf20 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/123831 [04:18:24] ^demon: no, wait [04:18:37] rolling back mediawiki.org was a good idea; there was no need to wait with the other wikis available for testing [04:18:41] but the other wikis are really far less urgent [04:18:48] so please limit it to just mediawiki.org [04:18:57] <^demon> Some people run tests against test2 and test.wikidata [04:19:04] <^demon> I'm going to leave test.wp broken for testing [04:19:21] <^demon> We only need one f'd up wiki. [04:20:09] (03CR) 10Chad: [C: 032] Roll mw.org, test2.wp, test.wikidata back to 1.23wmf20 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/123831 (owner: 10Chad) [04:20:10] so indeed the difference between the 1.21 and 1.20 cdb files represents core [04:20:16] (03Merged) 10jenkins-bot: Roll mw.org, test2.wp, test.wikidata back to 1.23wmf20 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/123831 (owner: 10Chad) [04:20:48] !log demon rebuilt wikiversions.cdb and synchronized wikiversions files: unbreak test2.wp and test.wikidata as well [04:20:53] Logged the message, Master [04:28:14] ok, doing an l10nupdate that calls rebuildLocalizationCache with --force [04:28:47] because i suspect LocalisationCache::isExpired returning false is the issue [04:30:48] bbiab [04:37:26] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [04:37:26] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [04:37:26] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [04:37:26] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [04:38:46] whoa! 1970! [04:39:15] wow [04:39:22] * greg-g is now caught up [04:39:35] thanks ^demon ori werdna [04:39:46] no worries greg-g [04:41:10] <^demon> greg-g: I think we're ok enough for now, all but one of the wikis is un-broke. [04:41:30] <^demon> Enough for me to go |away again and not panic. But people should tread lightly with wmf21. [04:41:35] i think 1.21 will be unbroken shortly [04:41:48] ^demon: yep, go back to playstation [04:41:48] <^demon> 1.23wmf21 :) [04:41:50] but yeah, thanks ^demon [04:41:56] <^demon> We haven't had 1.21 in awhile ;-) [04:42:03] wikia is on 1.19 [04:42:07] but yeah :P [04:45:01] !log LocalisationUpdate completed (1.23wmf20) at 2014-04-04 04:45:01+00:00 [04:45:04] Logged the message, Master [04:45:56] * greg-g waits for 21 [04:50:56] greg-g: you're like a college student outside a bar [04:54:18] werdna: POWER HOUR! [04:54:25] :p [04:54:46] australia's drinking age is 18! don't pander to us americans, we're on to you [04:55:36] also wouldn't say "college student" over here [04:56:07] !log LocalisationUpdate completed (1.23wmf21) at 2014-04-04 04:56:06+00:00 [04:56:12] Logged the message, Master [04:56:15] to an Australian that means you live on campus and it includes food. [04:56:23] greg-g: you can go in and get wasted now, 21 is here! [04:56:33] nope [04:56:38] test.wikipedia.org still borked [04:57:11] yep :/ [04:57:18] welp, time for bed then [04:57:25] at least for me [04:57:59] gnight greg-g [04:58:35] * greg-g waves [05:03:47] oops. Just landed here. [05:04:12] ori: LocalisationCache::isExpired returning false --> was the only issue? [05:05:07] kart_: no, i tried updating with rebuildLocalisationCache.php --force to test that theory [05:07:06] ok. Will check with Niklas once he is up (meanwhile, will look at if I can do anything). [05:08:49] kart_: did you see my email? [05:11:50] werdna: yes. thanks! [05:12:43] yw [05:12:45] good luck :p [05:33:56] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 500359 bytes in 8.910 second response time [05:45:11] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Apr 4 05:45:07 UTC 2014 (duration 18m 25s) [05:45:16] Logged the message, Master [05:55:56] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:14:54] morning ori [06:19:55] Nikerabbit: moin. [06:23:09] so nobody here anymore? sigh [06:23:16] I'm here [06:23:25] what's up? [06:23:54] Nikerabbit: ^ [06:24:38] paravoid: trying to catch up with the message failure last night [06:26:41] no clue :) [06:26:47] but if I can help somehow, do let me know [06:28:31] https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Trying_tin.27s_code_on_testwiki not completely up to date [06:31:23] wow setting up servers in eqiad using carbon as the web proxy is blazingly fast [06:31:57] (and apt) [06:32:10] https://wikitech.wikimedia.org/wiki/Test.wikipedia.org also not competely up to date [06:32:49] but it looks like it is safe to do scap on 1.23wmf21 to test things [06:33:44] so this looks exactly the same issue as last time [06:37:14] https://wikitech.wikimedia.org/wiki/Configuration_files#extension-list_and_ExtensionMessages-XXX.php also not completely up to date [06:37:27] all three places seem to have wrong paths to "common" [06:39:29] you can add {{old}} to the wrong sections [06:48:10] mutante: good morning [06:48:35] mutante: the system_role/salt grain feature is broken on new installs :( [06:48:38] paravoid: hi [06:48:56] grain-ensure runs from the system_role before salt has been configured [06:49:00] and it needs a working minion [06:49:06] oh..hmm [06:49:08] but the minion is unconfigured at that point (doesn't have a master set up) [06:50:57] does it break the puppet run? [06:51:01] yes [06:51:10] grain-ensure blocks forever [06:51:21] I'm pretty sure there was a way to scap only one version... does someone know the command for that? [06:53:12] paravoid: grmbl..ok.. let me disable it and think about it more [06:53:26] or did you already comment it or something [06:53:43] I didn't [06:54:07] ok, thanks, i will (for now) [06:54:21] it's not just a dependency on the minion config, it also hangs if you haven't signed the salt key on the master [06:54:44] I think the best would be for grain-ensure to try once, then just fail [06:54:47] instead of blocking forever [06:54:54] but I have no idea how the salt API works [06:55:35] that makes sense (salt key).. and good hints..hmm [06:56:29] did not think enough about new install situation... [06:59:26] paravoid: unrelated question. did you rename yourself in LDAP/Gerrit in the past or did you have it from the beginning [06:59:59] re: having actual "Firstname Lastname" [07:00:13] ah it is no longer possible [07:00:40] ori-l_: you sleeping already? [07:00:41] rename [07:00:47] Ryan renamed me [07:01:20] gotcha. i think i also want to do it (need to check what happens to author name in old commits) [07:01:33] i'll catch up with him [07:04:56] /dev/sda1 2.8T 33M 2.8T 1% /srv/swift-storage/sda1 [07:05:01] ;) [07:06:13] ooh, you copied them all? [07:06:59] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 500716 bytes in 8.145 second response time [07:08:34] no [07:08:37] new boxes [07:08:38] with 3T disks [07:08:47] ms-be1013/4/5 [07:09:16] ah! nice [07:10:02] (03PS1) 10Dzahn: deactivating salt grain-ensure in global role [operations/puppet] - 10https://gerrit.wikimedia.org/r/123834 [07:11:30] so those were the new installs being blocked i suppose [07:12:03] yup [07:12:22] (03CR) 10Dzahn: [C: 032] "i'll try to find a solution later, just unblocking new installs for now" [operations/puppet] - 10https://gerrit.wikimedia.org/r/123834 (owner: 10Dzahn) [07:13:36] got it, there, should be unblocked now [07:14:41] nah, I killed it manually [07:14:48] fixed, salt, rerun puppet [07:14:55] alright [07:15:02] they're fine now, I just pinged you for future boxes [07:16:29] ok, yes. thanks, others would have ran into it next [07:17:17] btw, the strike in Germany is over now.. it seems i can actually fly on Monday [07:17:22] mutante: afaik old commits can't be changed [07:17:46] yay german strikes; not as serious as the french, but still striking ;) [07:17:59] PROBLEM - NTP on ms-be1014 is CRITICAL: NTP CRITICAL: Offset unknown [07:18:03] (or if they can be changed someone should tell Matma and Siebrand= [07:18:38] Nemo_bis: hmm.. ok ohloh has feature to merge multiple users into one for stats afair [07:19:31] Nemo_bis: http://www.bbc.co.uk/news/world-europe-26846479 [07:19:44] mutante: merkel is coming here next friday [07:19:46] it's gonna be fun [07:20:39] paravoid: yes! i heard from apergos .....much security [07:21:27] "nice" combo..all that.. strike in .de , Merkel going there... [07:23:10] paravoid: and our hotel is right in the good location where any assembly of over 3 people is not allowed ?:p [07:23:53] we'll see if they outlaw assemblies or "only"marches [07:23:59] RECOVERY - NTP on ms-be1014 is OK: NTP OK: Offset -0.007594466209 secs [07:24:07] mutante: yes, ohloh does [07:24:07] apergos: ah. morning! [07:24:09] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:24:30] yes, it is [07:24:46] didn't that _just_ recover [07:25:44] !log restarting gitblit [07:25:46] do any localisationupdate wikis hit git.wm.o btw [07:25:48] Logged the message, Master [07:26:21] dunno [07:29:09] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 518131 bytes in 9.571 second response time [07:29:12] <_joe_> paravoid: merkel is in athens on friday? I may join the protests, then [07:29:24] <_joe_> mutante: welcome back [07:33:55] ugh [07:34:20] I'm having 150ms pings to bast1001 and 10% packet loss [07:34:31] <_joe_> Nikerabbit: from where? [07:34:56] 7. 100ge5-2.core1.par2.he.net 0.0% 48 17.9 20.1 17.7 28.2 3.2 [07:35:00] 8. 10ge15-1.core1.ash1.he.net 6.2% 48 167.6 157.3 136.8 167.6 7.1 [07:35:03] _joe_: Finland [07:35:06] <_joe_> I have 130 ms from my crappy DSL in italy [07:35:43] it definitely was better just a little while ago as I was debugging the message issues [07:36:02] anyway, now I'm done, doesn't matter [07:36:31] _joe_: hi:) [07:37:06] Nikerabbit: that argument / option was taken out iirc [07:37:28] 106ms 6% loss , from Germany [07:37:39] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [07:37:39] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [07:37:39] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [07:37:39] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [07:43:09] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:43:58] bleh he.net [07:44:17] (03PS1) 10Dzahn: give Chris Steipp access to release servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/123835 [07:44:36] apergos: ^ ? [07:44:55] is it really just 1 line [07:44:58] he doesn't? [07:45:02] no [07:45:04] he asked for it [07:45:19] * apergos frowns [07:45:29] "noticed today that I don't have access to the releases server. Since I occasionally do the releases, it would be helpful " ? [07:45:36] 0.0 % packet loss for me, via ntt.net [07:45:39] makes sense to me [07:45:42] and yes it's one line [07:45:53] thx [07:46:09] someone (maybe jeff?) nicely turned that stuff into a class after we needed more than two people over there [07:46:34] but!! [07:46:37] don't merge don't merge [07:46:48] he is already there [07:46:57] mutante: ^^ [07:47:03] ok [07:47:14] see he's int here a few lines up [07:47:14] duuh, i'm blind [07:47:24] yea [07:47:54] hmm, let me check why he cant get access then [07:48:01] yep [07:49:17] (03CR) 10Dzahn: [C: 04-2] "Chris, re: #7188, this should already work, was about to add this duplicate.. checking your key" [operations/puppet] - 10https://gerrit.wikimedia.org/r/123835 (owner: 10Dzahn) [08:07:11] !log deactivating cr1-eqiad<->HE peerings, translantic par2<->ash1 is congested [08:07:15] Logged the message, Master [08:07:17] Nikerabbit: better? [08:07:29] paravoid: let me check [08:09:21] paravoid: ah sorry I was actually doing mtr from Germany, that's now 90 ms no packet loss, from Finland it's now 120 ms and no packet loss, so definitely better [08:14:25] great, thanks [08:42:27] (03CR) 10Dzahn: [C: 031] Add an account for subbu on Parsoid / Cassandra test hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/123433 (owner: 10GWicke) [08:51:01] (03Abandoned) 10Dzahn: give Chris Steipp access to release servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/123835 (owner: 10Dzahn) [08:59:19] (03CR) 10Dzahn: [C: 032] Clean up DSH groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/123213 (owner: 10MaxSem) [09:18:09] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 506654 bytes in 9.901 second response time [09:26:09] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:26:34] ^ so now that keeps just being too often [09:27:01] it happened before but just every once in a while.. but that is worse [09:27:25] simple service restart doesnt cut it [09:28:16] ok [09:28:22] so how are you going to debug this further? :) [09:29:17] had the cause been ever identified? [09:33:00] java is using a lot of CPU when it timeouts. restarting it fixes it, for a little while, until it becomes busy again .. how to..restart in debug mode and attach with debugger or something? [09:33:09] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 506659 bytes in 9.786 second response time [09:34:47] <_joe_> mutante: with java apps you can see if they have JMX enabled and connect with jconsole [09:35:14] <_joe_> and look at the GC for example - very high cpu load usually means a lot of GC going on [09:35:30] anything in logs? I had a similar problem with solr, it was caused by GC going bonkers and the fix was to pick a different GC algorithm [09:36:47] <_joe_> MaxSem: it really depends on the memory usage profile usually [09:43:29] _joe_: jconsole 5043 .. nothing really happens [09:43:54] MaxSem: i just see the service restarts in syslog..hmm [09:44:28] " version of JConsole provided with the Java SE 6 platform can attach to any application that supports the Attach API. " [09:45:16] <_joe_> mutante: jconsole should give you info about the last GC event, for example [09:46:09] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:46:24] running jconsole with or without a , i dont get output or an error, just nothing [09:47:15] ah.. x11..ok [09:52:03] I guess we don't have any access log for giblit do we ? [09:52:20] would be interesting to dig in and see which URLs are being hit, possibly with the time it took to process them [09:52:31] there might be some bad bot hitting heavy processing URI [09:56:09] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 503069 bytes in 9.656 second response time [09:58:35] hashar: yea,no,access.log in VirtualHost but nothing in it besides monitoring [09:59:29] mutante: and I guess we do not have logs for the misc varnish do we? [10:00:36] ah sorry misread your reply [10:05:01] <_joe_> hashar, mutante I bet it's a GC issue... 99% of java issues are GC-related; also, are we using java6 or java7? [10:05:51] 7. java version "1.7.0_25" [10:06:01] OpenJDK [10:06:03] I have shown the CPU graph to a java coworker here [10:06:12] he instantly told me: looks like Java GC. :D [10:06:43] started around 16:30 UTC yesterday [10:07:35] * hashar digs in puppet git log [10:11:14] nothing suspectful bah [10:11:23] I don't have access on the machine :/ [10:23:09] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:26:09] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 496241 bytes in 9.580 second response time [10:32:15] try marksweep? [10:38:39] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [10:38:39] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [10:38:39] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [10:38:39] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [10:46:18] (03PS1) 10Dzahn: gitblit: raise -Xmx to 8191M, set -Xms to same [operations/puppet] - 10https://gerrit.wikimedia.org/r/123848 [10:47:10] (03PS2) 10Dzahn: gitblit: raise -Xmx to 8191M, set -Xms to same [operations/puppet] - 10https://gerrit.wikimedia.org/r/123848 [10:48:54] (03CR) 10Dzahn: ""If the value of the -Xms parameter is smaller than the value of the -Xmx parameter, not all of the space that is reserved is immediately " [operations/puppet] - 10https://gerrit.wikimedia.org/r/123848 (owner: 10Dzahn) [10:49:17] _joe_: hashar ^ [10:49:41] or even more than that? [10:50:00] I have no clue [10:50:04] Mem: 16353876k total [10:51:46] (03CR) 10Dzahn: ""Setting -Xms and -Xmx to the same value increases predictability by removing the most important sizing decision from the virtual machine." [operations/puppet] - 10https://gerrit.wikimedia.org/r/123848 (owner: 10Dzahn) [10:52:13] the machine has a lot of free mem space according to ganglia four more GB are probably fine :] [10:52:38] mutante: not sure you really need to set Xms but that is most probably harmless [10:52:42] (03PS3) 10Dzahn: gitblit: raise -Xmx to 8192M, set -Xms to same [operations/puppet] - 10https://gerrit.wikimedia.org/r/123848 [10:53:43] and as I understand it the JVM will allocate 8GB of memory, leaving only 8GB for the system / disk cache etc.. [10:54:41] <_joe_> I'm studying the docs for GC, sorry [10:54:53] <_joe_> I did that a few years ago... now all over again [10:55:21] i remember changing these for JIRA on Tomcat [10:55:43] but that was just following docs [10:56:00] thanks _joe_ [10:58:47] those quotes were from http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html [10:59:02] <_joe_> mutante: I'd suggest to use the incremental GC [10:59:05] <_joe_> -Xincgc [10:59:12] <_joe_> do NOT use G1 [10:59:37] heading out for lunch [11:00:44] _joe_: alright [11:01:05] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Please add -Xincgc to enable the incremental GC which should not eat all the CPU up at times." [operations/puppet] - 10https://gerrit.wikimedia.org/r/123848 (owner: 10Dzahn) [11:01:45] <_joe_> mutante: you may also think of not raising the dedicated ram and just go with adding the minimum to 4096 and using incgc [11:01:52] _joe_: should we still raise -Xmx ? (we have the memory, right) [11:01:58] ah [11:02:16] <_joe_> "let's see how it goes" [11:02:57] <_joe_> with java(TM) it's the only strategy I know [11:04:31] (03PS4) 10Dzahn: gitblit: use incremental GC and add -Xms [operations/puppet] - 10https://gerrit.wikimedia.org/r/123848 [11:06:12] better? [11:08:59] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [11:09:13] (03CR) 10Giuseppe Lavagetto: [C: 031] gitblit: use incremental GC and add -Xms [operations/puppet] - 10https://gerrit.wikimedia.org/r/123848 (owner: 10Dzahn) [11:10:05] <_joe_> I did not remember how annoying the d-i can be [11:11:36] (03CR) 10Dzahn: [C: 032] "this hopefully fixes the frequent git.wm.org timeouts, thanks for advice Giuseppe" [operations/puppet] - 10https://gerrit.wikimedia.org/r/123848 (owner: 10Dzahn) [11:13:23] !log restarting gitblit with new option to use incremental GC in an attempt to fix timeouts caused by GC eating CPU [11:13:28] Logged the message, Master [11:14:18] ok, there it is, running with -Xincgc [11:16:29] <_joe_> let's see what happens in the next few hours [11:16:39] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [11:22:49] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [11:26:29] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [11:26:55] <_joe_> mutante: I'm keeping an eye on the CPU/RAM situation for gitlbit, we're still far from hitting a major GC and cpu usage is low [11:27:50] _joe_: cool!:) [11:31:19] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [11:39:11] (03PS3) 10Dzahn: decom: hume [operations/puppet] - 10https://gerrit.wikimedia.org/r/122605 (owner: 10Matanya) [11:44:01] 10:29 < _joe_> paravoid: merkel is in athens on friday? I may join the protests, then [11:44:08] er [11:44:13] sorry :) [11:44:14] ?????? [11:44:22] lol? [11:47:59] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [11:48:22] ok [11:48:26] I think we need a new interview question ;-) [11:56:51] wants to kill hume and then leave (more or less:) [11:57:10] all the crons are gone , checked.. the /home [11:57:23] is on nas1-a.pmtpa , not local [12:01:49] (03CR) 10Dzahn: [C: 032] "mwdeploy/apache crons: all disabled or gone. root cleanup crons: identical on terbium. users: have been warned on Monday, no replies, home" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122605 (owner: 10Matanya) [12:06:08] !log hume - disable puppet/salt/monitoring [12:06:13] Logged the message, Master [12:08:19] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [12:26:46] (03PS1) 10Dzahn: puppetize apache-graceful-all [operations/puppet] - 10https://gerrit.wikimedia.org/r/123852 [12:26:48] (03CR) 10jenkins-bot: [V: 04-1] puppetize apache-graceful-all [operations/puppet] - 10https://gerrit.wikimedia.org/r/123852 (owner: 10Dzahn) [12:27:57] (03PS2) 10Dzahn: puppetize apache-graceful-all [operations/puppet] - 10https://gerrit.wikimedia.org/r/123852 [12:32:03] !log hume - shutting down [12:32:07] Logged the message, Master [12:37:46] (03PS1) 10Dzahn: replace hume with terbium in a comment [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/123853 [12:45:27] (03PS1) 10QChris: Remove group writability for analitycs files /a/squid, and /a/log [operations/puppet] - 10https://gerrit.wikimedia.org/r/123855 [12:45:38] (03CR) 10Dzahn: [C: 031] "shut down" [operations/dns] - 10https://gerrit.wikimedia.org/r/122609 (owner: 10Matanya) [12:49:33] _joe_: hashar , ^demon|away: looks fixed to me https://icinga.wikimedia.org/cgi-bin/icinga/trends.cgi?host=antimony&service=gitblit.wikimedia.org [12:49:57] * mutante waves (had another half day) [12:50:04] mutante: :-] [12:50:09] mutante: congrats! [12:50:54] (03PS1) 10QChris: Update docu around analytics rsync source [operations/puppet] - 10https://gerrit.wikimedia.org/r/123857 [12:51:33] _joe_ had the right one about incremental GC [12:51:53] hashar: and "Connection to hume closed. [12:52:00] have a nice weekend [12:52:06] mutante: awesome!!! [12:52:14] mutante: rest well and thanks for all the cleanup! [12:52:18] <_joe_> mutante: very happy about this :) [12:53:34] (03CR) 10Dzahn: "i say this fixed it, see the trends graph in Icinga. https://icinga.wikimedia.org/cgi-bin/icinga/trends.cgi?host=antimony&service=gitblit." [operations/puppet] - 10https://gerrit.wikimedia.org/r/123848 (owner: 10Dzahn) [12:54:10] _joe_: :) thanks, cya later [13:17:46] (03PS2) 10Ottomata: Remove group writability for analitycs files /a/squid, and /a/log [operations/puppet] - 10https://gerrit.wikimedia.org/r/123855 (owner: 10QChris) [13:17:52] (03CR) 10Ottomata: [C: 032 V: 032] Remove group writability for analitycs files /a/squid, and /a/log [operations/puppet] - 10https://gerrit.wikimedia.org/r/123855 (owner: 10QChris) [13:18:26] (03PS2) 10Ottomata: Update docu around analytics rsync source [operations/puppet] - 10https://gerrit.wikimedia.org/r/123857 (owner: 10QChris) [13:18:31] (03CR) 10Ottomata: [C: 032 V: 032] Update docu around analytics rsync source [operations/puppet] - 10https://gerrit.wikimedia.org/r/123857 (owner: 10QChris) [13:39:23] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [13:39:23] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [13:39:23] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [13:39:23] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [13:41:51] <_joe_> mmmh this is screenshot material :) [13:42:50] :O [13:47:58] (03PS3) 10Cmjohnson: remove all Tampa appservers from DHCP [operations/puppet] - 10https://gerrit.wikimedia.org/r/123211 (owner: 10Dzahn) [13:48:07] (03CR) 10Cmjohnson: [C: 031] remove all Tampa appservers from DHCP [operations/puppet] - 10https://gerrit.wikimedia.org/r/123211 (owner: 10Dzahn) [13:49:03] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:49:20] (03CR) 10Cmjohnson: [C: 031] hume: decom, left mgmt [operations/dns] - 10https://gerrit.wikimedia.org/r/122609 (owner: 10Matanya) [13:53:55] poor gitblit [13:54:55] (03CR) 10Giuseppe Lavagetto: [C: 031] "I see no issues here." [operations/puppet] - 10https://gerrit.wikimedia.org/r/123698 (owner: 10Ottomata) [14:09:55] (03CR) 10Giuseppe Lavagetto: [C: 031] "This is probably the best solution." [operations/puppet] - 10https://gerrit.wikimedia.org/r/123701 (owner: 10Ottomata) [14:16:33] (03CR) 10Alexandros Kosiaris: [C: 032] Adding $srange parameter to ferm::service [operations/puppet] - 10https://gerrit.wikimedia.org/r/123698 (owner: 10Ottomata) [14:17:03] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 228026 bytes in 8.629 second response time [14:17:29] (03PS2) 10Ottomata: Adding $srange parameter to ferm::service [operations/puppet] - 10https://gerrit.wikimedia.org/r/123698 [14:17:35] thanks [14:17:38] (03CR) 10Ottomata: [C: 032 V: 032] Adding $srange parameter to ferm::service [operations/puppet] - 10https://gerrit.wikimedia.org/r/123698 (owner: 10Ottomata) [14:18:01] (03CR) 10Alexandros Kosiaris: [C: 032] "Not sure if we will ever call base::firewall with ensure='absent' but technically fine, so LGTM" [operations/puppet] - 10https://gerrit.wikimedia.org/r/123701 (owner: 10Ottomata) [14:18:47] k, thanks akosiaris, I think we might want to do that in case we're like, ok great, let's apply this thing, and then it breaks a buncha stuff we ddin't think of, so we need to revert [14:19:17] ok, let's hope we wont :-) [14:19:30] (03PS3) 10Ottomata: Adding $ensure parameter to base::firewall [operations/puppet] - 10https://gerrit.wikimedia.org/r/123701 [14:19:30] :) [14:19:42] (03CR) 10Ottomata: [C: 032 V: 032] Adding $ensure parameter to base::firewall [operations/puppet] - 10https://gerrit.wikimedia.org/r/123701 (owner: 10Ottomata) [14:20:00] (03CR) 10Ottomata: Ferm rules for stat1003 (and rsyncd on any stat* server) [operations/puppet] - 10https://gerrit.wikimedia.org/r/123702 (owner: 10Ottomata) [14:20:01] as far as https://gerrit.wikimedia.org/r/#/c/123702 [14:20:12] why in site.pp ? [14:20:46] i wasn't sure where else to put it, since it is pretty dependent on whether or not the machine has a public IP [14:20:48] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The patch is correct, however I disagree with enabling ssh." [operations/puppet] - 10https://gerrit.wikimedia.org/r/123702 (owner: 10Ottomata) [14:21:01] I mean, does it not have a role that needs this ? [14:21:06] and the role classes don't necessarily specify that [14:21:14] ok, what is the use case then ? [14:21:16] it does not have a role that needs it, it is more for users convenience [14:21:37] in a chat yesterday [14:21:42] i coudl go either way on this [14:21:48] stat1 has apublic IP and users are used to just ssh ing in [14:21:56] we could make them go through bastions [14:22:10] ouch. I did not know that about stat1 [14:22:20] well, most machines with public IPs people just ssh into [14:22:20] i think [14:22:35] yeah, I do the same sometimes [14:22:36] coren said that since we use passwordless entry, he didn't htink it made much of a difference [14:22:39] depending on machine [14:22:49] and i'm tryign to minimize annoyances to stat1 users now [14:22:57] probably many of them only have shell accounts on this machine [14:23:02] and have never messed with their ssh configs before [14:23:21] so users that do not exist on bastions but do exist on stat1 [14:23:24] how nice... [14:23:48] ok, a couple of ways we can handle that. [14:23:50] <_joe_> ottomata: the problem is also compromised clients - If people really need shell access, then ok [14:23:58] ha, yeah, i'm with you, if you think we should not allow this [14:23:59] i'm fine with that [14:24:12] I can deal with getting people's .ssh/configs configured properly [14:24:28] oh, but someone was saying windows users might have a hard time too? [14:24:35] dunno. [14:24:37] oh for the love of ... [14:24:44] we got that too :-( [14:24:52] ottomata: putty is highly configurable, should not be an issue [14:25:04] hashar to the rescue :-) [14:25:08] ok hashar, yeah, and I know that Erik Zachte uses windows and connects fine to stat1002.eqiad [14:25:16] so it must be possible [14:25:19] :p [14:25:33] and my lame recommendation would lbe to require everyone to use the bastions to then bunny hop to the machine [14:25:42] well it might be the best possible time to fix this [14:25:55] anyway, ja, i'd prefer to have fewer thigns to annoy users during this migration, but if you guys think we shouldn't allow direct ssh, I'm ok with that [14:26:03] they are going to anyway change something in their habits or configs [14:26:12] as is, barely [14:26:13] just a hostname [14:26:18] everything else shoudl be identical [14:26:34] with the proper ssh config, that is transparent [14:26:38] true [14:26:41] heh, you 'd be surpised how difficult it is to do that sometimes :-) [14:26:45] ha, yeah [14:27:04] a lot of these folks are not big shell users either, so its hard for them [14:27:11] is stat1 access by non wmf folks? [14:27:12] they are researchers, used to SQL and R or wathever [14:27:21] non employees? i am not sure [14:27:38] you can see the list of accounts we are keeping for stat1003 here: [14:27:38] http://etherpad.wikimedia.org/p/stat1_accounts [14:27:42] under the Keep heading [14:28:31] diederik ? [14:28:41] <_joe_> ottomata: let me try to understand how sshuttle works with our network [14:28:42] is he still with WMF ? [14:28:45] yeah he still volunteers and does stuff and has an NDA, so i guess so [14:28:51] a ok then [14:28:53] i think he asked to keep it [14:28:55] <_joe_> ottomata: I do remember it worked under windows too [14:29:29] so if that is staff / underNDA folks I guess you can get them to have access on the bastions then hop to stat1 [14:29:45] I 'd prefer that too if possible [14:29:55] that is not ideal though since from bastion you get access to the whole cluster :/ [14:30:01] but that is a different topic [14:30:42] whoa sshuttle... [14:30:43] cool [14:30:53] ok cool, i'm fine with that [14:30:57] deep down i'd prefer that as well :) [14:31:04] ok ok ok [14:31:07] no direct ssh then :) [14:31:32] :-) [14:31:40] * akosiaris just got happier  [14:34:16] (03PS2) 10Ottomata: Including base::firewall on stat1003, allowing rsyncd traffic within internal network [operations/puppet] - 10https://gerrit.wikimedia.org/r/123702 [14:36:03] * _joe_ too [14:36:32] btw put a note that at some point base::firewall should be moved from site.pp to some other base class (hopefully some day standard but one step at a time) [14:37:16] <_joe_> ottomata: sshuttle - I'm not sure it will work, I'll try it better next week - now I have my desktop running sid and I'm off this awful apple BSD [14:37:22] (03PS3) 10Ottomata: Including base::firewall on stat1003, allowing rsyncd traffic within internal network [operations/puppet] - 10https://gerrit.wikimedia.org/r/123702 [14:37:30] aye, i'm not going to recommend folks to use sshuttle [14:37:51] but I might try it for some of my own devices (i.e. connecting to hadoop web services without annoying browser proxies that hardly work) [14:38:06] <_joe_> akosiaris: modules/base/firewall.pp would seem a nice place for it [14:38:57] I meant being used directly in node definitions in site.pp [14:39:00] like in this case [14:39:09] but it is an intermediate step I hope :) [14:39:47] aye ja [14:39:58] ok cool, +2 ? :) [14:40:09] <_joe_> akosiaris: yep - we should move towards the point where it's easy to switch to an ENC [14:40:51] (03CR) 10Alexandros Kosiaris: [C: 032] Including base::firewall on stat1003, allowing rsyncd traffic within internal network [operations/puppet] - 10https://gerrit.wikimedia.org/r/123702 (owner: 10Ottomata) [14:41:09] _joe_: yeah, it is going to take time though :-( [14:41:45] (03PS4) 10Ottomata: Including base::firewall on stat1003, allowing rsyncd traffic within internal network [operations/puppet] - 10https://gerrit.wikimedia.org/r/123702 [14:41:51] (03CR) 10Ottomata: [C: 032 V: 032] Including base::firewall on stat1003, allowing rsyncd traffic within internal network [operations/puppet] - 10https://gerrit.wikimedia.org/r/123702 (owner: 10Ottomata) [14:51:03] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:52:53] * ^d sighs [14:56:48] hey ^d :-D [14:57:16] ^d: _joe_ and mutante slightly raised gitblit heap , it doesn't help apparently [14:57:32] high cpu load since yesterday 4:30pm UTC. [14:58:59] <_joe_> hashar: mmmh [14:59:07] <_joe_> hashar: it has been ok for some time [14:59:10] <_joe_> let me check [14:59:49] <_joe_> hashar: we did not raise the heap a lot - just changed the GC strategy [15:00:29] ah sorry [15:01:11] gerrit / git blit is misbehaving [15:01:33] <_joe_> and it seems to have helped for some time https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=antimony.wikimedia.org&m=cpu_report&r=day&s=by%20name&hc=4&mc=2&st=1396623650&g=cpu_report&z=large&c=Miscellaneous%20eqiad [15:01:47] https://gerrit.wikimedia.org/r/#/c/123826/ [15:02:11] <^d> _joe_: The fact that nagios requests / every minute or two for status checks doesn't help. [15:02:14] been having so many problems since last night with our jenkins unable to get core from git.wikimedia.org [15:02:17] <^d> That main page isn't really the fastest thing ever. [15:02:18] or wikibase frm gerrit [15:02:28] <_joe_> ^d: cant beleive that's the problem :) [15:02:42] <^d> aude: gerrit is fine. [15:02:47] just not sure why it started 24 hours ago though [15:02:56] could it be some bot hitting some bad URL ? [15:02:58] <_joe_> aude: gerrit is fine as far as I can tell [15:03:06] seems ok now, although last night it died also a few times [15:03:08] access.log did not yield anything interesting apparently [15:03:09] <^d> _joe_: I disagree. [15:03:25] i submitted a chain of 4 patches, a minute later gerrit died [15:03:28] it came back [15:03:41] i had do fix something in my first patch, so submitted 4 again [15:03:45] <^d> access.log shows the nagios plugin hammering away at gitblit about every minute. [15:03:47] minute later, gerrit died again [15:03:51] <^d> It's the only thing in the logs recently. [15:03:57] <_joe_> ^d: if 1 req/min is an issue, then the software is abysmal [15:03:59] <_joe_> come on. [15:04:08] problem mostly has been git.wikimedia.org though [15:04:10] <_joe_> ^d: it's an application-side problem [15:04:14] <^d> _joe_: It is abysmal. [15:04:19] <^d> It was a mistake for me to use it. [15:04:26] I disagree chad [15:04:28] <_joe_> it seems that gitlbit has some memory leak [15:04:40] giblet is already better than gitweb (which is really the end of the abyss) [15:04:44] <_joe_> and thus GC is no use [15:04:55] <_joe_> well, gitlab is nice [15:05:00] <^d> hashar: gitweb was so bad because it was running on the same host as gerrit and was overloaded. [15:05:03] <_joe_> but I guess it scales even worse [15:05:04] <^d> it wouldn't be nearly as bad now. [15:05:05] could it be some packages that have been updated on antimony yesterday ? [15:05:08] <^d> plus, I want to use cgit. [15:05:11] <^d> but I digress. [15:05:15] :-D [15:05:41] before migration downloading dumps has become pretty faster for me [15:05:44] <^d> _joe_: We're literally gitblit's largest user :p [15:05:45] *after [15:05:54] <^d> mutante|away wasn't joking when he said it's meant for small-medium teams. [15:06:58] <_joe_> !log restarting gitblit as it has eaten up all of its ram again and is trashing cpu [15:07:02] Logged the message, Master [15:07:28] <_joe_> now it will magically work again [15:08:02] <_joe_> ^d: gitweb should be way better. hell, anything would be better. NO ONE was calling it apart from icinga. [15:08:03] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 224419 bytes in 8.847 second response time [15:08:55] <^d> All those NPEs probably aren't helping either, but are mostly a red herring. [15:09:50] <_joe_> ^d: they are the reason of the memory leak, most possibly [15:10:08] <_joe_> be back soon [15:10:30] <^d> I'm going to look at setting up cgit alongside gitblit on antimony right now. [15:12:03] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:42] well it would be nice to find out why it started dieing all of a sudden [15:12:57] <^d> It'll take less time to replace it than figure that out and fix it. [15:13:03] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 224419 bytes in 8.255 second response time [15:13:06] <^d> gitblit sucks. [15:13:46] <_joe_> ok... [15:14:03] <_joe_> ^d: where are the logs of the java app? [15:14:14] <^d> /var/log/upstart/gitblit.log [15:15:35] <_joe_> oh, ok, upstart [15:16:13] RECOVERY - Disk space on stat1002 is OK: DISK OK [15:16:53] <_joe_> and right when one would need stack traces from java, they're not there. [15:17:30] you know, i really like everything about upstart, except for /var/log/upstart [15:17:45] <^d> _joe_: Look earlier in the file. [15:17:51] well I am off for now. see you all on mondayç [15:17:56] <_joe_> ^d: ok [15:18:07] <_joe_> hashar: see ya [15:20:19] <^d> _joe_: This is the NPE I was seeing http://p.defau.lt/?iwzuJEwzJsdlXdqQvIjZPQ [15:21:03] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:24:17] <_joe_> ^d: com.gitblit.wicket.pages.RawPage$1.respond(RawPage.java:173) - but on their site I cannot find a code browser [15:25:12] <^d> On what site? gitblits? [15:25:31] <_joe_> yes [15:26:04] <^d> He's got some links to demos of it on redhat's site. [15:26:06] <^d> https://demo-gitblit.rhcloud.com/ [15:26:24] <_joe_> yes I just want to look at the source code [15:26:28] <_joe_> :) [15:26:40] <_joe_> I guess I need to download the jar and disassemble it [15:27:25] <^d> Ohhh [15:27:26] <^d> duh [15:27:52] <^d> https://github.com/gitblit/gitblit [15:27:52] <_joe_> github, maybe [15:28:04] <_joe_> yes, seen that [15:28:10] <_joe_> isn't it funny? [15:28:30] <^d> Yeah, he doesn't use his own service for his code :) [15:35:55] ^d: Everybody knows that dog food tastes bad :) [15:37:21] bd808: are you sure? there was some research showing guests to a party couldn't tell it from human bad food [15:37:25] <_joe_> bd808: so why we're eating dog food exactly? :P [15:37:41] it's as bad as McDonalds and costs less? [15:37:52] Plus more bone meal! [15:38:25] <^d> _joe_: it's a saying :p [15:38:28] dog food quip was re: gitblit code being found on github [15:38:31] <^d> "eating your own dog food" [15:39:00] https://en.wikipedia.org/wiki/Eating_your_own_dog_food [15:39:03] <_joe_> ok, I'm opening a ticket on this [15:39:10] <_joe_> bd808: yes I know [15:39:39] <_joe_> bd808: mine was a joke on the fact that the gitblit dev uses github and google code, while we use his dogfood [15:40:11] :/ yeah. Git repo browsers are all sucky in my experience [15:40:26] But I haven't looked at them for a couple of years [15:40:50] <_joe_> bd808: gitlab is nice to look at, but it has a few hiccups on its own [15:40:58] <_joe_> 1) does not fit our needs [15:41:03] <_joe_> 2) ... [15:42:17] <_joe_> anyway, anything would be better [15:42:56] I guess I should spend some time poking around in http://fab.wmflabs.org/diffusion/MW/browse/master/ since that's the heir apparent for "fix all the things that such with a new thing" [15:43:03] s/such/suck/ [15:44:58] <^d> bd808: I want to spend some time getting it right. [15:45:02] <^d> So we can *really* use it. [15:45:29] bd808: what about? [15:45:29] https://github.com/jonashaag/klaus [15:45:34] was just googling and foudn that [15:45:37] python! [15:45:41] everybody's favorit! [15:50:46] <^d> ottomata: well phabricator also does browsing built in. [15:50:50] <^d> one tool to rule them all and such. [15:51:21] it does and you can go ?grep='foo' type stuff in the url [15:51:37] or do a blame in the ui to see who changed what [15:51:38] :) [15:54:03] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 226063 bytes in 8.085 second response time [15:54:54] <_joe_> I was guessing - one reasons of this crazy gitblit behaviour could be one of the pmtpa migrations [15:57:32] <_joe_> chasemp: I was reading your email about admins.pp, I'll have some questions for you on monday/tuesday [15:57:45] (03PS2) 10Manybubbles: WIP: Deploy experimental highlighter [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/123704 [15:59:50] _joe_: I will be athens and/or on a plane [16:00:05] but I just realized so will you? [16:00:21] yep [16:00:29] everyone but ottomata :( [16:00:52] * bd808 wishes he was opsen just for that trip [16:00:52] _joe_: either way sounds good, I'm just excited someone read the thing [16:01:31] hha [16:02:40] (03PS3) 10Manybubbles: WIP: Deploy experimental highlighter [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/123704 [16:06:03] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:06:46] so, who's the point person for wikitech 2fa? [16:06:56] Coren: ^ [16:07:55] greg-g: as in need to reset someone who lost theirs? [16:08:11] yeah [16:08:16] i saw a ticket for that in the past month, lemme see if i can refind it [16:08:30] I thought I saved my token, but apparently [16:08:31] not [16:08:51] https://rt.wikimedia.org/Ticket/Display.html?id=6969 [16:09:10] I have my old phone, thankfully, so I can log in, but it won't let me reset the credentials without 'the token' (so to migrate phones I need to have 'the token'? or is there a better way? [16:09:24] i think you have to just have it wiped out of the database and start over =] [16:09:39] greg-g: but how do i know you are youuuuuuu [16:09:49] bah [16:10:01] this isn't how google's works! [16:10:02] <_joe_> !log restarting gitlbit, for the last time today [16:10:06] Logged the message, Master [16:10:09] * greg-g is not a google lover [16:10:09] heh [16:10:41] <_joe_> greg-g: well, I don't love them either, but you cannot deny their technical proficiency [16:11:03] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 223539 bytes in 9.067 second response time [16:11:51] RobH: that misc server quote, is that good to go? [16:12:12] lemme take a look [16:12:59] if it is, please compile a single PDF and calculate the total, then i'll request approval [16:13:10] would be good to have that go out today [16:24:03] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:44] (03CR) 10coren: [C: 032] Labs: -pmtpa support from role::labs::instance [operations/puppet] - 10https://gerrit.wikimedia.org/r/123681 (owner: 10coren) [16:29:54] (03PS2) 10coren: Labs: -pmtpa support from role::labs::instance [operations/puppet] - 10https://gerrit.wikimedia.org/r/123681 [16:36:31] * Coren pokes Jenkins. Oy! [16:40:23] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [16:40:23] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [16:40:23] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [16:40:23] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [16:47:30] Is something known to be wrong with Jenkins? [16:48:09] (03CR) 10coren: [C: 032 V: 032] "+V because Jenkins seems ill and it was already okay before the needless rebase." [operations/puppet] - 10https://gerrit.wikimedia.org/r/123681 (owner: 10coren) [16:51:05] (03PS1) 10coren: Tool Labs: Fixes for webgrid [operations/puppet] - 10https://gerrit.wikimedia.org/r/123880 [17:00:03] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 218590 bytes in 9.196 second response time [17:08:05] Coren: funny you should mention that, Jenkins told me "starting gate and submit jobs" for a +2'd change over 30 minutes ago, it hasn't complete [17:08:06] d [17:11:16] what's up with the lvs300x's? [17:12:14] (03CR) 10Krinkle: Use the BetaFeatures whitelist for production to avoid accidental deploys (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121892 (owner: 10Jforrester) [17:17:32] (03PS4) 10Jforrester: Use the BetaFeatures whitelist for production to avoid accidental deploys [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121892 [17:17:36] (03CR) 10Jforrester: Use the BetaFeatures whitelist for production to avoid accidental deploys (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121892 (owner: 10Jforrester) [17:19:53] greg-g: just not setup yet I think [17:20:35] mark: kk [17:31:15] chrismcmahon: It looks dead to me. [17:33:10] (03PS1) 10Tim Landscheidt: Tools: Use apt::repository instead of file resources [operations/puppet] - 10https://gerrit.wikimedia.org/r/123882 [17:59:06] (03CR) 10Ottomata: [C: 031] WIP: Deploy experimental highlighter [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/123704 (owner: 10Manybubbles) [17:59:43] (03CR) 10Manybubbles: [C: 04-1] "-1 until I'm happy with it in beta. Its already deployed there, but not being used right yet." [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/123704 (owner: 10Manybubbles) [18:08:57] greg-g: ^d ori ping [18:10:05] ello [18:10:58] why did we switch test2 / test wikidata back to wmf20? [18:11:05] i see 03:28 ori: Interface messages are missing on group0 / 1.23wmf21 wikis (mediawikiwiki, testwiki, test2wiki, and testwikidata) [18:11:06] bugzzzzzz [18:11:24] so is the issue fixed? when do we try again? [18:11:54] it's mostly fixed, and yeah, will be rolling them back today [18:12:07] ok, great [18:12:20] (only "mostly" because we need to make sure we know how to prevent it in the future) [18:12:34] * aude nods [18:12:38] is there a bug ticket? [18:13:57] not yet, I don't believe, I've pinged Nike-rabbit to write up the outage report/report follow on tickets [18:14:25] presumably that will have to wait till tuesday? [18:14:34] Nemo_bis: why? [18:15:05] he works for wmf on tue and thu now [18:15:13] AFAIK [18:15:28] ohhhh [18:17:29] greg-g: ok [18:18:34] seems nothing for us to worry about though would be good to know whta the issues are and if anything we need to do [18:19:07] aude: lemme paste what I have [18:20:37] ok [18:20:43] i can try to read backscroll [18:25:31] aude: I meant the email /me was distracted for a second... getting to it now [18:25:38] * greg-g will create the stub incident report [18:27:04] ugh, it's a mess :) [18:28:10] greg-g: ok [18:30:15] heya akosiaris, re our earlier discussion [18:30:23] should anyone with a shell account also have bastion access? [18:30:26] they should right? [18:30:40] and, bastion access is granted via admins::restricted, yes? [18:33:01] greg-g: and maybe answer to https://bugzilla.wikimedia.org/show_bug.cgi?id=63517 (if it's any way related) [18:41:42] (03PS1) 10Ottomata: Adding stat1003 accounts to admins::restricted [operations/puppet] - 10https://gerrit.wikimedia.org/r/123895 [18:42:05] aude: it's maybe kinda related but not really :) [18:42:56] greg-g: ok [18:43:04] !log redeployed updated patch for bug63251 to fix a reported bug [18:43:09] Logged the message, Master [18:43:50] ok, siebrand answered [18:44:24] is jenkins dead again? [18:44:45] is, will be fixed shortly/should be fixed now [18:44:55] (it's in the process of being fixed) [18:49:09] aude: https://wikitech.wikimedia.org/wiki/Incident_documentation/20140403-Deploy [18:49:50] siebrand, thanks for assistance on the etl thing [18:50:07] siebrand: you should put #wikimedia-mobile on your default channel list, too! [18:50:10] dr0ptp4kt: No problem. I came across some weird shit. [18:51:47] greg-g: thanks [18:55:06] hm, anything in particular broke? [18:56:26] 03:28 ori: Interface messages are missing on group0 / 1.23wmf21 wikis (mediawikiwiki, testwiki, test2wiki, and testwikidata) [18:56:29] OIC [18:56:47] (03PS2) 10Ottomata: Adding stat1003 accounts to bast1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/123895 [18:57:04] (03PS3) 10Ottomata: Adding stat1003 accounts to bast1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/123895 [18:59:27] (03CR) 10Ottomata: [C: 032 V: 032] Adding stat1003 accounts to bast1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/123895 (owner: 10Ottomata) [19:02:00] i'm confused by the use of ssh -W to access labs guests. someone got an example command line which doesn't use an ssh config file? [19:02:05] uh; anyone want to look at why zuul isn't submitting jobs to jenkins? [19:03:47] mwalker: Krinkle is already looking into it [19:05:52] !log Zuul / Jenkins stalled again. [19:05:57] Logged the message, Master [19:05:58] Krinkle: that is a mess :D [19:06:20] hashar: Jenkins is executing some jobs though, also things triggered by timer (e.g. beta-config) is running fine every 10 minutes [19:06:27] So it's very strange [19:06:34] the beta config jobs are timed by Jenkins [19:06:40] using a different scheduler [19:06:49] a few days ago we had a similar issue [19:07:01] for some reason Jenkins was no more allocating jobs to the gallium slave [19:07:23] !log Jenkins un pooling gallium slave [19:07:28] Logged the message, Master [19:08:50] ahh [19:12:32] bd808: So, ExtensionMessages [19:12:40] As I *think* I understand it, the issue is this [19:12:52] tin has two MW copies [19:13:13] The one in /a/common that's updated from git, and the one in /usr/local/apache/common that's updated from the former through the sync scripts [19:13:20] i.e. it is both the sync master and a sync slave [19:13:29] mwscript uses the slave copy [19:13:29] * bd808 agrees [19:13:48] So when we rebuild ExtensionMessages.php , we run mwscript mergeMessageFileList.php and that runs on the slave copy [19:14:08] Which does not have enough information to create the correct ExtensionMessages.php [19:14:46] So instead, we should either 1) run that script against the master copy, or 2) run sync-common or whatever the equivalent in the Brave New Python World on tin before running mergeMessageFileList.php [19:16:25] I think #2 might be more feasible #1 but I don't really know how to do either any more given that a bunch of stuff has been ported and rewritten and what not [19:16:29] !log restarting Jenkins [19:16:34] Logged the message, Master [19:16:47] Hmmm… there's something not quite right there though. On a thursday deploy mw-update-l10n runs before /usr/local/apache/common-local has been updated. [19:17:02] And it does see the new branch [19:17:17] So I think … I need to look at how it does that [19:17:21] Right [19:19:17] unsurprisingly jenkins is dieing ... [19:19:26] RoanKattouw: if [ -d "$MW_COMMON_SOURCE" ]; then MW_COMMON_DIR_USE=$MW_COMMON_SOURCE [19:20:02] So when running on tin, where MW_COMMON_SOURCE exists, it looks to me like that is used as the root for mwscript [19:20:44] Well that at least picks the version of multiversion/MWScript.php to run [19:22:21] !log Jenkins is processing jobs again. Queue unchanged so it will resume everything [19:22:25] Logged the message, Master [19:22:47] (03PS1) 10Ottomata: Reverting previous change, we will figure this out next week. [operations/puppet] - 10https://gerrit.wikimedia.org/r/123901 [19:23:18] Krinkle: so that is fixed. For some reason Jenkins stop processing jobs :-/ Must be a bug in the gearman plugin. [19:23:24] Krinkle: fix is to restart Jenkins [19:23:47] hashar: What level of restart are we talking? [19:24:01] what level of restart do you know? [19:24:01] :D [19:24:09] basically: /etc/init.d/jenkins stop [19:24:13] wait a bit [19:24:24] verify whether jenkins is still around with: ps -u jenkins f [19:24:26] Can you e-mail qa-list (sorry if you did that already, will look) with a small set of steps (e.g. unqueue slave X using this technique, do the restart with this, then repool over there) [19:24:38] for some reason Jenkins tends to not be killed properly. So sometime you need to kill -9 it :( [19:24:40] Oh, you mean the entire lib, not just the gallium slave? [19:24:44] then /etc/init.d/jenkins start [19:24:51] I thought you just unqueue, restart slave, repool? [19:24:56] That's what fixed it last time, right? [19:25:01] yeah that did [19:25:06] but not this time [19:25:22] and for some reason lanthanum does not seem to be executing jobs right now :( [19:26:02] (03PS1) 10coren: Labs: add wb_property_info to replication [operations/software] - 10https://gerrit.wikimedia.org/r/123902 [19:26:15] hashar: I've bumped executors to 12 on gallium. Will monitor closely, and put back to 5 (maybe 7) in a bit. [19:26:32] RoanKattouw: Oh. Your option #1 (sync local before running mw-update-l10n) is actually what happens right now. [19:26:53] That was my #2, right? [19:26:54] Krinkle: ah no please [19:26:55] Hmm, weird [19:26:58] Krinkle: that is going to kill the server [19:26:59] (03CR) 10coren: [C: 032] Labs: add wb_property_info to replication [operations/software] - 10https://gerrit.wikimedia.org/r/123902 (owner: 10coren) [19:27:31] (03PS1) 10Tim Landscheidt: Use apt::repository instead of file resources [operations/puppet] - 10https://gerrit.wikimedia.org/r/123903 [19:27:41] RoanKattouw: Yes, #2. But I think that both #1 & #2 happen today [19:27:43] hashar: Ow.. Yeah, might've exxagerated a bit [19:28:03] !log Jenkins: unpooled slave agent on lanthanum, killed it the java agent on it and repooled it. [19:28:07] Logged the message, Master [19:28:10] weird bug [19:28:16] (03CR) 10coren: [C: 032] Tool Labs: Fixes for webgrid [operations/puppet] - 10https://gerrit.wikimedia.org/r/123880 (owner: 10coren) [19:28:24] RoanKattouw: #2 happens here -- https://github.com/wikimedia/mediawiki-tools-scap/blob/master/scap/main.py#L132 [19:28:35] Krinkle: yeah gallium is a bit starved because it run jenkins + zuul. So you don't want too many jobs to run on it [19:28:52] Krinkle: turns out lanthanum was bugged and has no executor slot left :/ [19:28:58] hashar: Hm.. do we have like a monitor for the slaves and their continued uptime and link with jenkins? [19:29:12] RoanKattouw: and #1 happens here -- https://github.com/wikimedia/operations-puppet/blob/production/files/misc/scripts/mwscript#L8 [19:29:26] so that we know if lanthanum goes AWOL again [19:29:42] Krinkle: well that seems to be an issue with any slaves :( [19:29:49] and ideally also something to detect this weird "gallium stops processing jenkins jobs" though that's porbably harder [19:30:15] I am looking at the thread view in https://integration.wikimedia.org/ci/monitoring [19:30:19] because during that time it still processes timed jobs and jobs for labs, and the web interface is also working fine [19:30:21] there is a thread per executor [19:30:47] it's just the jobs for its slave that stop working, eventhough it says "slave: gallium: OK. 7 idle execeutors" [19:30:49] as if nothing is wrong [19:31:26] yup [19:31:41] so either there is something weird in Jenkins or the executors are not registered properly in gearman [19:33:15] (03PS2) 10Ottomata: Reverting previous change, we will figure this out next week. [operations/puppet] - 10https://gerrit.wikimedia.org/r/123901 [19:33:36] (03CR) 10Ottomata: [C: 032 V: 032] "Reverting this after IRC discussions. See pending email..." [operations/puppet] - 10https://gerrit.wikimedia.org/r/123901 (owner: 10Ottomata) [19:36:50] (03PS1) 10Ottomata: Disabling base::firewall on stat1003 [operations/puppet] - 10https://gerrit.wikimedia.org/r/123907 [19:38:17] Krinkle: that is a bug in gearman definitely. [19:38:34] lanthanum only has a few jobs registered [19:38:50] Krinkle: http://paste.openstack.org/show/75111/ [19:39:09] bd808: Right.... odd [19:39:54] (03CR) 10Ottomata: [C: 032 V: 032] Disabling base::firewall on stat1003 [operations/puppet] - 10https://gerrit.wikimedia.org/r/123907 (owner: 10Ottomata) [19:41:23] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [19:41:23] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [19:41:23] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [19:41:23] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [19:41:31] RoanKattouw: Do we have confirmation that ExtensionMessages-1.23wmf21.php was missing data or just that the l10n cbd files were missing data? [19:42:40] So, *last week* [19:42:49] When this happened for newly migrated extensions [19:42:56] bd808: RoanKattouw btw, when ya'll are "done" here (heh) could one of you put us back to wmf21 on phase0 wikis? (if you deem is appropriate given your conversation which I haven't watched) [19:42:59] ExtensionMessages-1.23wmf20.php was definitely missing data [19:44:32] (03PS1) 10Ottomata: No need to remove ferm confs on base::firewall ensure => absent [operations/puppet] - 10https://gerrit.wikimedia.org/r/123913 [19:44:54] (03CR) 10Ottomata: [C: 032 V: 032] No need to remove ferm confs on base::firewall ensure => absent [operations/puppet] - 10https://gerrit.wikimedia.org/r/123913 (owner: 10Ottomata) [19:46:13] RoanKattouw: Ok. Next question is was this broken in the initial scap or the subsequent l10nupdate run? Both times it was noticed after l1n0update ran if I understand the timelines. [19:46:20] I don't know [19:46:24] It might be the LU run for all I know [19:46:32] Timeline-wise that seems more plausbile [19:46:50] Because both times it broke on a Thursday (wmfN+1 day) but in the evening PDT (LU time) [19:46:58] Is there an equivalent to .pep8 for puppet-lint to silence individual rules for a directory? [19:47:45] RoanKattouw: Right. And when it was messed up during scap previously I heard about it within 30 minutes each time from folks using mw.o or the test sites [19:48:26] i'm sure test2 and test wikidata were fine when they were switched [19:48:33] I started to read through the l10nupdate code last week but never really grokked what it did [19:48:40] no idea what happened later [19:48:53] PROBLEM - SSH on stat1003 is CRITICAL: Connection timed out [19:49:53] RECOVERY - SSH on stat1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.2 (protocol 2.0) [19:51:21] (03PS1) 10Ottomata: base::firewall always needs defs, but main-input should be absent if ensure => absent [operations/puppet] - 10https://gerrit.wikimedia.org/r/123915 [19:51:25] bd808: Yeah so that suggests LU is breaking it instead [19:51:37] (03PS1) 10coren: Labs: automate federated table maintenance [operations/software] - 10https://gerrit.wikimedia.org/r/123916 [19:51:38] So the question is, 1) do the LU scripts rebuild ExtensionMessages, and 2) WHYYYYY [19:51:44] because they shouldn't need to AFAICT [19:51:57] (03CR) 10Ottomata: [C: 032 V: 032] base::firewall always needs defs, but main-input should be absent if ensure => absent [operations/puppet] - 10https://gerrit.wikimedia.org/r/123915 (owner: 10Ottomata) [19:52:49] (03CR) 10coren: [C: 032 V: 032] "Tested to run (and, indeed, ran)" [operations/software] - 10https://gerrit.wikimedia.org/r/123916 (owner: 10coren) [19:53:09] RoanKattouw: Was it the ExtensionMessages in /a/common that was bas last week or the copy in /usr/local/apache/common-local? [19:53:34] Because the local sync happens before it is built. [19:54:37] The /a/common one was wrong [19:54:52] Purplexing [19:54:55] I didn't inspect the slave copy but the site's behavior suggested that one was wrong as well [19:55:18] I see where you're going with this though: maybe the file is rebuilt but the rebuilt version is then not synced [19:55:24] We should have those damn files under version control so we can see when they change [19:55:47] Ideally generated files like these aren't under version control, right? [19:55:56] I mean that holds up so long as their generation isn't broken [19:58:37] Well, ideally they aren't in upstream version control, but there's really no good reason that they aren't versioned and tracked in the local clone. Other than that making the local clone diverge from the origin which would break some things Sam does there to roll out new branches. [19:58:59] The same with PrivateSettings [20:00:07] And actually for these files I don't see why they couldn't be added to the origin repo [20:00:31] Right [20:00:33] We do something similar with the bits webroot where generated content is tracked [20:07:30] RoanKattouw: One of the things that confuses me about l10nupdate is that it maintains checkouts of core and extensions in /var/lib/l10nupdate/mediawiki. How are these used when it runs? [20:12:18] * bd808 sees this has something to do with extensions/LocalisationUpdate [20:15:51] RoanKattouw: Could it be that LU pulls in new content from translatewiki and that's what is missing from ExtensionMessages? [20:35:44] bd808: So LU uses the checkouts in /var/lib/l10nupate to get the values of the i18n messages in master [20:35:51] Those checkouts are master checkouts [20:36:04] (Sorry for the delay, was eating lunch) [20:36:05] RoanKattouw: I'm at least going to blame LU for some of the breakage. Last night's log is full of warnings about "LU_Updater::readMessages: Unable to parse messages from …/something.i18n.php" that seems to be caused by LU_PHPReader not understanding how to deal with the shim files that are in use for some of the extensions (eg PagedTiffHandler) [20:36:12] No that's expected [20:36:18] I explained on the wmf-engineering thread [20:36:20] What happens is this: [20:36:30] * LU wants to update i18n from master [20:36:53] * LU uses $wgExtensionMessagesFiles and $wgMessagesDirs to determine where to get messages from [20:37:09] * If you're on wmf{N-1}, this information might be outdated relative to master [20:37:16] * LU applies outdated information to the master checkout [20:37:28] * LU tries to read a shim .i18n.php file believing it to be a real .i18n.php file [20:37:53] This only happens 1) during the wmf{N-1} run for extensions that were converted in wmfN and 2) during the wmfN run for extensions converted very recently in master [20:38:05] In practice I've only observed #1, I'm theorizing that #2 should occur as well but I haven't observed it [20:38:27] but that might be because I've only been observing on Thursday evenings, where the number of hours that Siebrand has been awake between the wmfN cut and me observing is essentially zero [20:38:38] s/where/when/ [20:39:30] bd808: So here's another theory [20:39:37] Ok. So that would just keep new messages from being merged ^ [20:39:44] Yes, exactly [20:39:45] And only for one week [20:39:58] Because wmf{N+1} will have the right config info [20:41:25] bd808: OK, so alternate theory: 1) initial scap on Thu morning misgenerates ExtensionMessages-wmfN.php , 2) another bug causes the l10n cache to not be rebuilt despite this change, so the brokenness of ExtensionMessages-wmfN.php is not exposed, 3) LU runs, 4) l10n cache is rebuilt in response to LU's changes, 5) broken ExtensionMessages-wmfN.php is now exposed [20:41:53] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 1532.0689802 [20:42:59] Here's my -1 for that theory: it would mean that there is no l10n cache at all for the new branch and we would have noticed that right away [20:44:37] (03PS1) 10Tim Landscheidt: Tools: Remove lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/124001 [20:44:45] Which wiki did Chad leave broken? testwiki? [20:44:48] yeah [20:45:39] Has anyone tried to identify *what* is broken? Like what pages are showing bad messages? [20:45:55] all? [20:46:12] * bd808 disagrees [20:46:18] https://test.wikipedia.org/wiki/Special:Version LGTM [20:46:22] wait, what.......... [20:46:25] that's different than last night [20:46:40] everything on Main Page was broken [20:46:47] all the tabs and sidebar links [20:47:04] so the plot thickens [20:47:09] Yeah last night all core messages were broken [20:47:10] I'm assuming it's because niklas kicked LU last night [20:47:14] Niklas did something overnight [20:47:16] well, not lu [20:47:20] He didn't kick LU [20:47:21] but the l10ncache [20:47:27] "This seems to be confirmed by the fact that running "mw-update-l10n", "sync-l10nupdate" and "sync-l10nupdate-1 1.23wmf21" fixed test.wikipedia.org. " [20:47:31] that [20:47:42] https://wikitech.wikimedia.org/wiki/Incident_documentation/20140403-Deploy :) :) [20:47:53] So he basically re-ran scap in a controlled manner [20:47:57] * greg-g nods [20:48:23] (03PS1) 10Manybubbles: Turn on experimental highlighting in beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124003 [20:48:31] OK so it looks like things are working now then [20:48:39] And thus wiped out any chance of figuring out what went wrong until it happens again next Thursday [20:48:42] By basically running scap twice [20:48:50] Yeah :( [20:48:52] My apologies [20:48:56] I told him to do that [20:49:16] Which was the classic bootstrap problem that kept happening in bug 51174 [20:49:32] A second scap would magically fix everything [20:50:18] * bd808 wants to see l10n caching burn in a fire and be reborn from the ashes [20:50:22] easy fix! have scap call scap-1 and scap-again which just calls scap-1! [20:50:24] bd808: The "fixes [20:50:32] bd808: The "fixes" associated with that bug look very scary [20:50:45] They were shot in the dark hacks [20:50:46] If the EM file isn't being built correctly and we just cp the one from the previous version in, that would be bad [20:51:02] AFAICT that's what's happening now: the initial EM for wmfN+1 is just identical to the wmfN one [20:51:10] Yeah. THat was Sam's quick and dirty hack [20:51:38] hmm… like the wrong version is being picked by multiverison? [20:52:26] Maybe? [20:52:29] * bd808 goes to re-read mw-update-l10n yet again [20:52:45] Is wikiversions.dat being updated *and synced* at the right time? [20:53:39] It's still weird that the cache build works [20:54:04] Maybe the initial l10n cache is being built in a different way and so it gets build correctly, but any subsequent rebuilds (like LU) fail? [20:54:16] So mwversionsinuse reads the json version of wikiversions from /usr/local/apache/common-local [20:55:14] During a scap this file would have been copied over from /a/common just before mw-update-l10n runs [20:55:39] (03CR) 10Jforrester: [C: 04-1] "Whoops, this should cover Search too." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121892 (owner: 10Jforrester) [20:55:43] If there wasn't data for the new version there would be no l10n cache built [20:56:10] (03CR) 10Tim Landscheidt: [C: 04-1] "Needs to be tested on Toolsbeta first." [operations/puppet] - 10https://gerrit.wikimedia.org/r/124001 (owner: 10Tim Landscheidt) [21:00:56] greg-g: Do you still want group0 back to wmf21? [21:14:53] bd808: it'd be nice, yeah, if you think it's safe :) [21:15:29] We can always roll it back again. I'll make the patch and scap it [21:15:35] ty, sir [21:15:59] (03PS1) 10BryanDavis: Revert "Roll mw.org, test2.wp, test.wikidata back to 1.23wmf20" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124010 [21:17:46] (03CR) 10BryanDavis: [C: 032] "We are going to try this again and hope that it works this time. testwiki seems fixed by things that Niklas did last night." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124010 (owner: 10BryanDavis) [21:17:54] (03Merged) 10jenkins-bot: Revert "Roll mw.org, test2.wp, test.wikidata back to 1.23wmf20" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124010 (owner: 10BryanDavis) [21:19:56] !log bd808 Started scap: Group0 to 1.23wmf21 (again) [21:20:02] Logged the message, Master [21:20:58] greg-g: If you look at https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor right now you can see where I added a marker for scap runs starting (and ending) [21:23:07] RoanKattouw, greg-g: fingers crossed. Scap is definitely rebuilding 1.23wmf21 cache. 1.23wmf20 was unchanged [21:27:04] Updated 366 JSON file(s) in '/a/common/php-1.23wmf21/cache/l10n'. [21:34:32] !log bd808 Finished scap: Group0 to 1.23wmf21 (again) (duration: 14m 35s) [21:34:38] Logged the message, Master [21:35:36] greg-g: https://www.mediawiki.org/wiki/Special:Version looks ok. [21:35:44] w00t [21:35:46] thanks bd808 [21:37:09] aude: group0 is back to 1.23wmf21. [22:05:15] (03PS7) 10Hoo man: Introduce an admins::release user group [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 [22:05:54] (03CR) 10Hoo man: "Bump... anyone? The users already confirmed per email that they're fine with this, so I don't see what's blocking this..." [operations/puppet] - 10https://gerrit.wikimedia.org/r/116019 (owner: 10Hoo man) [22:42:23] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [22:42:23] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [22:42:23] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [22:42:23] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC