[00:00:36] James_F: The VE config change is now live according to Erik B [00:01:03] greg-g: can i pushout another config update for zerowiki in a bit? [00:02:11] RoanKattouw: Yes, I know. :-) [00:02:30] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 1703 bytes in 7.475 second response time [00:03:50] http://privatekeycheck.com/ :D [00:07:08] hoo: thx, made me smile [00:07:41] :) [00:08:02] yurikR: suppose so [00:12:23] (03PS1) 10Yurik: zerowiki enable sysop to add/remove all groups [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125045 [00:12:44] greg-g: ^ [00:13:24] yurikR: is there local wiki consensus? :P [00:13:31] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 227867 bytes in 8.242 second response time [00:14:49] greg-g: yes - the whole community of 1 users has agreed ot it [00:15:06] yurikR: :) [00:15:50] 100% yes, 100% participation... a rare occurance [00:15:55] yurikR: Link to discussion? You know, for recording purposes :p [00:16:23] JohnLewis: +1 [00:17:21] (03CR) 10Yurik: [C: 032] "JohnLewis & greg-g concurred" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125045 (owner: 10Yurik) [00:17:39] there! [00:18:28] yurikR: If we're tge community, then sure. [00:18:47] bummer... i just can't please everyone, can i? [00:18:52] true wiki spirit [00:18:55] (03Merged) 10jenkins-bot: zerowiki enable sysop to add/remove all groups [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125045 (owner: 10Yurik) [00:19:11] yurikR: 'Sorry, en.zero.wikipedia.org is only supported by select mobile carriers and is not available from your mobile carrier.' :( [00:20:39] !log yurik synchronized wmf-config/InitialiseSettings.php [00:20:44] Logged the message, Master [00:20:50] wow, how come it went so fast now [00:21:18] Because the wiki has Zero weight :D [01:05:31] !log Undid local patch to "grunt-lib-phantomjs/phantomjs/main.js" (for bug 63579) in "/srv/deployment/integration/slave-scripts" on gallium [01:05:39] Logged the message, Master [01:06:18] !log git-deploy: Deploying integration/slave-scripts 'If2539ccb3152bd0' [01:06:22] Logged the message, Master [01:44:50] (03PS1) 10Ori.livneh: Use '$channel' to branch on interpreter [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125062 [01:46:31] (03PS2) 10Ori.livneh: Use '$channel' to branch on interpreter [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125062 [01:47:26] (03CR) 10Ori.livneh: [C: 032] Use '$channel' to branch on interpreter [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125062 (owner: 10Ori.livneh) [01:48:03] !log ori updated /a/common to {{Gerrit|I697f7e4a6}}: Use '$channel' to branch on interpreter [01:48:10] Logged the message, Master [01:48:44] !log ori synchronized wmf-config/CommonSettings.php 'I697f7e4a6: Use to branch on interpreter' [01:48:49] Logged the message, Master [01:49:17] !log ori synchronized wmf-config/InitialiseSettings-labs.php 'I697f7e4a6: Use to branch on interpreter' [01:49:20] Logged the message, Master [02:00:45] (03PS1) 10Ori.livneh: Update multiversion regexp for *.beta-hhvm.wmflabs.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125063 [02:00:54] (03CR) 10jenkins-bot: [V: 04-1] Update multiversion regexp for *.beta-hhvm.wmflabs.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125063 (owner: 10Ori.livneh) [02:06:54] (03PS1) 10Yurik: ZeroWiki: Add extra page to whitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125064 [02:13:21] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 2787 MB (2% inode=99%): [02:18:00] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [02:18:00] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [02:18:00] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [02:18:00] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [02:18:21] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3710 MB (3% inode=99%): [02:22:13] !log LocalisationUpdate completed (1.23wmf20) at 2014-04-10 02:22:11+00:00 [02:22:19] Logged the message, Master [02:48:30] PROBLEM - MySQL InnoDB on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:49:40] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:50:21] RECOVERY - MySQL InnoDB on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [02:50:31] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [03:00:21] RECOVERY - Disk space on virt0 is OK: DISK OK [03:00:48] (03PS2) 10Ori.livneh: Update multiversion regexp for *.beta-hhvm.wmflabs.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125063 [03:01:19] (03CR) 10Ori.livneh: [C: 032] Update multiversion regexp for *.beta-hhvm.wmflabs.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125063 (owner: 10Ori.livneh) [03:01:29] !log ori updated /a/common to {{Gerrit|Ibdbac982b}}: Update multiversion regexp for *.beta-hhvm.wmflabs.org [03:01:37] Logged the message, Master [03:01:59] !log ori synchronized multiversion/MWMultiVersion.php 'Ibdbac982b: Update multiversion regexp for *.beta-hhvm.wmflabs.org' [03:02:03] Logged the message, Master [03:19:21] !log LocalisationUpdate completed (1.23wmf21) at 2014-04-10 03:19:18+00:00 [03:19:26] Logged the message, Master [03:31:16] (03PS1) 10Ori.livneh: beta: correct wgCanonicalServer string format [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125070 [03:34:00] (03PS2) 10Ori.livneh: beta: correct wgCanonicalServer string format [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125070 [03:34:22] (03CR) 10Ori.livneh: [C: 032] beta: correct wgCanonicalServer string format [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125070 (owner: 10Ori.livneh) [04:11:52] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Apr 10 04:11:47 UTC 2014 (duration 11m 46s) [04:11:57] Logged the message, Master [05:19:00] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [05:19:00] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [05:19:00] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [05:19:00] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [05:19:45] (03CR) 10Giuseppe Lavagetto: "Fair enough :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125025 (owner: 10Ori.livneh) [06:15:35] !log Some interface messages are missing on wikidata.org. Started a manual l10nupdate. [06:15:39] Logged the message, Master [06:24:58] (03PS1) 10Ori.livneh: Avoid using bits on beta-hhvm.wmflabs.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125073 [06:26:11] (03CR) 10Ori.livneh: [C: 032] Avoid using bits on beta-hhvm.wmflabs.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125073 (owner: 10Ori.livneh) [06:26:55] !log ori updated /a/common to {{Gerrit|I20bbe05cc}}: Avoid using bits on beta-hhvm.wmflabs.org [06:27:00] Logged the message, Master [06:27:30] !log ori synchronized wmf-config/CommonSettings.php 'I20bbe05cc: Avoid using bits on beta-hhvm.wmflabs.org' [06:27:34] Logged the message, Master [06:27:35] !log LocalisationUpdate completed (1.23wmf20) at 2014-04-10 06:27:30+00:00 [06:27:40] Logged the message, Master [06:27:57] ori: yeah no dice [06:28:18] still broken [06:28:23] that was wmf20 [06:28:25] though diffs now work [06:28:26] wikidata.org is wmf21 [06:35:53] Jasper_Deng: give it another few minutes to finish rebuilding the wmf21 cache [06:46:01] !log LocalisationUpdate completed (1.23wmf21) at 2014-04-10 06:45:59+00:00 [06:46:06] Logged the message, Master [06:47:20] Jasper_Deng: ori still broken on that wikidata recentchanges page [06:47:28] I don't see it anywhere else on a wmf21 wiki [06:47:42] that's for Lydia to know then [06:47:51] I'm falling asleep at the keyboard, mind poking her for me? [06:48:32] * greg-g zonks out [06:48:34] g'night [07:19:09] (03PS2) 10BBlack: Add proxy support for 437-01. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124887 (owner: 10Dr0ptp4kt) [07:19:57] (03CR) 10BBlack: [C: 032 V: 032] Add proxy support for 437-01. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124887 (owner: 10Dr0ptp4kt) [07:21:21] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Apr 10 07:21:16 UTC 2014 (duration 7m 21s) [07:21:26] Logged the message, Master [07:39:31] localisation update fail???? [07:41:12] ori: ? any idea? [07:42:49] aude: nope. i haven't investigated it. i ran a manual update because that fixed things before. it didn't, so the problem could be unrelated. [07:45:02] i will put wikidata back on wmf20, and then we can investigate [07:45:08] test wikidata would still be broken [07:45:23] it's like all the wikibase messages are missing [07:46:36] (03PS1) 10Aude: Put wikidata back on wmf/1.23wmf20 due to localisation cache issues [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125091 [07:46:45] i might have idea [07:47:07] extension messages might be out of sync with the version of wikibase in wmf21 [07:48:18] i don't know quite how to fix and don't want to touch myself, but probably need extension messages rebuilt or just keep wikidata on wmf20 [07:48:28] and then on wmf22 next week [07:48:44] well, i rebuilt them by forcing an l10nupdate, it didn't help [07:48:46] nevermind [07:48:53] hm? [07:48:59] there are also messages in client (e.g. wikipedia) [07:49:08] so we can't just skip wmf21 [07:49:29] like the 13th floor of some hotels [07:49:33] (03CR) 10Aude: [C: 032] Put wikidata back on wmf/1.23wmf20 due to localisation cache issues [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125091 (owner: 10Aude) [07:49:49] we can take time to figure out and when people are around [07:50:36] (03PS8) 10Rush: module to manage new python-diamond package [operations/puppet] - 10https://gerrit.wikimedia.org/r/124608 [07:50:52] (03Merged) 10jenkins-bot: Put wikidata back on wmf/1.23wmf20 due to localisation cache issues [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125091 (owner: 10Aude) [07:52:08] (03CR) 10ArielGlenn: [C: 031] "straight copy, and change of contents to follow separately as I understand it, then this is good to go by me." [operations/puppet] - 10https://gerrit.wikimedia.org/r/123852 (owner: 10Dzahn) [07:52:21] (03CR) 10Andrew Bogott: [C: 031] module to manage new python-diamond package [operations/puppet] - 10https://gerrit.wikimedia.org/r/124608 (owner: 10Rush) [07:53:09] !log aude synchronized wikiversions.json 'Put Wikidata back on 1.23wmf20, due to localisation cache issues' [07:53:15] Logged the message, Master [07:54:08] ori: is there more i have to do than wikiversions.json? rebuild cdb? [07:54:35] (03PS9) 10Rush: module to manage new python-diamond package [operations/puppet] - 10https://gerrit.wikimedia.org/r/124608 [07:54:57] i think so [07:55:11] (03PS1) 10Alexandros Kosiaris: ignore vnet interfaces in check_eth [operations/puppet] - 10https://gerrit.wikimedia.org/r/125094 [07:55:43] sync-wikiversions [07:55:49] yes [07:55:53] also, i think i know what might be up [07:55:58] (03PS1) 10Springle: Enable engine condition pushdown for dbstore [operations/puppet] - 10https://gerrit.wikimedia.org/r/125096 [07:56:22] ok? [07:56:27] i'll do that and then we can look [07:56:37] (03PS10) 10Rush: module to manage new python-diamond package [operations/puppet] - 10https://gerrit.wikimedia.org/r/124608 [07:57:03] !log aude rebuilt wikiversions.cdb and synchronized wikiversions files: Rebuild wikiversions and put wikidata on 1.23wmf20 [07:57:08] Logged the message, Master [07:57:25] (03CR) 10Alexandros Kosiaris: [C: 032] ignore vnet interfaces in check_eth [operations/puppet] - 10https://gerrit.wikimedia.org/r/125094 (owner: 10Alexandros Kosiaris) [07:59:03] (03CR) 10Andrew Bogott: [C: 031] module to manage new python-diamond package [operations/puppet] - 10https://gerrit.wikimedia.org/r/124608 (owner: 10Rush) [07:59:29] but still seems broken [07:59:40] PROBLEM - check configured eth on virt1007 is CRITICAL: virbr0 reporting no carrier. [07:59:42] (03CR) 10Rush: [C: 032] module to manage new python-diamond package [operations/puppet] - 10https://gerrit.wikimedia.org/r/124608 (owner: 10Rush) [08:00:24] (03PS2) 10Hashar: contint: apply maven settings on labs slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/124822 [08:03:24] (03PS2) 10Springle: Enable engine condition pushdown for dbstore [operations/puppet] - 10https://gerrit.wikimedia.org/r/125096 [08:03:57] w/o js is good [08:04:15] just js is broken [08:05:31] PROBLEM - check configured eth on virt1006 is CRITICAL: virbr0 reporting no carrier. [08:05:45] (03CR) 10Springle: [C: 032] Enable engine condition pushdown for dbstore [operations/puppet] - 10https://gerrit.wikimedia.org/r/125096 (owner: 10Springle) [08:07:20] PROBLEM - check configured eth on virt1004 is CRITICAL: virbr0 reporting no carrier. [08:08:00] PROBLEM - check configured eth on virt1003 is CRITICAL: virbr0 reporting no carrier. [08:09:42] trying to touch stuff [08:09:54] !log aude synchronized php-1.23wmf20/extensions/Wikidata [08:09:59] Logged the message, Master [08:12:01] https://www.wikidata.org/wiki/Q60?debug=true is broken also [08:12:06] still [08:17:38] (03PS3) 10Dzahn: puppetize apache-graceful-all [operations/puppet] - 10https://gerrit.wikimedia.org/r/123852 [08:19:26] trying to touch more js [08:19:34] !log aude synchronized php-1.23wmf20/extensions/Wikidata [08:20:00] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [08:20:00] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [08:20:00] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [08:20:00] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [08:22:07] js still broken [08:22:15] * aude not sure what else to do :( [08:22:21] (03PS3) 10Hashar: contint: apply maven settings on labs slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/124822 [08:22:21] ori: still around? [08:22:40] messages work in non-js [08:22:42] broken in js [08:23:25] (03PS4) 10Hashar: contint: apply maven settings on labs slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/124822 [08:24:18] run scap, but seems scary [08:24:20] ? [08:26:21] PROBLEM - check configured eth on virt1005 is CRITICAL: virbr0 reporting no carrier. [08:26:44] aude: step 1: take a break for a moment. [08:27:15] ok [08:27:20] * aude trying to debug [08:27:21] it's no good to try everything in a panic; that's usually how our biggest outages happen (a careless, rushed fix for a small outage) [08:27:28] of course [08:27:53] are you on the engineering list? [08:27:56] no [08:27:57] (03CR) 10Hashar: [C: 031 V: 032] "Cherry picked on integration-puppetmaster.eqiad.wmflabs PS2-4 were meant to fix a permission issue (file belonged to root:root) which cau" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124822 (owner: 10Hashar) [08:28:10] ops [08:32:52] we also have stuff like "<wikibase-dataitem>" in wikivoyage [08:33:13] so i suppose we need to get wmf21 fixed ... a blocker for wikipedia [08:34:36] (03CR) 10Alexandros Kosiaris: [C: 032] contint: apply maven settings on labs slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/124822 (owner: 10Hashar) [08:34:41] (03PS1) 10BBlack: add public service IPs to lvs300[1234] [operations/puppet] - 10https://gerrit.wikimedia.org/r/125101 [08:36:40] (03PS2) 10BBlack: add public service IPs to lvs300[1234] [operations/puppet] - 10https://gerrit.wikimedia.org/r/125101 [08:36:57] (03CR) 10BBlack: [C: 032 V: 032] add public service IPs to lvs300[1234] [operations/puppet] - 10https://gerrit.wikimedia.org/r/125101 (owner: 10BBlack) [08:37:25] aude: i forwarded the last message to the wikidata list just now [08:38:06] ok [08:39:07] when i do wfMessage( 'wikidata-edit' ) .... in wmf21, broken of course [08:39:21] in wmf20, it's correct on php side, as we see on wikidata [08:39:41] just wonder why the js doesn't see them [08:42:08] !log forcing Bugzilla logout for all users [08:42:13] Logged the message, Master [08:43:30] aude: try /usr/local/bin/mwscript extensions/WikimediaMaintenance/refreshMessageBlobs.php --wiki=wikidatawiki [08:43:41] ok [08:44:20] aude: as l10nupdate user ideally. have you started yet? [08:44:59] not yet [08:45:20] i'll try on testwikidata first, even though it will do no good [08:47:21] nope [08:48:14] Does anybody know what server you connect when your save a page? [08:48:27] It's for a school who wants to local the vandals. :) [08:50:54] (03PS1) 10Rush: collector definition. CPU and Network collector enable. [operations/puppet] - 10https://gerrit.wikimedia.org/r/125104 [08:55:09] !log ms6 - shutdown -h now [08:55:14] Logged the message, Master [08:55:26] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [08:58:29] (03PS1) 10BBlack: add lvs300[1234] to lvs::configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/125105 [08:58:29] akosiaris: regarding "virbr0 reporting no carrier", is it important that I sort out and understand that, or can we just turn off that check? [08:58:45] hashar: http://en.wikipedia.beta-hhvm.wmflabs.org/wiki/Main_Page [08:58:49] in other words… is that check a hard-coded enumeration of interfaces, or does it automatically check every interface that it can find? [08:59:35] …actually, was that one you or cmjohnson1? [09:00:50] (03CR) 10BBlack: [C: 032 V: 032] add lvs300[1234] to lvs::configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/125105 (owner: 10BBlack) [09:02:35] andrewbogott: it checks every interface it can find but we can skip interfaces (we already skip alias interface and vtnets). But it may be better to actually know why that virbr0 interface is there (and is not there in some hosts). [09:02:59] ori: you are the best :-] [09:03:12] ori: did you get the cssmin corruption solved / figured out ? [09:04:20] hashar: no, it went away on its own. i did my usual thing of changing things before i completely understood them. there are a lot of configuration values that are set on the basis of the request host and i didn't handle all of them [09:04:47] !log Jenkins bunch of jobs are not being triggered properly. Taking traces. [09:04:52] Logged the message, Master [09:05:15] ori: that would explain. You might want to restart the bits varnish to clear out RL entries [09:05:34] i ended up with something slightly kludgy, which is to add '$channel' (like the chrome release channels -- not a perfect analogy, i admit) as one the string patterns that get replaced (with 'beta-hhvm' or 'beta' or 'stable') [09:05:44] had to stop using bits, too, actually [09:05:50] :-D [09:05:56] the req-host rewriting for bits load.php urls needs literal domain name suffix (any regular expression syntax is escaped), and it only allows for a maximum of three dynamic parts that are passed to the application stack [09:06:10] so given the suffix 'beta.wmflabs.org', you can have en.m.wikipedia.beta.wmflabs.org (the three dynamic parts being 'en', 'm', and 'wikipedia'). if i set the suffix to just 'wmflabs.org' then en.wikipedia.beta-hhvm.wmflabs.org would work but en.m would not [09:06:28] akosiaris: OK, I'll make an attempt at understanding :) [09:06:39] changing that means changing the vcl for prod too so i didn't want to touch that [09:07:05] hashar: if you are wondering how i typed that so quickly it's because i copy-pasted from a chat with aaron earlier :P [09:07:19] * hashar drops packets [09:07:20] :D [09:08:20] hashar: i think in hindsight you were right, though [09:09:19] at this point the configuration for the hhvm-powered wiki and the php-powered ones are a little different -- not a lot, but enough to make performance comparisons meaningless [09:09:45] so it would have probably been better to have the varnishes delegate to a separate set of apache instances [09:11:08] (03CR) 10Dzahn: [C: 032] "it's down, bye" [operations/dns] - 10https://gerrit.wikimedia.org/r/124901 (owner: 10Matanya) [09:11:31] bye bye ms6 [09:11:39] !log DNS update - removing ms6 [09:11:43] Logged the message, Master [09:15:39] ori: well you can still split them :-) [09:16:09] ori: I think it would be nice to have Varnish act as a director between apaches and hhvm instances. Would let us load the hhvm cluster gradually and easily fallback to the apache/zend application servers [09:16:21] hashar: yes, i think you're right [09:16:34] i know you said it before but i was foolish and didn't listen :P [09:16:35] we could even make it a beta feature which would set a cookie [09:16:42] if varnish is able to route based on a cookie [09:16:50] yes, it is, and yeah, we could [09:16:52] matanya: what do you mean on 122338 in line 231 [09:16:52] and we will need to vary cache which has various implications [09:16:59] matanya: what should i align there [09:17:14] * matanya is looking [09:17:29] hashar: i think i'll stick with having both interpreters on each apache until our quarterly review on the 15th, since getting it working was one of the targets rob set [09:17:29] ori: dont blame yourself!! [09:17:37] but afterwards probably split [09:17:56] mutante: the arrows [09:18:08] ori: the main goal was to get hhvm running on beta and I think it is fulfilled. [09:18:15] matanya: they look aligned to me [09:18:26] matanya: 230 and 231 are aligned [09:18:31] ori: now the team can think about the target architecture and how to migrate :-] [09:18:43] i don't remember what i meant there [09:18:50] might be just a mistake [09:18:56] ori: we will also need to investigate all the available hhvm settings and finely tune them. It is oing to take a while [09:19:00] hashar: it's going to be pretty huge [09:19:03] yeah, i was just about to say [09:19:04] matanya: ok! i thought maybe you mean that the whole file is 2-spaces :p [09:19:07] there are a lot more knobs to twiddle [09:19:30] maybe i did mutante hard to tell, i have a memory leak [09:20:02] matanya: alright, no worries, thanks [09:20:15] !log Jenkins unpooling both slave labs using the web interface and killing the Jenkins client running as jenkins-deploy . Will repool so the job can be reregistered properly {{bug|63760}} [09:20:20] Logged the message, Master [09:23:41] (03CR) 10Springle: [C: 031] collector definition. CPU and Network collector enable. [operations/puppet] - 10https://gerrit.wikimedia.org/r/125104 (owner: 10Rush) [09:23:55] (03PS2) 10Rush: collector definition. CPU and Network collector enable. [operations/puppet] - 10https://gerrit.wikimedia.org/r/125104 [09:24:19] (03CR) 10Rush: [C: 032] collector definition. CPU and Network collector enable. [operations/puppet] - 10https://gerrit.wikimedia.org/r/125104 (owner: 10Rush) [09:28:13] (03CR) 10Rush: [V: 032] "jenkins where are you?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125104 (owner: 10Rush) [09:29:08] !log Jenkins: disabling Gearman client in https://integration.wikimedia.org/ci/configure and reenabling it [09:29:13] Logged the message, Master [09:32:06] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [09:33:13] (03CR) 10Ori.livneh: "modules/graphite/lib/puppet/parser/functions/configparser_format.rb adds a configparser_format() function that you can use to turn a Puppe" (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125104 (owner: 10Rush) [09:38:50] (03CR) 10Rush: "checking out the configparser_format function. If it seems good I will follow up with that in a bit." (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125104 (owner: 10Rush) [09:39:23] (03PS1) 10Ori.livneh: Move HHVM extension blacklist below extract($globals) so it isn't simply clobbered [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125108 [09:39:28] !log Zuul processed its backlog. Had to disconnect/reconnect the labs slaves. There is some weird bug occurring :-( [09:39:34] Logged the message, Master [09:40:55] (03CR) 10Ori.livneh: collector definition. CPU and Network collector enable. (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125104 (owner: 10Rush) [09:41:47] (03CR) 10Ori.livneh: [C: 032] Move HHVM extension blacklist below extract($globals) so it isn't simply clobbered [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125108 (owner: 10Ori.livneh) [09:41:58] !log ori updated /a/common to {{Gerrit|I107179a27}}: Move HHVM extension blacklist below extract($globals) so it isn't simply clobbered [09:42:04] Logged the message, Master [09:44:07] !log ori synchronized wmf-config/CommonSettings.php 'I107179a27: Move HHVM extension blacklist below extract($globals) so it isn't simply clobbered' [09:44:12] Logged the message, Master [09:50:13] (03PS1) 10BBlack: remove dysfunctional DNS recursor from LVSes [operations/puppet] - 10https://gerrit.wikimedia.org/r/125113 [09:50:46] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [09:52:42] (03CR) 10BBlack: [C: 032 V: 032] remove dysfunctional DNS recursor from LVSes [operations/puppet] - 10https://gerrit.wikimedia.org/r/125113 (owner: 10BBlack) [09:53:07] (03PS1) 10Alexandros Kosiaris: Have check_sysctl populated on all hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/125116 [09:53:09] (03PS1) 10Alexandros Kosiaris: Set up a check for LVS rp_filter is disabled [operations/puppet] - 10https://gerrit.wikimedia.org/r/125117 [09:54:34] (03CR) 10jenkins-bot: [V: 04-1] Set up a check for LVS rp_filter is disabled [operations/puppet] - 10https://gerrit.wikimedia.org/r/125117 (owner: 10Alexandros Kosiaris) [09:56:55] (03CR) 10Aklapper: [C: 04-1] "01tonythomas: Did you test the patch on a local Bugzilla test instance and did it work?" [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/124140 (owner: 1001tonythomas) [09:58:23] (03CR) 10Alexandros Kosiaris: [C: 032] Have check_sysctl populated on all hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/125116 (owner: 10Alexandros Kosiaris) [09:59:36] (03PS1) 10Springle: Enable all MariaDB block-based joins algorithms [operations/puppet] - 10https://gerrit.wikimedia.org/r/125120 [10:02:09] (03PS2) 10Alexandros Kosiaris: Set up a check for LVS rp_filter is disabled [operations/puppet] - 10https://gerrit.wikimedia.org/r/125117 [10:06:11] (03PS2) 10Dzahn: lint role/deployment [operations/puppet] - 10https://gerrit.wikimedia.org/r/122338 [10:13:00] (03CR) 10Matanya: "on labs i gut:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125116 (owner: 10Alexandros Kosiaris) [10:13:17] ori: ^ this might interst you [10:59:15] mutante: can i configure RT in labs ? [10:59:41] there is already an instance, isn't it? [10:59:48] * yuvipanda remembers setting up a proxy for one such thing [11:01:47] where yuvipanda ? [11:01:59] matanya: ottomata or drdee set it up, IIRC [11:02:04] don't remember too well :( [11:02:15] I setup a proxy at rt.wmflabs.org but that no longer seems to be aroudn [11:02:18] * yuvipanda pokes wikitech [11:03:36] back [11:04:06] matanya: aha! project packaging, instance 'rt'. status: SHUTOFF [11:04:17] matanya: so i guess you can ignore it and assume it doesn't exist [11:04:39] i don't find any puppet role i can apply [11:06:24] matanya: what's your wikitech username so I can add you to the project? [11:06:30] matanya [11:06:31] matanya. [11:06:35] haha :) [11:06:51] :) [11:06:55] added [11:07:11] thanks [11:07:17] :) [11:07:32] I know nothing of the project / instance, just the fact it exists :) [11:09:04] the box is dead :) [11:09:14] i'll wait for mutante he will know, i guess [11:14:45] matanya: :) ok! [11:15:05] thanks for the help [11:18:54] (03CR) 10Dzahn: lint role/deployment (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/122338 (owner: 10Dzahn) [11:19:03] matanya: what's dead [11:19:07] (03CR) 10Springle: [C: 032] Enable all MariaDB block-based joins algorithms [operations/puppet] - 10https://gerrit.wikimedia.org/r/125120 (owner: 10Springle) [11:19:24] mutante: i wanted to develop stuff for rt, and test it in labs [11:19:42] matanya: i don't know about that instance [11:19:42] but can't find a way to configure a rt role in labs [11:19:59] matanya: role/rt.pp ? [11:20:34] mutante: i meant using manage instance [11:20:37] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [11:20:37] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [11:20:37] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [11:20:37] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [11:21:04] matanya: it needs to be added to "puppet groups" then [11:21:13] hold on [11:21:57] (03CR) 10Alexandros Kosiaris: [C: 032] Set up a check for LVS rp_filter is disabled [operations/puppet] - 10https://gerrit.wikimedia.org/r/125117 (owner: 10Alexandros Kosiaris) [11:23:46] matanya: sooo.. a) create new project or have existing one b) be project admin of that project c) add any puppet class you want to puppet groups d) configure instance and apply [11:24:11] matanya: existing project? [11:24:17] yes mutante [11:24:18] there's definitely a role class specifically for creating an RT labs instance... [11:24:21] * andrewbogott looks for name [11:24:28] rt-128 instance in the puppet project [11:25:28] andrewbogott: role/rt.pp right [11:25:36] that's for prod I think [11:25:46] matanya, I can add you to the rt project if you like... [11:26:02] oh, nm I deleted it apparently :) [11:26:24] yes, yuvipanda added me, but it is down [11:26:26] role::rt::labs [11:26:48] Um… differs from the prod role, possibly for good reasons or possibly for bad [11:27:06] matanya, what's it called? I can't find it [11:27:10] (the rt proj) [11:27:29] also: https://gerrit.wikimedia.org/r/#/c/116064/ [11:27:29] project packaging, instance 'rt' [11:27:36] oh, that's something else [11:27:37] how is it packaging [11:27:50] oh, i was sure it was merged [11:27:56] * andrewbogott is about to vanish again as meeting starts. [11:28:19] i'll delay this effort until that is merged. thanks andrewbogott and mutante [11:28:30] matanya: um… until what's merged? [11:28:50] the change linked above [11:29:02] https://gerrit.wikimedia.org/r/#/c/116064/ [11:29:13] oh, ok, I'm way behind then [11:29:18] anyway, gotta go [11:31:05] i'd have to say to be a real test it should be the identical role [11:31:09] not role:labs [11:31:20] but yea, we gotta run, ttyl matanya [11:31:40] thanks and bye [11:31:57] PROBLEM - RAID on nescio is CRITICAL: Connection refused by host [11:53:07] PROBLEM - RAID on brewster is CRITICAL: CRITICAL: Active: 2, Working: 2, Failed: 1, Spare: 0 [11:53:27] RECOVERY - RAID on nescio is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [11:54:07] ACKNOWLEDGEMENT - RAID on brewster is CRITICAL: CRITICAL: Active: 2, Working: 2, Failed: 1, Spare: 0 alexandros kosiaris well known. Brewster is to be decom soon [11:56:17] (03PS1) 10Alexandros Kosiaris: Correct check_sysctl source path [operations/puppet] - 10https://gerrit.wikimedia.org/r/125173 [11:57:27] (03CR) 10Alexandros Kosiaris: [C: 032] Correct check_sysctl source path [operations/puppet] - 10https://gerrit.wikimedia.org/r/125173 (owner: 10Alexandros Kosiaris) [11:57:39] (03CR) 10Alexandros Kosiaris: [V: 032] Correct check_sysctl source path [operations/puppet] - 10https://gerrit.wikimedia.org/r/125173 (owner: 10Alexandros Kosiaris) [11:59:55] (03PS1) 10BBlack: add esams.wmnet to strontium puppet allow_from [operations/puppet] - 10https://gerrit.wikimedia.org/r/125174 [12:03:04] (03CR) 10BBlack: [C: 032 V: 032] add esams.wmnet to strontium puppet allow_from [operations/puppet] - 10https://gerrit.wikimedia.org/r/125174 (owner: 10BBlack) [12:18:23] akosiaris: heh. thanks for that, too :) [12:20:01] ori: ? [12:20:23] the check_sysctl fix [12:20:35] (03CR) 10Matanya: "fixed in: https://gerrit.wikimedia.org/r/125173" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125116 (owner: 10Alexandros Kosiaris) [12:21:27] ah, yeah. still waiting on check on some machines but hopefully this chapter is almost closed :-) [12:29:59] (03PS1) 10Hashar: jenkins: logrotate main file on a daily basis [operations/puppet] - 10https://gerrit.wikimedia.org/r/125178 [12:30:26] akosiaris: if you have some spare cycles I could use https://gerrit.wikimedia.org/r/125178 :D Adjust logrotate for the Jenkins log file :] [12:30:59] (please) [12:31:01] 180 ? [12:31:05] days [12:31:08] instead of 52 weeks [12:31:09] 6 months of logs ? [12:31:23] that already cut it by half, but I am fine with less logs, maybe 60 days :] [12:31:45] you tell me. [12:31:48] how much is one day of logs? [12:31:59] 600MB uncompressed maybe? [12:32:08] ouch [12:32:43] the compressed weekly logs are 140M [12:32:58] so maybe 20 per day [12:33:06] sold! [12:33:22] 3.6G for 6 months. [12:33:26] yeah I think I am ok with that [12:33:34] (03PS2) 10Hashar: jenkins: logrotate main file on a daily basis [operations/puppet] - 10https://gerrit.wikimedia.org/r/125178 [12:33:37] and I filled a bug to upstream to reduce the amount of log [12:33:53] the main culprit is the Jenkins Gearman client plugin which has been written by openstack [12:34:01] it is solved upstream \O/ [12:34:32] LGTM but put a mode and ensure parameter [12:35:19] isn't ensure present always there ? [12:36:23] it defaults to present [12:36:24] (03PS3) 10Hashar: jenkins: logrotate main file on a daily basis [operations/puppet] - 10https://gerrit.wikimedia.org/r/125178 [12:36:29] ^^^ \O/ [12:36:36] but I learned to not trust puppet defaults in some case [12:36:43] unless I am the one declaring them [12:38:39] (03CR) 10Alexandros Kosiaris: [C: 032] jenkins: logrotate main file on a daily basis [operations/puppet] - 10https://gerrit.wikimedia.org/r/125178 (owner: 10Hashar) [12:39:16] thanks [12:39:31] running puppet on gallium [12:42:38] !log removed broken pdns_gmetric cronjob on lvs boxes [12:42:43] Logged the message, Master [12:44:04] damn jenkins.log has 30 miillions lines [12:50:05] (03Abandoned) 10Reedy: Non wikipedias to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124591 (owner: 10Reedy) [12:54:03] Reedy: i'm trying to figure out why localistion messages are not working in js (only) and how to poke/purge things [12:54:17] touched the resource / js stuff [12:54:42] did refresh message blobs [12:55:44] I was just about to say refresh message blobs... [12:55:49] i can try again [12:56:55] trying [12:57:54] no success [13:00:48] we put wikidata back on wmf20, so it should have good caches [13:00:57] hmm, i get a timeout every time i post [13:01:12] wmf21 is totally broken for wikibase (test.wikidata).. no messages in php + js [13:01:26] aude: what messages ? [13:01:35] localistion ? [13:01:37] all i18n messages from wikibase [13:01:43] core messages are ok [13:01:43] aude: I wonder if the rest of localisation update needs to run first.. [13:01:50] aren;t those in json ? [13:02:01] wmf20 = no, wmf21 yes [13:02:01] Yes [13:02:22] * aude scared to run localisation update [13:02:33] don't want to further break things [13:03:24] No Bryan... [13:03:32] not yet [13:03:43] and i totally wouldn't put wikipedias on wmf21 yet [13:03:59] I'm wondering if there's any point branching wmf22 [13:04:10] anyone has those timeouts too ? [13:04:10] no idea, probably not yet [13:04:13] Running 3 version of mw isn't going to go well [13:04:43] wikivoyage etc is on wmf21 and has a few missing messages from wikibase [13:04:53] e.g. "wikibase-data-item" in the sidebar instead of "Data item" [13:05:12] * aude not putting everything back on wmf20 though... it's only a few messages [13:06:06] wait [13:06:28] it could be varnishes but wikidata looks ok in firefox / lgoged out [13:06:48] huh [13:06:55] looks ok logged in [13:07:34] could be my caches now, maybe and maybe refresh message blobs worked this time [13:08:30] every time the request : [13:08:31] https://he.wikipedia.org/w/api.php?action=query&format=json&prop=extracts%7Cpageimages%7Crevisions%7Cinfo&redirects=true&exintro=true&exsentences=2&explaintext=true&piprop=thumbnail&pithumbsize=300&rvprop=timestamp&inprop=watched&indexpageids=true&titles=%D7%9C%D7%95%D7%A0%D7%93%D7%95%D7%9F [13:08:35] aborts [13:10:52] deleting local storage etc [13:10:56] (sure i did that before) [13:11:52] localisation update ran fine overnight [13:12:03] looked ok, but missed wikibase [13:12:05] I'm tempted to just run it again [13:12:07] and/or scap [13:12:09] i would try again [13:12:11] yes [13:12:11] or both [13:12:24] !log reedy Started scap: l10n cache update for wikidatawiki [13:12:24] localisation still broken in chrome [13:12:31] Logged the message, Master [13:13:29] * aude patiently waits [13:13:54] scap doesn't seem to have done anything for the wmf20 localisation cache [13:14:00] odd [13:14:13] wmf21 is taking a while [13:14:15] though should be up-to-date i assume [13:14:31] wmf21 might get wikibase, if it gets found [13:15:12] akosiaris: sorry, I'm looking at puppet and don't follow. Is there currently code to exclude interfaces in there? Or will I need to add that myself? [13:15:47] ./modules/base/templates/check_eth.erb [13:15:52] Reedy: does scap rebuild extension messages? [13:15:53] chain one more regexp [13:17:45] hm, maybe I'm looking at an out-of-date version here [13:18:03] aude: what exactly do you mean? ExtensionMessages-1.XXwmfXX.php ? [13:18:09] mw-update-l10n does it [13:18:10] yes [13:18:23] that might fix missing wikibase messages in wmf21 [13:19:37] still rebuilding wmf21 [13:19:42] ok [13:19:59] looks like scap calls mw-update-l10n [13:20:46] (03PS1) 10Andrew Bogott: Exclude virbr0 from eth interface tests. [operations/puppet] - 10https://gerrit.wikimedia.org/r/125183 [13:20:49] i see wikibase in there for wmf21 [13:21:03] 13:20:23 INFO - Finished mw-update-l10n (duration: 07m 50s) [13:21:14] ok [13:21:42] akosiaris: ^ [13:25:04] (03PS1) 10Ottomata: Adding $deployable_networks variable in network.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/125184 [13:26:25] (03PS1) 10BBlack: remove resolv.conf dependencies on dobson+mchenry(.pmtpa) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125185 [13:27:30] (03PS1) 10Alexandros Kosiaris: Move check_rp_filter_disabled check to role::lvs [operations/puppet] - 10https://gerrit.wikimedia.org/r/125186 [13:27:42] huh, wikidata has messages now [13:27:49] i cleared all the things in my browser [13:29:13] (03CR) 10BBlack: [C: 032 V: 032] remove resolv.conf dependencies on dobson+mchenry(.pmtpa) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125185 (owner: 10BBlack) [13:30:23] but test wikidata still broken [13:31:15] (03CR) 10Alexandros Kosiaris: "Technically correct. However I 'd prefer to know if the interface is really needed and why some hosts got it and some don't and try either" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125183 (owner: 10Andrew Bogott) [13:31:39] !log reedy Finished scap: l10n cache update for wikidatawiki (duration: 19m 15s) [13:31:43] Logged the message, Master [13:31:47] (03CR) 10Alexandros Kosiaris: [C: 032] Move check_rp_filter_disabled check to role::lvs [operations/puppet] - 10https://gerrit.wikimedia.org/r/125186 (owner: 10Alexandros Kosiaris) [13:32:07] That's scap done [13:32:47] ok [13:32:56] could be that's what fixed it [13:33:42] so, i suppose then wikidata ok... people need to purge their browser caches [13:34:04] test.wikidata / wmf21 not ok (also includes commons / wikivoyage etc) [13:34:48] l10nupdate running [13:35:44] ok [13:41:25] (03PS1) 10Ottomata: Setting checkout_submodules and gitfat_enabled for test/testrepo for testing [operations/puppet] - 10https://gerrit.wikimedia.org/r/125189 [13:41:39] (03PS2) 10Ottomata: Setting checkout_submodules and gitfat_enabled for test/testrepo for testing [operations/puppet] - 10https://gerrit.wikimedia.org/r/125189 [13:41:46] (03CR) 10Ottomata: [C: 032 V: 032] Setting checkout_submodules and gitfat_enabled for test/testrepo for testing [operations/puppet] - 10https://gerrit.wikimedia.org/r/125189 (owner: 10Ottomata) [13:42:25] akosiaris: about to puppet merge your role::lvs change [13:42:27] is that ok? [13:43:31] sure [13:43:43] thanks :-) [13:43:56] aude: Updated 197 JSON file(s) in '/a/common/php-1.23wmf20/cache/l10n'. [13:44:07] ok [13:46:48] still see on wikivoyage [13:47:09] that's wmf21 [13:47:27] we need bd808|BUFFER to do his progress bar magic to l10nupdate [13:47:32] :) [13:48:01] looking at extension messages, i don't see a reason for it not to work this time [13:49:26] !log LocalisationUpdate completed (1.23wmf20) at 2014-04-10 13:49:24+00:00 [13:49:31] Logged the message, Master [13:49:34] alright [13:49:57] (03CR) 10Andrew Bogott: "I've restored virt1001 and 1002 to their original libvirt state, so now the icinga warning is present on all nova-compute hosts." [operations/puppet] - 10https://gerrit.wikimedia.org/r/125183 (owner: 10Andrew Bogott) [13:50:56] 1.23wmf20 done [13:50:59] 1.23wmf21 underway [13:51:36] PROBLEM - check configured eth on virt1001 is CRITICAL: virbr0 reporting no carrier. [13:51:36] PROBLEM - check configured eth on virt1002 is CRITICAL: virbr0 reporting no carrier. [13:51:40] ok [13:59:09] wfMessage( 'wikibase-edit' ); in testwikidata produces something now [14:08:22] ottomata: ignore the lvs rp_filter checks. [14:10:00] ottomata: did you want to start the elastic repartition today? [14:10:03] yes! [14:10:07] i certainly do! [14:10:17] i really really really want to fix this git-deploy submodule problem [14:10:31] submodule support isn't working [14:10:34] i had some work i was in the middle of yesterday that needs that [14:10:40] and its kinda in a funky state now until I fix it [14:10:45] so, i'm going to do that [14:10:52] and then next up is elastic repartition [14:11:05] scap is sure taking long time [14:12:03] akosiaris: i get a lot of 504 [14:12:19] 504 Gateway Time-out nginx/1.1.19 [14:12:51] ottomata: cool. submodules aren't super well supported by git. better then by hg, but annoying [14:17:19] !log LocalisationUpdate completed (1.23wmf21) at 2014-04-10 14:17:16+00:00 [14:17:24] Logged the message, Master [14:19:16] ok... w/o js on test.wikidata, we have messages [14:21:07] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [14:21:07] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [14:21:07] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [14:21:07] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [14:21:13] It's still refreshing [14:21:17] ok [14:22:56] people are also still reporting problems with wikidata, even though it works for me now [14:34:24] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (201099) [14:44:21] anybody hear have experience with git-deploy and submodules? [14:44:25] here* [14:44:28] ? [14:46:34] (03CR) 1001tonythomas: "I could test that on my local installation the time I prepared the patch, alright. Cant confirm now, as my BZ installation got currupted i" [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/124140 (owner: 1001tonythomas) [14:49:53] do submodules even work with git deploy? [14:50:14] assume they do now, but before they didn't [14:53:27] .... [14:54:05] huh, test wikidata looks ok now [14:54:09] scap almost done? [14:55:37] it's in Refreshing ResourceLoader caches somewhere [14:55:48] alright [14:55:59] maybe we can put wikidata back on wmf21 when it's done [15:00:07] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Apr 10 15:00:01 UTC 2014 (duration 27m 49s) [15:00:12] Logged the message, Master [15:00:16] aude: ^^ [15:00:20] ok [15:00:41] i suggest put wikidata back on wmf21 [15:01:14] if you want to do that, or i can [15:02:06] hi bd808 [15:03:20] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikidatawiki back to 1.23wmf21... [15:03:25] Logged the message, Master [15:03:29] thanks [15:04:38] * bd808 reads backscroll [15:05:01] wikidata still looks weird in firefox while test.wikidata looks fine [15:05:22] bd808: having issues with messages not available in js but available in php [15:05:31] (among other issues previosuly) [15:05:49] * Reedy waits while aude logs a bug titled "Wikidata looks weird in firefox" [15:05:54] heh [15:06:00] firefox worked before, though [15:06:06] when chrome was broken [15:06:10] now opposite [15:07:19] Both look ok [15:07:44] it's probably browser caching [15:08:27] but i assume the timestamps etc. in resource loader modules should update and force cache refresh [15:09:38] aude: Hmmm… I don't know anything about how the messages get to js land. I do know that there is a "Refreshing ResourceLoader caches" step at the end of the l10nupdate script [15:09:49] (03PS1) 10Hashar: beta: drop pmtpa instances from the natfix subclass [operations/puppet] - 10https://gerrit.wikimedia.org/r/125194 [15:09:56] i did that this morning, again this afternoon [15:10:04] It runs `extensions/WikimediaMaintenance/refreshMessageBlobs.php` [15:10:06] scap should do that so shouldn't need to do again [15:10:23] we can try touching all the wikidata things [15:10:39] (did that earlier in wmf20 and didn't exactly work) [15:10:44] Scap doesn't call that particular maintenance script but maybe it touches files that functionally do the same thing? [15:10:57] hmmm, i can try again [15:11:28] caching + resource loader + localisation is exceedingly frustrating [15:11:57] refreshing blobs [15:12:05] bd808: hey. Sorry still havent found the envy to look at scap on beta :-( [15:12:39] i did everything i could earlier except scap (seems bad idea when no one is around) [15:12:52] Reedy: AFAIK 1.23wmf22 should be safe today. You might poke greg-g and get his thoughts. We got group1 on 1.23wmf21 on Tuesday and the scap bug of using the old version to build ExtensionMessages should be fixed. [15:13:00] broken messages is better than possibly breaking more things [15:13:35] hashar: No worries. I'm making progress along the plan I laid out for it. Lots of puppet hacks^Wtricks :) [15:13:51] bd808: we can see how it goes with wikidata on wmf21 (hopefully not break again) [15:14:55] alright wikidata looks ok now [15:15:27] i suppose refresh blobs worked this time [15:17:00] (03PS1) 10Reedy: wikidatawiki back to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125196 [15:17:07] (03CR) 10Hashar: [C: 031 V: 032] "Cherry picked on beta cluster puppetmaster. iptables -L -t nat now only list eqiad instances:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125194 (owner: 10Hashar) [15:17:09] (03CR) 10Reedy: [C: 032] wikidatawiki back to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125196 (owner: 10Reedy) [15:17:23] (03Merged) 10jenkins-bot: wikidatawiki back to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125196 (owner: 10Reedy) [15:19:02] bd808: I noticed it's still scheduled... And with wikidatawiki back on the correct verison, it would seem we're back on track... [15:19:30] i think we are ok, but i'm asking people to check [15:20:04] We can't run 3 versions of mw, but testwiki could be on a different version [15:21:06] greg-g: Can we deploy all the things? [15:21:34] 00:03:53.964 15:18:31 ERROR - mwversionsinuse failed: [Errno 2] No such file or directory: '/srv/common-local/wikiversions.json' [15:21:34] bouuhh :-( [15:21:47] o_0 [15:22:10] non local file system? [15:22:51] hashar: In beta? [15:23:01] That's me. [15:23:09] Damn [15:23:37] I should write something to fill a bug automatically whenever that job fail hehe [15:23:37] or maybe make it send emails [15:24:45] hashar: I added a symlink that I think will get it working again [15:25:16] not sure why it points to scap though [15:25:17] That's a side effect of me being half done with putting scap in place there [15:25:20] ahh [15:25:25] +0,5 so :] [15:25:38] it is being rerun https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/2438/console [15:25:46] (03PS1) 10Yuvipanda: toollabs: Add maven to dev_environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/125200 [15:25:48] (03PS1) 10Yuvipanda: toollabs: Add expect to exec nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/125201 [15:26:13] Coren: ^ trivial package additions? [15:27:03] /usr/local/bin/mwversionsinuse is a symlink to /srv/scap/bin/mwversionsinuse for quite a while. I changed the scap.cfg late yesterday to start looking at new directories. I didn't think that completely through obviously. [15:27:23] * bd808 shakes fist at interlocking dependencies [15:28:46] bd808: another fun failure is that fawiki does not exist on beta cluster. Might want to reference wikiversions-labs.json or something with -labs [15:28:49] I did remember to add a symlink to keep mw-update-l10n working in beta, but I forgot about mwversionsinuse looking at content in the deploy directory too [15:28:50] see bottom of https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/2438/console [15:29:01] /usr/local/apache/common-local/wikiversions.cdb has no version entry for `fawiki`. [15:29:10] scfc_de: did your patch adding tools.wmflabs.org to /etc/hosts get merged? [15:29:55] bd808: apparently you are not generating a wikiversions-labs.cdb :( [15:29:56] $ ls wikiversio* [15:29:56] wikiversions.cdb wikiversions.json wikiversions-labs.json [15:31:05] yuvipanda: No. [15:31:08] (03PS2) 10Yuvipanda: toollabs: Add expect to exec nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/125201 [15:31:18] scfc_de: :( [15:31:30] scfc_de: any particular reason other than 'have not found someone to pester enough' [15:31:38] yuvipanda: https://gerrit.wikimedia.org/r/#/c/123149/ [15:32:09] bd808: when localisation update runs, does ExtensionMessages get rebuilt? [15:32:13] hashar: I think it reads wikiversions-labs.json but writes as wikiversions.cdb [15:32:28] aude: No, That only happens during scap [15:32:32] oh [15:32:39] so i should have run scap yesterday [15:32:55] bd808: indeed, the cdb keys are http://paste.openstack.org/show/75486/ [15:33:08] (03CR) 10Yuvipanda: "Anyone?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/123149 (owner: 10Tim Landscheidt) [15:33:19] just want to know what we / i did wrong [15:33:23] Did you have a backport that changed extension i18n files? [15:34:00] due to bug in our wmf21 version of wikibase, we put wikibase back on wmf19 (older version that is in wmf20 core) [15:34:02] yuvipanda: I generally don't pester :-). My motivation lies in solving puzzles. [15:34:14] i think that was the problem, that extension messages was then wrong [15:34:36] other question is when we put wikidata back on wmf20 core, why the messages didn't work in js [15:34:49] despite touching things and refreshing message blobs [15:34:58] aude: Ah. Yeah. That would need a scap for sure. Or at least a manual run of mw-update-l10n [15:35:07] This crap is too complicated [15:35:07] noted for next time! [15:35:18] scfc_de: heh :) [15:35:23] scfc_de: I'll pester later today [15:35:27] * aude thinks our users are understanding and patience  [15:35:31] wikidata users [15:35:35] patient* [15:35:55] I really want to get to the point that we don't try to update bits and pieces and just run scap (or the scap replacement) full deploy anytime things change [15:36:05] and then why js messages didn't work? or how to really poke them [15:36:09] bd808: +1 [15:37:17] wmf20 core had correct messages in php (non-js) [15:38:21] * bd808 tails https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/2439/console to see if beta will be fixed now [15:38:27] bd808: do you deploy anthing using git-deploy that also uses submodules? [15:38:42] ottomata: I don't. Parsoid does though [15:38:47] yeah ok [15:39:00] who should I ask? [15:39:04] i got it working, but I think I found a big bug [15:39:06] not sure how it works for them [15:39:13] It needs some config in … that one place in puppet :/ [15:39:18] yuvipanda: Re https://gerrit.wikimedia.org/r/#/c/125200, sure about maven vs. maven2? modules/contint/manifests/packages.pp seems to use the latter, maven is maven3, I think. [15:39:34] scfc_de: it is maven3, yeah. maven2 is fairly old [15:39:58] yuvipanda: Just wanted to make sure. [15:40:00] ottomata: It's you, me and Ryan for most understanding of trebuchet. Maybe Ori knows a bit? [15:40:03] scfc_de: :) [15:40:32] (03CR) 10Tim Landscheidt: [C: 031] toollabs: Add maven to dev_environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/125200 (owner: 10Yuvipanda) [15:40:32] bd808: I filed this issue [15:40:32] https://github.com/trebuchet-deploy/trigger/issues/27 [15:40:34] basically [15:40:40] on tin [15:40:50] .git/modules//info/refs doesn't exist [15:40:53] unless you do [15:41:00] cd .git/modules/ && git update-server-info [15:41:05] and, if it doesn't exist [15:41:19] the targets can't clone the submodule from tin [15:41:32] (03PS1) 10Odder: Local assignment of accountcreator flag on ptwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125210 [15:41:37] because via http, they don't know how to list the available refs, and then don't know what commit to clone at [15:41:59] it looks like Ryan has seen this before, because he has a script to fix it [15:42:04] but nothing runs that script automatically [15:42:06] ottomata: Hmm… so a bootstrapping problem basically for a new repo with submodules? [15:42:10] yeah [15:42:30] That seems likely. There are several bootstrapping dragons in trebuchet still [15:43:05] The problem is of course that we don't add new things to it often enough [15:43:27] (03PS3) 10Ottomata: Puppetizing Camus cronjob [operations/puppet] - 10https://gerrit.wikimedia.org/r/121546 [15:43:43] The long-term users have been manually poked and prodded over time and mask old and new bugs [15:44:34] At some point when we are using all this kit in beta we should do a wipe-and-start-over test on a periodic basis [15:45:10] I wonder if I should start the branching process [15:45:39] Reedy: I don't see why not. [15:46:13] It sounds like the wikidata problem was related to an extension downgrade without a full scap [15:46:17] quite sure i know what went wrong with localisation update and shouldn't happen again [15:46:22] i say go ahead [15:47:05] (03PS4) 10Ottomata: Puppetizing Camus cronjob [operations/puppet] - 10https://gerrit.wikimedia.org/r/121546 [15:47:11] (03CR) 10Ottomata: [C: 032 V: 032] Puppetizing Camus cronjob [operations/puppet] - 10https://gerrit.wikimedia.org/r/121546 (owner: 10Ottomata) [15:47:15] lesson learned! [15:47:23] * aude will run scap next time [15:48:10] we can break testwiki at least [15:48:40] aude: If I had been paying attention to what you were doing I may have realized that needed to be done too. [15:48:48] it's alright [15:48:51] *may have* [15:49:04] fixed now, not the worse thing to happen [15:49:36] shouldn't happen again and perhaps i know how to fix quicker if it does [15:49:39] phew, ok manybubbles I am ready! [15:49:40] I would say any time an extension is changed other than a trivial bug backport a full scap is prudent [15:49:42] sorry that took so long [15:49:48] ok [15:49:48] elastic1003 is the first target, eh? [15:49:51] noted! [15:49:55] ottomata: np - don't worry - the first step on that will take a while any way [15:49:57] sure! [15:50:00] (03CR) 10Tim Landscheidt: [C: 031] "Oh, memories! I think I used expect some time in 1998 or so as a fresh Linux user to script dialing up into a BBS, auth'ing and up- and d" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125201 (owner: 10Yuvipanda) [15:51:15] ok, manybubbles, i'm going to issue the rolling restart commands on elastic1003 [15:51:16] ja? [15:51:34] ottomata: cool. slow and steady [15:51:52] first one, then wait for all the nodes to be gone then check again, etc [15:51:55] i'll watch [15:52:20] ha, it is so hard for me to type elastic10xx instead of analytics10xx [15:52:24] finger memory! [15:52:27] yeah! [15:52:40] i was all like 'wha, where's es???' [15:52:40] hah [15:53:36] ha [15:53:38] k, its moving shards [15:53:40] well, the shards are moving [15:54:07] you can see it in ganglia too [15:54:11] oh? [15:54:27] network? [15:54:35] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:56:34] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 540931 bytes in 9.721 second response time [15:57:54] <_joe_> wow, is the gitblit drama going to happen again? [15:57:57] manybubbles: how long should this take? [15:58:20] <_joe_> it seems that something tanks it on thurrsday [15:58:27] ottomata: tens of minutes [15:58:31] less than ano hour [15:58:40] you can see it on the network and a bit on the load [15:59:13] ok [16:00:28] _joe_: New branches are cut on Thursday. Does gitblit try to index things? If so I bet that's what causes it to be overworked one day a week. [16:02:53] At $DAYJOB-1 our FishEye server would freak out on every branch cut day because of the way it indexed all the commit history for a branch. [16:04:53] FishEye tex sux [16:05:37] MaxSem: Yeah, but crucible was pretty nice. [16:05:56] ottomata: elastic1003 has less then half its remaining shards. [16:06:14] k [16:06:34] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:08:10] ottomata: be back in a bit [16:09:34] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 543374 bytes in 9.786 second response time [16:10:12] mutante: what? https://wikitech.wikimedia.org/w/index.php?title=Server_Admin_Log&diff=prev&oldid=109250 [16:10:18] Why again? [16:12:34] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:13:35] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 543276 bytes in 9.658 second response time [16:13:50] Hey greg-g, FYI we have https://www.mediawiki.org/wiki/Multimedia/Media_Viewer/Release_Plan#Timeline which is a little more concrete now, and includes links to config patches [16:14:11] <_joe_> bd808: that may be the reason [16:14:12] I am actually not sure if I've already mentioned this. [16:14:19] Too many brain cells dying [16:14:22] !log reedy updated /a/common to {{Gerrit|I2cccebdd7}}: wikidatawiki back to 1.23wmf21 [16:14:27] (03PS1) 10Reedy: Add/update symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125216 [16:14:27] Logged the message, Master [16:15:02] (03CR) 10Reedy: [C: 032] Add/update symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125216 (owner: 10Reedy) [16:15:09] (03Merged) 10jenkins-bot: Add/update symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125216 (owner: 10Reedy) [16:15:39] greg-g: around? [16:15:42] !log reedy Started scap: testwiki to 1.23wmf22 and build l10n cache [16:15:47] Logged the message, Master [16:26:19] (03PS1) 10Alexandros Kosiaris: Correct LVS sysctl oid checks [operations/puppet] - 10https://gerrit.wikimedia.org/r/125220 [16:28:27] (03CR) 10BryanDavis: "A few random comments/questions inline. I'm fine with it as is though too." (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/125184 (owner: 10Ottomata) [16:36:44] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:37:17] (03CR) 10Alexandros Kosiaris: [C: 032] Correct LVS sysctl oid checks [operations/puppet] - 10https://gerrit.wikimedia.org/r/125220 (owner: 10Alexandros Kosiaris) [16:38:44] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 544725 bytes in 9.910 second response time [16:40:27] !log reedy Finished scap: testwiki to 1.23wmf22 and build l10n cache (duration: 24m 45s) [16:40:32] Logged the message, Master [16:42:44] !log reedy updated /a/common to {{Gerrit|Ie72029103}}: Add/update symlinks [16:42:48] Logged the message, Master [16:43:37] (03PS1) 10Reedy: testwiki to 1.23wmf22 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125223 [16:43:40] (03PS1) 10Reedy: Wikipeidas to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125224 [16:43:41] (03PS1) 10Reedy: group0 wikis to 1.23wmf22 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125225 [16:43:50] (03CR) 10Reedy: [C: 032] testwiki to 1.23wmf22 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125223 (owner: 10Reedy) [16:44:00] (03Merged) 10jenkins-bot: testwiki to 1.23wmf22 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125223 (owner: 10Reedy) [16:45:46] ^ testwiki can be tested [16:51:38] (03PS2) 10Reedy: Wikipedias to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125224 [16:52:08] mh, whenever we log people out of BZ, jenkins gets stalled as it seems [16:52:41] ottomata: that took longer then I though [16:52:44] *thought* [16:52:45] almost done [16:52:47] op! [16:52:49] there it goes [16:52:50] cool [16:53:01] actually is done [16:53:04] yeah [16:53:18] ok, so, shutting down es, ja? [16:53:27] on 1003 [16:54:13] ok, doing that [16:54:24] ya [16:54:37] status is still green cool [16:54:39] whole cluster is still healthy while lacking one node [16:54:42] yeah [16:54:44] good [16:54:44] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:55:29] ok, shutting down elastic1003 for reinstall [16:55:44] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 546115 bytes in 9.905 second response time [16:55:44] !log shutting down elastic1003 for reinstall and reformat [16:55:50] Logged the message, Master [16:56:54] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.110 [16:58:34] PROBLEM - Host elastic1003 is DOWN: PING CRITICAL - Packet loss = 100% [16:59:02] ah, duh, manybubbles, I need to change partman recipe for this first…doing that now [17:03:08] (03PS1) 10Ottomata: Using raid1-1partition.cfg for elastic* nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/125226 [17:03:44] RECOVERY - Host elastic1003 is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [17:03:45] (03CR) 10Ottomata: [C: 032 V: 032] Using raid1-1partition.cfg for elastic* nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/125226 (owner: 10Ottomata) [17:05:44] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:54] PROBLEM - RAID on elastic1003 is CRITICAL: Timeout while attempting connection [17:05:54] PROBLEM - check configured eth on elastic1003 is CRITICAL: Timeout while attempting connection [17:06:04] PROBLEM - SSH on elastic1003 is CRITICAL: Connection timed out [17:06:44] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 546478 bytes in 9.819 second response time [17:06:44] PROBLEM - puppet disabled on elastic1003 is CRITICAL: Timeout while attempting connection [17:06:44] PROBLEM - DPKG on elastic1003 is CRITICAL: Timeout while attempting connection [17:06:54] PROBLEM - Disk space on elastic1003 is CRITICAL: Timeout while attempting connection [17:07:07] (03CR) 10Cmcmahon: "Can this be reverted or made more correct? It is causing all post-login redirects to fail in beta labs https://gerrit.wikimedia.org/r/#/" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125062 (owner: 10Ori.livneh) [17:08:20] ori: ^^ [17:09:44] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:10:14] PROBLEM - MySQL InnoDB on db1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:10:24] PROBLEM - MySQL Idle Transactions on db1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:11:04] RECOVERY - MySQL InnoDB on db1016 is OK: OK longest blocking idle transaction sleeps for 0 seconds [17:11:14] RECOVERY - MySQL Idle Transactions on db1016 is OK: OK longest blocking idle transaction sleeps for 0 seconds [17:14:54] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:05] (03CR) 10Cmcmahon: "this seems to still be happening https://bugzilla.wikimedia.org/show_bug.cgi?id=63780" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125070 (owner: 10Ori.livneh) [17:17:54] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 545882 bytes in 9.923 second response time [17:21:54] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [17:21:54] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [17:21:54] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [17:21:54] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [17:24:44] PROBLEM - NTP on elastic1003 is CRITICAL: NTP CRITICAL: No response from NTP server [17:25:59] so, the monitoring sprint didn't care about the constantly alerting/noisy false alert for lvs300x? :P [17:27:14] PROBLEM - Host elastic1003 is DOWN: PING CRITICAL - Packet loss = 100% [17:28:08] manybubbles: ottomata I assume that ^^ is known given the scroll back, right? [17:28:22] greg-g: thanks! [17:28:27] yeah, known [17:28:29] expected [17:28:32] kk [17:28:41] yup its ok, its coming back up now [17:28:43] whole cluster still green, but that node is indeed down [17:28:45] sweet [17:28:54] RECOVERY - SSH on elastic1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [17:29:04] RECOVERY - Host elastic1003 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [17:29:06] right now [17:29:17] ottomata: with puppet and stuff? [17:29:51] no puppet [17:30:09] greg-g: Saw my question? [17:30:18] oh, sorrry, no [17:30:23] I was on a long call [17:30:46] hoo: "around?" kinda, I'mma gonna go get some food real quick, if that's ok? [17:30:50] hoo: any rush? [17:31:06] No, no need to hurry [17:32:02] naw still need to format [17:32:03] not puppet yet [17:34:27] cool [17:34:39] also good to check that puppet sets up Elasticsearch properly [17:34:45] last time it screwed up some settings [17:34:51] so I'll check it manually after the puppet run [17:39:20] (03PS1) 10Ottomata: Meant to use raid1-30G, not raid1-1partition for elastic nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/125234 [17:39:36] rats, manybubbles, i used the wrong partman, i meant that one :/ [17:39:38] need to reinstall [17:39:41] again [17:39:46] k [17:39:49] its ok. [17:39:50] s'ok though, once we get it on the first the others will be easy [17:39:53] the cluster doesn't mind [17:39:57] sure [17:40:17] I think I'll go get some pizza. I'll be back in 30 or so. Any chance of puppet run before then? [17:40:20] (03CR) 10Ottomata: [C: 032 V: 032] Meant to use raid1-30G, not raid1-1partition for elastic nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/125234 (owner: 10Ottomata) [17:40:28] naw probably not [17:40:32] i'll wait for puppet run either way [17:40:34] til you are ready [17:42:22] k [17:43:04] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:14] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 543431 bytes in 9.682 second response time [17:45:44] PROBLEM - Host elastic1003 is DOWN: PING CRITICAL - Packet loss = 100% [17:48:44] jdlrobson: yt? [17:50:54] RECOVERY - Host elastic1003 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [17:52:54] PROBLEM - SSH on elastic1003 is CRITICAL: Connection refused [17:55:16] (03PS1) 10Ottomata: Adding howief, jmorgan and msyed to bast1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/125237 [17:55:54] PROBLEM - Host db1016 is DOWN: PING CRITICAL - Packet loss = 100% [17:56:24] RECOVERY - Host db1016 is UP: PING OK - Packet loss = 0%, RTA = 73.97 ms [17:57:29] akosiaris: around ? [17:58:03] (03CR) 10Ottomata: [C: 032 V: 032] Adding howief, jmorgan and msyed to bast1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/125237 (owner: 10Ottomata) [17:58:14] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:14] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 542644 bytes in 9.682 second response time [18:01:14] PROBLEM - Disk space on virt1000 is CRITICAL: DISK CRITICAL - free space: / 1325 MB (2% inode=86%): [18:03:24] Reedy: I assume you've all the information you need from Multimedia re our release/backport today but let me know if you need anything [18:05:27] (03PS1) 10Yuvipanda: toollabs: Add some i386 compat packages to exec nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/125241 [18:05:29] scfc_de: ^ [18:05:35] Coren: ^ merge too [18:06:23] Coren: I hope that syntax will work :) [18:07:07] yuvipanda: I'll test on Toolsbeta later. [18:07:54] PROBLEM - Host elastic1003 is DOWN: PING CRITICAL - Packet loss = 100% [18:08:09] scfc_de: \o/ [18:08:54] RECOVERY - SSH on elastic1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [18:09:04] RECOVERY - Host elastic1003 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [18:11:14] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:12:14] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 541642 bytes in 9.847 second response time [18:20:56] Reedy: in case you didn't see it, there's a backport for wmf21 that should go out before mw.org gets wmf22: https://gerrit.wikimedia.org/r/#/q/125213,n,z [18:22:50] ottomata: back - that took longer then expected [18:22:57] cool, just in time! [18:23:02] /dev/md2 494G 198M 494G 1% /var/lib/elasticsearch [18:23:39] ottomata: looks good [18:23:43] let me know when you puppet [18:23:48] and I'll check the args [18:23:59] running in 1 min [18:24:37] k its running now [18:24:51] sloooooooooooooooooooow [18:25:38] manybubbles: can you give me a section heading on your Search wikitech page where I can put some shell commands to help automate formatting this? [18:25:58] ottomata: just jam when wherever you like [18:26:03] yeah not sure where! [18:26:25] maybe ust under Administration - filesystem formatting or something? [18:27:44] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [18:31:08] manybubbles: https://wikitech.wikimedia.org/wiki/Search#Filesystem_Formatting [18:32:29] ottomata: needs to have git-deploy run to pick up plugins [18:32:35] RECOVERY - puppet disabled on elastic1003 is OK: OK [18:32:35] RECOVERY - DPKG on elastic1003 is OK: All packages OK [18:32:36] and Elasticsearch restarted [18:32:42] ok [18:32:44] RECOVERY - Disk space on elastic1003 is OK: DISK OK [18:32:54] RECOVERY - check configured eth on elastic1003 is OK: NRPE: Unable to read output [18:32:54] RECOVERY - RAID on elastic1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [18:32:58] that'll have to be once each - before pullet if you can ? [18:32:59] manybubbles: can I run that now? [18:33:02] yeah [18:33:05] before puppet? [18:33:08] hm, don't thikn I can do that [18:33:16] PROBLEM - Host db1016 is DOWN: PING CRITICAL - Packet loss = 100% [18:33:29] not without multiple puppet commits to separate the elasticsearch stuff out on the first run [18:33:38] shouldn't matter too much, actually [18:33:42] so long as you don't change the ip [18:33:43] !log reedy synchronized php-1.23wmf21/extensions/MultimediaViewer [18:33:44] RECOVERY - Host db1016 is UP: PING OK - Packet loss = 0%, RTA = 344.76 ms [18:33:47] Logged the message, Master [18:34:02] \o/ [18:34:10] and you run git-deploy and restart before you remove the ip from the exclusion list [18:34:14] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:34:21] it won't have any shards on it so shooting it is no problem [18:35:14] ah ok [18:35:15] right ok [18:35:26] (03CR) 10Reedy: [C: 032] Wikipedias to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125224 (owner: 10Reedy) [18:36:04] AaronSchulz: what instance are you using on labs for HHVM work? [18:36:45] (03PS1) 10Manybubbles: Lower udp2log maxage [operations/puppet] - 10https://gerrit.wikimedia.org/r/125247 [18:36:49] (03Merged) 10jenkins-bot: Wikipedias to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125224 (owner: 10Reedy) [18:37:20] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.23wmf21 [18:37:25] Logged the message, Master [18:38:23] (03PS2) 10Reedy: group0 wikis to 1.23wmf22 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125225 [18:39:14] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 535750 bytes in 9.804 second response time [18:40:11] (03CR) 10Reedy: [C: 032] group0 wikis to 1.23wmf22 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125225 (owner: 10Reedy) [18:40:18] (03Merged) 10jenkins-bot: group0 wikis to 1.23wmf22 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125225 (owner: 10Reedy) [18:41:43] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 wikis to 1.23wmf22 [18:41:45] (03CR) 10Chad: "Deja vu? Didn't I write the same patch before basically?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125247 (owner: 10Manybubbles) [18:41:48] Logged the message, Master [18:42:05] ok manybubbles, deployed plugins. I had to accept the new salt key too [18:42:07] !log reedy synchronized docroot and w [18:42:07] but, done. [18:42:11] (03PS2) 10Reedy: First pilot site for Media Viewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125031 (owner: 10MarkTraceur) [18:42:12] Logged the message, Master [18:42:16] ottomata: cool! [18:42:22] (03CR) 10Reedy: [C: 032] First pilot site for Media Viewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125031 (owner: 10MarkTraceur) [18:42:26] preilly: there is http://en.wikipedia.beta-hhvm.wmflabs.org/wiki/Special:SpecialPages ... I just use my own VM atm though [18:42:41] ottomata: seems to be missing a plugin [18:42:52] (03Merged) 10jenkins-bot: First pilot site for Media Viewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125031 (owner: 10MarkTraceur) [18:42:59] AaronSchulz: okay cool [18:43:00] oh i didn't restart [18:43:03] i shoudl do that, ja? [18:43:09] elasticsearch there [18:43:16] manybubbles: ^ [18:43:20] yeah! [18:43:42] done [18:43:50] !log reedy synchronized database lists files: Enable MediaViewer on mediawikiwiki [18:43:53] (03CR) 10Manybubbles: "I have to admit I didn't search through previous patches before doing this." [operations/puppet] - 10https://gerrit.wikimedia.org/r/125247 (owner: 10Manybubbles) [18:43:54] Logged the message, Master [18:44:24] !log reedy synchronized wmf-config/InitialiseSettings.php 'touch' [18:44:29] Logged the message, Master [18:44:48] (03CR) 10Ottomata: "I think the policy is 90 days. Should this be maxage 90?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125247 (owner: 10Manybubbles) [18:44:54] manybubbles: how's elastic1003 look now? [18:45:34] RECOVERY - NTP on elastic1003 is OK: NTP OK: Offset -0.01913952827 secs [18:45:35] (03CR) 10Manybubbles: "I'm happy with 90. I just wanted to pick a number < 3 months including a potential one day lag for logrotate to run." [operations/puppet] - 10https://gerrit.wikimedia.org/r/125247 (owner: 10Manybubbles) [18:46:04] ottomata: users of he.wiki (and me among them) complain about timeouts and slowness throughout the day [18:46:17] anything known? [18:46:19] Reedy: Issues with the StripeButtons file not loading it looks like [18:46:34] i got 504 myself a few times today [18:46:57] For anon users [18:47:01] Prolly just caching issue [18:47:31] Ah, working [18:47:34] matanya: not htat I know of, there are a bunch of strange warnings about lvs* nodes in icinga, but I know that akosiaris was messing with monitoring of those earlier today [18:47:44] so its probably just some new monitoring with kinks [18:47:52] hmm [18:48:05] that wouldn't cuase 504's i guess [18:48:23] :/s/cuase/cause [18:48:32] manybubbles: can I reenabled elastic1003 for the cluster? [18:48:35] so it starts moving shards? [18:49:36] ottomata: yeah, looks great! [18:49:47] AaronSchulz: can we make a HHVM project on labs so that we can work together on testing? [18:50:12] ottomata: you can also start moving shards off the next one if you like - you don't have to wait for the all the moves to stop [18:50:12] oh ha, we should install jq on all nodes, eh? [18:50:13] manybubbles: ? [18:50:17] oh ok [18:50:19] will do that [18:50:20] elastic1004 [18:50:22] ottomata: yeah, jq good [18:51:01] sweet [18:52:14] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:52:18] (03PS1) 10Ottomata: Installing jq on all elasticsearch server nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/125254 [18:53:03] (03CR) 10Manybubbles: [C: 031] Installing jq on all elasticsearch server nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/125254 (owner: 10Ottomata) [18:53:23] (03CR) 10Ottomata: [C: 032 V: 032] Installing jq on all elasticsearch server nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/125254 (owner: 10Ottomata) [18:53:26] ottomata: elasticsearch is more aggressive at moving stuff off of the node that you are forbidding them moving stuff back to the new node [18:53:37] so you should give it a rest after doing a couple [18:53:43] ah, hm, ok [18:53:43] but if you do 2-3 a day it should be ok [18:53:47] alright cool [18:54:17] if you do them at "night" (UTC's night) then you don't have to let them equal out as much [18:54:21] rather, it is less important [18:54:46] it makes load to move things around but not too much [18:54:57] (03CR) 10Hashar: "Weak +1, I got two nitpicks that you can safely discard =]" (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/122338 (owner: 10Dzahn) [18:55:14] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 535393 bytes in 9.641 second response time [18:55:40] ottomata: also, the /_cat api might replace some of the need for jq [18:55:43] some of it, I think [18:55:56] hm, ok [18:56:26] if I were me, I'd just install jq everywhere, tis very useful [18:56:34] awk is everywhere, why not jq? :p [18:59:20] :) [19:02:03] (03PS1) 10MarkTraceur: Enable survey on MediaWiki.org for Media Viewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125258 [19:02:29] (03CR) 10Hashar: "Isn't it useful to keep six months of data to find out possible performance regressions that we might take time to notice? Just wondering" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125247 (owner: 10Manybubbles) [19:03:06] Reedy: If you've time/interest to deploy https://gerrit.wikimedia.org/r/125258 I wouldn't stop you, but it's not necessary, we can SWAT it out [19:03:49] (03CR) 10Manybubbles: [C: 04-1] "Sure! If we can separate the files with pii from the ones without then I'm all for it." [operations/puppet] - 10https://gerrit.wikimedia.org/r/125247 (owner: 10Manybubbles) [19:04:21] marktraceur: There's still time in the window.... [19:04:30] Only barrier is interest, then :) [19:08:11] too many metaphors [19:08:41] Nemo_bis: Metaphors are like dollar bills, you can never have too many [19:09:05] Yes, I'm sure Bernanke would agreee [19:09:09] And similes are like people who don't understand what metaphors are, there are way too many in the world [19:09:59] <^demon> Dollar bills: https://www.youtube.com/watch?v=iR6oYX1D-0w [19:10:07] Dolla dolla bill y'all [19:10:35] (03PS2) 10Hashar: role::beta::uploadservice to allow port 80 [operations/puppet] - 10https://gerrit.wikimedia.org/r/122786 [19:10:48] (03CR) 10Hashar: [C: 031 V: 032] "rebased." [operations/puppet] - 10https://gerrit.wikimedia.org/r/122786 (owner: 10Hashar) [19:14:47] (03PS1) 10Ori.livneh: Work around bug 63780 by specifying a siteParamsCallback [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125264 [19:15:35] (03CR) 10Ori.livneh: [C: 032] Work around bug 63780 by specifying a siteParamsCallback [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125264 (owner: 10Ori.livneh) [19:15:39] (03CR) 10Reedy: [C: 032] Enable survey on MediaWiki.org for Media Viewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125258 (owner: 10MarkTraceur) [19:15:46] !log ori updated /a/common to {{Gerrit|Ia79b1b848}}: Work around bug 63780 by specifying a siteParamsCallback [19:15:51] Logged the message, Master [19:16:28] !log ori synchronized wmf-config/InitialiseSettings-labs.php 'Ia79b1b848: Work around bug 63780 by specifying a siteParamsCallback' [19:16:32] Logged the message, Master [19:16:47] (03Merged) 10jenkins-bot: Enable survey on MediaWiki.org for Media Viewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125258 (owner: 10MarkTraceur) [19:17:17] chrismcmahon: fixed [19:17:32] ori: awesome, thanks, gerrit just told me :-) [19:17:41] chrismcmahon: sorry bout that [19:19:08] ori: that's what beta is for :-) although when something central breaks like that late on Wed it makes me a bit nervous, because late Wednesdays is usually all of the hairy merges that show up as bugs on Thursday morning. no big deal though, thanks for fixing it. [19:19:26] !log reedy synchronized wmf-config/InitialiseSettings.php 'I5501078cee871fb9df03e085547b7a047ef5bd7e' [19:19:31] Logged the message, Master [19:20:14] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:27] ori, any thoughts on the vagrant patch for user [19:20:45] yurik2: haven't had a chance at all, sorry -- but i'll look today [19:21:04] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 535807 bytes in 9.577 second response time [19:25:14] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:28:34] manybubbles: elastic1004 shards are off [19:28:37] proceeding [19:28:41] (03PS1) 10MaxSem: Revoke my key, I'm relocating to SF [operations/puppet] - 10https://gerrit.wikimedia.org/r/125267 [19:28:49] ottomata: concur [19:29:14] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 535033 bytes in 9.860 second response time [19:29:55] !log reinstalling elastic1004 [19:30:00] Logged the message, Master [19:31:54] PROBLEM - Host elastic1004 is DOWN: PING CRITICAL - Packet loss = 100% [19:36:54] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [19:37:04] RECOVERY - Host elastic1004 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [19:37:32] good luck MaxSem [19:37:45] :) [19:39:04] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.111 [19:39:04] PROBLEM - puppet disabled on elastic1004 is CRITICAL: Connection refused by host [19:39:14] PROBLEM - check configured eth on elastic1004 is CRITICAL: Connection refused by host [19:39:14] PROBLEM - SSH on elastic1004 is CRITICAL: Connection refused [19:39:15] PROBLEM - DPKG on elastic1004 is CRITICAL: Connection refused by host [19:39:24] PROBLEM - RAID on elastic1004 is CRITICAL: Connection refused by host [19:40:04] PROBLEM - Disk space on elastic1004 is CRITICAL: Connection refused by host [19:47:15] ori-afk, thx, let me know. That patch has one problem - it requires Vagrantfile modifications. I would much rather do it some other way, like "vagrant set-user ori" to store it [19:51:35] PROBLEM - NTP on elastic1004 is CRITICAL: NTP CRITICAL: No response from NTP server [19:56:14] RECOVERY - SSH on elastic1004 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.3 (protocol 2.0) [19:56:24] (03PS9) 10BryanDavis: [WIP] Configure scap master and clients in beta [operations/puppet] - 10https://gerrit.wikimedia.org/r/123674 [19:59:45] (03PS2) 10Ottomata: Adding $deployable_networks variable in network.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/125184 [20:00:12] (03PS3) 10Ottomata: Adding $deployable_networks variable in network.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/125184 [20:01:13] (03CR) 10jenkins-bot: [V: 04-1] Adding $deployable_networks variable in network.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/125184 (owner: 10Ottomata) [20:05:14] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:06] manybubbles: running puppet on elastic1004 [20:07:14] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 440521 bytes in 9.614 second response time [20:07:22] ottomata: cool - let me know when you think it is ready and I'll check it [20:10:04] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [20:12:21] (03PS1) 10Ottomata: Allowing wikimetrics $debug parameter to be set from labsconsole [operations/puppet] - 10https://gerrit.wikimedia.org/r/125274 [20:13:03] (03PS2) 10Ottomata: Allowing wikimetrics $debug parameter to be set from labsconsole [operations/puppet] - 10https://gerrit.wikimedia.org/r/125274 [20:13:56] (03CR) 10Ottomata: [C: 032 V: 032] Allowing wikimetrics $debug parameter to be set from labsconsole [operations/puppet] - 10https://gerrit.wikimedia.org/r/125274 (owner: 10Ottomata) [20:15:04] RECOVERY - puppet disabled on elastic1004 is OK: OK [20:15:14] RECOVERY - check configured eth on elastic1004 is OK: NRPE: Unable to read output [20:15:15] RECOVERY - DPKG on elastic1004 is OK: All packages OK [20:15:24] RECOVERY - RAID on elastic1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [20:16:04] RECOVERY - Disk space on elastic1004 is OK: DISK OK [20:19:19] manybubbles: check it [20:19:57] ottomata: fuck.... java version [20:20:08] OH POO [20:20:10] i did not check that [20:20:17] wait [20:20:19] that's right, no? [20:20:21] everything elase is right [20:20:23] root@elastic1004:~# java -version [20:20:23] java version "1.7.0_51" [20:20:24] ? [20:20:28] _51 is broken [20:20:32] OH [20:20:41] _25 [20:20:47] greg-g: hoo added an item to the timeline in the meta draft version of the post https://meta.wikimedia.org/w/index.php?title=Wikimedia_Blog%2FDrafts%2FHeartbleed&diff=8122624&oldid=8122344 [20:21:03] i think this means that bugzilla users were actually logged out ten hours earlier [20:21:11] * 22:33 hoo: Logged out all Bugzilla users by deleting all session cookie data from mysql [20:21:12] * 08:42 mutante: forcing Bugzilla logout for all users [20:21:15] puppet should be doing that, no? [20:21:36] ottomata: I don't think we ever forced puppet to do that [20:21:37] can we? [20:21:45] will port that back to https://blog.wikimedia.org/2014/04/10/wikimedias-response-to-the-heartbleed-security-vulnerability/ if it's fine with everyone [20:21:49] yes [20:21:57] think so hang on [20:22:54] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [20:22:54] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [20:22:54] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [20:22:54] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [20:23:10] Haeb: Yeah, I did that... I don't even know why people chose to do that again (and nobody answers me) [20:23:10] hmm, manybubbles [20:23:22] elastic1005 has [20:23:22] Version: 7u25-2.3.10-1ubuntu0.12.04.2 [20:23:35] available from apt are [20:23:36] Version: 7u51-2.4.4-0ubuntu0.12.04.2 [20:23:42] Version: 7~u3-2.1.1~pre1-1ubuntu2 [20:23:57] where'd we get 7u25 from? [20:25:23] ottomata: must be old. u25 is the right one for Elasticsearch [20:25:38] yeha, but where did we get it? [20:25:40] how did we install it? [20:25:47] apt-get, I presume [20:25:53] you did it every time, I thikn [20:25:55] i don't see in apt now htough [20:26:14] it might have been removed for being too old? [20:26:16] hm, ok, i'm going to grab it from one of the nodes and add it to our apt [20:26:20] yeah mabye, why u3 though? [20:26:23] that's available [20:26:25] what is the 7~u3 one? [20:26:41] that is 7u3? too old for us I think [20:27:04] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.111 [20:27:10] we know [20:27:29] ottomata: 1003 also has the unstable version of java.... [20:27:59] ja will fix [20:28:10] yeah i stopped there [20:28:34] RECOVERY - NTP on elastic1004 is OK: NTP OK: Offset -0.008090019226 secs [20:31:50] hi all :) don't want to interrupt but i've got a question about the education extension from mw.org: can it be painlessly disabled on wiki if no editors use it? hope i'm asking this in the right channel. thanks! [20:32:01] ok manybubbles, check now [20:32:03] better? [20:32:43] ottomata: looks better. did you bounce Elasticsearch? [20:32:44] AnnaKoval: do we know no editors use it? [20:33:05] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [20:33:21] i started it on es04 [20:33:22] yeah [20:33:27] i had it stopped while I reinstalled java [20:33:35] manybubbles: can I do the same on es03 [20:33:39] should take just a few seconds [20:34:00] ebernhardson that's what the poster of the question claimed. i don't know that to be true myself, though. [20:34:14] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:34:31] ottomata: I'd let the shards go back to elastic1004 first [20:35:14] (03CR) 10BryanDavis: "This is looking better all the time!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125184 (owner: 10Ottomata) [20:35:15] AnnaKoval: my gut feeling is that the education programs team does use it, either way to be able to disable the extension would probably need to collect information like: 1) who uses it 2) how often they use it 3) effects it has on users not using it while enabled [20:35:47] ok, so start moving shards to 1004, wait til finished, then reinstalla nd bounce on 1003? [20:35:48] manybubbles: ? [20:36:06] ottomata: yes on shards back to 1004 [20:36:14] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 492104 bytes in 9.432 second response time [20:36:16] I'll write up instructions for a faster/safer bounch [20:36:18] bounc [20:36:20] bounce [20:36:23] to use for 1003 [20:36:41] ok [20:36:49] starting to move shards to 1004... [20:38:03] hm [20:38:47] ebernhardson, thanks. is that up to the inquiring wiki to collect that info. do i need to tell them how to do so, and if yes, would you please tell me how, too? we can take this to a private chat if it's distracting to other work being discussed here. :) [20:41:35] AnnaKoval: Hmm, actually collecting that information is non-trivial :) 1 & 2 you probably have to ask someone from analytics. Alternatively extensions have been disabled before by anouncing them loudly in the wiki's common spaces and waiting for people to say they do use it. For the third part that should be the community, basically 3 is "in what way is EducationProgram affecting users such that it needs to be turned off" [20:41:41] Haeb: hoo that sounds familiar, but not sure why mutante did it then at 8:xx [20:42:01] greg-g: Jenkins was stalled two times due to that... messy :/ [20:42:14] ottomata: instruction https://wikitech.wikimedia.org/w/index.php?title=Search&diff=109352&oldid=109341 [20:42:27] hoo: gotcha, when would you say we did it then? :) [20:43:09] hoo: probably just miscommunication due to opsen hackathon right now, I guess? [20:43:14] (re why no answers) [20:43:33] yes, they are sleeping [20:43:38] all at UTC+3 [20:43:45] greg-g: Might be... mind adding both things to the blog post? [20:44:00] I wondered whether mutante exchanged the ssl cert, but it doesn't seem like that [20:44:02] the bugzilla should be there [20:44:06] ottomata: https://gist.github.com/nik9000/10421441 [20:44:14] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:45:23] (dang gitblit) [20:45:38] matanya: yeah, best I can tell is it happened twice :/ [20:45:55] i felt only once [20:46:14] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 493442 bytes in 9.967 second response time [20:46:20] you can just mention it happend, without many details [20:46:37] so many 504's! [20:47:14] matanya: well, we/I wanted a nice awesome timeline showing how good we were :) [20:47:31] we are, even without a timeline [20:47:34] ebernhardson. thanks. i'll relay the info. :) [20:47:47] matanya: :) :) [20:47:53] With the correct timeline we would be even better than we are now :P [20:47:55] it was estimated it will take up to two weeks to do the same at my $day [20:48:02] _job [20:48:05] wow [20:48:12] what would the slow down be? [20:48:15] if you can say [20:48:42] clients need to replace our application [20:48:54] matanya: do you rememeber if the logout you felt was the one on 4/9 at 22:33 or on 4/10 at 08:42? [20:49:01] that makes sense [20:49:08] i can check my log [20:49:16] ty [20:49:20] UTC time? [20:49:22] yeah [20:49:31] greg-g: Look at the server admin looks... each time after jenkins went mental (we should fix that...) [20:49:52] I only see the latter [20:49:52] hoo: :) touche [20:50:01] ok, now, fight! [20:50:04] :P [20:50:21] might be i was logged out already on the first one [20:50:33] oh, i see why [20:50:47] i had a kernel upgrade when the first occoured [20:51:02] so i can only say about the second one [20:51:06] dangit [20:51:36] new shiny 3.14 kernel reboot [20:51:55] fancy, I'm still on 3.13-1-amd64 [20:52:11] I'm still on 3.12 radeon breakage, ft [20:52:12] w [20:52:30] i like being bleeding edge on the desktop/laptop [20:53:18] Yep, usually I do the same... probably going to jump on 3.14 once someone build it in fedoras buildsystem :P To lazy to compile it myself atm [20:53:32] compiling kernels is so 1990s [20:54:11] :D [20:54:28] using arch, distro does it all for me :) [20:54:48] greg-g: how can i get a dump of he.wiki into betalabs? [20:54:52] And I'm not even sure 3.14 will boot for me... I don't really get enough of the graphic driver stuff to know wheter a commit fixes my setup or not [20:55:05] I upgrade from Apple Store. works all the time [20:55:28] matanya: I'd talk with apergos about that [20:55:42] tomorrow, then [20:55:47] hashar: also your troll daemon, I guess? :D [20:56:04] that is triggered by zuul [20:56:51] phew, sorry manybubbles, lost internet [20:57:00] ottomata: yikes! [20:57:02] sok [20:57:09] so, did you get my messages? [20:57:17] https://wikitech.wikimedia.org/w/index.php?title=Search&diff=109352&oldid=109341 [20:57:19] hoo: that daemon is not fully ported to english though. I am training its neural network :] [20:57:21] https://gist.github.com/nik9000/10421441 [20:57:26] gist was the last thing i got [20:57:30] that is it [20:57:35] anyway sleeping. have fun folks. [20:57:40] what's that showing, how many shards? [20:57:56] yeah [20:57:59] g'night hashar [20:58:20] so might actually be better to just do the exclude thing from 1003 before the restart [20:58:24] because there are so few shards on it [21:00:49] ok [21:00:51] can I do that now? [21:00:59] greg-g: wikis getting wmf21 today ? [21:01:05] manybubbles: ? [21:01:05] yeah [21:01:15] oh, sorry, yeah [21:01:19] ottomata: ^^ [21:01:23] https://en.wikipedia.org/wiki/Special:Version [21:01:36] it won't hurt anything to push ask it to move the shards away from 1003 [21:01:51] ok [21:01:56] oh, already got [21:02:16] ok, manybubbles its going [21:02:20] moving off of 1003 [21:02:31] ottomata: indeed it is [21:02:40] matanya: happened at 18:37 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.23wmf21 [21:02:43] greg-g, hoo: updated the post [21:02:50] \o/ :) [21:03:06] now it's just on there twice, we'll see if it confuses any one :) [21:03:18] and when do you cut the next branch greg-g ? [21:03:32] ottomata: so after that is restarted and shards start going back to it we have a choice: stop for the day or do another one [21:03:45] stop for the day! [21:03:46] anywhere from 30 minutes to 2 hours before that [21:03:46] the choice is influenced by free disk and load [21:03:46] :) [21:03:48] :) [21:03:59] and by my parent's desire for me to make some bread for dinner [21:04:00] matanya: it's kind of "whenever the person doing the deploy does it" [21:04:04] fair [21:04:05] matanya: another thing to automate :) [21:04:12] on day [21:04:16] *one [21:04:25] one day soon I hope [21:04:38] ottomata: cool. when you want to start again (whenever!) just ping me and I'll verify after puppet runs [21:05:01] I just need to chat with Antoine and write the job. We have scripts that do all the hard parts already. [21:05:05] so wmf23 was cut already ? [21:05:12] cut and deployed [21:05:20] wmf22 [21:05:24] 23 is next week [21:05:34] eer 23 [21:05:36] Oh, yes sorry. https://www.mediawiki.org/wiki/Special:Version [21:05:39] getting close to 1.24wmf1..... [21:06:02] lost the road map :) [21:06:04] (there should be an annoucement from M&M re 1.23 RCs soon) [21:06:10] err, 22 [21:06:32] * aude panics to get all the stuff into core [21:06:35] E_TOOMANY20SOMETHINGNUMBERS [21:06:41] so 22 is on group 0 ? [21:06:44] yeah [21:07:01] * greg-g checks the roadmap page [21:07:09] heh, out of date [21:07:15] * greg-g edits [21:07:40] And 'peidias are on 1.23wmf21 after it looked so grim last Friday :) [21:08:10] oh, ok. i'm not the one to blame then :) [21:08:38] (another thing that should be automated) [21:08:57] The cause of our troubles with it turned out to be a really old bug in scap that we mostly never noticed [21:09:08] ^ every trouble, ever! :) [21:09:27] Or more accurately that we noticed but usually went away beofre we figured out the problem [21:09:28] sound like scap is the source of all trouble in the world [21:09:43] Well new deploys are, yes. [21:09:54] If we never changed the code we'd never have new problems [21:09:59] Just the same old ones [21:10:22] ok but manybubbles, i should wait til everything is off of 1003, right? [21:10:42] ottomata: may as well given how few things are on there. [21:10:44] do you have to go? [21:10:52] because I can bounce it if you need to [21:10:57] so long as you install java [21:11:01] matanya: But it's a lot more fun to have new problems every week. :) [21:11:04] I don't know the apt invocations for forcing that version [21:11:16] bd808: daily is better :P [21:11:35] we're almost there (at daily) ;) [21:11:53] ok, https://www.mediawiki.org/wiki/MediaWiki_1.23/Roadmap#Schedule_for_the_deployments should be all good [21:12:39] wow network https://ganglia.wikimedia.org/latest/?c=Elasticsearch%20cluster%20eqiad&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [21:13:27] greg-g: biggest thing for 22 is php>json stuff? [21:13:28] oh, that's just ES [21:13:44] Nemo_bis: ottomata and manybubbles are playing around [21:13:48] matanya: plus all of https://www.mediawiki.org/wiki/MediaWiki_1.23/wmf22 [21:14:09] yeah I saw [21:14:14] Nemo_bis: yeah, wouldn't worry, a new machine/node was added, it is probably them re-balancing (probably all internal traffic) [21:14:17] Nemo_bis: yeah, we're moving shards around [21:14:35] Nemo_bis: unless search is slow for you it should be ok [21:14:47] but that is what it looks like when we do rolling restarts, yeah [21:14:49] manybubbles: oh yeah puppet [21:14:50] on that now [21:14:52] no i'm around [21:14:56] just intermittently, so its [21:14:57] ok [21:15:02] search is fine, edits give 504 every now and then [21:15:13] no additional load from interwiki search it would seem [21:15:15] !log Disabled beta update Jenkins job (https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/) so that scap testing can happen in beta. [21:15:20] Logged the message, Master [21:15:55] neat, nice drop in 5xxs: https://gdash.wikimedia.org/dashboards/reqerror/deploys [21:18:35] matanya: thats not nice:( [21:18:41] Nemo_bis: yeah, I'm happy about that! [21:19:17] manybubbles: what can i do? it happens :/ [21:19:35] so, getting other reports of slowness [21:19:38] eg on meta [21:19:56] manybubbles: i'm unsure if I downgrade java now if apt will bounce elasticsearch on its own [21:20:07] hm, i guess let's wait and find out [21:20:11] if it doesn't bounce [21:20:29] then I can merge this puppet change now and the next reinstalls will get it [21:20:44] if it does bounce, then we'll probably want to wait til we're done the reformatting [21:21:42] (03PS1) 10Ottomata: Installing java version 7u25-2.3.10-1ubuntu0.12.04.2 for elasticsearch [operations/puppet] - 10https://gerrit.wikimedia.org/r/125329 [21:21:43] ottomata: all the other servers have the right version [21:21:57] ottomata: it shouldn't bounce anything. I mean, the deb isn't build to bounce it [21:22:08] 8 more shards [21:24:19] more slowness reports, where is my timing data? https://gdash.wikimedia.org/dashboards/frontend/ [21:24:30] 5-6s every edit [21:24:54] oh right right [21:24:56] of couse [21:25:00] all others have the right version [21:26:14] (03CR) 10Ottomata: [C: 032 V: 032] Installing java version 7u25-2.3.10-1ubuntu0.12.04.2 for elasticsearch [operations/puppet] - 10https://gerrit.wikimedia.org/r/125329 (owner: 10Ottomata) [21:26:51] oh great, just in time manybubbles, 1003 is done [21:26:52] ja? [21:27:17] ottomata: sure [21:27:36] go ahead and do it! [21:27:48] k running puppet, i wanna see puppet do it :) [21:27:50] i hope [21:28:31] coool [21:28:32] notice: /Stage[main]/Elasticsearch::Packages/Package[openjdk-7-jdk]/ensure: ensure changed '7u51-2.4.4-0ubuntu0.12.04.2' to '7u25-2.3.10-1ubuntu0.12.04.2' [21:28:39] did it [21:28:41] bouncing es [21:28:43] yup! [21:28:53] ok, how's it look on 1003 now? [21:29:13] good [21:29:15] ok! [21:29:20] so we can move shards back to it now? [21:29:23] yup! [21:29:26] ok [21:29:35] great done [21:30:03] yup [21:30:07] ok, go make bread [21:30:09] ok! [21:30:17] great, will do more of these tomorrow [21:30:21] we are getting it down pat [21:31:10] ok laterrs! [21:32:10] (03PS1) 10Manybubbles: Make Elasticsearch more reliable in beta [operations/puppet] - 10https://gerrit.wikimedia.org/r/125331 [21:33:15] (03CR) 10Manybubbles: "Now that the "solr" project is gone in labs I don't have an easy way to test my puppet changes so this was done without any real testing.." [operations/puppet] - 10https://gerrit.wikimedia.org/r/125331 (owner: 10Manybubbles) [21:35:13] Nemo_bis: see how the network usage has dropped off? we're moving shards back to the two nodes we repartitioned today and it's keeping it slow. when we evicted the shards from those nodes it got to fan them out to all the other nodes so it went much faster [21:35:29] :) [21:35:36] it'll take quite a while for the shards to be balanced again [21:35:42] sometime tonight it'll calm down [21:50:54] !log VisualEditor throws uncaught error on load for 1.23wmf21 wikis (bug 63791) [21:50:59] Logged the message, Master [22:00:14] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:01:13] (03PS1) 10Gergő Tisza: Add Commons favicon to filerepo setup [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125335 [22:02:25] (03CR) 10MarkTraceur: [C: 031] "Will add to deploy calendar for SWAT today" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125335 (owner: 10Gergő Tisza) [22:03:14] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 487141 bytes in 9.592 second response time [22:03:25] * marktraceur adds SWAT patch [22:10:07] !log krinkle synchronized php-1.23wmf21/resources/oojs-ui/oojs-ui.js 'touch' [22:10:12] Logged the message, Master [22:10:24] !log krinkle synchronized php-1.23wmf21/resources/startup.js 'touch' [22:10:28] Logged the message, Master [22:10:37] !log krinkle synchronized php-1.23wmf21/extensions/VisualEditor/lib/ve/lib/oojs-ui/oojs-ui.js 'touch' [22:10:42] Logged the message, Master [22:12:52] !log krinkle synchronized php-1.23wmf21/extensions/VisualEditor/lib/ve/modules/ve/ui/ve.ui.Toolbar.js 'touch' [22:12:56] Logged the message, Master [22:15:14] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:16:14] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 486549 bytes in 9.514 second response time [22:35:27] !log krinkle synchronized php-1.23wmf21/extensions/VisualEditor/modules/ve-mw/ui/tools/ve.ui.MWReferenceDialogTool.js 'touch' [22:35:32] Logged the message, Master [22:47:09] (03PS1) 10Bsitu: Update Flow cache version to 4.1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125340 [22:53:00] !log krinkle synchronized php-1.23wmf21/extensions/VisualEditor/modules/ve-mw/ui/tools/ 'touch *.js' [22:53:03] Logged the message, Master [22:53:52] ryasmeen: Should be live within 5 minutes [22:54:13] okay thanks Krinkle [22:55:25] mwalker, ebernhardson, RoanKattouw_away: I'm going to be away from my computer for the next 15 mins. If one of you would like to do SWAT, cool. If not, I would be happy to do it at 4:15. [22:58:28] I can do it [22:59:34] (03CR) 10Mwalker: [C: 032] Add Commons favicon to filerepo setup [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125335 (owner: 10Gergő Tisza) [23:00:51] marktraceur, I was just looking at ^; but question: why are we adding the commons favicon when the DB != commonswiki? [23:01:07] (03CR) 10Mwalker: [C: 032] Update Flow cache version to 4.1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/125340 (owner: 10Bsitu) [23:01:53] Uhhh [23:01:55] * marktraceur looks. [23:02:15] oh; I get it; because we're declaring a link to commons [23:02:17] Yeah [23:02:23] Sorry, was a bit slow on that draw [23:03:27] !log sync-common for {{gerrit|125340}} and {{gerrit|125335}} [23:03:32] Logged the message, Master [23:03:42] bsitu, your change has been pushed [23:04:54] marktraceur, I don't know if you can tell if your change has been pushed; but it has [23:05:01] mwalker: thx [23:05:06] Yeah, I should be able to...not seeing it yet probly because caching [23:05:58] https://en.wikipedia.org/w/api.php?action=query&format=jsonfm&meta=filerepoinfo doesn't reflect it, but...not sure how long I will need to wait. [23:07:05] I may also need to touch a different file [23:09:20] !log mwalker synchronized wmf-config/InitialiseSettings.php 'touched to see if that pushes changes to FileBackend.php' [23:09:25] Logged the message, Master [23:10:05] hmm... marktraceur not sure why the change isn't taking [23:10:17] It may just be API caching? [23:10:24] This is a long-cached sort of thing [23:10:25] I mean [23:10:30] Repo config doesn't change very often [23:10:33] we dont typically cache API calls [23:10:47] because we have no way of invalidating the API cache [23:10:48] mwalker: We do when it's one wiki talking to another wiki's API [23:10:51] Wait...no. [23:10:51] NVM [23:10:59] Cannot brain, have dumbs [23:11:25] Yeah then I dunno [23:11:35] ori-afk, when you get back; Mark had a change that touched FileBackend.php that doesn't seem to be reflected [23:11:47] so I'm wondering if the configuration cache is simply not rengenerating itself [23:12:05] but touching InitializeSettings.php doesn't seem to change anything [23:12:12] so, I'm out of ideas [23:12:22] Helpful :) [23:12:36] mwalker: Touching FileBackend? Not sure that would do anything different [23:13:05] what do you mean, doesn't seem to be reflected? [23:13:10] nah; because it already has a timestamp of 2302 from when I pulled it down from git [23:13:13] ori-afk: https://www.mediawiki.org/w/api.php?action=query&format=jsonfm&meta=filerepoinfo doesn't show the change [23:13:19] * greg-g secretly really happy he wasn't pinged during SWAT [23:13:25] Haha [23:13:50] 'tis true, and not so secret, I want it go roll without hitches. [23:13:51] FileBackend? You mean something else? [23:13:54] marktraceur: what's the patch that you expected to change? [23:14:15] ori-afk: https://gerrit.wikimedia.org/r/125335 [23:14:24] ah; it may be that sync-common doesn't do what I think it does [23:14:30] Oh, no [23:14:35] sync-file probably [23:14:36] because the random API server that I have doesn't have the change [23:15:08] so; I can sync file; but I swore there was a command that synced all the common stuff [23:15:10] But then you...did sync-file. Oh, but only for IS.php [23:15:10] ah, filebackend.php (lower case), that makes more sense [23:15:19] heh [23:15:28] mwalker: itym scap? :P [23:15:50] just run scap [23:16:00] mwalker: sync-common pulls from tin, when you're on an apache [23:16:05] https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Trying_tin.27s_code_on_testwiki [23:16:06] !log mwalker synchronized wmf-config/filebackend.php [23:16:11] Logged the message, Master [23:16:13] bd808: a testament to your complaint that people are allergic ^ :P [23:16:17] There we go [23:16:18] Ta, mwalker [23:16:31] There used to be a "sycn-common-all" script but ti was really just an alias for "scap" [23:16:48] soon, soon all of 'em will just be aliases to scap ;) [23:17:01] * marktraceur fills in for greg-g [23:17:03] which will be an alias to git-deploy ;) [23:17:03] MUAHAHAHAHAHA [23:17:25] * ^demon|away will fork the scripts so he can use the old-and-busted, JUST TO TROLL bd808 [23:17:39] I typically try to avoid scap... is that no longer required? [23:17:47] is it intelligent to know that I only changed config files? [23:17:55] and not try and break everything i18n related [23:18:09] I wasn't aware scap *had* to try in order to break i18n [23:18:27] scap is the only thing that fixes i18n [23:19:00] scap is awesome and if you're not releasing a new branch pretty fast anymore (~10 minutes) [23:19:19] >.> [23:19:20] Plus colors! and progress bars! [23:19:26] as opposed to 10 seconds to sync a directory [23:19:47] rsync :/ [23:19:50] * ^demon|away wasn't aware people viewed scap as so broken until very recently. [23:19:54] working on it [23:20:21] ^demon|away, I guess I dont think of it as broken; just a big blunt hammer [23:20:23] The "we like scap" cabal is pretty exclusive [23:20:24] ^demon|away: Not broken as much as slow [23:20:48] (03PS2) 10Tim Landscheidt: Tools: Add some i386 compat packages to exec nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/125241 (owner: 10Yuvipanda) [23:23:54] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [23:23:54] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [23:23:54] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [23:23:54] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [23:24:38] bd808, I suspect the 'we like puppet cabal' is even more exclusive [23:24:48] And how. [23:25:16] I like it better than shell scripts to manage server configs [23:25:47] <^demon|away> Manage with memory! [23:25:48] I probably wouldn't invite it out for a dinner and drinks though [23:25:52] <^demon|away> "How did we set this up again?" [23:25:59] <^demon|away> "I think we installed these couple of packages" [23:26:05] <^demon|away> "And then put the data here....or maybe it was here" [23:26:18] ^ to be fair... we still have all those problems, even with puppet [23:26:23] even on puppetized servers [23:26:37] <^demon|away> Hehe [23:26:42] "is it in /a/ or /srv/ or /etc/ or /usr/local/bin..." [23:26:54] Lol [23:27:04] mwalker: So we're seeing an issue in l10n, go figure [23:27:05] https://dumpling-attachments-storage.s3.amazonaws.com/wikimedia/attachments/720d7f561bf3fba13dccc7592f0a3e92/11263/download-message-missing.png?AWSAccessKeyId=ASIAIWG6SUMJ2LZJDKFQ&Expires=1397174191&Signature=LTNlxG2bFBzOopABp2MR93AAj2Y%3D&response-content-type=image%2Fpng&x-amz-security-token=AQoDYXdzEOD%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEa0ANYG0cKGXtcFBLUiA4RNqRvWFYdeZGA4otnnrRQW8MjCpj3FqeOlgZSHwwtnR5XdWF7eU3aVMOezSJtwTx6kfVtyhIsT [23:27:12] AGH [23:27:14] http://ur1.ca/h1xdk [23:27:15] Better [23:27:18] WHAT [23:27:26] The funny thing is, tgr sees the issue but I don't [23:27:34] s3? [23:27:41] i see one issue [23:27:42] namely: [23:27:43] greg-g: Mingle is a bitch about attachments [23:27:46] <^demon|away> mwalker: Actually, when I setup my new laptop I ended up swapping my "primary development directory where I clone basically everything to work from" [23:27:48] marktraceur: ah [23:27:50] "x-amz-security-token=AQoDYXdzEOD//////////wEa0ANYG0cKGXtcFBLUiA4RNqRvWFYdeZGA4otnnrRQW8MjCpj3FqeOlgZSHwwtnR5XdWF7eU3aVMOezSJtwTx6kfVtyhIsT" [23:27:51] <^demon|away> Changed from /www to /a [23:27:59] <^demon|away> And now I'm having trouble retraining my muscle memory. [23:28:07] Sigh. [23:28:09] ori_: haha [23:28:14] oh mingle [23:28:29] it's very handy to have all the keys in the url like that [23:28:49] Security! [23:28:51] All the security! [23:28:53] Anyway [23:29:03] ^demon|away, I have a /development directory that then gets symlinked and nfs shared everywhere I need it -- not sure it's any better... [23:29:05] marktraceur: i only see the error when logged in [23:29:07] mwalker, etc., any thoughts on the i18n issue or should we just wait? [23:29:12] anyway, joking aside, is that your personal s3, or something work related? and if the latter, can you change the keys, like, now? [23:29:15] Well, I'm logged in [23:29:16] marktraceur, I am more than happy to scap [23:29:24] um, no [23:29:28] ori_: I cannot change the keys, it's the Mingle instance [23:29:35] For us [23:29:38] i see it logged out, things just break less bad [23:29:40] Hosted on Thoughtworks [23:29:49] Oh. [23:30:08] are those keys even present in the versions of mw on the cluster? [23:30:16] Wait. [23:30:20] so that's enwiki... /me looks in the i18n files [23:30:34] OK, I see it logged out [23:30:44] I suspect the issue may be dependencies for modules or something [23:31:01] (03CR) 10Tim Landscheidt: [C: 04-1] "a) The general concept of specifying architecture works." [operations/puppet] - 10https://gerrit.wikimedia.org/r/125241 (owner: 10Yuvipanda) [23:31:23] ori_: I'm not even sure the keys there are supposed to be private? It's used to access images that are *supposed* to be public. [23:31:29] the keys are missing in the 1.23wmf21 branch as far as i can see [23:32:21] tgr: Maybe got lost in the json migration or something? [23:32:27] so the bug itself is not so mysterious, but it only showing up for some people is [23:32:43] easy culprit, but not likely honestly [23:32:49] (json switchover) [23:33:01] marktraceur, so... should I revert the config change? [23:33:09] mwalker: I doubt the config change caused this [23:33:12] have you... run scap? [23:33:15] Hahaha [23:33:23] greg-g, no; I've not yet run scap [23:33:37] if the keys aren't present in 1.21 it wouldnt help [23:33:38] Now...may be the time [23:33:40] if the l10n was built without this enabled it might be the solution [23:33:53] ok /me breaks out the hammer [23:34:14] !log mwalker Started scap: Attempting to regenerate i18n keys for multimediaviewer [23:34:14] i retract, the messages are there [23:34:19] Logged the message, Master [23:34:27] well then... :) [23:34:28] i got confused by ack not searching json files [23:35:51] bd808, scap is much faster! this is impressive [23:35:59] it's actually tolerable to use now [23:36:00] :) [23:36:12] And it tells you what it's doing! [23:36:37] Removing ~200 pmtpa hosts didn't hurt performance :) [23:36:54] maybe we should try removing the eqiad ones too then [23:36:56] hehe [23:37:05] ulsfo all the way [23:37:24] marktraceur: it doesn't sync to ulsfo ;) [23:37:33] those are just caches, no apaches [23:37:42] We'll likely need to fix that before we retire eqiad. :P [23:37:45] Changing tin to host everything on ssd would help a lot I bet [23:37:47] !log mwalker Finished scap: Attempting to regenerate i18n keys for multimediaviewer (duration: 03m 33s) [23:37:53] Logged the message, Master [23:37:55] look at that! 3 minutes! [23:38:12] * mwalker remembers when it took 2 hours [23:38:18] tgr, try it now [23:38:26] Issue still there... [23:38:35] marktraceur, are you logged in? [23:38:51] (if not it could be caching) [23:39:01] Probably is caching then [23:39:07] I only ever saw the issue logged out [23:39:07] marktraceur: example url? [23:39:12] https://en.wikipedia.org/wiki/Kickboxing?debug=true#mediaviewer/File:High%20kick%20block.jpg [23:39:47] I don't see any issues, what would be wrong if it were? [23:39:55] i still see the error, even after clearing browser cache [23:39:57] Click on "use this file" in the lower right [23:40:08] bah, still there [23:40:12] Yarp [23:40:38] so... it's in the 18n file for 1.23wmf21 on tin [23:40:45] is ExtensionMessage-1.23mwf21 correct? [23:40:52] :) [23:41:06] where does that file get placed? [23:41:17] "on the servers" is all I know [23:42:35] there's no such file named ExtensionMessage* in the php-1.23wmf21 directories [23:42:46] bd808, do you know where ^ ends up? [23:43:01] It's in wmf-config [23:43:06] Hah [23:43:21] /a/common/wmf-config/ExtensionMessages-1.23wmf22.php [23:44:13] ok; so multimediaviewer is in wgExtensionMessagesFiles [23:44:24] pointing to "$IP/extensions/MultimediaViewer/MultimediaViewer.i18n.php" [23:44:31] But it's JSONy now [23:44:43] So.... [23:46:01] your extension does set wgMessagesDirs... [23:47:10] interesting; ExtensionMessages-1.23wmf21 and wmf22 are different [23:47:14] for this extension [23:47:23] I would expect so? [23:47:26] bd808, do you know who /how these files get generates [23:47:30] We added things in wmf22 IIRC [23:47:54] marktraceur, yes; but with respect to your extension... wmf21 does not have your i18n directory listed, only the i18n file [23:48:00] wmf22 has both [23:48:06] mwalker: sudo -u apache $BINDIR/mwscript mergeMessageFileList.php [23:48:13] Weeeeird. [23:48:28] Happens in the middle of mw-update-l10n [23:48:39] which happens in the middle of scap [23:48:49] scaaaaap. [23:49:22] I must specify a wiki... does it matter which one? [23:49:30] It "shouldn't [23:49:34] " [23:49:41] (03PS1) 10Dr0ptp4kt: Only tag 470-07 if going through proxy. [operations/puppet] - 10https://gerrit.wikimedia.org/r/125347 [23:49:48] (but in practice it does) [23:50:11] greg-g: I think I fixed that [23:50:19] :) [23:50:31] as long as wmf-config/extension-list is up to date [23:50:57] nope... it still doesn't generate a listing for multimediaviewers 1.23wmf21 i18n dir [23:52:22] and I just double checked... php-1.23wmf21/extensions/MultimediaViewer/MultimediaViewer.php has a line that says $wgMessagesDirs['MultimediaViewer'] = __DIR__ . '/i18n'; [23:52:36] btw, in 10 minutes, you're gonna need to revert the config change and wait til monday [23:52:38] bblack, when you have a minute during normal hours, would you please review and, if appropriate, +2 https://gerrit.wikimedia.org/r/#/c/125347/ ? [23:53:06] marktraceur: THE FINAL COUNTDOWN [23:53:15] greg-g, I'm not sure this is actually related to the config change... but understood [23:53:17] greg-g: The config change didn't cause this [23:53:29] fine, make things harder [23:53:33] I'd rather have a weird l10n situation than turn things off [23:53:43] marktraceur: just on mw.org still, yes? [23:53:45] Especially since it only happens to logged-in users who need to follow a link for this to manifest [23:53:50] The bug is on enwiki [23:53:51] (mw and testwikis, that is) [23:53:57] er logged-out [23:53:58] :/ [23:54:13] !log Enabled beta update Jenkins job (https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/) [23:54:18] Logged the message, Master [23:54:32] oh, on enwiki, but a user can't get there by clicking only? [23:54:52] not as bad, ok [23:57:20] oh... I may be dumb ... mergeMessageFileList outputs something to stdout that I probably need to pipe to a file... [23:58:10] Gosh, mwalker. [23:58:14] WTF etc. [23:58:45] ok greg; I'm going to backup the 1.23wmf21 extension messages file; replace it with the new generated one; and run scap [23:59:04] does that sound ok? [23:59:19] or would you rather me leave this as is; send an email to wikitech and us deal with this on monday? [23:59:37] Did you generate a file with better contents? Because scap runs that command each time. [23:59:56] So if you found a way to make it "different" thats a scap bug