[00:00:05] The cross browser results are interesting. The beta/prod results are a red herring. There is nothing that can be correlated there. [00:00:35] The cross browser results say to me that the ve devs are running chrome. :) [00:00:57] yes, that is what they are saying in fact [00:01:15] RECOVERY - Host baham.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 52.22 ms [00:03:44] (03PS2) 10Ori.livneh: beta: switch to /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/159431 [00:19:31] (03PS3) 10Ori.livneh: beta: switch to /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/159431 [00:21:55] (03PS1) 10Dzahn: terbium-remove trailing comma from domain_search [puppet] - 10https://gerrit.wikimedia.org/r/159435 [00:22:10] (03CR) 10Ori.livneh: [C: 032] "PS3 applied on labs, did the right thing." [puppet] - 10https://gerrit.wikimedia.org/r/159431 (owner: 10Ori.livneh) [00:39:09] (03CR) 10Dzahn: "no, still compiles as "identical".. what the .." [puppet] - 10https://gerrit.wikimedia.org/r/159435 (owner: 10Dzahn) [00:55:20] (03PS1) 10Dzahn: replace mexia with baham in misc/monitoring.pp [puppet] - 10https://gerrit.wikimedia.org/r/159437 [00:59:09] (03PS2) 10Dzahn: decom mexia [puppet] - 10https://gerrit.wikimedia.org/r/159437 [01:03:00] (03PS1) 10Dzahn: remove pmtpa subnets from install-server [puppet] - 10https://gerrit.wikimedia.org/r/159438 [01:12:34] (03PS1) 10Dzahn: remove 10.4.16.0/24 and host scs-c1-pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/159439 [01:14:56] (03PS1) 10Dzahn: dhcp - delete remaining Tampa db's and es's [puppet] - 10https://gerrit.wikimedia.org/r/159440 [01:19:03] (03PS1) 10Dzahn: remove pmtpa from all $domain_search [puppet] - 10https://gerrit.wikimedia.org/r/159441 [01:21:46] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [01:32:46] (03PS1) 10Dzahn: decom nfs1 [puppet] - 10https://gerrit.wikimedia.org/r/159442 [01:34:38] (03CR) 10Dzahn: [C: 04-1] "who knows about this part?" [puppet] - 10https://gerrit.wikimedia.org/r/159442 (owner: 10Dzahn) [01:36:08] (03PS2) 10Dzahn: decom nfs1 [puppet] - 10https://gerrit.wikimedia.org/r/159442 [01:41:48] (03CR) 10Dzahn: "how come this is unmerged but the related bug is resolved?" [puppet] - 10https://gerrit.wikimedia.org/r/150813 (https://bugzilla.wikimedia.org/63120) (owner: 10Hashar) [01:43:41] (03PS1) 10Rush: phab tools for migrating content [puppet] - 10https://gerrit.wikimedia.org/r/159443 [01:45:12] (03CR) 10Rush: [C: 032] phab tools for migrating content [puppet] - 10https://gerrit.wikimedia.org/r/159443 (owner: 10Rush) [01:45:29] (03Abandoned) 10Dzahn: terbium-remove trailing comma from domain_search [puppet] - 10https://gerrit.wikimedia.org/r/159435 (owner: 10Dzahn) [01:48:53] (03PS1) 10Rush: phab path fix for migration tools [puppet] - 10https://gerrit.wikimedia.org/r/159444 [01:53:19] (03CR) 10Rush: [C: 032] phab path fix for migration tools [puppet] - 10https://gerrit.wikimedia.org/r/159444 (owner: 10Rush) [01:53:22] (03PS1) 10Rush: standard no-exim for iridium [puppet] - 10https://gerrit.wikimedia.org/r/159445 [01:54:37] (03CR) 10Rush: [C: 032] standard no-exim for iridium [puppet] - 10https://gerrit.wikimedia.org/r/159445 (owner: 10Rush) [01:56:15] (03PS2) 10Dzahn: webserver - use ssl_ciphersuite in generic_vhost [puppet] - 10https://gerrit.wikimedia.org/r/153971 [01:56:33] (03PS1) 10Rush: remove git-core from phab as dupe w/ standard [puppet] - 10https://gerrit.wikimedia.org/r/159446 [01:57:28] (03CR) 10Rush: [C: 032] remove git-core from phab as dupe w/ standard [puppet] - 10https://gerrit.wikimedia.org/r/159446 (owner: 10Rush) [01:58:28] (03PS3) 10Dzahn: webserver - use ssl_ciphersuite in generic_vhost [puppet] - 10https://gerrit.wikimedia.org/r/153971 [01:58:53] (03CR) 10Dzahn: webserver - use ssl_ciphersuite in generic_vhost (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/153971 (owner: 10Dzahn) [02:11:25] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3265 MB (3% inode=99%): [02:11:28] (03PS1) 10Rush: phab dont apply migration tools just yet [puppet] - 10https://gerrit.wikimedia.org/r/159447 [02:12:12] (03CR) 10Dzahn: [C: 032] contint: switch localvhost to apache::conf [puppet] - 10https://gerrit.wikimedia.org/r/155707 (https://bugzilla.wikimedia.org/68256) (owner: 10Hashar) [02:12:26] (03PS2) 10Rush: phab dont apply migration tools just yet [puppet] - 10https://gerrit.wikimedia.org/r/159447 [02:12:30] (03CR) 10Rush: [C: 032] phab dont apply migration tools just yet [puppet] - 10https://gerrit.wikimedia.org/r/159447 (owner: 10Rush) [02:13:36] (03CR) 10Rush: [V: 032] phab dont apply migration tools just yet [puppet] - 10https://gerrit.wikimedia.org/r/159447 (owner: 10Rush) [02:13:42] ori: what's mw1201? [02:13:52] (and is there somewhere i can look this up myself, instead of always having to bug someone?) [02:15:07] mutante: about? [02:15:39] chasemp: yes [02:15:39] jackmcbarn: it's a standard application server. you can determine this by looking it up in operations/puppet.git:manifests/site.pp (though you may have to recurse into included files), but there's an easier way that works most of the time -- go to http://ganglia.wikimedia.org/, click 'search', and type the name of the host [02:15:51] can you take a peek at puppet run on iridium [02:15:54] jackmcbarn: it should autocomplete with the cluster name [02:15:56] some salt / trebuchet messages [02:16:03] ori: does standard application server means it serves web requests? [02:16:04] ...I don't get why they would show up here [02:16:10] jackmcbarn: yes [02:16:13] "Warning: /Stage[main]/Role::Trebuchet/Salt::Grain[trebuchet_master]/Exec[/usr/local/sbin/grain-ensure set trebuchet_master tin.eqiad.wmnet" [02:16:18] on...iridium? [02:16:34] almost out of time for tonight, not sure what the deal is [02:16:38] chasemp: since you just added "standard" ? [02:16:48] yes that did it standard-noexim [02:17:41] jackmcbarn: if you actually pull up the node in ganglia, you can click "host overview", which will give you the kernel version and uptime, which is enough to determine if it's an HHVM box or not. (it isn't; it's running precise / php5) [02:18:04] chasemp: looking .. [02:18:22] !log started salt-minion on iridium [02:18:33] Logged the message, Master [02:18:39] thanks! [02:18:41] is that the whole deal? come on...:) [02:18:53] chasemp: i think that warning is just the side effect, the actual one is [02:18:56] Error: Could not start Service[salt-minion]: Execution of '/sbin/start salt-minion' returned 1: [02:19:00] Error: /Stage[main]/Salt::Minion/Service[salt-minion]/ensure: change from stopped to running failed: Could not start Service[salt-minion]: Execution of '/sbin/start salt-minion' returned 1: [02:19:08] chasemp: no, interestingly i can start it but puppet cant? [02:19:22] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 4192 MB (3% inode=99%): [02:19:23] is it still running? [02:19:25] hmm.. i saw some "fixup salt minion" changes earlier? [02:19:26] I guess [02:19:38] no [02:19:49] service salt-minion start [02:19:50] start: Job failed to start [02:19:52] so it's crashing. is this machine newly-reimaged? [02:19:57] /etc/init.d/salt-minion start * Starting salt minion control daemon salt-minion [ OK ] [02:20:02] ori: are you talking to me / mutante? [02:20:06] init.d says OK. upstart says FAIL :) [02:20:06] either one :) [02:20:15] it's not a new image, but it didn't have standard before [02:20:27] as I didn't wnat the monitoring noise from ppl as it was idle / in setup [02:20:35] but now w/ standard salt seems unhappy [02:20:42] let me look, just a sec [02:20:49] i won't touch anything [02:21:18] [ERROR ] This master address: 'salt' was previously resolvable but now fails to resolve! The previously resolved ip addr will continue to be used [02:21:18] [WARNING ] Master hostname: salt not found. Retrying in 30 seconds [02:21:33] /var/log/upstart/salt-minion.log [02:21:42] hmmmm this box got reimaged at one point and moved ip's [02:21:46] long ago [02:22:44] ok if i retry restarting salt? [02:23:01] if i try restarting the salt-minion service on iridium, i mean [02:23:12] please do [02:23:48] hmm, the master is supposedly set to palladium.eqiad.wmnet though [02:23:56] ah, the message is a red herring [02:23:57] /etc/salt/minion [02:24:02] if you try starting it manually you get: [02:24:06] [CRITICAL] The Salt Master has rejected this minion's public key! [02:24:06] To repair this issue, delete the public key for this minion on the Salt Master and restart this minion. [02:24:06] Or restart the Salt Master in open mode to clean out the keys. The Salt Minion will now exit. [02:24:10] which is the standard issue with reimages [02:24:23] ah so back when it must not have had the salt key removed [02:24:30] i can fix this if you like [02:24:33] and it's been lying in wait [02:24:43] if you have time, great [02:26:01] PROBLEM - puppet last run on mw1007 is CRITICAL: CRITICAL: Puppet has 1 failures [02:26:29] (03CR) 10Dzahn: [C: 032] contint: migrate localvhost to apache::site [puppet] - 10https://gerrit.wikimedia.org/r/155708 (owner: 10Hashar) [02:26:30] sure [02:26:35] this is the fix, btw: https://dpaste.de/8S3w/raw [02:26:50] excerpted from http://etherpad.wikimedia.org/p/app-server-upgrade , which is a pad _joe._ and i used for note-taking [02:27:05] got it, nice [02:28:24] !log updated salt key for iridium and restarted salt-minion [02:28:30] Logged the message, Master [02:28:54] chasemp, mutante: salt-minion is up now, so the puppet run should be ok (well, or fail for a different reason :)) [02:28:54] sweet, in business now, thanks gents [02:28:59] np! [02:29:02] yeah looks good [02:29:13] ori: cool [02:29:49] it didnt appear like a reinstall but makes sense when standard is added freshly [02:31:04] (03CR) 10Dzahn: "ran puppet on integration-slave1006-trusty and also on integration-slave1003 - no related issues seen just missing packages php5-parsekit " [puppet] - 10https://gerrit.wikimedia.org/r/155707 (https://bugzilla.wikimedia.org/68256) (owner: 10Hashar) [02:32:05] yeah, looks like: [02:32:21] standard includes base, base includes role::salt::minions [02:32:41] (03CR) 10Dzahn: "root@integration-slave1003:/etc/apache2/sites-enabled# ls" [puppet] - 10https://gerrit.wikimedia.org/r/155708 (owner: 10Hashar) [02:33:01] why's the beta cluster broken? [02:33:11] uhoh [02:33:12] * ori looks [02:33:27] (03PS5) 10Dzahn: contint-use apache::site,move config to templates [puppet] - 10https://gerrit.wikimedia.org/r/153959 [02:34:17] (03CR) 10Dzahn: [C: 031] "the localvhost stuff has been merged separately, rebased now what is left here" [puppet] - 10https://gerrit.wikimedia.org/r/153959 (owner: 10Dzahn) [02:36:00] !log LocalisationUpdate completed (1.24wmf19) at 2014-09-10 02:36:00+00:00 [02:36:05] Logged the message, Master [02:41:25] (03CR) 10Ori.livneh: [C: 031] contint-use apache::site,move config to templates [puppet] - 10https://gerrit.wikimedia.org/r/153959 (owner: 10Dzahn) [02:41:28] whenever i forward stuff to OTRS i get the reply instead of the actual requestor... [02:41:36] ori: :) [02:42:22] jackmcbarn: one of the app servers lost part of its mediawiki deployment dir because of a change i made earlier; i'm re-populating it. should be a couple of mins. [02:42:38] kk [02:43:12] RECOVERY - puppet last run on mw1007 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [02:46:51] (03PS2) 10Dzahn: put racktables behind misc. varnish [puppet] - 10https://gerrit.wikimedia.org/r/154980 [02:49:36] (03CR) 10Dzahn: [C: 032] put racktables behind misc. varnish [puppet] - 10https://gerrit.wikimedia.org/r/154980 (owner: 10Dzahn) [02:52:28] (03PS1) 10Dzahn: swich racktables to misc varnish cluster [dns] - 10https://gerrit.wikimedia.org/r/159448 [02:53:48] (03CR) 10Dzahn: [C: 032] swich racktables to misc varnish cluster [dns] - 10https://gerrit.wikimedia.org/r/159448 (owner: 10Dzahn) [03:01:06] RECOVERY - Disk space on virt0 is OK: DISK OK [03:08:00] !log LocalisationUpdate completed (1.24wmf20) at 2014-09-10 03:07:59+00:00 [03:08:06] Logged the message, Master [03:22:37] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [03:45:15] PROBLEM - puppet last run on mw1160 is CRITICAL: CRITICAL: Epic puppet fail [04:04:25] RECOVERY - puppet last run on mw1160 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [04:17:39] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Sep 10 04:17:36 UTC 2014 (duration 17m 35s) [04:17:45] Logged the message, Master [04:21:13] mutante: did you get the otrs ticket fixed? [05:23:35] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [06:01:02] mutante: re forwarding to otrs: try setting reply-to? [06:01:40] i doubt that would work [06:01:49] but feel free to test and we'll see [06:24:17] (03CR) 10Chmarkine: [C: 031] webserver - use ssl_ciphersuite in generic_vhost [puppet] - 10https://gerrit.wikimedia.org/r/153971 (owner: 10Dzahn) [06:28:07] i have a reproducible zend fatal [06:28:10] (not hhvm) [06:28:50] commons. seems to be any category page [06:28:56] PHP fatal error in /srv/mediawiki/php-1.24wmf20/extensions/ConfirmEdit/ConfirmEdit.php line 218: [06:28:56] Cannot redeclare confirmEditSetup() (previously declared in /usr/local/apache/common-local/php-1.24wmf20/extensions/ConfirmEdit/ConfirmEdit.php:207) [06:28:58] even logged out [06:29:07] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:16] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:25] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:26] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:30] ok, not any [06:29:36] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:37] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:06] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:06] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:16] i actually got 2 different fatals. only one is reproducible [06:32:31] try https://commons.wikimedia.org/wiki/Category:Wikimania_2012 [06:37:25] <_joe_> jeremyb: I get the page perfectly rendered [06:37:32] <_joe_> but, I'm on hhvm I guess [06:37:46] <_joe_> no, not on commons [06:38:16] <_joe_> jeremyb: also, /usr/local/apache/ shouldn't be there [06:39:08] what do you mean shouldn't be there? [06:39:34] <_joe_> that we use /srv/mediawiki as a docroot [06:39:40] how can i tell from exception in logstash what page it was on? [06:39:43] huh [06:40:05] <_joe_> jeremyb: no clue, php is bad at that in genera [06:40:41] and what is testwiki? 1017? [06:40:49] <_joe_> yes [06:41:23] <_joe_> it's also in the hhvm appservers pool [06:43:13] <_joe_> I was looking at fatals on losgstash, they're not so terrible [06:43:28] right. just want to be sure e.g. 1042 is the normal zend pool [06:43:39] <_joe_> ok so, /usr/local/apache/common-local [06:43:45] (03CR) 10JanZerebecki: [C: 031] let NDAed people login on servermon [puppet] - 10https://gerrit.wikimedia.org/r/159419 (owner: 10Dzahn) [06:43:48] <_joe_> is the real dir where the code is [06:45:06] huh [06:45:11] now it's magically fixed [06:45:30] RECOVERY - puppet last run on searchidx1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:45:36] <_joe_> what was giving that error? [06:45:40] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:45:45] the link above [06:45:46] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:45:55] but maybe i pasted the wrong stack [06:45:56] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:46:06] <_joe_> It was ok for me when I tried [06:46:08] i wonder if this was an error that happens only during l10n changes? [06:46:15] was consistent for me [06:46:26] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:46:36] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:46:46] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:46:55] or alternatively is related to a puppet run? [06:47:06] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:48:19] <_joe_> nah [06:48:23] <_joe_> no way [06:48:35] <_joe_> puppet is a no-op on appservers in my absence these days [06:48:36] <_joe_> :) [06:49:01] <_joe_> meaning - it's me the one screwing that up :P [06:50:33] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "One minor comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/153971 (owner: 10Dzahn) [06:53:48] (03PS35) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [06:53:55] _joe_: https://logstash.wikimedia.org/#dashboard/temp/z8CGaXhuTIKN1J0kxUoVeg [06:55:28] i don't think that's a fluke... [06:56:36] (i already poked #mediawiki-i18n, no response yet) [06:58:20] i was hoping timing for earliest instances lined up with something in SAL. but no such luck [07:00:49] was still happening 8 mins ago [07:02:20] <_joe_> mmmmh [07:03:08] <_joe_> maybe I don't see that since I use english? [07:03:28] english what? [07:03:31] i use english too [07:04:42] <_joe_> what is funny (?) is that this happened out of blue far from any release [07:05:22] <_joe_> maybe some memcached expiration is involved; alas, I don't know much about the php code itself [07:05:34] jeremyb: and what makes you think the error is i18n related? [07:06:54] And what's the second fatal? [07:07:01] And what are the actual URLs for each? [07:07:09] Impossible to comment like this, please file bugs [07:08:29] <_joe_> I should start studying the mediawiki code better [07:09:11] (03PS1) 10Spage: Set wgContentHandlerDB true for enwiki-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159457 (https://bugzilla.wikimedia.org/49193) [07:11:16] grrrr, bad wifi is bad [07:11:33] Nemo_bis: i don't remember the url for the one i only got once [07:11:38] oh, i do actually [07:11:45] [[commons:category:foo]] [07:11:53] but that wasn't i18n. it was confirmedit [07:12:11] (03CR) 10JanZerebecki: "naggen2 currently expects the puppet master database to be a mysql one and searches the config file of the puppet master for the credentia" [puppet] - 10https://gerrit.wikimedia.org/r/145472 (owner: 10Dzahn) [07:12:17] and https://commons.wikimedia.org/wiki/Category:Wikimania_2012 was [07:12:18] PHP fatal error in /usr/local/apache/common-local/php-1.24wmf20/languages/classes/LanguageKk.php line 24: [07:12:18] require_once() [function.require-once]: Cannot redeclare class languageconverter [07:12:25] Nemo_bis: [07:13:01] i think i consistently got Kk. but language code varies in logstash [07:14:09] (03CR) 10Giuseppe Lavagetto: "Please do not change naggen2 to add functionalities; it's thought for simplicity and speed in prod. you may want to adapt it a little, or " [puppet] - 10https://gerrit.wikimedia.org/r/145472 (owner: 10Dzahn) [07:15:59] jeremyb: great, you now have enough info to file a bug :) [07:16:23] (03CR) 10Spage: "It would be good to deploy this a week or two before it's the default on enwiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159457 (https://bugzilla.wikimedia.org/49193) (owner: 10Spage) [07:16:31] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "Also, I don't see any reason to rename classes. Icinga::monitor::service makes much more sense than icinga::service." [puppet] - 10https://gerrit.wikimedia.org/r/145472 (owner: 10Dzahn) [07:17:21] Nemo_bis: but there's a moderately well defined start time for this exception. it doesn't line up with SAL [07:18:37] (00:55 UTC today) [07:20:04] but? [07:21:40] so why did it start when it did? :) [07:22:18] (03CR) 10Giuseppe Lavagetto: [C: 031] swift: separate access log from general log [puppet] - 10https://gerrit.wikimedia.org/r/159348 (owner: 10Filippo Giunchedi) [07:23:30] _joe_: so, what about your objection to /usr/local/apache ? [07:24:03] <_joe_> jeremyb: none, actually [07:24:27] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [07:27:15] (03CR) 10JanZerebecki: "I would expect naggen2 to currently use an sql adapter, so the only suggestion from me that involved changing it was for it to try if sqli" [puppet] - 10https://gerrit.wikimedia.org/r/145472 (owner: 10Dzahn) [07:27:24] (03CR) 10Filippo Giunchedi: [C: 031] contint-use apache::site,move config to templates [puppet] - 10https://gerrit.wikimedia.org/r/153959 (owner: 10Dzahn) [07:29:50] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] sudoers.erb - deprecated variable access [puppet] - 10https://gerrit.wikimedia.org/r/154372 (owner: 10Dzahn) [07:31:08] sigh, I'm reverting that [07:33:23] cronspam some? :) [07:33:32] indeedly [07:33:40] (03PS1) 10Filippo Giunchedi: Revert "sudoers.erb - deprecated variable access" [puppet] - 10https://gerrit.wikimedia.org/r/159458 [07:34:11] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "sudoers.erb - deprecated variable access" [puppet] - 10https://gerrit.wikimedia.org/r/159458 (owner: 10Filippo Giunchedi) [07:35:21] should be over shortly, apologies for the noise [07:35:39] np, at least it wasn't piped into an alarm clock :) [07:38:26] hehe perhaps there's a simple way to check sudo syntax after puppet has expanded the template [07:39:53] so, ummmmm, doesn't it take 30 mins for the next puppet run? [07:40:03] this is going to keep going a while [07:40:13] (03CR) 10Filippo Giunchedi: "btw this didn't work, reverted in https://gerrit.wikimedia.org/r/#/c/159458/" [puppet] - 10https://gerrit.wikimedia.org/r/154372 (owner: 10Dzahn) [07:41:36] heh good question, depending if puppet reads the catalog before or after the splay period [07:43:22] <_joe_> 20 [07:43:31] <_joe_> it's 20 mins [07:44:14] oh [07:44:27] so, 10 more :) [07:46:28] (03CR) 10Filippo Giunchedi: [C: 031] StrictTransportSecurity for lists.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/145500 (https://bugzilla.wikimedia.org/38516) (owner: 10Dzahn) [07:51:41] <_joe_> this was very serious btw [07:52:08] <_joe_> how come we don't check files before putting them on the server? [07:53:05] well it passed lint [07:54:57] but i guess lint couldn't do template output [07:55:03] it would need to know input [07:55:14] jeremyb: pcc is the tool to check with [07:55:30] otoh, this is something that maybe could be done with rspec-puppet [07:56:02] (but rspec-puppet is on the chopping block?) [07:56:31] jeremyb: https://integration.wikimedia.org/ci/view/operations/job/operations-puppet-catalog-compiler/ [08:00:52] <_joe_> no. We need to have a system that checks the file _on_server_ and maybe removes it if it's bogus [08:00:55] <_joe_> making puppet fail [08:01:02] <_joe_> I was pretty sure we had that [08:01:41] (03CR) 10Alexandros Kosiaris: sudoers.erb - deprecated variable access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/154372 (owner: 10Dzahn) [08:01:42] that is even better [08:01:46] i don't understand how you would make puppet fail? [08:02:02] re lint: ohhhh, it's not lint, it's just documentation generation job [08:02:27] still i think in this case unit test is the best fix [08:03:11] akosiaris: heh, possibly numbing effect from many other similar reviews :( [08:03:56] it's not an error in a specific instance only, but a general error that would effect the most basic test case [08:04:16] (03PS1) 10Matanya: sudoers: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/159460 [08:04:19] godog: heh. yeah. I generally runs these through the puppet compiler just to be sure [08:04:22] akosiaris: ^ [08:04:43] and you can please merge the pybal one too, if you wish :) [08:05:11] <_joe_> the pybal one? [08:05:22] <_joe_> can you wait for me to take a break? [08:05:23] <_joe_> :) [08:05:55] _joe_: :) https://gerrit.wikimedia.org/r/158086 [08:06:02] akosiaris: very true [08:06:17] once you remove it from fenari the world would be a better place [08:06:43] (03CR) 10Alexandros Kosiaris: [C: 032] pybal: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/158086 (owner: 10Matanya) [08:07:11] _joe_: I reviewed it.. and ran it through compiler, seems a noop [08:07:26] <_joe_> akosiaris: I was joking [08:07:35] k [08:08:00] <_joe_> btw, this makes me think of the "moving pybal configs" I should set myself to. [08:08:36] _joe_: what does not work on servermon btw ? [08:09:48] seemed like the per host packagelist worked ok to me [08:09:53] <_joe_> akosiaris: yesterday the "available package updates" were showing nothing [08:10:00] <_joe_> now it works, wft [08:10:05] <_joe_> *wtf [08:10:44] if you checked right after I send my email, I submitted an extra minor change that might have fixed it [08:10:52] <_joe_> I think so [08:10:59] <_joe_> very very nice btw [08:11:09] thanks. I hope it will help us :-) [08:13:18] (03PS1) 10Matanya: limn: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/159461 [08:23:19] (03CR) 10Alexandros Kosiaris: [C: 032] swift: separate access log from general log [puppet] - 10https://gerrit.wikimedia.org/r/159348 (owner: 10Filippo Giunchedi) [08:38:58] PROBLEM - check google safe browsing for wikiversity.org on google is CRITICAL: Connection timed out [08:39:18] PROBLEM - check google safe browsing for wikimedia.org on google is CRITICAL: Connection timed out [08:40:19] PROBLEM - check google safe browsing for mediawiki.org on google is CRITICAL: Connection timed out [08:40:58] RECOVERY - check google safe browsing for wikiversity.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3925 bytes in 0.327 second response time [08:41:09] RECOVERY - check google safe browsing for wikimedia.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 4268 bytes in 0.346 second response time [08:41:49] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [08:41:49] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [08:42:12] RECOVERY - check google safe browsing for mediawiki.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3918 bytes in 0.086 second response time [08:46:06] (03PS1) 10Matanya: rsync: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/159462 [08:47:23] <_joe_> wat? [08:47:58] <_joe_> ^^ [08:48:22] what is it _joe_ ? [08:48:37] <_joe_> the hoarding of alerts [08:48:42] <_joe_> but it was a spike [08:48:48] <_joe_> that we should investigate [08:48:59] <_joe_> but unluckily I have no time no [08:49:03] <_joe_> *now [08:52:50] (03PS2) 10Filippo Giunchedi: swift: separate access log from general log [puppet] - 10https://gerrit.wikimedia.org/r/159348 [08:52:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: separate access log from general log [puppet] - 10https://gerrit.wikimedia.org/r/159348 (owner: 10Filippo Giunchedi) [08:56:09] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:56:09] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [08:56:22] (03PS1) 10Matanya: salt: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/159463 [08:58:09] PROBLEM - Swift HTTP backend on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:58:35] (03CR) 10Alexandros Kosiaris: "Pointing out that connecting to http://racktables.wikimedia.org does not redirect to the HTTPS version of the service." [puppet] - 10https://gerrit.wikimedia.org/r/154980 (owner: 10Dzahn) [08:59:10] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [08:59:10] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [08:59:20] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:59:35] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [08:59:35] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:59:39] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:59:50] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [08:59:50] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [08:59:59] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [09:00:26] hmm ? [09:00:29] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.028 second response time [09:01:19] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [500.0] [09:01:19] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [500.0] [09:01:21] looks like the imagescalers? looking [09:02:41] whoa. re last 10ish mins in ganglia [09:02:45] https://ganglia.wikimedia.org/latest/?c=Image%20scalers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [09:02:57] yup [09:03:13] cascade from the earlier nagios alert for swift? [09:03:40] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [09:03:40] PROBLEM - Swift HTTP backend on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:03:40] PROBLEM - Swift HTTP backend on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:03:50] and there we go [09:04:54] godog: OOM showed up and killed convert [09:05:07] ETOOMANYCONVERSIONS ? [09:05:20] huh [09:05:36] this OOM, he is a killer! [09:05:39] RECOVERY - Swift HTTP backend on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 5.020 second response time [09:05:49] akosiaris: possible, all at the same time and cascading [09:06:01] so actually swift invokes imagescaler which fetches full-size original from swift [09:06:02] i think [09:06:32] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:06:39] RECOVERY - Swift HTTP backend on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.012 second response time [09:07:39] PROBLEM - Swift HTTP backend on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:08:29] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 6.435 second response time [09:08:59] PROBLEM - Swift HTTP frontend on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:09:20] RECOVERY - Swift HTTP backend on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.017 second response time [09:09:59] PROBLEM - Swift HTTP frontend on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:09:59] PROBLEM - Swift HTTP backend on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:10:11] high cpu across swift frontends, still looking [09:10:59] PROBLEM - Swift HTTP backend on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:11:30] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:11:49] RECOVERY - Swift HTTP backend on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.026 second response time [09:11:58] couly be related to my recent change to rsyslog, rolling restarting frontends [09:12:03] RECOVERY - Swift HTTP frontend on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.004 second response time [09:12:07] PROBLEM - puppet last run on ms-fe1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:12:08] <_joe_> godog: also seems high output network traffic [09:12:09] PROBLEM - SSH on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:12:09] PROBLEM - DPKG on ms-fe1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:12:09] PROBLEM - check configured eth on ms-fe1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:12:23] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.025 second response time [09:12:54] RECOVERY - Swift HTTP frontend on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.011 second response time [09:12:54] RECOVERY - Swift HTTP backend on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.021 second response time [09:12:54] RECOVERY - Swift HTTP backend on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.134 second response time [09:13:04] RECOVERY - puppet last run on ms-fe1001 is OK: OK: Puppet is currently enabled, last run 281 seconds ago with 0 failures [09:13:04] RECOVERY - SSH on ms-fe1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [09:13:04] RECOVERY - check configured eth on ms-fe1001 is OK: NRPE: Unable to read output [09:13:05] RECOVERY - DPKG on ms-fe1001 is OK: All packages OK [09:13:24] !log rolling restart swift-proxy on ms-fe1* [09:13:29] Logged the message, Master [09:14:06] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.585 second response time [09:14:07] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.308 second response time [09:14:14] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.077 second response time [09:14:14] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 66797 bytes in 0.511 second response time [09:14:24] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.047 second response time [09:14:34] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.057 second response time [09:14:47] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.082 second response time [09:14:48] <_joe_> I didn't see any alarm on rendering [09:14:51] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.555 second response time [09:14:55] <_joe_> are we just getting the recovery? [09:14:57] _joe_: there were [09:14:59] <_joe_> WTF? [09:15:34] <_joe_> ok sorry I was out and ran back home when I got paged [09:15:34] [08:59:50] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [09:15:39] <_joe_> yes seen now [09:15:52] _joe_: it is flapping [09:15:54] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.081 second response time [09:16:28] <_joe_> akosiaris: it seems stable now [09:17:07] _joe_: OOM showed up for the 3rd time on all imagescalers just a minute ago [09:17:13] killing convert [09:17:30] <_joe_> akosiaris: oom is quite regular there [09:17:44] <_joe_> as far as I remember from the last time I checked [09:18:40] <_joe_> so, the problem here could be some batch upload of a lot of images, which resulted in high load on the image scalers, and maybe us incresing the number of workers on imagescalers did more harm than good [09:19:18] <_joe_> http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Image%2520scalers%2520eqiad&tab=m&vn=&hide-hf=false shows how imagescalers were basically completely down during that phase [09:20:45] <_joe_> http://gdash.wikimedia.org/dashboards/reqerror/ is interesting [09:21:00] I'm also trying to understand if swift and imagescaler stuff were related, imagescalers not responding seem a bit earlier [09:21:05] <_joe_> basically all our 500s are for images - I guess most are for the wrong arguments [09:21:25] <_joe_> godog: I think swift is a consequence of imagescalers maybe [09:21:55] not sure, there was also a puppet change for swift logging I pushed around that time [09:21:59] <_joe_> or, they may be unrelated and the swift issue may be related to the access log split? [09:22:05] probably [09:22:46] <_joe_> now, let's see if the next time I walk out of the door the alarm comes back [09:22:59] <_joe_> if that is the case, we probably found "root cause" [09:23:11] I can't understand though how the rsyslog change would do that [09:23:18] <_joe_> akosiaris: me neither [09:23:28] <_joe_> hence I guess this is related to the imagescalers [09:24:07] there was a problem with swift and syslog a while back, but while I checked it seemed fixed by the new swift version to which we upgraded [09:24:15] <_joe_> so, swift frontends do call the imagescalers, right? [09:24:33] <_joe_> or is that done by the backend? [09:24:44] that's correct, the frontends [09:24:44] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [09:24:49] <_joe_> ok so [09:25:01] <_joe_> if the frontends were busy polling the scalers [09:25:06] <_joe_> which were not responding [09:25:22] <_joe_> that seems like a classic domino effect [09:25:59] <_joe_> http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=ms-fe1002.eqiad.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1410341146&g=network_report&z=large&c=Swift%20eqiad [09:26:14] <_joe_> NO network traffic between 8:57 and 9:07 [09:26:23] <_joe_> meaning the scalers didn't respond [09:26:26] <_joe_> basically [09:26:52] <_joe_> what we need now is to understand what made the imagescalers choke [09:27:13] <_joe_> godog: you wanted an incident report to write? you've been served :P [09:27:34] thanks I'm thrilled already [09:29:07] what I'm trying to understand is what could have killed the image scalers like that, basically not even gmond could report anything [09:29:39] <_joe_> some large image upload [09:30:04] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [09:30:04] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:30:22] <_joe_> or maybe a new "wikilinks is free" promotion [09:30:37] <_joe_> bbiab [09:52:01] mark: did you see https://rt.wikimedia.org/Ticket/Display.html?id=8316 ? [09:56:40] matanya: yes, he has replied already on peering@ [09:56:55] I don't have access there, thanks akosiaris [10:01:37] <_joe_> ugh, gerrit sucks [10:01:56] <_joe_> I have no way to rebase the current tree over the debian one [10:04:44] <_joe_> it's a damn git basic functionality [10:23:42] _joe_: you could force push... [10:24:02] not saying you *should*. idk [10:26:36] <_joe_> jeremyb: according to docs, gerrit should refuse it [10:30:16] depends on the repo and on who's doing the pushing [10:30:30] don't worry about that so much. worry more about whether you should do it :) [10:31:07] <_joe_> jeremyb: it would make sense, though I'd probably be better off just porting some of our changes to the debian package [11:05:36] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:10:45] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 44 data above and 0 below the confidence bounds [11:10:50] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.012 second response time [11:11:09] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 44 data above and 0 below the confidence bounds [11:21:30] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [11:23:25] _joe_: if the porting to debian doesn't take care of everything: why not merge instead of rebase? [11:25:35] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [11:32:06] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds [11:32:06] PROBLEM - HTTP error ratio anomaly detection on labmon1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds [11:33:17] <_joe_> jzerebecki: it's equally painful [11:33:44] <_joe_> and I'd like to stop maintaining our version of hhvm packages [11:37:24] (03CR) 10Alexandros Kosiaris: "I'd like us to discuss this thoroughly. The reason is that this tool provides insight into the infrastructure that NDAed people could not " [puppet] - 10https://gerrit.wikimedia.org/r/159419 (owner: 10Dzahn) [11:38:46] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [11:47:38] any prod box, puppet --version. please [11:49:58] _joe_: godog: ping? [11:49:58] jeremyb: ping detected, please leave a message! [11:50:02] hah [11:50:53] actually, make it a precise box (12.04) please [11:55:13] or maybe andre__ could help? [11:56:27] jeremyb: 3.4.3 [11:56:33] with? [11:56:34] thanks [11:56:38] what do you want it for ? [11:56:39] with that [11:56:51] no, I don't go on prod boxes. [11:56:52] * akosiaris just curious :-) [11:56:55] I just read bugs. [11:57:08] andre__: well puppet's running on kaulen? [11:57:13] you could ask kaulen? :) [11:57:14] moot now [11:57:35] jeremyb, how? [11:57:35] jeremyb: btw.. same version in labs [11:57:41] akosiaris: i was getting an error running puppet with your config which I thought may be because puppet is too new [11:57:45] right, same version i was on [11:57:57] jeremyb, I think you misunderstand what I do and what I don't do :) [11:57:58] andre__: you don't have shell on kaulen? [11:58:03] jeremyb, no, what for? [11:58:10] * jeremyb is confused [11:58:25] andre__ <= bugmeister [11:58:30] correct :) [11:58:32] i know :) [11:58:38] and I good one at that :-) [11:58:46] damn puppet runs with long running execs [11:58:47] s/I/a/ [11:59:01] hmm. :) [11:59:55] ori: beta cluster is still broken [12:00:33] jackmcbarn: -> #wikimedia-qa [12:00:44] and it's not still broken [12:00:53] it's broken periodically [12:01:42] jeremyb: record your self and play when needed :) [12:02:30] <_joe_> sorry I'm at lunch ATM [12:02:46] _joe_: np, got what i wanted [12:02:50] <_joe_> :) [12:03:13] (03PS2) 10Alexandros Kosiaris: Purge all backup::client related packages/confs [puppet] - 10https://gerrit.wikimedia.org/r/159280 [12:03:56] (03CR) 10jenkins-bot: [V: 04-1] Purge all backup::client related packages/confs [puppet] - 10https://gerrit.wikimedia.org/r/159280 (owner: 10Alexandros Kosiaris) [12:08:14] (03PS3) 10Alexandros Kosiaris: Purge all backup::client related packages/confs [puppet] - 10https://gerrit.wikimedia.org/r/159280 [12:09:19] (03CR) 10Alexandros Kosiaris: "Does it make sense to also update network.pp:191 ?" [puppet] - 10https://gerrit.wikimedia.org/r/159439 (owner: 10Dzahn) [12:11:33] (03CR) 10Alexandros Kosiaris: [C: 032] remove pmtpa subnets from install-server [puppet] - 10https://gerrit.wikimedia.org/r/159438 (owner: 10Dzahn) [12:16:57] (03PS1) 10Yuvipanda: icinga: Fix typo in check_graphite [puppet] - 10https://gerrit.wikimedia.org/r/159470 [12:21:57] fun, puppet run failed because apt-get update failed [12:22:05] rinse, repeat [12:22:15] (03CR) 10Alexandros Kosiaris: "https://github.com/wikimedia/operations-puppet/commit/d6571bb1d4c866917396ecdead6068f9dd4a98ae was not really helpful." [puppet] - 10https://gerrit.wikimedia.org/r/159384 (owner: 10Dzahn) [12:34:12] (03CR) 10Alexandros Kosiaris: [C: 032] Purge all backup::client related packages/confs [puppet] - 10https://gerrit.wikimedia.org/r/159280 (owner: 10Alexandros Kosiaris) [13:00:05] K4-713: Dear anthropoid, the time has come. Please deploy Fundraising (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140910T1300). [13:01:18] (03PS1) 10Yuvipanda: icinga: Add graphite_series_threshold check [puppet] - 10https://gerrit.wikimedia.org/r/159473 [13:01:44] mutante: ^ adds a new type of graphite checks [13:01:46] *check [13:09:19] (03CR) 10Ottomata: "NDA is fine with me, but do people other than opsen and/or wikimedia employees want it? If not, why open this up to the larger group?" [puppet] - 10https://gerrit.wikimedia.org/r/159419 (owner: 10Dzahn) [13:14:14] ottomata: i do [13:16:34] matanya! good example! [13:16:36] :) [13:16:56] :) [13:19:56] btw, this got me thinking. Are there ndaers that have nothing to do with ops ? [13:22:32] (03CR) 10Alexandros Kosiaris: [C: 032] add esams.wmnet to search in resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/159391 (owner: 10Dzahn) [13:24:41] akosiaris: is the NDA ldap group the folks that sign the Volunteer_NDA? [13:24:49] if so then, probably [13:25:05] I know of at least one volunteer who is about to get access to hadoop stuff, for research purposes [13:26:16] (03PS2) 10Alexandros Kosiaris: Remove all already absent backup::server crons/schedules [puppet] - 10https://gerrit.wikimedia.org/r/159281 [13:26:30] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [13:28:05] ottomata: hmmm weird... [13:28:31] perhaps mapping the nda straight to an LDAP group was not granular enough [13:28:47] akosiaris: talk to apergos :) RT 6293 [13:29:34] jeremyb: ah yes... that saga [13:29:42] hah [13:30:54] (03CR) 10Alexandros Kosiaris: [C: 032] Remove all already absent backup::server crons/schedules [puppet] - 10https://gerrit.wikimedia.org/r/159281 (owner: 10Alexandros Kosiaris) [13:44:56] (03CR) 10Alexandros Kosiaris: [C: 032] "The principle is fine. Ottomata's comment about .join(' ') is also valid." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/153801 (owner: 10Ottomata) [13:46:48] i'm missing something here [13:46:52] thanks akosiaris, I may try that out today... :) [13:47:42] in modules/admin/manifests/user.pp there is a call to the function validate_ensure, that is declared in modules/wmflib/lib/puppet/parser/functions/ [13:48:07] but how the admin module knows to read the wmflib module to find it ? [13:48:16] akosiaris: how are those backups on tridge doing? :-) [13:48:58] jeremyb: nda, the issue that would not die [13:54:42] apergos: moving. Most of amanda should be purged of the cluster. See https://gerrit.wikimedia.org/r/#/q/status:merged+project:operations/puppet+branch:production+topic:amanda_removal,n,z There are a couple more to merge https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:amanda_removal,n,z and wait them out and I 'll be done [13:54:46] mark are you comfortable with me deploying the verp exim config https://gerrit.wikimedia.org/r/#/c/155753/ today? [13:55:02] as long as you do it very very carefully and test the config before it becomes active [13:55:16] like, get a generated config file, feed it to exim, monitor it through scenarios [13:55:20] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [13:55:29] that router should not be active in production yet so you should verify that [13:55:29] RECOVERY - HTTP error ratio anomaly detection on labmon1001 is OK: OK: No anomaly detected [13:55:45] feed it to production exim? or to a test exim instance? [13:55:51] whichever works [13:55:57] ok. [13:56:01] if you have a generated config file you can ask exim to test on it using exim -C [13:56:06] yeah [13:56:16] and feed it example messages or addresses or whatever [13:56:31] ^d, because we have debug on on wikitech we're seeing a (probably inconsequential) Lucene warning. Interested? https://bug-attachment.wikimedia.org/attachment.cgi?id=16385 [13:56:37] as currently implemented the router would be active but hopefully not hit by any mail due to the localpart pattern match [13:56:38] akosiaris: I was thinking rather of /data on tridge :-) [13:56:49] <^d> andrewbogott: godog pinged me about it elsewhere. That's already fixed in master. [13:56:52] the puppet classes going away is exciting though [13:56:57] <^d> (Also, we shouldn't be hitting MWSearch *at all* [13:56:58] I wondered if we should comment the router out for the production realm short term [13:57:15] ^d: also… that sounds like something I should care about? [13:57:24] <^d> No, harmless. [13:58:06] ok… in that case, is it too late to swat https://gerrit.wikimedia.org/r/#/c/158390/ ? [13:58:37] <^d> Not too late, swat's not for another hour :) [13:59:12] perfect! [13:59:46] apergos: well if amanda does not get purged it will recreate /data/amanda which I 'd rather avoid. But once it is purged, deleting /data/amanda manually is easy. I also started going through the rest of the stuff. Will create an etherpad and will pair with you [14:01:39] sweet! [14:02:50] Jeff_Green: we should comment that ? [14:03:00] I think its almost == silent ! [14:03:58] tonythomas: I can go either way on it [14:05:37] ok. so this generating config file == the exim4.conf, after this router is added right ? [14:06:26] yes. we need to have puppet generate the config, and confirm that it's sane before restarting prod exim on the new config [14:06:28] andrewbogott: dns question, how can i point in labs a dns name to its 443 port instead of 5080 it is using now ? [14:06:46] bad way to put the question [14:07:17] I need to check whether puppet restarts exim automatically, if not I can let puppet generate the config then syntax-check it before manually restarting exim [14:07:43] Jeff_Green: ok. hope everything go well :) [14:07:45] matanya: dns doesn't specify port at all, does it? [14:08:02] revisited: i have openmeetings.wmflabs.org pointing to port 80 [14:08:11] tonythomas: I need to help one of the FR folks with a blocker, I can work with you on this in about 30min [14:08:31] i want it to go directly to openmeetings.wmflabs.org:5080 [14:08:42] without the user need to enter it [14:09:01] Jeff_Green: ok. will wait, and I will try to test the same in one of our labs instance -- where I have root privs [14:09:24] If you're using a proxy, you can specify a port for a given proxy name. [14:09:34] I haven't tested it much, but it should work. [14:10:15] andrewbogott: if you have a moment later, i would love if you can show me how [14:10:28] <^d> andrewbogott: Got you on the list: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=126275&oldid=126236 [14:12:35] ^d, thanks [14:13:25] matanya: If you click 'manage web proxies' on the sidebar it should be pretty straightforward. You'll probably need to remove whatever DNS setup you had previously if you want to reuse the name. [14:13:54] ah, now i understand [14:13:58] thanks a lot [14:14:40] hope it works! [14:14:56] be sure i'll poke you if not :) [14:24:33] 16:07:15 I need to check whether puppet restarts exim automatically, if not I can let puppet generate the config then syntax-check it before manually restarting exim [14:24:40] Jeff_Green: that's a wrong assumption [14:24:53] the running exim daemon may not pick up the new config, but it spawns new children all the time that do [14:25:33] such as... on every received message [14:27:41] they reread the main config every child load? that's nuts [14:27:47] but ok, duly noted [14:27:53] how else do they get the configuration? [14:28:35] cached at the parent process? [14:28:48] and then transferred to the child? [14:28:55] possible I suppose ;) [14:29:41] sure seems saner than having the parent and child running with mismatched configs [14:32:21] <_joe_> mark: that's what apache does (pass configs from the master to the child, using fork) [14:32:53] fair enough [14:32:56] in any case, exim doesn't [14:32:57] i think that's what postfix does too [14:33:15] well that does complicate config testing :-) [14:37:35] (03CR) 10Manybubbles: [C: 031] Remove debugging stuff from wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158390 (owner: 10Reedy) [14:38:10] <_joe_> Jeff_Green: postfix is a wee bit more complicated [14:38:32] <_joe_> (and I prefer it to exim, but that's just me) [14:38:37] in what sense? [14:39:02] i find postfix much more intuitive, but I'm pretty sure it's all dependent on your personal thinking style [14:39:10] <_joe_> Jeff_Green: postfix is a collection of daemons, each of which does only part of the work [14:39:15] right [14:39:29] but in re. the main config, afaik that's handled like apache [14:39:43] <_joe_> never looked into that directly [14:40:11] <_joe_> as in read the code and straced the processes [14:40:12] afaik the main config is read on a reload or restart, like apache [14:40:15] yeah [14:41:23] I can't think of a reason it would be good not to have coordinated control over when the main config becomes active [14:51:09] I'll SWAT today [14:51:24] andrewbogott: ping for SWAT in about 9 minutes [14:51:29] thx [14:52:02] <^d> anomie: If you're volunteering. I put it on the list for andrewbogott so I'm more than happy to do it myself :) [14:52:21] ^d: Well, if you really want to you can do it. [15:00:04] manybubbles, anomie, ^d, marktraceur, andrewbogott: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140910T1500). Please do the needful. [15:00:14] ^d: So are you doing it or am I? [15:00:56] <^d> I've got it :) [15:01:30] (03CR) 10Chad: [C: 032] Remove debugging stuff from wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158390 (owner: 10Reedy) [15:01:40] (03Merged) 10jenkins-bot: Remove debugging stuff from wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158390 (owner: 10Reedy) [15:02:09] !log demon Synchronized wmf-config/wikitech.php: no-op (duration: 00m 06s) [15:02:14] Logged the message, Master [15:02:22] <^d> andrewbogott: You're live, feel free to pull it to wikitech [15:02:31] thanks, syncing... [15:05:06] hm, sync is taking forever. I hope it's not doing something interesting [15:09:05] andrewbogott: Ummm... it may be. Files are syncing to /srv/mediawiki now. That was changed recently from /usr/local/apache/common-local [15:09:32] does that mean I'm going to get two copies of everything? [15:09:42] It very well may. [15:09:52] ok, I'll keep an eye on df [15:09:55] so far so good [15:10:00] will it clean up after the move? [15:10:28] Ori has been working on cleaning up the on-disk layout and wikitech is probably not on his radar for ensuring things are correct. [15:11:02] * bd808 goes to look at a random mw box to see how things are setup today [15:11:21] moving things to /srv shuld be fine, it's the same volume as /usr [15:11:43] woo fatal error [15:11:50] so much for merging one little fix :( [15:11:54] On mw1039.eqiad.wmnet /srv/mediawiki is a symlink to /usr/local/apache/common-local [15:12:12] yeah, symlinked for me too [15:12:12] But scap/sync-common is targeting /srv/mediawiki [15:12:22] ok cool [15:12:31] but, broken! Got a minute to help me troubleshoot? [15:12:42] Sure. What's the error? [15:12:50] see for yourself :( [15:13:01] manybubbles: around? [15:13:02] "Fatal exception of type MWException" is not very helpful MW [15:13:11] andrewbogott: Other log output? [15:13:12] bd808: Because we just turned off debugging ;) [15:13:16] aude: yeah - I'm waiting for our meeting to start [15:13:18] yeah :( [15:13:22] lol [15:13:30] manybubbles: we got sidetracked [15:13:40] "" is not a valid magic word for "smwdoc" [15:13:40] we can join now, if good for you [15:13:45] aude: no problem - I can wait [15:13:50] sure now is cool [15:13:52] bad l10n cache then [15:13:55] ok :) [15:14:28] bd808: is that something I should fix locally? [15:15:18] "labswiki": "php-1.24wmf20" -- Reedy was that on purpose? [15:15:34] Jumping from wmf15 to wmf20? [15:15:36] Not really [15:15:42] I guess it's because it's in the "right" dblist [15:15:45] oh… I had scheduled that for tomorrow AM [15:16:03] The branch upgrade, I mean. [15:16:10] Maybe we should change that back and see if it magically fixes wikitech? [15:17:11] yes please? [15:17:41] ^d: You want to do the honors? [15:17:51] <^d> huh wha? [15:17:56] * ^d wasn't watching [15:18:16] labswiki got bumped from wmf15 to wmf20 in wikiversions. We'd like to move it back [15:18:40] <^d> - "labswiki": "php-1.24wmf20", [15:18:40] <^d> + "labswiki": "php-1.24wmf15", [15:18:43] <^d> Look good? [15:18:56] yeah that looks right [15:19:05] !log demon updated /a/common to {{Gerrit|I158e7c685}}: Remove Wikipedia:Teahouse/Questions/Flow_test from enwiki Flow pages [15:19:10] Meanwhile… can someone explain 'got bumped'? [15:19:19] !log demon rebuilt wikiversions.cdb and synchronized wikiversions files: labswiki back to wmf15 [15:19:39] andrewbogott: To switch mutliple wikis we use dblists [15:19:40] PROBLEM - puppet last run on virt1005 is CRITICAL: CRITICAL: Puppet has 1 failures [15:19:41] andrewbogott: On tuesdays everything that isn't a 'pedia jumps to the newest branch [15:19:41] (03PS1) 10Chad: labswiki back to 1.24wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159479 [15:19:50] bd808: Ah, ok. [15:19:53] (03CR) 10Chad: [C: 032] labswiki back to 1.24wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159479 (owner: 10Chad) [15:19:58] (03Merged) 10jenkins-bot: labswiki back to 1.24wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159479 (owner: 10Chad) [15:19:58] For the Tuesday deploy I use all to switch everything, and then use the wikipedia one to put the pedias back [15:20:00] But in that case wouldn't the l10n cache have been up to date as well? [15:20:01] <^d> ^ already live, pushing through [15:20:18] andrewbogott: sync-common again please [15:20:45] Looks better. [15:20:54] andrewbogott: Yes, l10n cache should be up to date. "Should be" [15:21:06] OK, so, since we just accidentally did what I was hoping to do on purpose tomorrow… what have we learned? [15:21:30] I think we learned that l10n cache isn't normally getting all the things wikitech needs [15:21:44] But I'm not sure how wmf15 did? [15:22:01] Oh. I know. We rebuilt l10n with wikitech being the only wiki on wmf15 :( [15:22:22] But normally it will be test2wiki or something for hte new branch [15:22:40] Reedy: Isn't there some place where we add in all the possible extensions for the l10n build? [15:22:57] Ah, when you build l10n you target a particular install? [15:23:04] Yeah [15:23:12] It uses the first wiki on that version IIRC [15:23:32] andrewbogott: Sort of. l10n is wacky and can only be built by starting a wiki and asking it for the l10n strings. [15:23:44] Though... [15:23:44] / Do not attempt to load SMW for l10n in beta. [15:23:44] if ( $wmfRealm != 'labs' ) { [15:23:50] It should be there [15:24:46] there's more there than just smw [15:25:13] Right [15:25:29] But that was just the reason for excluding it from labs [15:27:29] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [15:29:47] (03PS2) 10Alexandros Kosiaris: Remove the backup::client class [puppet] - 10https://gerrit.wikimedia.org/r/159282 [15:31:21] * YuviPanda vaguely pokes chasemp with https://gerrit.wikimedia.org/r/#/c/159473/ [15:35:46] (03CR) 10Alexandros Kosiaris: [C: 032] Remove the backup::client class [puppet] - 10https://gerrit.wikimedia.org/r/159282 (owner: 10Alexandros Kosiaris) [15:36:26] Reedy, bd808, still investigating? [15:37:00] RECOVERY - puppet last run on virt1005 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [15:37:39] akosiaris: if you're +2'ing - mind looking at a quick apache patch in puppet? :) [15:37:54] JohnLewis: You might be better asking _joe_ [15:38:07] He's doing an apache deploy in the morning all being well [15:38:29] PROBLEM - HTTP error ratio anomaly detection on labmon1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [15:38:29] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [15:38:34] Reedy: kk' it's the ve.wikimedia one :p [15:38:52] andrewbogott: Sorry got distracted by another ping. Looking again now. [15:39:25] JohnLewis: shoot, worst case scenario I 'll route it to someone else :-) [15:39:45] akosiaris: https://gerrit.wikimedia.org/r/#/c/159356/ :p [15:40:03] apergos ^ that's what I poked you about before I saw someone else doing stuff :) [15:40:56] andrewbogott: [15:40:57] reedy@tin:/a/common$ grep -c smw php-1.24wmf20/cache/l10n/upstream/l10n_cache-en.cdb.json [15:40:57] 0 [15:40:57] reedy@tin:/a/common$ grep -c smw php-1.24wmf15/cache/l10n/upstream/l10n_cache-en.cdb.json [15:40:58] 277 [15:41:00] There's obviously some difference [15:41:22] andrewbogott: Sam is right, the wgExtensionEntryPointListFiles entry should be pulling in the right things... but maybe we missed something? [15:42:26] * bd808 looks at extension-list-wikitech [15:43:28] (03PS1) 10Reedy: Add extension-list-wikitech to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159484 [15:43:40] (03CR) 10Reedy: [C: 032] Add extension-list-wikitech to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159484 (owner: 10Reedy) [15:43:49] (03Merged) 10jenkins-bot: Add extension-list-wikitech to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159484 (owner: 10Reedy) [15:44:10] !log reedy Synchronized docroot and w: (no message) (duration: 00m 15s) [15:44:14] JohnLewis: I don't want to push the redir because I won't be here too mch longer to babysit it tonight [15:44:15] Logged the message, Master [15:44:36] apergos: yeah; it's alright :) [15:44:51] but I will look at it at least [15:44:52] akosiaris or _joe_ might pick it up so [15:45:09] they are both getting near the end of their workdays too.. though maybe they start later [15:45:12] same-ish tz [15:45:21] I 'll do it now [15:45:26] <^d> ottomata: Did you get a chance to get started on getting the traffic replay ready? [15:45:54] oh shoot, needs manual rebase [15:46:04] should be trivial to rebase [15:46:05] andrewbogott, Reedy: extension-list-wikitech has all of the entrypoints that are used in wmf-config/wikitech.php [15:46:20] yeah, I just hate doing it [15:46:29] ^d, no, not yet, had an interview just now... [15:46:34] (03CR) 10ArielGlenn: [C: 031] Don't redirect vewikimedia [puppet] - 10https://gerrit.wikimedia.org/r/159356 (https://bugzilla.wikimedia.org/70579) (owner: 10John F. Lewis) [15:47:05] andrewbogott, Reedy: But there was at least one full scap yesterday so l10n should have been up to date in wmf20 [15:47:18] <^d> ottomata: ack. when did you want to try and pair up on that? [15:47:23] Sine it wasn't, there is something that we are missing [15:47:36] And the test wiki that is used for regenerating pulls in extension-list-wikitech? [15:48:52] hummmmmhmmhmh, ^d, gimme a few minutes to write up my jobvite thing, then let's do it...10 minutes maybe? at least lets sync up and make a plan then [15:48:57] SoS meetings is in 1.5 hours too [15:49:11] JohnLewis: ve.wikimedia.org is already in wikimedia.conf [15:49:26] <^d> ottomata: Ok do your jobvite thing, in 10 we'll at least chat some more :) [15:49:34] jobvite ? [15:49:42] not greenhouse ? [15:49:59] akosiaris: Reedy told me to add it there; probably should have checked first actually... [15:50:12] xD [15:50:18] akosiaris: dunno, i was told to use jobvite [15:50:23] <^d> akosiaris: Might be a legacy app that was already underway? [15:50:24] havn'te used greenhouse at all yet [15:50:25] I sorta (hoped) it wasn't still in there if we were redirecting it offsite [15:50:37] * ^d doesn't know [15:50:38] ottomata: ^d yeah probably [15:51:51] (03PS3) 10Alexandros Kosiaris: Don't redirect vewikimedia [puppet] - 10https://gerrit.wikimedia.org/r/159356 (https://bugzilla.wikimedia.org/70579) (owner: 10John F. Lewis) [15:52:22] (03PS1) 10Manybubbles: Further throttle Cirrus template update jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159485 [15:53:22] andrewbogott: The way it *should* work is that CommonSettings adds to wgExtensionEntryPointListFiles (which is does on line 2654) and then the mergeMessageFileList.php maintenance script adds those extension entry points during the l10n cache creation. [15:54:25] ^d: I'm actually around for today's deploy! [15:55:06] <^d> Wheee [15:55:16] <^d> I forgot about it until now :p [15:56:12] (03CR) 10Alexandros Kosiaris: [C: 032] Don't redirect vewikimedia [puppet] - 10https://gerrit.wikimedia.org/r/159356 (https://bugzilla.wikimedia.org/70579) (owner: 10John F. Lewis) [15:56:36] (03PS1) 10Chad: nlwiki gets Cirrus as primary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159486 [15:59:23] (03PS2) 10Ottomata: limn: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/159461 (owner: 10Matanya) [15:59:28] (03CR) 10Ottomata: [C: 032 V: 032] limn: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/159461 (owner: 10Matanya) [15:59:57] ^d: do you want me to make the config change? [16:00:05] manybubbles, ^d: Dear anthropoid, the time has come. Please deploy Search (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140910T1600). [16:00:18] <^d> manybubbles: I just did ^ [16:00:40] (03CR) 10Manybubbles: [C: 032] nlwiki gets Cirrus as primary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159486 (owner: 10Chad) [16:01:49] (03Merged) 10jenkins-bot: nlwiki gets Cirrus as primary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159486 (owner: 10Chad) [16:01:56] bd808: sure looks to me like those extensions are included unconditionally... [16:02:31] Yeah. I'm looking at some of the generated files now to see if I can spot anything weird [16:02:33] ^d: just did a performance test for it - looks fine [16:03:11] <^d> Alrighty, here we go then [16:03:13] JohnLewis: done. ve.wikimedia.org no longer redirects anywhere [16:03:37] akosiaris: kk [16:03:39] !log demon Synchronized wmf-config/InitialiseSettings.php: nlwiki cirrus (duration: 00m 04s) [16:03:45] Logged the message, Master [16:04:59] (03PS1) 10Jeremyb: new path for beta cluster CommonSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/159487 [16:06:03] <^d> manybubbles: Looking good. [16:06:03] andrewbogott: There is no key for SemanticMediaWiki in ExtensionMessages-1.24wmf20.php but there is in ExtensionMessages-1.24wmf15.php [16:06:07] ^d, ok. so! [16:06:13] andrewbogott: So ... need to dig more [16:06:16] ^d: indeed [16:06:34] ^d: cool! I'm going to step out for a bit then. [16:06:55] (03PS1) 10Jgreen: add SPF record for donate.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/159489 [16:07:00] andrewbogott: But it may have something to do with "enableSemantics('wikitech');" in wmf-config/wikitech.php [16:07:23] ^d: oh - I think I've found a bug in lucene to do with not cleaning up deleted docs - it is hurting us but it isn't clear how much. or if it is a bug. [16:07:26] If so then that totally doesn't do what I thought it did [16:07:48] <^d> ottomata: So, the plan is to pull the traffic logs from analytics1004? [16:07:55] <^d> And then rewrite them to ES queries. [16:08:28] still working on wikitech? [16:08:33] andrewbogott: i guess? [16:08:49] jeremyb: um... [16:08:53] what are you seeing? [16:09:01] We're discussing it but it should be up and working in the meantime [16:09:07] console output and "configure" (for instances) are both broken [16:09:15] * andrewbogott looks [16:09:44] um, well, i think so? we need to find a shard that is on elastic1016 [16:09:46] jeremyb: most likely I broke your openstack token by fussing with the cache :( Try log out and in [16:09:51] a large enwiki shard [16:09:51] i guess [16:09:59] oh [16:09:59] jeremyb: btw, you saw my comment on your bug about morebots? [16:10:00] and then, find a different node where that shard also is [16:10:05] no [16:10:08] that way we can target those two nodes to compare [16:10:25] jeremyb: It said, in essence, "Should be fixed, please verify and then unhack the bot" [16:10:40] since I presume you made a local fork of the adminbot code or something? [16:10:43] i see now [16:10:46] i did [16:10:57] essentially cp -av to home dir [16:11:07] ah, ^d, yeah so, the script many bubbles has on an04 uses cirrus search [16:11:14] jeremyb: I /really/ need to fix wikitech so that it says something meaningful when your OpenStack token expires. Right now the design just trusts in the fact that OS tokens have a shorter life than wikitech login tokens. [16:11:15] which, i htink won't let us target a node specifically [16:11:22] If they get out of sync then all kinds of dumb things happen [16:11:23] <^d> ottomata: We can move an enwiki shard to it if we need to. [16:11:27] hah [16:11:32] we need to write something that replays the searches via elasticsearch itself...or we don't really care if they are real searchers [16:11:37] Jeff_Green: where was that script you were talking about? [16:11:47] that loaded up elasticsearch via its http API? [16:12:23] Or I could just add a site notice that says "Welcome to wikitech! Have you tried turning it off and on again?" [16:12:32] (03PS1) 10Giuseppe Lavagetto: mediawiki: add HHVM proxy rules in main.conf [puppet] - 10https://gerrit.wikimedia.org/r/159490 [16:12:38] sooooo much beter on log out+in [16:13:13] ^d, there is an enwiki shard there already [16:13:14] fwiw it predates elasticsearch, iirc we used it for lucene rebalancing [16:13:18] oh [16:13:22] hmmmm [16:13:26] sec, I'll post it somewhere [16:13:32] ok, well, ^d, we need a way to hammer s certain shard via the elasticsearch API [16:13:36] sure [16:13:42] maybe it'll be useful [16:13:44] <^d> Lemme start an etherpad. [16:13:46] <^d> I have an idea. [16:13:48] k [16:13:54] not via Cirrus...because i think many bubbles said that Cirrus wouldn't allow us to set the _node preference [16:13:55] andrewbogott: bah. It is the enableSemantics() call. SMW is tricky and doesn't expose it's l10n files until that method is called. (cc Reedy) [16:14:00] it's basically the same core as apache_fast_test --it's a threaded web-hitter [16:14:17] * andrewbogott shakes fist at smw devs [16:14:18] again [16:14:20] bd808: fu--------- [16:14:24] They like to be different [16:14:35] Reedy: It happens in includes/SMW_Setup.php [16:15:02] I wonder if master has "fixed" this [16:15:55] I'd suspect not [16:17:06] <^d> ottomata: http://etherpad.wikimedia.org/p/elastic-single-node-testing [16:17:23] Reedy: Actually they have \o/ -- https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/79cbda3ad82af76f0bdf055a315bfd948fc90573/SemanticMediaWiki.php#L121-L123 [16:17:36] Wow [16:17:38] Backport? [16:17:52] For an interesting reason -- // Because of MW 1.19 we need to register message files here [16:18:36] Hmm [16:18:48] With the same array keys... Having it done twice shouldn't be a big issue? [16:19:16] No, it should be fine [16:19:51] ^d, if we know the query already, why do we need to do step 2? [16:19:53] I think we just need the wgExtensionMessagesFiles lines from includes/SMW_Setup.php to be outside the enableSemantics() call [16:20:09] can't we just parse the queries out and do step 5 via elasticsearch API witih _node preference set? [16:20:23] <^d> ottomata: The logs we have aren't ES logs, they're Special:Search logs I thought. [16:20:34] <^d> If I'm wrong, then we can skip (2) [16:21:00] ja, they are queries against lucene search, or whatever [16:21:03] but we have the query strings [16:21:13] so we can just parse the logs, and make an elasticssearch HTTP request out of them [16:21:52] ottomata: https://github.com/j6r33n/hacking/tree/master/qa_scripts [16:25:32] hm, ^d, do you know how to get a node id? [16:25:40] what is 1016's elasticsearch node id? [16:26:43] <^d> I think there's a way to coax it out of /_cat/nodes/ [16:27:14] ottomata: i can hack that script to suit your needs, but iirc what it does it take a giant list of search hits and rerun them directly against the API ip/port of a search node [16:27:52] and it's main point was to show differences in the responses of the individual nodes [16:27:58] bd808: that function call can be added to the wmf config, right? Or do we need to backport the fix? [16:28:27] andrewbogott: We need to backport a change to the l10n loading. Working on it now. [16:28:28] <^d> ottomata: `curl -s localhost:9200/_cat/nodes?v\&h=id,host` [16:28:31] Jeff_Green, you parse the webrequest logs to get the search hits? or the lucene search logs? [16:28:38] 'k thanks [16:28:47] <^d> ottomata: So 1016 is "fDeo" [16:29:07] ottomata: I'm not sure how I did it at the time, but the end result was search requests formatted as they would be made by the webserver to the search API [16:29:17] hmm, ^d, weird [16:29:19] that doesn't work either [16:29:33] ElasticsearchIllegalArgumentException[No data node with id[fDeo] found] [16:29:52] <^d> Grrr :\ [16:30:15] hm, ok, Jeff_Green, so if we figure out how to get search query strings and format them into elasticsaerch http requests...can you modify this for us (oh perlmaster....) to do the reqs? [16:30:35] absolutely [16:31:04] coooOOOl [16:32:47] (03CR) 10Legoktm: "Can we just do this everywhere on labs?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159457 (https://bugzilla.wikimedia.org/49193) (owner: 10Spage) [16:34:04] andrewbogott: The version of SMW we run is OMG old. Is fixing that still on the list of things you want to do? [16:34:40] something something composer [16:34:46] * bd808 nods [16:34:46] ^d, you trying what I'm trying? (in etherpad) [16:34:48] but, yes, eventually [16:35:44] i don't think anyone has changed them since Ryan_Lane chose them ;) [16:35:47] "known working" etc [16:36:20] can it be replaced with wikidata yet? :( [16:36:49] Has anyone actually spec'd that out? [16:36:52] PROBLEM - puppet last run on cp3011 is CRITICAL: CRITICAL: Epic puppet fail [16:37:45] <^d> ottomata: Yeah same result. [16:39:02] Reedy, andrewbogott: I *think* this patch will fix the problem -- https://gerrit.wikimedia.org/r/#/c/159493/ [16:40:27] seems like it oughta... [16:40:29] * andrewbogott cringes [16:41:16] ^d, i think that is a short ID [16:41:21] not sure how to get the long one [16:42:18] <^d> Silly _cat [16:42:32] e.g. [16:42:33] curl 'localhost:9200/_cat/master?v' [16:42:35] gives a long id [16:43:13] <^d> fDeoGVcgSbmV88k-Nd4Kpg [16:43:25] howd you do it!? [16:43:27] <^d> From `curl localhost:9200/_nodes/elastic1016?pretty=true` [16:43:46] that works, cooool [16:44:22] cool [16:44:22] curl -XGET 'http://localhost:9200/enwiki_content/_search?preference=_only_node:fDeoGVcgSbmV88k-Nd4Kpg&q=Sicculu' | jq ._shards [16:44:24] and that only hit one shard [16:44:25] perfect [16:44:41] <^d> \o/ [16:44:53] ok, ^d, can you find another node that has shard enwiki_content_1407944746 [16:44:54] ? [16:45:13] (03CR) 10Aaron Schulz: [C: 031] Further throttle Cirrus template update jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159485 (owner: 10Manybubbles) [16:45:21] and, is q= enough? or does cirrus do cooler things? [16:46:11] (03CR) 10BryanDavis: [C: 04-1] new path for beta cluster CommonSettings.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/159487 (owner: 10Jeremyb) [16:47:16] <^d> ottomata: Any node except 06, 11, 13. [16:47:42] ? i thought there were only 3 replicas of each shard [16:47:58] <^d> enwiki_content_1407944746 is the index. [16:48:01] oh. [16:48:08] <^d> The one on 1016 is shard 0. [16:48:19] ok [16:48:21] yeah, that's what we want [16:48:26] another node with that index shard 0 [16:48:30] (03CR) 10Jeremyb: new path for beta cluster CommonSettings.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/159487 (owner: 10Jeremyb) [16:48:43] <^d> ottomata: In that case, 12 or 05. [16:54:17] (03CR) 10BryanDavis: new path for beta cluster CommonSettings.php (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/159487 (owner: 10Jeremyb) [16:56:13] (03PS1) 10Giuseppe Lavagetto: pybal: serve the virtualhost with pybal lb files with a dedicated vhost [puppet] - 10https://gerrit.wikimedia.org/r/159495 [16:56:24] RECOVERY - puppet last run on cp3011 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [16:56:57] (03CR) 10jenkins-bot: [V: 04-1] pybal: serve the virtualhost with pybal lb files with a dedicated vhost [puppet] - 10https://gerrit.wikimedia.org/r/159495 (owner: 10Giuseppe Lavagetto) [17:00:07] Any opsen about? [17:00:13] Need a few apaches restarting/gracefulling [17:02:20] (03PS1) 10Cmjohnson: adding cname for civicrm2-gr.frdev.wikimedia.org rt8330 [dns] - 10https://gerrit.wikimedia.org/r/159497 [17:04:33] 10.64.16.116, 10.64.16.106, 10.64.16.101, 10.64.16.96, 10.64.16.94, 10.64.16.126 [17:05:40] ok, ^d cool, hm [17:05:51] so, 'im looking at Jeff's script, and at Nik's...I *think* nik's will be easier to adapt [17:06:02] it does basically the same thing, except Jeff's looks more more Mediawiki specific [17:09:13] <^d> ottomata: Sounds like a plan. [17:09:33] RECOVERY - HTTP error ratio anomaly detection on labmon1001 is OK: OK: No anomaly detected [17:09:33] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [17:10:21] !log Restarted logstash on logstash1001 [17:10:27] Logged the message, Master [17:11:36] (03CR) 10RobH: [C: 032] "chatted with mark about this in irc, its ok so merging" [dns] - 10https://gerrit.wikimedia.org/r/159188 (owner: 10RobH) [17:14:30] Reedy: did you get your restarts? [17:15:01] andrewbogott: not yet [17:15:04] mw1126, mw1116, mw1122, mw1146, mw1121, mw1136, mw1114, mw1068 reporting the wikibase fatal [17:15:17] Reedy, ok, stay tuned... [17:15:21] thanks [17:15:50] Reedy: are those the same hosts that bd808 just rattled off? [17:16:11] I'm guessing so [17:17:10] ok, graceful'd mw1126. Did that get the desired effect? [17:17:28] aka 10.64.16.106 [17:18:31] Reedy: ^ ? [17:19:09] I'm still seeing mw1126 report the error in logstash, but graceful might take a bit to drain I guess [17:19:57] Shall I graceful the others anyway, or can we wait and verify that it helps? [17:20:39] andrewbogott: It's looking better now. I'd say proceed [17:20:59] ok [17:21:08] A graceful should fix the apc issue... So I'd suggest going on too [17:21:50] (03PS1) 10RobH: setting dns for server uranium [dns] - 10https://gerrit.wikimedia.org/r/159500 [17:24:03] andrewbogott: yup, looks like the total errors are dropping quite a bit [17:24:13] lemme know if I missed any [17:24:22] (03CR) 10RobH: [C: 032] setting dns for server uranium [dns] - 10https://gerrit.wikimedia.org/r/159500 (owner: 10RobH) [17:24:51] Down to 9/1000 lines being those errors [17:25:03] Getting a fucktonne of "Fatal error: Cannot use object of type stdClass as array" [17:25:05] But they're not a new error [17:25:15] andrewbogott: All gone. LGTM :) [17:25:20] cool [17:25:35] Thanks [17:25:43] !log mw1126, mw1116, mw1122, mw1146, mw1121, mw1136, mw1114, mw1068 have been gracefulled [17:25:50] Logged the message, Master [17:28:06] (03CR) 10Greg Grossmeier: "Can this be merged?" [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://bugzilla.wikimedia.org/70181) (owner: 10Dduvall) [17:28:32] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [17:35:35] !log rebuilding cirrus index for test2wiki to test some performance enhancements don't break anything. test2wiki is too small to see any gain from the enhancements though. [17:35:40] Logged the message, Master [17:41:04] (03CR) 10RobH: "I'm not sure I want to delay this for that, since updating EQIAD will certainly be back seat to pushing new servers live in CODFW. (I jus" [puppet] - 10https://gerrit.wikimedia.org/r/159167 (owner: 10RobH) [17:42:05] (03CR) 10RobH: "i do agree with the comments inline to pull codfw out entirely for now." [puppet] - 10https://gerrit.wikimedia.org/r/159167 (owner: 10RobH) [17:43:47] (03CR) 10RobH: [C: 032] setting install params for db2001-2031 [puppet] - 10https://gerrit.wikimedia.org/r/159200 (owner: 10RobH) [17:44:41] ^d: https://gist.github.com/ottomata/fb03fd03267aa0eb1767 [17:45:10] i think we can just set host and node_id, and then run that through a bunch of lucene logs [17:45:16] and time how long it takes to get through it all! [17:45:45] (03PS1) 10RobH: setting install params for server uranium [puppet] - 10https://gerrit.wikimedia.org/r/159505 [17:46:01] <^d> Hmmm [17:52:06] (03PS2) 10RobH: setting install params for server uranium [puppet] - 10https://gerrit.wikimedia.org/r/159505 [17:57:06] (03CR) 10RobH: [C: 032] setting install params for server uranium [puppet] - 10https://gerrit.wikimedia.org/r/159505 (owner: 10RobH) [18:00:05] yurik: Respected human, time to deploy Wikipedia Zero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140910T1800). Please do the needful. [18:00:34] !log cirrus index rebuild for test2wiki went well - doing the rest of group0 [18:00:41] Logged the message, Master [18:04:47] mutante: poke when around? [18:22:43] !log yurik Synchronized php-1.24wmf19/extensions/ZeroBanner: (no message) (duration: 01m 11s) [18:22:49] Logged the message, Master [18:26:11] !log yurik Synchronized php-1.24wmf20/extensions/ZeroBanner: (no message) (duration: 01m 09s) [18:26:17] Logged the message, Master [18:27:13] going to graceful apaches to see if it helps w/ [18:28:05] Reedy, deployed the +repage, but no luck. Need to research further :( [18:28:27] actually, no. [18:28:48] Reedy, let me know if you can think of an easy backend way to convert svg->gif and return gif's output directly [18:29:48] Why a gif? [18:30:16] seemingly convert file.svg file.gif should work [18:39:33] (03PS1) 10Yurik: Custom rights on zerowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159517 [18:40:05] MaxSem or Reedy, could you briefly look at ^ before i push it out [18:40:18] just a sanity check :) [18:40:44] saaaanity? yer asking me about sanity? [18:41:10] sorry, and oh, forgot to unset local var, one sec [18:42:20] (03PS2) 10Yurik: Custom rights on zerowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159517 [18:42:23] MaxSem, ^ [18:42:54] does it make sense, can i push it out? [18:43:50] Reedy, re why gif - that's the safest bet. Images are used only for the oldest, no JS phones, hence i don't want to require png [18:44:19] yurikR, + for arrays teh suxxx [18:44:21] ori: "# require mediawiki::users::mwdeploy -- temp. removed for ::mediawiki refactor -- OL" <- can I replace? [18:44:53] I'm pretty sure that you'll just overwrite sysop perms with user ones [18:45:10] MaxSem, i copied code from the above - see foreach ( $groupOverrides2 [18:45:57] ori: Also, I note that mwdeploy is not in the 'deployment' group and hence can't run mwscript. Is that a good thing or a bad thing? [18:46:11] MaxSem, The + operator returns the right-hand array appended to the left-hand array; for keys that exist in both arrays, the elements from the left-hand array will be used, and the matching elements from the right-hand array will be ignored. [18:46:29] bleh [18:47:06] this is the "standard" way of doing things in php... through ones' orifice [18:47:36] MaxSem, need to depl, please +2 or at least +1, i'll take the blame ) [18:47:52] or just tell me if you see something wrong :) [18:48:41] (03CR) 10MaxSem: [C: 031] Custom rights on zerowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159517 (owner: 10Yurik) [18:48:55] (03CR) 10Yurik: [C: 032] Custom rights on zerowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159517 (owner: 10Yurik) [18:48:59] (03Merged) 10jenkins-bot: Custom rights on zerowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159517 (owner: 10Yurik) [18:50:39] (03PS1) 10Dduvall: beta: add deployment-mediawiki03 to scap targets [puppet] - 10https://gerrit.wikimedia.org/r/159520 [18:51:15] !log yurik Synchronized wmf-config/CommonSettings.php: (no message) (duration: 01m 05s) [18:51:20] Logged the message, Master [18:51:47] !log yurik CommonSettings.php - zerowiki perm changes [18:51:51] Logged the message, Master [18:55:13] (03PS2) 10Dduvall: beta: add deployment-mediawiki03 to scap targets [puppet] - 10https://gerrit.wikimedia.org/r/159520 [18:56:40] (03PS3) 10BryanDavis: beta: add deployment-mediawiki03 to scap targets [puppet] - 10https://gerrit.wikimedia.org/r/159520 (https://bugzilla.wikimedia.org/70181) (owner: 10Dduvall) [18:56:56] (03CR) 10BryanDavis: [C: 031] beta: add deployment-mediawiki03 to scap targets [puppet] - 10https://gerrit.wikimedia.org/r/159520 (https://bugzilla.wikimedia.org/70181) (owner: 10Dduvall) [18:57:45] bd808: ah, thanks for fixing that. i'm still a little confused about when to use Bug: and when to just mention it in the commit message [18:58:17] marxarelli: Just use Bug: XXXX problem solved [18:58:36] bd808: will do. can there be multiple? [18:58:57] Yeah, one per line [18:59:20] bd808: got it [18:59:22] There is a bot that reads those and links the patch to the bug if they are in that format [19:00:31] bd808: I need to run a cron to drain the jobqueue on wikitech which, presumably, uses mwscript. Is there a standard user setup I should use for that? [19:00:48] Until now that cron was run by mwdeploy, but at present mwdeploy doesn't have the right privs to use mwscript [19:02:36] andrewbogott, bd808 : any idea why applying https://gerrit.wikimedia.org/r/#/c/155753/ patch to my role::puppet::self enabled test lab instance and giving sudo puppet-apply does not have the exim configurations change applied as per the patch ? [19:02:41] andrewbogott: mwscript ends up running php as the apache user. mwdeploy *should* be able to sudo as apache I think generally [19:03:35] tonythomas: not offhand. I don't ever use puppet-apply, don't really know how to use it properly. [19:04:23] andrewbogott: ok. any thoughts on who I should be pinging ? [19:05:17] tonythomas: didn't you already discuss this with yuvi at length? It sounded like you just had a broken puppet config [19:05:27] but you weren't using puppet-apply then, were you? [19:06:11] oh. that was a typo :o, actually - sudo puppet agent -tv [19:06:15] bd808: https://dpaste.de/6FVh <- some jobs run, some don't [19:06:57] tonythomas: in that case… you're probably doing things right but your puppet code is wrong :( It's hard to offer any general suggestions other than 'debug' [19:07:03] yuvi could break into a lot of the issues, but still we are not able to find the patch results effecting in /etc/exim4/ [19:07:13] Sometimes you can put something intentionally broken in the class of interest to make sure it's actually traversed [19:07:26] andrewbogott: What is the ownership of /tmp/mw-runJobs-backoffs.json ? [19:07:43] -rw-r--r-- 1 apache apache 30 Sep 10 18:43 /tmp/mw-runJobs-backoffs.json [19:07:43] andrewbogott: ok. Will try. :) [19:08:27] andrewbogott: Which is what I would expect. So mwscript isn't sudoing to apache? Or ... your doing something else to run the jobs? [19:08:55] bd808: I believe the complete commandline I'm using is in that paste. In theory that paste is replicating a cron [19:08:59] (and seems to be failing in the same way) [19:10:45] hmm... yeah I see it now. So mwscript apparently only does the sudo to apache for members of 'sudo|wikidev|root' groups. Try cahcing your command to start sudo -u apache ... [19:10:56] *changing [19:12:26] bd808: that works of course. I can move the cron to the apache crontab -- my question is more about style/convention than about how to make it work [19:12:43] maybe there is no convention, if running a cron on mediawiki is weird [19:13:17] andrewbogott: all php should run as the apache user. It is the least privileged user on the box [19:13:39] fair enough [19:15:44] (03CR) 10Cmjohnson: [C: 032] adding cname for civicrm2-gr.frdev.wikimedia.org rt8330 [dns] - 10https://gerrit.wikimedia.org/r/159497 (owner: 10Cmjohnson) [19:16:58] (03PS1) 10Andrew Bogott: Run mw maintenance jobs as 'apache' rather than mwdeploy. [puppet] - 10https://gerrit.wikimedia.org/r/159532 [19:17:11] bd808: ^ [19:20:52] andrewbogott: Do you need to change the ensure=>absent jobs there? [19:21:26] Or did you make manual changes you want puppet to nuke? [19:21:35] s/need/mean/ [19:21:46] bd808: That patch does three things -- it removes the old ensure->absent bits (since they're obsolete) it adds new ensure->absent sections (to remove the mwdeploy crontab) and it adds the new desired apache crontabs. [19:21:54] gerrit diff conflates the first two [19:22:02] ah. ok [19:22:11] +1 for user=>apache :) [19:23:04] (03CR) 10BryanDavis: [C: 031] "LGTM but I don't have access to the host to check the current jobs that are being removed." [puppet] - 10https://gerrit.wikimedia.org/r/159532 (owner: 10Andrew Bogott) [19:24:44] (03CR) 10Andrew Bogott: [C: 032] Run mw maintenance jobs as 'apache' rather than mwdeploy. [puppet] - 10https://gerrit.wikimedia.org/r/159532 (owner: 10Andrew Bogott) [19:28:34] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [19:30:22] Reedy, hi .. wanted to check if https://gerrit.wikimedia.org/r/#/c/157177/ has been deployed. [19:30:47] subbu: It was [19:30:55] ok, thanks. [19:37:11] grr. re. the civicrm alert from watchmouse, I am aware and will fix ASAP [19:51:57] !log puppet disabled on carbon (install server) for a livehack test of config setting [19:52:03] Logged the message, RobH [19:52:29] !log Created Echo tables on extension1 for cawikimedia [19:52:33] Logged the message, Master [19:54:55] Got an error on Commons: (Cannot contact the database server: Too many connections (10.64.32.29)) [19:55:35] And other hits are just very slow. [19:55:48] hmm [19:55:51] just an s4 slave [19:56:09] It's not every hit. I'll keep an eye out though. [19:58:24] JohnLewis: https://ve.wikimedia.org/wiki/P%C3%A1gina_principal [19:59:01] Is it fishbowl? [19:59:03] Should it be? [19:59:22] Seems I already have an account on it... [19:59:49] Reedy: I believe they asked for it to be a fishbowl wiki indirectly [20:00:00] https://ve.wikimedia.org/w/index.php?title=Especial:ListaUsuarios&offset=&limit=500 [20:00:05] gwicke, subbu, cscott: Dear anthropoid, the time has come. Please deploy Parsoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140910T2000). [20:00:18] chasemp: i looked at the admin module, it calls validate_ensure which comes from modules/wmflib/lib/puppet/parser/functions/validate_ensure.rb but i can't see how the that function is called [20:00:30] i took it into a VM i have. and puppet fails for not finding the .rb although i added it [20:00:44] can you please point me in the right direction ? [20:00:58] I don't think it does...or can't think of where it would [20:01:02] so that's odd [20:01:27] chasemp: line 56 of user.pp [20:02:22] ah [20:02:29] I think that is meant to be [20:02:42] validate_ensure($ensure, 'present|absent') [20:02:43] kind of thing [20:03:03] or [20:03:18] yeah I think this was done by _not me_ [20:03:22] and it used to be that [20:03:32] that is what the docs say, but i can't make my VM use the function. [20:04:08] submodule? [20:04:21] for wmflib ori would be a good person to ask [20:04:28] they have been doing a lot there [20:04:38] chasemp: i get: Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Unknown function validate_ensure at /etc/puppet/modules/admin/manifests/user.pp:57 [20:05:34] youcan do 'file modules/wmflib/lib/puppet/parser/functions/validate_ensure.rb' [20:05:36] and it exists right? [20:05:56] yes [20:06:12] and i also copied it into the admin module lib dir just in case [20:06:32] and how would puppet know to look in that directory anyway [20:06:49] puppet sucks in that any function defined in any module is "global" [20:06:52] figure that out [20:07:04] in theory you can call it defined in wmflib as if it were native [20:07:25] but this seems more puppet namespace / reference oriented than admin module and I'm just not sure man [20:07:29] that makes my question even stronger [20:07:45] (03CR) 10RobH: [C: 04-1] "dont commit this until chris is onsite and shuts down that device." [puppet] - 10https://gerrit.wikimedia.org/r/159439 (owner: 10Dzahn) [20:07:48] i have it locally in the module! :) [20:07:59] but thanks for the help anyway [20:08:04] * matanya looks for ori [20:11:36] I guess _joe_ and akosiaris would be able to answer as well of they were here [20:11:41] greg-g, not deploying parsoid today. so, if anyone is waiting on us, they can deploy. [20:20:38] grrrit-wm: wah, puppet alerting about betalabs puppet failures! [20:20:41] I see lots of 'em http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1409953203.01&target=deployment-prep.*.puppetagent.failed_events.value [20:21:08] err [20:21:10] greg-g: ^ [20:21:11] not grrrit-wm [20:42:05] YuviPanda: jeremyb was trying to do some puppet stuff and may have made that sadness [20:43:05] YuviPanda: yes certainly. but good to see a graph [20:43:28] the recent sadness at least [20:43:39] the earlier sadness problem corresponds to puppetmaster off [20:44:27] heh, the fact that you got alerted is the nice part [20:44:41] did anyone? [20:44:54] i think we found out about last night by people complaining [20:45:08] 13:22 < YuviPanda> grrrit-wm: wah, puppet alerting about betalabs puppet failures! [20:45:12] i thought this [20:45:22] via graphite check [20:45:32] maybe that was a request to have grrrit-wm say something? :) [20:46:01] I got "** PROBLEM alert - labmon1001/Monitor for puppet failures on beta labs is CRITICAL **" at Wed, 10 Sep 2014 19:30:24 +0000 [20:46:08] jeremyb: no, it's actually in Icinga [20:46:19] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=labmon1001&service=Monitor+for+puppet+failures+on+beta+labs [20:46:24] cool [20:46:32] yea, it's pretty new [20:47:01] mutante: it only noticed because all the machines went down [20:47:05] mutante: https://gerrit.wikimedia.org/r/#/c/159470/ is trivial patch [20:47:22] mutante: https://gerrit.wikimedia.org/r/#/c/159473/ actually lets us check properly [20:48:22] YuviPanda: right, just saw it a few min ago [20:48:28] coool [20:48:36] (03PS2) 10Dzahn: icinga: Fix typo in check_graphite [puppet] - 10https://gerrit.wikimedia.org/r/159470 (owner: 10Yuvipanda) [20:49:23] (03CR) 10Dzahn: [C: 032] icinga: Fix typo in check_graphite [puppet] - 10https://gerrit.wikimedia.org/r/159470 (owner: 10Yuvipanda) [20:49:48] it says a lot of service is not scheduled to be checked [20:50:36] those could be the passive ones [20:50:50] like puppet freshness [20:51:03] wait, that got removed the other day i think [20:52:00] "Wikimedia Labs / deployment-prep (beta): Beta Cluster api.php, index.php, load.php return 404 " [20:52:22] ori: ^ that was last night though? [20:55:59] mutante: look at the other patch too? :) [20:56:03] I checked it locally [21:00:58] (03PS2) 10Dzahn: icinga: Add graphite_series_threshold check [puppet] - 10https://gerrit.wikimedia.org/r/159473 (owner: 10Yuvipanda) [21:01:56] (03CR) 10Dzahn: [C: 032] icinga: Add graphite_series_threshold check [puppet] - 10https://gerrit.wikimedia.org/r/159473 (owner: 10Yuvipanda) [21:02:35] mutante: w00t [21:02:37] YuviPanda: ok! [21:02:44] you know,you have "trends" for this service too [21:02:46] https://icinga.wikimedia.org/cgi-bin/icinga/trends.cgi?host=labmon1001&service=Monitor+for+puppet+failures+on+beta+labs [21:02:52] just a bit slow to load [21:02:54] Nice [21:03:20] Next I should fix the puppet failure checker [21:04:43] (03PS2) 10Dzahn: add esams.wmnet to search in resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/159391 [21:04:59] YuviPanda: it's creating the new command on neon right now... [21:06:44] mutante: cool [21:06:52] If it fails should have a different message [21:07:08] (03CR) 10Dzahn: [C: 032] "let's you ssh to hosts in esams.wmnet without having to type the FQDN from ops bastion" [puppet] - 10https://gerrit.wikimedia.org/r/159391 (owner: 10Dzahn) [21:08:36] YuviPanda: yep,icinga no warnings or errors [21:08:59] (in the config check after puppet ran) [21:10:49] (03CR) 10Dzahn: "here this worked just fine, so it's a mystery to me why it doesn't on terbium." [puppet] - 10https://gerrit.wikimedia.org/r/159391 (owner: 10Dzahn) [21:19:32] (03CR) 10Dzahn: [C: 032] "virt0 and this group is already gone from ganglia web ui, this must be just a remnant that causes entries in error log" [puppet] - 10https://gerrit.wikimedia.org/r/159390 (owner: 10Dzahn) [21:22:23] PROBLEM - puppet last run on ytterbium is CRITICAL: CRITICAL: Puppet has 1 failures [21:22:59] mutante: cool. I'll mess it up tomorrow and see if something turns up :) [21:23:06] Off to bed now [21:28:50] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [21:40:43] RECOVERY - puppet last run on ytterbium is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [21:44:25] * [new branch] sandbox/jeremyb/betacluster-2014-09-10 -> origin/sandbox/jeremyb/betacluster-2014-09-10 [21:44:29] heh [21:49:51] (03PS1) 10Dzahn: delete "check_bad_apaches" monitoring [puppet] - 10https://gerrit.wikimedia.org/r/159619 [21:51:13] (03CR) 10Dzahn: [C: 031] "/usr/lib/nagios/plugins/check_bad_apaches" [puppet] - 10https://gerrit.wikimedia.org/r/159619 (owner: 10Dzahn) [21:52:56] CRIT: deployment-prep.deployment-videoscaler01.puppetagent.failed_events.value [21:53:06] (puppet fail on beta on videoscaler) [21:55:55] ACKNOWLEDGEMENT - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC daniel_zahn HHVM jobrunner [21:56:08] ACKNOWLEDGEMENT - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC daniel_zahn HHVM [21:58:57] mutante or anyone: some responses from bits.wikimedia.org have an "Access-Control-Allow-Origin: *" header, and others don't (and so trigger CORS failures). [21:59:05] ACKNOWLEDGEMENT - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn probably needs https://dpaste.de/8S3w/raw [21:59:52] It seems the responses that have an X-Cache miss header lack the Access-Control-Allow-Origin header [22:03:49] spagewmf: forwarded the message, can we have some example URLs? [22:04:33] mutante, spagewmf: ResourceLoader::sendResponseHeaders only sets CORS for responses that contain exclusively CSS [22:05:32] see [22:07:05] <_joe_> btw, caching has no role in setting CORS for bits [22:07:19] <_joe_> so I'd mostly exclude caching issues [22:07:28] <_joe_> s/caching/varnish/ [22:08:30] (03PS1) 10Aaron Schulz: Show 4 instead of 8 lines on the file backend graphs [puppet] - 10https://gerrit.wikimedia.org/r/159621 [22:12:00] (03PS2) 10Ori.livneh: Show 4 instead of 8 lines on the file backend graphs [puppet] - 10https://gerrit.wikimedia.org/r/159621 (owner: 10Aaron Schulz) [22:12:06] (03CR) 10Ori.livneh: [C: 032 V: 032] Show 4 instead of 8 lines on the file backend graphs [puppet] - 10https://gerrit.wikimedia.org/r/159621 (owner: 10Aaron Schulz) [22:16:14] mutante: thanks! https://bugzilla.wikimedia.org/show_bug.cgi?id=70681#c4 *perhaps* if X-Cache reports a miss, there's no Access-Control-Allow-Origin header [22:19:45] spagewmf: something weird is definitely going on [22:20:05] _joe_: https://dpaste.de/GrvG/raw [22:20:20] ori these are ttf/woff font requests. _joe_ might be something else, it just seems when people retry the problem is gone. [22:22:46] cp1057 hit for one request, then a miss six seconds later [22:23:04] <_joe_> ori: the miss is _not_ so strange [22:23:32] <_joe_> oh sorry, the second one is the latest [22:23:38] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [22:23:40] <_joe_> ok so it's kinda strange [22:24:04] i just had another miss from cp1057 [22:25:29] <_joe_> ori: well in bits we probably do evictions I guess [22:25:43] <_joe_> but - is this related to CORS? [22:26:05] <_joe_> because AFAICS I never receive it [22:26:21] <_joe_> my best guess is that some backends do respond with it, and some don't [22:28:26] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Epic puppet fail [22:29:30] _joe_: you mean you never get "Access-Control-Allow-Origin: *" ? I think that means e.g. https://en.wikipedia.org/wiki/Wikipedia_talk:Flow/Developer_test_page will not have the WikiFont-glyphs , which is bug 70681 [22:31:09] <_joe_> spagewmf: ok, but the problem is on the backend [22:31:13] <_joe_> not in varnish [22:32:02] <_joe_> the php backend should return the CORS header, and it does not [22:34:46] _joe_: PHP backend? Aren't we talking about static files being served? [22:35:00] cp4003 has the header, cp4002 does not [22:35:14] try this from the cluster: [22:35:17] Oooh you know what [22:35:18] curl -I -H 'host: bits.wikimedia.org' 'cp4003/static-1.24wmf19/extensions/Flow/modules/new/fonts/WikiFont-Glyphs.woff?2014-08-28T18:13:20Z' [22:35:25] and the same, for cp4002 [22:35:28] I bet you this regressed when bits was merged into the general pool [22:35:39] i have the varnishlog entries for those reqs too, but they don't show anything interesting [22:35:40] I wonder if the Apache configs were ported correctly to set CORS headers for static files on bits [22:35:42] RoanKattouw: hm? how come? [22:35:43] <_joe_> ori: that means someone changed something today [22:35:55] <_joe_> ori: were they both misses? [22:36:12] <_joe_> RoanKattouw: I don't see how, given apache configs were identical [22:36:16] <_joe_> and the code as well [22:36:17] Hmm OK [22:36:21] * RoanKattouw was just guessing [22:36:27] PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: Epic puppet fail [22:36:28] PROBLEM - puppet last run on fenari is CRITICAL: CRITICAL: Epic puppet fail [22:36:36] In any case if the backends are serving things wrong, that sounds like an Apache config issue [22:36:46] cp4003 is consistently a hit, cp4002 is hit/miss [22:36:52] <_joe_> RoanKattouw: and we changed nothing. [22:36:52] Which, I suppose more interesting things have happened in that area more recently, like porting to the new Apache version [22:37:08] Oh this is like a very recent regression? [22:37:11] RoanKattouw, ori, _joe_ : the reports of bad WikiFont glyphs started today, AFAIK non earlier [22:37:12] <_joe_> ori: so, old cached content has, the new one hasn't [22:37:21] <_joe_> it _is_ [22:37:27] <_joe_> RoanKattouw: it's pretty new [22:37:30] Right, OK [22:37:48] Sorry, I was assuming that maybe this was an older problem, but I guess it's not [22:37:54] <_joe_> and luckily for us my last apache change (which was a no-op btw) was on monday [22:37:58] <_joe_> mmmmh wait [22:38:04] My ears perked up because I was the one that pushed through the CORS on bits for font files change in a previous life [22:38:04] <_joe_> ori: lemme check one thing [22:38:07] hahaha [22:40:07] curling the apache directly doesn't get me the cors headers [22:40:23] <_joe_> ori: exactly [22:40:23] well, *an apache [22:40:29] so where did it come from? [22:40:30] <_joe_> ori: I tried all [22:40:35] maybe a code change [22:40:36] <_joe_> it came from apache [22:40:37] in flow? [22:40:38] <_joe_> sure as hell [22:40:42] ebernhardson: ^^ ? [22:40:55] <_joe_> varnish has _no_ cors-setting routine for bits [22:40:58] <_joe_> that I can see [22:41:22] <_joe_> bt then again, you may ask someone in a time zone where it's not 1 AM [22:41:30] <_joe_> and he may be more sure [22:41:32] <_joe_> :) [22:41:36] <_joe_> he/she [22:42:21] _joe_: thanks for puzzling on this. I don't think Flow has made any changes. Our CSS references the font, the browser makes a request [22:43:21] ori: sorry, whats the question? reading up [22:43:34] maybe sendResponseHeaders is short-circuited depending on whether something was in memcached or not [22:44:08] ebernhardson: bug 70681. Sometimes the WikiFont-Glyph doesn't load, seems to be no Access-Control-Allow-Origin: * in the bits response [22:44:14] <_joe_> It's so funny to debug multi-tiered web apps [22:44:35] <_joe_> you basically need an ops, a dba, a couple of devs and some ritual sacrifice [22:46:02] ebernhardson: some varnishes are serving WikiFont-Glyphs.woff with CORS header, others are not. it appears not to have anything to do with varnish, but just a reflection of when they were cached. so we suspect something made apache behave differently at some point today. [22:46:11] actually, this isn't an RL request at all, is it? [22:46:16] it's just a file on disk [22:46:17] also Flow didn't deploy any new code since Thursday [22:46:20] ori: i'm certain flow doesn't directly set CORS anywhere. spage should have it right about [22:47:32] ori: yes it should just be a file on disk, referenced from css as url(... [22:47:35] ) [22:48:29] ori: Yeah it's a totally static file [22:49:07] Only reason I know about it is because I was the one that got ops to put those CORS headers in [22:49:09] years ago [22:50:01] modules/mediawiki/files/apache/modules/expires.conf adds Access-Control-Allow-Origin "for static content" [22:51:22] could it be the heck of symlinks affecting ? [22:52:22] <_joe_> RoanKattouw: ops? [22:52:52] !log labstore1003 - (earlier) revoked salt and puppet key and signed new after hostname fix - same salt-minion puppet errors that happen after reinstalls [22:52:57] Logged the message, Master [22:53:51] _joe_: Operations people [22:53:53] <_joe_> ori: found it [22:53:57] Your predecessors circa 2011 [22:54:09] <_joe_> RoanKattouw: yes sorry, I wasn't sure we set that via apache [22:54:13] <_joe_> but I found the bug [22:54:22] Oh? [22:54:27] <_joe_> and it's .... drumroll [22:54:32] <_joe_> mine fault and ori's :P [22:54:37] <_joe_> give me 2 mins [22:56:32] yay [22:58:17] (03PS1) 10Giuseppe Lavagetto: mediawiki: change config of expires module [puppet] - 10https://gerrit.wikimedia.org/r/159627 [22:58:20] <_joe_> I think the problem is this ^^ [22:58:28] <_joe_> ori: care to take a look? [22:58:34] * ori looks [22:59:05] (03CR) 10Ori.livneh: [C: 031] mediawiki: change config of expires module [puppet] - 10https://gerrit.wikimedia.org/r/159627 (owner: 10Giuseppe Lavagetto) [22:59:18] if you care to, add Bug: 70681 to commit message [22:59:21] Certainly needs doing whether it's the problem or not ;) [22:59:26] <_joe_> spagewmf: yes [22:59:34] I'll do the swat [22:59:36] +1 :) i saw ori change the path elsewhere [22:59:38] <_joe_> adding and deploying [23:00:05] RoanKattouw, ^d, marktraceur, MaxSem: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140910T2300). [23:00:07] so it is '15:51 < spagewmf> could it be the heck of symlinks affecting ? [23:00:28] MaxSem: Thanks. [23:00:39] alrighty, no other pretenders! [23:00:40] (03PS2) 10Giuseppe Lavagetto: mediawiki: change config of expires module [puppet] - 10https://gerrit.wikimedia.org/r/159627 [23:01:42] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: change config of expires module [puppet] - 10https://gerrit.wikimedia.org/r/159627 (owner: 10Giuseppe Lavagetto) [23:02:44] Coren: fixing puppet run on labstore1003 broke puppet run on tin - Duplicate declaration: Sshkey[labstore1003] is already declared , haha [23:03:10] <_joe_> spagewmf: yes and no [23:03:28] <_joe_> spagewmf: we moved everything under /srv/mediawiki/docroot (still a symlink) [23:04:21] _joe_ , ori et al, molto grazie ! The pigeon for the ritual sacrifice gets to live another day. [23:04:36] <_joe_> ok change confirmed to work [23:04:44] <_joe_> on testwiki [23:04:59] <_joe_> so, I'd wait ~ 20 mins and run an apache-graceful-all [23:05:14] <_joe_> I won't force a puppet tagged run for this [23:05:24] <_joe_> but your opinion may differ [23:05:28] <_joe_> if so, let me know [23:06:30] <_joe_> https://dpaste.de/7h3P/raw before and after [23:07:04] <_joe_> so ewww that's why content was expiring so fast [23:07:10] * _joe_ facedesks [23:07:45] <_joe_> how serious was the bug for users? [23:08:01] _joe_: within the hour is fine by me. Who can do an apache-graceful-all so you can sleep? [23:08:21] <_joe_> spagewmf: any opsen [23:08:35] <_joe_> but I may still be around [23:08:55] _joe_: a few people have commented "Whoa, weird looking glyphs" https://i.imgur.com/azFlGqz.png , but Flow isn't in wide use. [23:09:10] <_joe_> :/ [23:09:27] enwiki looked worse than that for me [23:09:44] Reedy: on a Flow page? or something else? [23:09:50] Yup, flow [23:09:57] <_joe_> Reedy: dfferent caches [23:10:05] loads of missing graphics and such [23:10:30] <_joe_> we _need_ a unit testing framework for apache [23:11:14] <_joe_> Reedy: but everywhere or just on flow? [23:11:24] just flow stuff, I think [23:11:43] Reedy: which browser? the WikiFont glyphs are in the Private Use Area, some browsers show Unicode ##s, others show funny icons. I think only Flow is using the font, plus some mobile apps. [23:12:04] Chrome 39 [23:12:34] do you guys make it failover to images in case of no support? [23:13:01] spagewmf: want a screenshot? [23:13:27] Reedy: sure, or attach to bug 70681. Thanks [23:15:09] spagewmf: https://bug-attachment.wikimedia.org/attachment.cgi?id=16435 [23:19:35] !log maxsem Synchronized php-1.24wmf10/resources/: https://gerrit.wikimedia.org/r/#/c/159513/ (duration: 00m 05s) [23:19:41] Logged the message, Master [23:19:53] James_F|Away, ^ [23:20:08] <_joe_> MaxSem: is it ok to do an apache graceful in ~ 5 mins? [23:20:18] I guess [23:20:21] <_joe_> or are you still deploying? [23:20:33] you're not gonna interfere with rsync, right? [23:20:55] also, I'll be done in 5 mins unless something breaks [23:21:14] !log maxsem Synchronized php-1.24wmf19/extensions/CentralAuth/: (no message) (duration: 00m 04s) [23:21:18] Logged the message, Master [23:21:24] !log maxsem Synchronized php-1.24wmf20/extensions/CentralAuth/: (no message) (duration: 00m 03s) [23:21:28] Logged the message, Master [23:21:31] Reedy, ^^^ [23:22:25] !log maxsem Synchronized php-1.24wmf19/includes/specialpage/SpecialPageFactory.php: https://gerrit.wikimedia.org/r/#/c/159526/ (duration: 00m 03s) [23:22:29] Logged the message, Master [23:22:39] !log maxsem Synchronized php-1.24wmf20/includes/specialpage/SpecialPageFactory.php: https://gerrit.wikimedia.org/r/#/c/159526/ (duration: 00m 03s) [23:22:43] Logged the message, Master [23:22:47] Reedy, ^^ [23:24:49] _joe_, go ahead [23:24:56] <_joe_> MaxSem: thanks [23:26:24] oblivian is doing a graceful restart of all apaches [23:26:39] !log oblivian gracefulled all apaches [23:26:44] Logged the message, Master [23:27:05] <_joe_> it should have worked [23:27:23] wow, that was fast! and no problems due to too many apaches being reloaded at the same time? [23:27:26] :) [23:27:47] <_joe_> MaxSem: it should not [23:29:30] !log deleted labstore1003.eqiad.wmnet.org from puppet stored resource db, fixes puppet runs on hosts with ssh host key collection [23:29:36] Logged the message, Master [23:29:58] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [23:30:23] <_joe_> spagewmf: I guess it should be fixed by now [23:30:29] <_joe_> or in 5 minutes tops [23:32:12] _joe_: again, thanks. I did some curls and got the Access-Control-Allow-Origin header [23:34:50] RECOVERY - puppet last run on bast1001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [23:36:08] RECOVERY - puppet last run on fenari is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [23:38:57] (03CR) 10Dzahn: [C: 032] contint-use apache::site,move config to templates [puppet] - 10https://gerrit.wikimedia.org/r/153959 (owner: 10Dzahn) [23:39:15] switching integration.wm and doc.wm to use apache::site ... watching on gallium [23:40:22] (03CR) 10Ori.livneh: [C: 031] delete "check_bad_apaches" monitoring [puppet] - 10https://gerrit.wikimedia.org/r/159619 (owner: 10Dzahn) [23:41:03] (03CR) 10Ori.livneh: [C: 032] Add sudo -u apache call to foreachwikiindblist [puppet] - 10https://gerrit.wikimedia.org/r/157013 (owner: 10Reedy) [23:41:46] Reedy, your fix seems to have stopped the problem, but it got replaced with another fatal:P [23:42:02] PHP Fatal error: Base lambda function for closure not found in /usr/local/apache/common-local/php-1.24wmf20/extensions/Wikidata/extensions/Wikibase/lib/config/WikibaseLib.default.php on line 18 [23:42:08] Ugh, that's APC again [23:42:10] hoo: ^^ [23:42:11] :( [23:42:17] let me find what servers [23:42:26] mh :( [23:42:36] riiight, mw1202 [23:42:45] 10.64.48.34 [23:42:52] mutante: Would you -2 the idea of giving deployers the option to restart apache2? [23:43:00] hoo: They used to be able to [23:43:08] but? [23:43:20] I think it broke/needs root or something [23:43:25] No one ever "fixed" it [23:43:29] (03CR) 10Dzahn: "before: 00-dummy.conf 50-qunit-localhost.conf doc.wikimedia.org integration.mediawiki.org integration.wikimedia.org (where those wer" [puppet] - 10https://gerrit.wikimedia.org/r/153959 (owner: 10Dzahn) [23:43:39] Can someone graceful 10.64.48.34/mw1202 please? [23:44:00] Reedy: Oh, that's an easy one to code [23:44:01] spamming APC related errors [23:44:02] :P [23:44:09] !log graceful'ed mw1202 apache [23:44:10] Reedy: done [23:44:11] There might be more to it... If mutante remembers [23:44:13] thanks [23:44:14] Logged the message, Master [23:44:16] * hoo will submit a patch in a moment... and that will then be in review for $ages [23:44:31] hoo: i don't know, it needed root in the past as Reedy says [23:44:44] to run apache-graceful-all using dsh [23:44:54] but now it's different [23:44:55] you can give ppl a limited sudo... [23:45:05] MaxSem: That's what I want to do :) [23:45:25] they will say to use salt [23:45:57] ...and salt is too scary to allow mortals?:P [23:46:15] (03PS1) 10Ori.livneh: Update remaining references to /u/l/a/common-local [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159635 [23:46:35] hoo: I guess there's a few more people working on the apache code atm, so getting reviews might be easier [23:47:14] what he says [23:47:22] Reedy: ^ patch above [23:47:53] :) [23:47:54] hoo: add ori [23:49:08] (03CR) 10Reedy: [C: 031] Update remaining references to /u/l/a/common-local [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159635 (owner: 10Ori.livneh) [23:49:32] there, doc.wikimedia, integration.wikimedia, integration.mediawiki are still up [23:49:33] (03PS1) 10Hoo man: Allow deployers to graceful apache [puppet] - 10https://gerrit.wikimedia.org/r/159636 [23:50:02] and using apache::site [23:50:24] mutante: Done [23:50:32] Do we use apache-graceful normaly? [23:51:15] i think it's history by now [23:51:25] heh [23:51:31] what did you just use? ;) [23:51:58] /etc/init.d/apache2 graceful :p [23:52:04] lol [23:52:25] dsh -F20 -g apaches -cM 'sudo /usr/sbin/apache2ctl graceful' [23:52:33] that's what the graceful-all does [23:52:39] it was a single server, keep it simple :p [23:52:52] (03PS1) 10Ori.livneh: Get rid of MULTIVER_CDB_DIR_{APACHE,HOME} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159637 [23:52:59] the alternative is i make a new dsh group with a single member [23:53:02] :p [23:53:10] Reedy: ^ another one :) [23:53:37] I wonder why we had that originally [23:54:32] (03CR) 10Reedy: [C: 031] Get rid of MULTIVER_CDB_DIR_{APACHE,HOME} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159637 (owner: 10Ori.livneh) [23:54:49] re: <+MaxSem> do you guys make it failover to images in case of no support? [23:55:07] yes? [23:55:41] you should ask Shahyar. There's no fallback beyond the.eot, .woff, .ttf list, after that you get no icon and only text. [23:55:51] Reedy: Do you have time to look at https://gerrit.wikimedia.org/r/#/c/159493/ before the train deploy tomorrow and see if you think it will fix things for the wiktech l10n cache? [23:56:14] MaxSem: but I think front-end engineers are going back to generated .svgs in ResourceLoader [23:56:38] Reedy: It's completely untested by me in an way shape or form, but seems like it should work. [23:56:55] bd808: Right. I can't see why it wouldn't work [23:57:20] Does it get loaded in by the entry point? [23:57:23] https://wikitech.wikimedia.org/wiki/Apache#Logging [23:57:33] says fenari, and NFS and Squid :p [23:57:51] eww. andrewbogott has a window to update wikitech *before* the train deploy. [23:58:04] copies Reedy to /home/wikipedia/logs/syslog/syslog [23:58:38] bd808: I don't mind moving it later [23:58:39] Reedy: Yeah. That file is loaded by the entry point in a round about way. The function there is what wikitech's config calls immediately after laoding.