[00:00:05] The cross browser results are interesting. The beta/prod results are a red herring. There is nothing that can be correlated there. [00:00:35] The cross browser results say to me that the ve devs are running chrome. :) [00:00:57] yes, that is what they are saying in fact [00:01:15] RECOVERY - Host baham.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 52.22 ms [00:03:44] (03PS2) 10Ori.livneh: beta: switch to /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/159431 [00:19:31] (03PS3) 10Ori.livneh: beta: switch to /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/159431 [00:21:55] (03PS1) 10Dzahn: terbium-remove trailing comma from domain_search [puppet] - 10https://gerrit.wikimedia.org/r/159435 [00:22:10] (03CR) 10Ori.livneh: [C: 032] "PS3 applied on labs, did the right thing." [puppet] - 10https://gerrit.wikimedia.org/r/159431 (owner: 10Ori.livneh) [00:39:09] (03CR) 10Dzahn: "no, still compiles as "identical".. what the .." [puppet] - 10https://gerrit.wikimedia.org/r/159435 (owner: 10Dzahn) [00:55:20] (03PS1) 10Dzahn: replace mexia with baham in misc/monitoring.pp [puppet] - 10https://gerrit.wikimedia.org/r/159437 [00:59:09] (03PS2) 10Dzahn: decom mexia [puppet] - 10https://gerrit.wikimedia.org/r/159437 [01:03:00] (03PS1) 10Dzahn: remove pmtpa subnets from install-server [puppet] - 10https://gerrit.wikimedia.org/r/159438 [01:12:34] (03PS1) 10Dzahn: remove 10.4.16.0/24 and host scs-c1-pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/159439 [01:14:56] (03PS1) 10Dzahn: dhcp - delete remaining Tampa db's and es's [puppet] - 10https://gerrit.wikimedia.org/r/159440 [01:19:03] (03PS1) 10Dzahn: remove pmtpa from all $domain_search [puppet] - 10https://gerrit.wikimedia.org/r/159441 [01:21:46] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [01:32:46] (03PS1) 10Dzahn: decom nfs1 [puppet] - 10https://gerrit.wikimedia.org/r/159442 [01:34:38] (03CR) 10Dzahn: [C: 04-1] "who knows about this part?" [puppet] - 10https://gerrit.wikimedia.org/r/159442 (owner: 10Dzahn) [01:36:08] (03PS2) 10Dzahn: decom nfs1 [puppet] - 10https://gerrit.wikimedia.org/r/159442 [01:41:48] (03CR) 10Dzahn: "how come this is unmerged but the related bug is resolved?" [puppet] - 10https://gerrit.wikimedia.org/r/150813 (https://bugzilla.wikimedia.org/63120) (owner: 10Hashar) [01:43:41] (03PS1) 10Rush: phab tools for migrating content [puppet] - 10https://gerrit.wikimedia.org/r/159443 [01:45:12] (03CR) 10Rush: [C: 032] phab tools for migrating content [puppet] - 10https://gerrit.wikimedia.org/r/159443 (owner: 10Rush) [01:45:29] (03Abandoned) 10Dzahn: terbium-remove trailing comma from domain_search [puppet] - 10https://gerrit.wikimedia.org/r/159435 (owner: 10Dzahn) [01:48:53] (03PS1) 10Rush: phab path fix for migration tools [puppet] - 10https://gerrit.wikimedia.org/r/159444 [01:53:19] (03CR) 10Rush: [C: 032] phab path fix for migration tools [puppet] - 10https://gerrit.wikimedia.org/r/159444 (owner: 10Rush) [01:53:22] (03PS1) 10Rush: standard no-exim for iridium [puppet] - 10https://gerrit.wikimedia.org/r/159445 [01:54:37] (03CR) 10Rush: [C: 032] standard no-exim for iridium [puppet] - 10https://gerrit.wikimedia.org/r/159445 (owner: 10Rush) [01:56:15] (03PS2) 10Dzahn: webserver - use ssl_ciphersuite in generic_vhost [puppet] - 10https://gerrit.wikimedia.org/r/153971 [01:56:33] (03PS1) 10Rush: remove git-core from phab as dupe w/ standard [puppet] - 10https://gerrit.wikimedia.org/r/159446 [01:57:28] (03CR) 10Rush: [C: 032] remove git-core from phab as dupe w/ standard [puppet] - 10https://gerrit.wikimedia.org/r/159446 (owner: 10Rush) [01:58:28] (03PS3) 10Dzahn: webserver - use ssl_ciphersuite in generic_vhost [puppet] - 10https://gerrit.wikimedia.org/r/153971 [01:58:53] (03CR) 10Dzahn: webserver - use ssl_ciphersuite in generic_vhost (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/153971 (owner: 10Dzahn) [02:11:25] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3265 MB (3% inode=99%): [02:11:28] (03PS1) 10Rush: phab dont apply migration tools just yet [puppet] - 10https://gerrit.wikimedia.org/r/159447 [02:12:12] (03CR) 10Dzahn: [C: 032] contint: switch localvhost to apache::conf [puppet] - 10https://gerrit.wikimedia.org/r/155707 (https://bugzilla.wikimedia.org/68256) (owner: 10Hashar) [02:12:26] (03PS2) 10Rush: phab dont apply migration tools just yet [puppet] - 10https://gerrit.wikimedia.org/r/159447 [02:12:30] (03CR) 10Rush: [C: 032] phab dont apply migration tools just yet [puppet] - 10https://gerrit.wikimedia.org/r/159447 (owner: 10Rush) [02:13:36] (03CR) 10Rush: [V: 032] phab dont apply migration tools just yet [puppet] - 10https://gerrit.wikimedia.org/r/159447 (owner: 10Rush) [02:13:42] ori: what's mw1201? [02:13:52] (and is there somewhere i can look this up myself, instead of always having to bug someone?) [02:15:07] mutante: about? [02:15:39] chasemp: yes [02:15:39] jackmcbarn: it's a standard application server. you can determine this by looking it up in operations/puppet.git:manifests/site.pp (though you may have to recurse into included files), but there's an easier way that works most of the time -- go to http://ganglia.wikimedia.org/, click 'search', and type the name of the host [02:15:51] can you take a peek at puppet run on iridium [02:15:54] jackmcbarn: it should autocomplete with the cluster name [02:15:56] some salt / trebuchet messages [02:16:03] ori: does standard application server means it serves web requests? [02:16:04] ...I don't get why they would show up here [02:16:10] jackmcbarn: yes [02:16:13] "Warning: /Stage[main]/Role::Trebuchet/Salt::Grain[trebuchet_master]/Exec[/usr/local/sbin/grain-ensure set trebuchet_master tin.eqiad.wmnet" [02:16:18] on...iridium? [02:16:34] almost out of time for tonight, not sure what the deal is [02:16:38] chasemp: since you just added "standard" ? [02:16:48] yes that did it standard-noexim [02:17:41] jackmcbarn: if you actually pull up the node in ganglia, you can click "host overview", which will give you the kernel version and uptime, which is enough to determine if it's an HHVM box or not. (it isn't; it's running precise / php5) [02:18:04] chasemp: looking .. [02:18:22] !log started salt-minion on iridium [02:18:33] Logged the message, Master [02:18:39] thanks! [02:18:41] is that the whole deal? come on...:) [02:18:53] chasemp: i think that warning is just the side effect, the actual one is [02:18:56] Error: Could not start Service[salt-minion]: Execution of '/sbin/start salt-minion' returned 1: [02:19:00] Error: /Stage[main]/Salt::Minion/Service[salt-minion]/ensure: change from stopped to running failed: Could not start Service[salt-minion]: Execution of '/sbin/start salt-minion' returned 1: [02:19:08] chasemp: no, interestingly i can start it but puppet cant? [02:19:22] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 4192 MB (3% inode=99%): [02:19:23] is it still running? [02:19:25] hmm.. i saw some "fixup salt minion" changes earlier? [02:19:26] I guess [02:19:38] no [02:19:49] service salt-minion start [02:19:50] start: Job failed to start [02:19:52] so it's crashing. is this machine newly-reimaged? [02:19:57] /etc/init.d/salt-minion start * Starting salt minion control daemon salt-minion [ OK ] [02:20:02] ori: are you talking to me / mutante? [02:20:06] init.d says OK. upstart says FAIL :) [02:20:06] either one :) [02:20:15] it's not a new image, but it didn't have standard before [02:20:27] as I didn't wnat the monitoring noise from ppl as it was idle / in setup [02:20:35] but now w/ standard salt seems unhappy [02:20:42] let me look, just a sec [02:20:49] i won't touch anything [02:21:18] [ERROR ] This master address: 'salt' was previously resolvable but now fails to resolve! The previously resolved ip addr will continue to be used [02:21:18] [WARNING ] Master hostname: salt not found. Retrying in 30 seconds [02:21:33] /var/log/upstart/salt-minion.log [02:21:42] hmmmm this box got reimaged at one point and moved ip's [02:21:46] long ago [02:22:44] ok if i retry restarting salt? [02:23:01] if i try restarting the salt-minion service on iridium, i mean [02:23:12] please do [02:23:48] hmm, the master is supposedly set to palladium.eqiad.wmnet though [02:23:56] ah, the message is a red herring [02:23:57] /etc/salt/minion [02:24:02] if you try starting it manually you get: [02:24:06] [CRITICAL] The Salt Master has rejected this minion's public key! [02:24:06] To repair this issue, delete the public key for this minion on the Salt Master and restart this minion. [02:24:06] Or restart the Salt Master in open mode to clean out the keys. The Salt Minion will now exit. [02:24:10] which is the standard issue with reimages [02:24:23] ah so back when it must not have had the salt key removed [02:24:30] i can fix this if you like [02:24:33] and it's been lying in wait [02:24:43] if you have time, great [02:26:01] PROBLEM - puppet last run on mw1007 is CRITICAL: CRITICAL: Puppet has 1 failures [02:26:29] (03CR) 10Dzahn: [C: 032] contint: migrate localvhost to apache::site [puppet] - 10https://gerrit.wikimedia.org/r/155708 (owner: 10Hashar) [02:26:30] sure [02:26:35] this is the fix, btw: https://dpaste.de/8S3w/raw [02:26:50] excerpted from http://etherpad.wikimedia.org/p/app-server-upgrade , which is a pad _joe._ and i used for note-taking [02:27:05] got it, nice [02:28:24] !log updated salt key for iridium and restarted salt-minion [02:28:30] Logged the message, Master [02:28:54] chasemp, mutante: salt-minion is up now, so the puppet run should be ok (well, or fail for a different reason :)) [02:28:54] sweet, in business now, thanks gents [02:28:59] np! [02:29:02] yeah looks good [02:29:13] ori: cool [02:29:49] it didnt appear like a reinstall but makes sense when standard is added freshly [02:31:04] (03CR) 10Dzahn: "ran puppet on integration-slave1006-trusty and also on integration-slave1003 - no related issues seen just missing packages php5-parsekit " [puppet] - 10https://gerrit.wikimedia.org/r/155707 (https://bugzilla.wikimedia.org/68256) (owner: 10Hashar) [02:32:05] yeah, looks like: [02:32:21] standard includes base, base includes role::salt::minions [02:32:41] (03CR) 10Dzahn: "root@integration-slave1003:/etc/apache2/sites-enabled# ls" [puppet] - 10https://gerrit.wikimedia.org/r/155708 (owner: 10Hashar) [02:33:01] why's the beta cluster broken? [02:33:11] uhoh [02:33:12] * ori looks [02:33:27] (03PS5) 10Dzahn: contint-use apache::site,move config to templates [puppet] - 10https://gerrit.wikimedia.org/r/153959 [02:34:17] (03CR) 10Dzahn: [C: 031] "the localvhost stuff has been merged separately, rebased now what is left here" [puppet] - 10https://gerrit.wikimedia.org/r/153959 (owner: 10Dzahn) [02:36:00] !log LocalisationUpdate completed (1.24wmf19) at 2014-09-10 02:36:00+00:00 [02:36:05] Logged the message, Master [02:41:25] (03CR) 10Ori.livneh: [C: 031] contint-use apache::site,move config to templates [puppet] - 10https://gerrit.wikimedia.org/r/153959 (owner: 10Dzahn) [02:41:28] whenever i forward stuff to OTRS i get the reply instead of the actual requestor... [02:41:36] ori: :) [02:42:22] jackmcbarn: one of the app servers lost part of its mediawiki deployment dir because of a change i made earlier; i'm re-populating it. should be a couple of mins. [02:42:38] kk [02:43:12] RECOVERY - puppet last run on mw1007 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [02:46:51] (03PS2) 10Dzahn: put racktables behind misc. varnish [puppet] - 10https://gerrit.wikimedia.org/r/154980 [02:49:36] (03CR) 10Dzahn: [C: 032] put racktables behind misc. varnish [puppet] - 10https://gerrit.wikimedia.org/r/154980 (owner: 10Dzahn) [02:52:28] (03PS1) 10Dzahn: swich racktables to misc varnish cluster [dns] - 10https://gerrit.wikimedia.org/r/159448 [02:53:48] (03CR) 10Dzahn: [C: 032] swich racktables to misc varnish cluster [dns] - 10https://gerrit.wikimedia.org/r/159448 (owner: 10Dzahn) [03:01:06] RECOVERY - Disk space on virt0 is OK: DISK OK [03:08:00] !log LocalisationUpdate completed (1.24wmf20) at 2014-09-10 03:07:59+00:00 [03:08:06] Logged the message, Master [03:22:37] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [03:45:15] PROBLEM - puppet last run on mw1160 is CRITICAL: CRITICAL: Epic puppet fail [04:04:25] RECOVERY - puppet last run on mw1160 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [04:17:39] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Sep 10 04:17:36 UTC 2014 (duration 17m 35s) [04:17:45] Logged the message, Master [04:21:13] mutante: did you get the otrs ticket fixed? [05:23:35] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [06:01:02] mutante: re forwarding to otrs: try setting reply-to? [06:01:40] i doubt that would work [06:01:49] but feel free to test and we'll see [06:24:17] (03CR) 10Chmarkine: [C: 031] webserver - use ssl_ciphersuite in generic_vhost [puppet] - 10https://gerrit.wikimedia.org/r/153971 (owner: 10Dzahn) [06:28:07] i have a reproducible zend fatal [06:28:10] (not hhvm) [06:28:50] commons. seems to be any category page [06:28:56] PHP fatal error in /srv/mediawiki/php-1.24wmf20/extensions/ConfirmEdit/ConfirmEdit.php line 218: [06:28:56] Cannot redeclare confirmEditSetup() (previously declared in /usr/local/apache/common-local/php-1.24wmf20/extensions/ConfirmEdit/ConfirmEdit.php:207) [06:28:58] even logged out [06:29:07] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:16] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:25] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:26] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:30] ok, not any [06:29:36] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:37] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:06] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:06] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:16] i actually got 2 different fatals. only one is reproducible [06:32:31] try https://commons.wikimedia.org/wiki/Category:Wikimania_2012 [06:37:25] <_joe_> jeremyb: I get the page perfectly rendered [06:37:32] <_joe_> but, I'm on hhvm I guess [06:37:46] <_joe_> no, not on commons [06:38:16] <_joe_> jeremyb: also, /usr/local/apache/ shouldn't be there [06:39:08] what do you mean shouldn't be there? [06:39:34] <_joe_> that we use /srv/mediawiki as a docroot [06:39:40] how can i tell from exception in logstash what page it was on? [06:39:43] huh [06:40:05] <_joe_> jeremyb: no clue, php is bad at that in genera [06:40:41] and what is testwiki? 1017? [06:40:49] <_joe_> yes [06:41:23] <_joe_> it's also in the hhvm appservers pool [06:43:13] <_joe_> I was looking at fatals on losgstash, they're not so terrible [06:43:28] right. just want to be sure e.g. 1042 is the normal zend pool [06:43:39] <_joe_> ok so, /usr/local/apache/common-local [06:43:45] (03CR) 10JanZerebecki: [C: 031] let NDAed people login on servermon [puppet] - 10https://gerrit.wikimedia.org/r/159419 (owner: 10Dzahn) [06:43:48] <_joe_> is the real dir where the code is [06:45:06] huh [06:45:11] now it's magically fixed [06:45:30] RECOVERY - puppet last run on searchidx1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:45:36] <_joe_> what was giving that error? [06:45:40] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:45:45] the link above [06:45:46] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:45:55] but maybe i pasted the wrong stack [06:45:56] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:46:06] <_joe_> It was ok for me when I tried [06:46:08] i wonder if this was an error that happens only during l10n changes? [06:46:15] was consistent for me [06:46:26] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:46:36] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:46:46] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:46:55] or alternatively is related to a puppet run? [06:47:06] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:48:19] <_joe_> nah [06:48:23] <_joe_> no way [06:48:35] <_joe_> puppet is a no-op on appservers in my absence these days [06:48:36] <_joe_> :) [06:49:01] <_joe_> meaning - it's me the one screwing that up :P [06:50:33] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "One minor comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/153971 (owner: 10Dzahn) [06:53:48] (03PS35) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [06:53:55] _joe_: https://logstash.wikimedia.org/#dashboard/temp/z8CGaXhuTIKN1J0kxUoVeg [06:55:28] i don't think that's a fluke... [06:56:36] (i already poked #mediawiki-i18n, no response yet) [06:58:20] i was hoping timing for earliest instances lined up with something in SAL. but no such luck [07:00:49] was still happening 8 mins ago [07:02:20] <_joe_> mmmmh [07:03:08] <_joe_> maybe I don't see that since I use english? [07:03:28] english what? [07:03:31] i use english too [07:04:42] <_joe_> what is funny (?) is that this happened out of blue far from any release [07:05:22] <_joe_> maybe some memcached expiration is involved; alas, I don't know much about the php code itself [07:05:34] jeremyb: and what makes you think the error is i18n related? [07:06:54] And what's the second fatal? [07:07:01] And what are the actual URLs for each? [07:07:09] Impossible to comment like this, please file bugs [07:08:29] <_joe_> I should start studying the mediawiki code better [07:09:11] (03PS1) 10Spage: Set wgContentHandlerDB true for enwiki-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159457 (https://bugzilla.wikimedia.org/49193) [07:11:16] grrrr, bad wifi is bad [07:11:33] Nemo_bis: i don't remember the url for the one i only got once [07:11:38] oh, i do actually [07:11:45] [[commons:category:foo]] [07:11:53] but that wasn't i18n. it was confirmedit [07:12:11] (03CR) 10JanZerebecki: "naggen2 currently expects the puppet master database to be a mysql one and searches the config file of the puppet master for the credentia" [puppet] - 10https://gerrit.wikimedia.org/r/145472 (owner: 10Dzahn) [07:12:17] and https://commons.wikimedia.org/wiki/Category:Wikimania_2012 was [07:12:18] PHP fatal error in /usr/local/apache/common-local/php-1.24wmf20/languages/classes/LanguageKk.php line 24: [07:12:18] require_once() [function.require-once]: Cannot redeclare class languageconverter [07:12:25] Nemo_bis: [07:13:01] i think i consistently got Kk. but language code varies in logstash [07:14:09] (03CR) 10Giuseppe Lavagetto: "Please do not change naggen2 to add functionalities; it's thought for simplicity and speed in prod. you may want to adapt it a little, or " [puppet] - 10https://gerrit.wikimedia.org/r/145472 (owner: 10Dzahn) [07:15:59] jeremyb: great, you now have enough info to file a bug :) [07:16:23] (03CR) 10Spage: "It would be good to deploy this a week or two before it's the default on enwiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159457 (https://bugzilla.wikimedia.org/49193) (owner: 10Spage) [07:16:31] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "Also, I don't see any reason to rename classes. Icinga::monitor::service makes much more sense than icinga::service." [puppet] - 10https://gerrit.wikimedia.org/r/145472 (owner: 10Dzahn) [07:17:21] Nemo_bis: but there's a moderately well defined start time for this exception. it doesn't line up with SAL [07:18:37] (00:55 UTC today) [07:20:04] but? [07:21:40] so why did it start when it did? :) [07:22:18] (03CR) 10Giuseppe Lavagetto: [C: 031] swift: separate access log from general log [puppet] - 10https://gerrit.wikimedia.org/r/159348 (owner: 10Filippo Giunchedi) [07:23:30] _joe_: so, what about your objection to /usr/local/apache ? [07:24:03] <_joe_> jeremyb: none, actually [07:24:27] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [07:27:15] (03CR) 10JanZerebecki: "I would expect naggen2 to currently use an sql adapter, so the only suggestion from me that involved changing it was for it to try if sqli" [puppet] - 10https://gerrit.wikimedia.org/r/145472 (owner: 10Dzahn) [07:27:24] (03CR) 10Filippo Giunchedi: [C: 031] contint-use apache::site,move config to templates [puppet] - 10https://gerrit.wikimedia.org/r/153959 (owner: 10Dzahn) [07:29:50] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] sudoers.erb - deprecated variable access [puppet] - 10https://gerrit.wikimedia.org/r/154372 (owner: 10Dzahn) [07:31:08] sigh, I'm reverting that [07:33:23] cronspam some? :) [07:33:32] indeedly [07:33:40] (03PS1) 10Filippo Giunchedi: Revert "sudoers.erb - deprecated variable access" [puppet] - 10https://gerrit.wikimedia.org/r/159458 [07:34:11] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "sudoers.erb - deprecated variable access" [puppet] - 10https://gerrit.wikimedia.org/r/159458 (owner: 10Filippo Giunchedi) [07:35:21] should be over shortly, apologies for the noise [07:35:39] np, at least it wasn't piped into an alarm clock :) [07:38:26] hehe perhaps there's a simple way to check sudo syntax after puppet has expanded the template [07:39:53] so, ummmmm, doesn't it take 30 mins for the next puppet run? [07:40:03] this is going to keep going a while [07:40:13] (03CR) 10Filippo Giunchedi: "btw this didn't work, reverted in https://gerrit.wikimedia.org/r/#/c/159458/" [puppet] - 10https://gerrit.wikimedia.org/r/154372 (owner: 10Dzahn) [07:41:36] heh good question, depending if puppet reads the catalog before or after the splay period [07:43:22] <_joe_> 20 [07:43:31] <_joe_> it's 20 mins [07:44:14] oh [07:44:27] so, 10 more :) [07:46:28] (03CR) 10Filippo Giunchedi: [C: 031] StrictTransportSecurity for lists.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/145500 (https://bugzilla.wikimedia.org/38516) (owner: 10Dzahn) [07:51:41] <_joe_> this was very serious btw [07:52:08] <_joe_> how come we don't check files before putting them on the server? [07:53:05] well it passed lint [07:54:57] but i guess lint couldn't do template output [07:55:03] it would need to know input [07:55:14] jeremyb: pcc is the tool to check with [07:55:30] otoh, this is something that maybe could be done with rspec-puppet [07:56:02] (but rspec-puppet is on the chopping block?) [07:56:31] jeremyb: https://integration.wikimedia.org/ci/view/operations/job/operations-puppet-catalog-compiler/ [08:00:52] <_joe_> no. We need to have a system that checks the file _on_server_ and maybe removes it if it's bogus [08:00:55] <_joe_> making puppet fail [08:01:02] <_joe_> I was pretty sure we had that [08:01:41] (03CR) 10Alexandros Kosiaris: sudoers.erb - deprecated variable access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/154372 (owner: 10Dzahn) [08:01:42] that is even better [08:01:46] i don't understand how you would make puppet fail? [08:02:02] re lint: ohhhh, it's not lint, it's just documentation generation job [08:02:27] still i think in this case unit test is the best fix [08:03:11] akosiaris: heh, possibly numbing effect from many other similar reviews :( [08:03:56] it's not an error in a specific instance only, but a general error that would effect the most basic test case [08:04:16] (03PS1) 10Matanya: sudoers: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/159460 [08:04:19] godog: heh. yeah. I generally runs these through the puppet compiler just to be sure [08:04:22] akosiaris: ^ [08:04:43] and you can please merge the pybal one too, if you wish :) [08:05:11] <_joe_> the pybal one? [08:05:22] <_joe_> can you wait for me to take a break? [08:05:23] <_joe_> :) [08:05:55] _joe_: :) https://gerrit.wikimedia.org/r/158086 [08:06:02] akosiaris: very true [08:06:17] once you remove it from fenari the world would be a better place [08:06:43] (03CR) 10Alexandros Kosiaris: [C: 032] pybal: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/158086 (owner: 10Matanya) [08:07:11] _joe_: I reviewed it.. and ran it through compiler, seems a noop [08:07:26] <_joe_> akosiaris: I was joking [08:07:35] k [08:08:00] <_joe_> btw, this makes me think of the "moving pybal configs" I should set myself to. [08:08:36] _joe_: what does not work on servermon btw ? [08:09:48] seemed like the per host packagelist worked ok to me [08:09:53] <_joe_> akosiaris: yesterday the "available package updates" were showing nothing [08:10:00] <_joe_> now it works, wft [08:10:05] <_joe_> *wtf [08:10:44] if you checked right after I send my email, I submitted an extra minor change that might have fixed it [08:10:52] <_joe_> I think so [08:10:59] <_joe_> very very nice btw [08:11:09] thanks. I hope it will help us :-) [08:13:18] (03PS1) 10Matanya: limn: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/159461 [08:23:19] (03CR) 10Alexandros Kosiaris: [C: 032] swift: separate access log from general log [puppet] - 10https://gerrit.wikimedia.org/r/159348 (owner: 10Filippo Giunchedi) [08:38:58] PROBLEM - check google safe browsing for wikiversity.org on google is CRITICAL: Connection timed out [08:39:18] PROBLEM - check google safe browsing for wikimedia.org on google is CRITICAL: Connection timed out [08:40:19] PROBLEM - check google safe browsing for mediawiki.org on google is CRITICAL: Connection timed out [08:40:58] RECOVERY - check google safe browsing for wikiversity.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3925 bytes in 0.327 second response time [08:41:09] RECOVERY - check google safe browsing for wikimedia.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 4268 bytes in 0.346 second response time [08:41:49] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [08:41:49] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [08:42:12] RECOVERY - check google safe browsing for mediawiki.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3918 bytes in 0.086 second response time [08:46:06] (03PS1) 10Matanya: rsync: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/159462 [08:47:23] <_joe_> wat? [08:47:58] <_joe_> ^^ [08:48:22] what is it _joe_ ? [08:48:37] <_joe_> the hoarding of alerts [08:48:42] <_joe_> but it was a spike [08:48:48] <_joe_> that we should investigate [08:48:59] <_joe_> but unluckily I have no time no [08:49:03] <_joe_> *now [08:52:50] (03PS2) 10Filippo Giunchedi: swift: separate access log from general log [puppet] - 10https://gerrit.wikimedia.org/r/159348 [08:52:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: separate access log from general log [puppet] - 10https://gerrit.wikimedia.org/r/159348 (owner: 10Filippo Giunchedi) [08:56:09] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:56:09] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [08:56:22] (03PS1) 10Matanya: salt: qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/159463 [08:58:09] PROBLEM - Swift HTTP backend on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:58:35] (03CR) 10Alexandros Kosiaris: "Pointing out that connecting to http://racktables.wikimedia.org does not redirect to the HTTPS version of the service." [puppet] - 10https://gerrit.wikimedia.org/r/154980 (owner: 10Dzahn) [08:59:10] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [08:59:10] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [08:59:20] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:59:35] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [08:59:35] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:59:39] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:59:50] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [08:59:50] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [08:59:59] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [09:00:26] hmm ? [09:00:29] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.028 second response time [09:01:19] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [500.0] [09:01:19] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [500.0] [09:01:21] looks like the imagescalers? looking [09:02:41] whoa. re last 10ish mins in ganglia [09:02:45] https://ganglia.wikimedia.org/latest/?c=Image%20scalers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [09:02:57] yup [09:03:13] cascade from the earlier nagios alert for swift? [09:03:40] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [09:03:40] PROBLEM - Swift HTTP backend on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:03:40] PROBLEM - Swift HTTP backend on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:03:50] and there we go [09:04:54] godog: OOM showed up and killed convert [09:05:07] ETOOMANYCONVERSIONS ? [09:05:20] huh [09:05:36] this OOM, he is a killer! [09:05:39] RECOVERY - Swift HTTP backend on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 5.020 second response time [09:05:49] akosiaris: possible, all at the same time and cascading [09:06:01] so actually swift invokes imagescaler which fetches full-size original from swift [09:06:02] i think [09:06:32] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:06:39] RECOVERY - Swift HTTP backend on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.012 second response time [09:07:39] PROBLEM - Swift HTTP backend on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:08:29] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 6.435 second response time [09:08:59] PROBLEM - Swift HTTP frontend on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:09:20] RECOVERY - Swift HTTP backend on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.017 second response time [09:09:59] PROBLEM - Swift HTTP frontend on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:09:59] PROBLEM - Swift HTTP backend on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:10:11] high cpu across swift frontends, still looking [09:10:59] PROBLEM - Swift HTTP backend on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:11:30] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:11:49] RECOVERY - Swift HTTP backend on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.026 second response time [09:11:58] couly be related to my recent change to rsyslog, rolling restarting frontends [09:12:03] RECOVERY - Swift HTTP frontend on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.004 second response time [09:12:07] PROBLEM - puppet last run on ms-fe1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:12:08] <_joe_> godog: also seems high output network traffic [09:12:09] PROBLEM - SSH on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:12:09] PROBLEM - DPKG on ms-fe1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:12:09] PROBLEM - check configured eth on ms-fe1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:12:23] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.025 second response time [09:12:54] RECOVERY - Swift HTTP frontend on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.011 second response time [09:12:54] RECOVERY - Swift HTTP backend on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.021 second response time [09:12:54] RECOVERY - Swift HTTP backend on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.134 second response time [09:13:04] RECOVERY - puppet last run on ms-fe1001 is OK: OK: Puppet is currently enabled, last run 281 seconds ago with 0 failures [09:13:04] RECOVERY - SSH on ms-fe1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [09:13:04] RECOVERY - check configured eth on ms-fe1001 is OK: NRPE: Unable to read output [09:13:05] RECOVERY - DPKG on ms-fe1001 is OK: All packages OK [09:13:24] !log rolling restart swift-proxy on ms-fe1* [09:13:29] Logged the message, Master [09:14:06] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.585 second response time [09:14:07] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.308 second response time [09:14:14] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.077 second response time [09:14:14] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 66797 bytes in 0.511 second response time [09:14:24] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.047 second response time [09:14:34] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.057 second response time [09:14:47] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.082 second response time [09:14:48] <_joe_> I didn't see any alarm on rendering [09:14:51] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.555 second response time [09:14:55] <_joe_> are we just getting the recovery? [09:14:57] _joe_: there were [09:14:59] <_joe_> WTF? [09:15:34] <_joe_> ok sorry I was out and ran back home when I got paged [09:15:34] [08:59:50] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [09:15:39] <_joe_> yes seen now [09:15:52] _joe_: it is flapping [09:15:54] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.081 second response time [09:16:28] <_joe_> akosiaris: it seems stable now [09:17:07] _joe_: OOM showed up for the 3rd time on all imagescalers just a minute ago [09:17:13] killing convert [09:17:30] <_joe_> akosiaris: oom is quite regular there [09:17:44] <_joe_> as far as I remember from the last time I checked [09:18:40] <_joe_> so, the problem here could be some batch upload of a lot of images, which resulted in high load on the image scalers, and maybe us incresing the number of workers on imagescalers did more harm than good [09:19:18] <_joe_> http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Image%2520scalers%2520eqiad&tab=m&vn=&hide-hf=false shows how imagescalers were basically completely down during that phase [09:20:45] <_joe_> http://gdash.wikimedia.org/dashboards/reqerror/ is interesting [09:21:00] I'm also trying to understand if swift and imagescaler stuff were related, imagescalers not responding seem a bit earlier [09:21:05] <_joe_> basically all our 500s are for images - I guess most are for the wrong arguments [09:21:25] <_joe_> godog: I think swift is a consequence of imagescalers maybe [09:21:55] not sure, there was also a puppet change for swift logging I pushed around that time [09:21:59] <_joe_> or, they may be unrelated and the swift issue may be related to the access log split? [09:22:05] probably [09:22:46] <_joe_> now, let's see if the next time I walk out of the door the alarm comes back [09:22:59] <_joe_> if that is the case, we probably found "root cause" [09:23:11] I can't understand though how the rsyslog change would do that [09:23:18] <_joe_> akosiaris: me neither [09:23:28] <_joe_> hence I guess this is related to the imagescalers [09:24:07] there was a problem with swift and syslog a while back, but while I checked it seemed fixed by the new swift version to which we upgraded [09:24:15] <_joe_> so, swift frontends do call the imagescalers, right? [09:24:33] <_joe_> or is that done by the backend? [09:24:44] that's correct, the frontends [09:24:44] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [09:24:49] <_joe_> ok so [09:25:01] <_joe_> if the frontends were busy polling the scalers [09:25:06] <_joe_> which were not responding [09:25:22] <_joe_> that seems like a classic domino effect [09:25:59] <_joe_> http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=ms-fe1002.eqiad.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1410341146&g=network_report&z=large&c=Swift%20eqiad [09:26:14] <_joe_> NO network traffic between 8:57 and 9:07 [09:26:23] <_joe_> meaning the scalers didn't respond [09:26:26] <_joe_> basically [09:26:52] <_joe_> what we need now is to understand what made the imagescalers choke [09:27:13] <_joe_> godog: you wanted an incident report to write? you've been served :P [09:27:34] thanks I'm thrilled already [09:29:07] what I'm trying to understand is what could have killed the image scalers like that, basically not even gmond could report anything [09:29:39] <_joe_> some large image upload [09:30:04] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [09:30:04] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:30:22] <_joe_> or maybe a new "wikilinks is free" promotion [09:30:37] <_joe_> bbiab [09:52:01] mark: did you see https://rt.wikimedia.org/Ticket/Display.html?id=8316 ? [09:56:40] matanya: yes, he has replied already on peering@ [09:56:55] I don't have access there, thanks akosiaris [10:01:37] <_joe_> ugh, gerrit sucks [10:01:56] <_joe_> I have no way to rebase the current tree over the debian one [10:04:44] <_joe_> it's a damn git basic functionality [10:23:42] _joe_: you could force push... [10:24:02] not saying you *should*. idk [10:26:36] <_joe_> jeremyb: according to docs, gerrit should refuse it [10:30:16] depends on the repo and on who's doing the pushing [10:30:30] don't worry about that so much. worry more about whether you should do it :) [10:31:07] <_joe_> jeremyb: it would make sense, though I'd probably be better off just porting some of our changes to the debian package [11:05:36] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:10:45] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 44 data above and 0 below the confidence bounds [11:10:50] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.012 second response time [11:11:09] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 44 data above and 0 below the confidence bounds [11:21:30] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [11:23:25] _joe_: if the porting to debian doesn't take care of everything: why not merge instead of rebase? [11:25:35] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [11:32:06] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds [11:32:06] PROBLEM - HTTP error ratio anomaly detection on labmon1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds [11:33:17] <_joe_> jzerebecki: it's equally painful [11:33:44] <_joe_> and I'd like to stop maintaining our version of hhvm packages [11:37:24] (03CR) 10Alexandros Kosiaris: "I'd like us to discuss this thoroughly. The reason is that this tool provides insight into the infrastructure that NDAed people could not " [puppet] - 10https://gerrit.wikimedia.org/r/159419 (owner: 10Dzahn) [11:38:46] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [11:47:38] any prod box, puppet --version. please [11:49:58] _joe_: godog: ping? [11:49:58] jeremyb: ping detected, please leave a message! [11:50:02] hah [11:50:53] actually, make it a precise box (12.04) please [11:55:13] or maybe andre__ could help? [11:56:27] jeremyb: 3.4.3 [11:56:33] with? [11:56:34] thanks [11:56:38] what do you want it for ? [11:56:39]