[00:01:41] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [00:02:33] (03CR) 10Dzahn: [C: 032] "enables caching" [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/144855 (owner: 10JanZerebecki) [00:02:43] (03CR) 10Dzahn: [V: 032] Add mtime argument to css link. [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/144855 (owner: 10JanZerebecki) [00:05:46] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [00:11:26] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [00:13:36] (03PS2) 10Tim Landscheidt: Fix paths in comments after modularization [operations/puppet] - 10https://gerrit.wikimedia.org/r/114736 [00:21:25] (03Abandoned) 10Aaron Schulz: Give "mergehistory" to sysops [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141892 (owner: 10Aaron Schulz) [00:24:30] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [00:26:40] AaronSchulz: jfyi, tl_from_namespace and il_from_namespace are done. pl_from_namespace has just started [00:32:06] springle: ?? [00:32:37] yes? [00:34:59] \o/ [00:35:12] (03PS3) 10Dzahn: Fix paths in comments after modularization [operations/puppet] - 10https://gerrit.wikimedia.org/r/114736 (owner: 10Tim Landscheidt) [00:35:43] (03CR) 10Dzahn: [C: 032] Fix paths in comments after modularization [operations/puppet] - 10https://gerrit.wikimedia.org/r/114736 (owner: 10Tim Landscheidt) [00:38:48] (03CR) 10Dzahn: "this just touches comments but rather important ones, thanks for the fix, merging now before it will needs rebasing again soon" [operations/puppet] - 10https://gerrit.wikimedia.org/r/114736 (owner: 10Tim Landscheidt) [00:41:41] springle: can you explain what il_fron_namespace is and/or documentation on that? [00:42:13] Betacommand: https://gerrit.wikimedia.org/r/#/c/117373 [00:42:23] il = imagelink? [00:42:43] http://www.mediawiki.org/wiki/Manual:Imagelinks_table [00:42:48] https://bugzilla.wikimedia.org/show_bug.cgi?id=60618 [00:43:03] springle: ah thanks [00:49:52] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 516 bytes in 0.012 second response time [00:49:56] (03CR) 10Dzahn: "ignore my last comment. can you add to the message what the actual diff is?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144427 (owner: 10Matanya) [00:50:49] !log restarted gitblit service [00:50:54] Logged the message, Master [00:57:02] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Tue 08 Jul 2014 22:56:40 UTC [00:57:04] git.wikimedia.org is down. 503 [00:57:27] mutante just restarted it... [00:57:33] gitblit sucks anyway [00:57:38] give it a couple more seconds [00:57:46] i did a "start", then a "restart" [00:57:50] Restarted because it went down? [00:57:52] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 53307 bytes in 0.363 second response time [00:57:53] yes [00:57:56] there [00:59:54] <^d> We should just leave it down. [00:59:59] <^d> Let it think about what it's done. [01:00:02] <^d> (ie: nothing) [01:00:27] lol [01:05:11] (03PS3) 10JanZerebecki: bugzilla: Enable strict transport security [operations/puppet] - 10https://gerrit.wikimedia.org/r/127256 [01:11:55] (03CR) 10Dzahn: [C: 032] bugzilla: Enable strict transport security [operations/puppet] - 10https://gerrit.wikimedia.org/r/127256 (owner: 10JanZerebecki) [01:12:56] (03CR) 10Dzahn: "what, "needs verified"? it is" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127256 (owner: 10JanZerebecki) [01:13:44] (03CR) 10Dzahn: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127256 (owner: 10JanZerebecki) [01:16:27] and this time it did more tests? wth [01:16:54] verified +1 vs. verified +2 - jenkins? [01:18:05] Hey Reedy, remember that "IP address is blocked" thing with Special:Contributions? [01:18:20] Well it was fixed for that page, but not for Special:DeletedContributions [01:18:41] The IP leakage thing that displays a "change block" for admins [01:19:47] mutante: Yeah, the V+1 tests don't run code (just static analysis), whereas the V+2 ones can include code that might exploit vulnerabilities. [01:20:33] James_F: but that behaviour is new? i never saw it do a +1 test and then several minutes later a +2 test [01:20:53] mutante: Your C+2 gives it the authority to run the V+2 tests. [01:21:07] mutante: If you'd uploaded or altered the patch it would also run the V+2 tests. [01:21:30] mutante: But JanZerebecki isn't in the (eww) whitelist of people, so it only does the V+1 tests. [01:21:41] mutante: usually someone who has the rights gives a C+1 so then jenkins may do the +2 tests [01:21:53] i overlooked that, too [01:21:57] you must be talking about mediawiki repo [01:22:03] and this got activated here as well [01:22:05] or something [01:22:11] mutante: It may be new for operations/puppet, yeah. [01:22:18] that would make sense then, yes [01:22:22] mutante: This is how it's been for > a year in other places. [01:22:54] yes, the whitelist thing.. got it! thanks [01:22:58] whitelisted users [01:23:46] the behaviour was just slightly different before, i think it did not do the +1 test either [01:23:51] in this case [01:24:30] it is not new in operations/puppet, it happened in that same patch in the first version also [01:25:06] Deskana: Some users are still not appearing on loginwiki [01:25:19] This is getting silly. 4 vandal accounts are not on there [01:25:24] Recent ones [01:25:35] What is making them not SUL to that special wiki? [01:26:18] !log Bugzilla - enabled https://en.wikipedia.org/wiki/HTTP_Strict_Transport_Security [01:26:24] Logged the message, Master [01:56:49] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Wed Jul 9 01:56:46 UTC 2014 [02:04:09] (03PS2) 10Dzahn: bugzilla: vars in scope, no need for lookup [operations/puppet] - 10https://gerrit.wikimedia.org/r/144442 (owner: 10Matanya) [02:15:41] !log LocalisationUpdate completed (1.24wmf11) at 2014-07-09 02:14:38+00:00 [02:15:48] Logged the message, Master [02:26:36] !log LocalisationUpdate completed (1.24wmf12) at 2014-07-09 02:25:33+00:00 [02:26:42] Logged the message, Master [02:55:43] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jul 9 02:54:37 UTC 2014 (duration 54m 36s) [02:55:48] Logged the message, Master [03:21:45] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [03:36:45] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [03:37:02] !log ran puppet on neon - false puppet failure alarms [03:37:06] Logged the message, Master [04:48:12] (03PS1) 10KartikMistry: Disable captcha for ca/eswikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144876 [04:57:59] (03PS2) 10Legoktm: Disable captcha for ca/eswikis on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144876 (owner: 10KartikMistry) [05:07:44] legoktm: thanks. otherwise it was confusing commit msg. [05:08:02] yeah, I saw it scroll by and was a bit shocked :P [05:08:36] :) [06:17:50] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 04:16:48 UTC [06:29:33] PROBLEM - puppet last run on mw1060 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:33] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:33] PROBLEM - puppet last run on db1022 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:33] PROBLEM - puppet last run on mw1100 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:34] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:34] PROBLEM - puppet last run on mw1173 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:34] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:35] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:43] PROBLEM - puppet last run on mw1164 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:43] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:43] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:44] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:03] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:13] PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:03] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Wed Jul 9 06:36:56 UTC 2014 [06:44:29] RECOVERY - puppet last run on mw1060 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:44:39] RECOVERY - puppet last run on db1022 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:45:39] RECOVERY - puppet last run on mw1100 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:45:39] RECOVERY - puppet last run on mw1173 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:45:39] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:45:39] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:45:39] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:45:40] RECOVERY - puppet last run on mw1164 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:46:09] RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:46:39] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:46:39] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [06:47:09] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:48:39] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 3 failures [07:06:44] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [07:17:09] (03PS10) 10Giuseppe Lavagetto: mediawiki: manage the apache config via puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 [07:19:03] (03PS11) 10Giuseppe Lavagetto: mediawiki: manage the apache config via puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 [07:23:17] (03PS2) 10Matanya: bugzilla: update cipher_suite [operations/puppet] - 10https://gerrit.wikimedia.org/r/144427 [07:27:14] <_joe_> matanya: hey [07:27:26] hi _joe_ [07:27:38] <_joe_> did you see the shitload of deprecations for puppet 3 I posted for you? :) [07:27:51] <_joe_> well, you just because you asked for it [07:27:52] no, i haven't :) [07:27:57] <_joe_> but it's up for everyone [07:28:19] <_joe_> http://etherpad.wikimedia.org/p/Puppet3 I reused this [07:28:28] i had crazy 24h [07:28:41] <_joe_> hey no reason to justify :) [07:28:51] actually week, but the last 24 are the worst [07:29:01] <_joe_> :/ sorry to hear that [07:37:07] good morning [07:37:30] hashar: good morning! [07:42:35] <_joe_> hashar: puppet scoping just bit me :( [07:44:13] _joe_: told you it is confusing hehe [07:44:35] <_joe_> mmmh no actually, I'm a moron. [07:44:41] <_joe_> it's not scoping, I got that right [07:45:05] kart_: Moritz Schubotz send a mail on wikitech to push Mathoid to production. That is a nodejs backend to render math. Would you mind having a look at the email and possibly reply to him with your cxserver experience ? :-] [07:46:42] (03PS12) 10Giuseppe Lavagetto: mediawiki: manage the apache config via puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 [07:48:39] hashar: sure! [07:48:51] (03PS1) 10Matanya: admin: fix var scoping [operations/puppet] - 10https://gerrit.wikimedia.org/r/144908 [07:48:53] kart_: awesome :-) [07:48:53] hashar: meanwhile there is patch for you somewhere :) [07:49:16] kart_: oh my gerrit dashboard is filled with reviews requests. Link? [07:49:34] hashar: https://gerrit.wikimedia.org/r/#/c/144876/ [07:51:45] _joe_: https://gerrit.wikimedia.org/r/#/c/144908/ <-- jenkins please ? [07:53:57] <_joe_> matanya: which nodes? [07:54:14] all those with admin module [07:54:19] need a list ? [07:54:26] <_joe_> matanya: you need to provide me a list of suitable nodes if possible [07:54:29] <_joe_> yes please [07:54:55] few minutes, handling fundraising issues [07:55:24] (03CR) 10Hashar: [C: 04-1] "kart_ : just drop the statement entirely from -labs. That is the default in CommonSettings.php" (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144876 (owner: 10KartikMistry) [07:56:56] matanya: what do you need with jenkins? [07:57:06] access :D [07:57:10] to what? [07:57:18] by default only members of the wmf LDAP group can play with jenkins [07:57:24] puppet-compiler [07:57:30] I know hashar [07:57:30] ahhh yeah [07:57:37] thanks anyways [07:57:52] you can't trigger trigger a run right? [07:58:28] matanya: what is your labs account name? [07:58:34] matanya [07:58:48] that is convenient [07:58:50] hashar: Since it is set true in CommonSettings.php, entire statement is not needed? or I should set it to true in -labs? [07:59:02] yes, i try to kiss it :) [08:00:01] matanya: added you , can you give it a look? [08:00:14] matanya: hopefully you should be able to access the views / builds and trigger a run [08:00:39] kart_: you can remove everything [08:00:57] kart_: labs has all settings from CommonSettings.php (i.e. prod) then some are extended / overridden in CommonSettings-labs.php [08:00:58] hashar: cool. thanks. [08:01:20] kart_: I will have to figure out a way to trigger puppet compiler from a comment made in Gerrit [08:01:43] something like: jenkins compile this on host1,host2 [08:01:49] (03PS3) 10KartikMistry: Disable captcha on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144876 [08:02:07] !log upgrade ms-be1005/1006/1007 (zone3) to swift icehouse [08:02:12] Logged the message, Master [08:02:17] (03CR) 10Hashar: [C: 032] Disable captcha on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144876 (owner: 10KartikMistry) [08:02:23] (03Merged) 10jenkins-bot: Disable captcha on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144876 (owner: 10KartikMistry) [08:02:25] kart_: jenkins is going to deploy that :-) [08:02:50] fast :) [08:02:56] kart_: https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/591/console :D [08:03:29] hashar: thanks a lot! it seems to be running :) [08:03:37] matanya: congratulations [08:04:09] hashar: you just saved 50% of ttm for me [08:04:11] !log Jenkins: granted matanya the ability to manually trigger builds. Use case: the puppet compiler! [08:04:15] Logged the message, Master [08:04:17] ttm ? [08:04:40] time to market, or in my case - time to merge [08:11:38] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: Fetching readonly [08:12:48] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: manage the apache config via puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 (owner: 10Giuseppe Lavagetto) [08:13:38] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 06:12:55 UTC [08:14:35] hashar: does puppet-compiler take grep-like syntax or must be explicit names ? [08:15:22] matanya: for the nodes? That should be a comma separated list of fully qualified names [08:15:30] like gallium.wikimedia.org,lanthanum.eqiad.wmnet [08:15:42] ah, a lot of work then :) [08:16:03] ok, thank. I thought my grep-fu was cool [08:16:03] you probably don't need to compile on all similar nodes [08:16:16] still 150 of them [08:16:25] oh really [08:16:39] I changed the admin module [08:16:46] <_joe_> matanya: if you want to compile on all kind of nodes [08:16:54] <_joe_> just leave the nodes field blank [08:17:03] <_joe_> the software will select nodes for you [08:17:17] that would be good. especially for changes in base [08:17:23] <_joe_> (literally, it parses site.pp and matches that with the node list it had) [08:17:32] oh man that is undocumented! :D [08:17:48] <_joe_> hashar: well it's standard mode of operations [08:18:07] <_joe_> hashar: http://git.wikimedia.org/blob/operations%2Fsoftware/3b610a877ebd451b7001aa9aa21778d1ae287afe/compare-puppet-catalogs%2Fpuppet_compare%2Fnodegen.py [08:18:19] <_joe_> (then people ask why I love python) [08:19:02] !log upgrade ms-be1009/1010/1011 (zone4) to swift icehouse [08:19:07] Logged the message, Master [08:19:15] I have updated the field description [08:19:23] I love python [08:20:37] what a joy, i can what i'm about to break ! :) [08:20:44] *can see [08:26:59] hashar , _joe_: I crashed jenkins host [08:27:17] Request: POST http://integration.wikimedia.org/ci/view/operations/job/operations-puppet-catalog-compiler/125/logText/progressiveHtml, from 10.64.0.171 via cp1043 cp1043 ([10.64.0.171]:80), Varnish XID 512851126 [08:27:17] Forwarded for: 62.0.53.15, 10.64.0.171 [08:27:17] Error: 503, Service Unavailable at Wed, 09 Jul 2014 08:22:19 GMT [08:27:29] matanya: yeah that happens from time to time [08:27:40] matanya: just reload the console page [08:27:51] ie https://integration.wikimedia.org/ci/view/operations/job/operations-puppet-catalog-compiler/125/console [08:28:06] yes, took several refresh tries [08:28:10] matanya: the web interface polls the server over ajax to fetch console update or something like that [08:28:14] and sometime the check fails [08:28:36] <_joe_> matanya: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/125/console [08:28:43] matanya: or hammer http://puppet-compiler.wmflabs.org/125/change/144908/html/ :D [08:28:43] <_joe_> it is working [08:28:49] <_joe_> always check console output :) [08:29:07] <_joe_> matanya: when doing a full compile, use 4 threads, or even 6 [08:29:40] <_joe_> matanya: it will take time, on 2 threads it will be ~ 1 hour [08:30:19] matanya: you should be able to abort the build. At the top right of the console output there is a progress bar with a red checkbox you can click to cancel it [08:30:40] OK thanks. I saw that on our (day job) jenkins slaves in the past, bumping to 1.56x solved most of it [08:30:52] <_joe_> also, check http://puppet-compiler.wmflabs.org/125/change/144908/compiled/puppet_catalogs_3_144908/mw1008.eqiad.wmnet.warnings for example, to see if the file you modified still gives any warning [08:33:34] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Wed Jul 9 08:33:25 UTC 2014 [08:34:50] hashar: i'm way too familiar with jenkins :D [08:40:13] PROBLEM - Puppet freshness on mw1182 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 08:38:09 UTC [08:42:13] PROBLEM - Puppet freshness on mw1182 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 08:38:09 UTC [08:44:13] PROBLEM - Puppet freshness on mw1182 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 08:38:09 UTC [08:44:45] <_joe_> I hate the f****** icinga-wm puppet spam [08:45:12] * YuviPanda|zzz waves at godog [08:45:19] YuviPanda|zzz: yo! [08:45:23] godog: do you know of tungsten's hardware config? or where I can find that? [08:45:24] (see what I did there?) [08:45:42] * YuviPanda installs an app for godog [08:46:00] YuviPanda: it is the same as the db boxes, 16 disks (IIRC) in raid10 hardware [08:46:13] PROBLEM - Puppet freshness on mw1182 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 08:38:09 UTC [08:46:57] godog: ah, hmm. RAM? [08:47:34] godog: need to pick one out of https://wikitech.wikimedia.org/wiki/Server_Spares for the labs graphite machine [08:48:13] PROBLEM - Puppet freshness on mw1182 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 08:38:09 UTC [08:48:27] _joe_: are you aware naggen2 breaks puppet compile on neon ? [08:48:31] (03PS1) 10Giuseppe Lavagetto: mediawiki: manage with puppet on all nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/144917 [08:48:33] <_joe_> matanya: yes [08:48:39] ok, thanks [08:48:42] <_joe_> matanya: it's one of my todos [08:48:59] your list is huge :) [08:49:01] <_joe_> matanya: the diffs in nickel can't be triggered by your change [08:49:06] <_joe_> matanya: and ever growing [08:50:01] why is nickel not happy ? [08:50:05] godog: I also emailed op@ about graphite on labs. do respond with thoughts, etc (and perhaps a summary of your experiments / situation on tungsten) [08:50:13] PROBLEM - Puppet freshness on mw1182 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 08:38:09 UTC [08:51:18] YuviPanda: yeah saw that but didn't get to it yet :)) anyways 300gb ssd would probably be enough iops-wise but you'd have to juggle the 300gb perhaps (tungsten has 800gb used for example) [08:51:30] !log Jenkins migrating jobs to use $ZUUL_URL instead of git://zuul.eqiad.wmnet Preparing to scale out Zuul merger to several nodes [08:51:34] Logged the message, Master [08:52:13] PROBLEM - Puppet freshness on mw1182 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 08:38:09 UTC [08:52:35] YuviPanda: or try the 4 3tb disks in raid10 and see if that's suitable, no idea how many metrics labs is/can be pushing [08:53:47] !log upgrade ms-be1013/1014/1015 (zone5) to icehouse swift [08:53:53] Logged the message, Master [08:54:07] <_joe_> hashar: we can I think send up to two jobs to the puppet-compiler [08:54:13] PROBLEM - Puppet freshness on mw1182 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 08:38:09 UTC [08:54:16] <_joe_> I don't remember how to change that setting [08:55:13] and [08:55:14] godog: sorry, got disconnected. [08:55:16] I broke Jenkins \O/ [08:55:31] godog: hmm, I was thinking of RAIDing two 500GB spinning rusts [08:56:13] PROBLEM - Puppet freshness on mw1182 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 08:38:09 UTC [08:57:06] YuviPanda: raid1 ? it really boils down on how many metrics you want to be able to take in, it is going to be disk io bound anyway [08:57:18] hmm, right [08:57:33] all of labs when it was sending metrics was at about 70k [08:57:43] which is quite a bit below prod's 250k [08:57:54] per minute, I am assuming [08:57:59] yeah [08:58:13] PROBLEM - Puppet freshness on mw1182 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 08:38:09 UTC [08:58:43] RECOVERY - Puppet freshness on mw1182 is OK: puppet ran at Wed Jul 9 08:58:39 UTC 2014 [08:58:48] <_joe_> please reduce labs metrics to the relevant ones [08:59:03] <_joe_> I don't see a problem in not collecting cpu stats on labs [08:59:06] <_joe_> if not needed [08:59:15] <_joe_> we're wasting resources imo [08:59:30] yeah, that's one of the things I was considering doing. We need some CPU metrics, but not all [08:59:38] one way is to submit a patch to diamond that lets us whitelist them [09:00:17] <_joe_> YuviPanda: exactly [09:00:52] for puppet collector, I ended up writing a minimal version, but doing that feels slightly ugly and also doesn't scale for more collectors. [09:01:48] <_joe_> hashar: I'd need help with jenkins, when you're done fixing it :P [09:02:01] YuviPanda: ye anyways I'll reply to your email too [09:02:23] godog: cool, ty. Hardware discussion is on the RT ticket linked, so do respond there too if appropriate [09:02:32] !log Jenkins killing slave process on lanthanum. Some job is stalled and unrecoverable. [09:02:36] _joe_: sure [09:02:36] Logged the message, Master [09:09:39] _joe_: tin is also unhappy [09:09:58] Error: Another local or imported resource exists with the type and title Ssh::Hostkey[db60.pmtpa.wmnet] on node tin.eqiad.wmnet [09:11:07] <_joe_> matanya: yeah, that is the stupid puppet db [09:11:18] ok [09:11:44] ah I found jenkins issue [09:11:47] <_joe_> when I refreshj the facts list, I usually do wipe the db [09:11:47] some cache got wiped out [09:11:48] bah [09:12:12] !log Jenkins being slow because the mediawiki-core* jobs history cache has been wiped out while updating their configuration. Jenkins is busy processing the history :( [09:12:15] Logged the message, Master [09:17:13] <_joe_> hashar: so, problem solved? [09:20:55] more or less [09:21:06] I think I will end up restarting jenkins entirely :( [09:22:44] no! [09:22:55] my ran is about to finish [09:22:59] *run [09:23:59] _joe_: shout your question. I will have to rush out soon [09:24:04] might be able to multitask :-D [09:30:36] !log restarted Zuul to clear out stalled items in queue [09:30:41] Logged the message, Master [09:32:11] _joe_: Jenkins has bring puppet-compiler02.eqiad.wmflabs offline because Disk space is too low. Only 0.743GB left on /tmp. [09:32:41] there is a build going on. I guess it will complete https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/125/ [09:32:51] but then the node will be offline for real until the /tmp is cleaned up [09:34:30] <_joe_> hashar: ok, damn [09:34:34] <_joe_> I'll take a look [09:35:22] (03CR) 10Filippo Giunchedi: [C: 031] "looks good! just one ignorable comment/nitpick (starting with ~)" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144612 (owner: 10Ori.livneh) [09:38:16] <_joe_> hashar fixing that [09:38:34] <_joe_> sorry I was deep in hhvm [09:39:08] !log Jenkins had a bit of failure earlier due to the massive configuration update of mediawiki-core and mwext jobs. If that fails again the best thing is to stop Jenkins on gallium , wait for it to be killed or force kill -9 the java process then restart Jenkins. Should sort it out [09:39:13] Logged the message, Master [09:39:28] _joe_: I have to move now sorry :-/ will be back around 2pm [09:39:32] hashar: 95%! :( [09:39:45] <_joe_> hashar: np [09:40:16] to ops : if Jenkins fails just /etc/init.d/jenkins stop . Wait for it to die or eventually kill -9 the java process. Then /etc/init.d/jenkins start . That sort things out 99% of the time [09:40:40] matanya: :-( [09:40:59] matanya: seems the script should not rely on /tmp or /var at all but use some extentded disk space :D [09:41:15] matanya: at least the run completed [ 07/09/2014 09:37:35 ] INFO: Run completed, you can see detailed results for your work at http://puppet-compiler.wmflabs.org//125/change/144908/html [09:41:17] yes, the ssd mount ... [09:41:19] matanya: then the node went offline [09:41:39] oh, good :) [09:41:43] thanks, and bye [09:41:43] matanya: oh no it is running on labs instance so there is no ssd there . But we can get some LVM based disk space mounted at /srv/ [09:41:51] there is a role class for it. Something like labs::mnt::srv [09:41:55] <_joe_> matanya: you don't need to run again btwe [09:41:58] can't remember the exact name [09:42:10] anyway : http://puppet-compiler.wmflabs.org/125/change/144908/html/ ! [09:42:12] ok. _joe_ merge ? [09:42:14] I am out of here [09:47:37] (03PS1) 10Matanya: bacula: fix var scoping [operations/puppet] - 10https://gerrit.wikimedia.org/r/144927 [10:16:00] <_joe_> matanya_: sorry I'm terribly busy at the moment. [10:16:45] _joe_: scoping question: in modules/base/templates/resolv.conf.erb domain_search, nameservers_prefix and nameserver are facts/top level vars or something else ? [10:17:06] (03PS1) 10Filippo Giunchedi: swift: add swift-dispersion-report and stats [operations/puppet] - 10https://gerrit.wikimedia.org/r/144932 [10:17:16] sure, when you have time [10:21:19] (03CR) 10Filippo Giunchedi: [C: 04-1] "for restarts affecting many machines and possibily user-facing I think we'd be better controlling the restarts via apache-graceful-all in " [operations/puppet] - 10https://gerrit.wikimedia.org/r/144917 (owner: 10Giuseppe Lavagetto) [10:30:20] (03PS1) 10KartikMistry: Disable addurl captcha trigger for es/ca wikis on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144933 [10:37:22] (03PS1) 10Mark Bergsma: Move ulsfo traffic to eqiad for pending ulsfo move [operations/dns] - 10https://gerrit.wikimedia.org/r/144934 [10:40:05] (03CR) 10Mark Bergsma: [C: 032] Move ulsfo traffic to eqiad for pending ulsfo move [operations/dns] - 10https://gerrit.wikimedia.org/r/144934 (owner: 10Mark Bergsma) [10:59:14] <_joe_> lunch, bbiab [11:15:11] (03PS1) 10Phuedx: Re-enable the anonymous signup invite experiment [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144938 [11:15:42] (03CR) 10Phuedx: [C: 04-1] "-1 until the Growth team are ready to ploy." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144938 (owner: 10Phuedx) [11:15:54] (03CR) 10Nemo bis: "I recommend to fix the grammar first: https://gerrit.wikimedia.org/r/135533" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144938 (owner: 10Phuedx) [12:35:57] !log enabled amssq47 text frontend cache in pybal for esams [12:36:03] Logged the message, Master [12:37:04] re [12:52:17] RECOVERY - Disk space on ms-be1005 is OK: DISK OK [12:52:44] !log umounted sdg1 on ms-be1005, device disappeared, errors in dmesg [12:52:49] Logged the message, Master [13:00:04] K4-713: The time is nigh to deploy Fundraising (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140709T1300) [13:09:39] mhrm /dev/sdo appeared on ms-be1005 Jul 8 21:02:45 ms-be1005 kernel: [2883228.905821] sd 0:2:6:0: [sdo] 3905945600 512-byte logical blocks: (1.99 TB/1.81 TiB) [13:10:38] I think the perc got confused, I'm going to reboot ms-be1005 in a few if there are no ideas [13:11:35] is there some mechanism for persistently mapping device UUIDs to /dev/sdX? if not maybe switch the mounts to UUIDs [13:12:34] (if it really seems to be the case that /dev/sdg became /dev/sdo out of nowhere) [13:13:06] sdg got replaced yesterday, I haven't seen this happening elsewhere after a replacement though [13:13:10] the sdX mappings are generally stable to logical/lun ids [13:13:25] godog: did you create a new LV instead of reusing the old? [13:13:30] lemme have a quick look [13:14:19] pci-0000:03:00.0-scsi-0:2:6:0 -> ../../sdo [13:14:21] hrm, weird [13:14:49] paravoid: I didn't intervene on the replacement, cmjohnson1 did take of it though [13:15:19] paravoid: unrelated, just saw your dotScale talk. nicely done :) I had kinda forgotten that we have only 17(!?!?) in ops. [13:15:20] yeah, LD 6 wasn't reused [13:16:00] paravoid: i swapped the disk...cleared the cache and added back [13:16:13] oh right [13:16:15] it appeared to have use LD 6 at least that is what the msg stated [13:16:23] YuviPanda: we have volunteer input on top of that, too :) [13:16:26] it's LD 6, it's just not named properly [13:16:34] bblack: indeed, but still :) [13:16:39] but that shouldn't matter [13:16:53] bblack: there's a good amount of work from platform as well, I suppose (ori, bd808, etc) [13:17:00] megacli -LDinfo -Lall -aALL is what I'm looking at fwiw [13:17:01] yup [13:17:12] YuviPanda: are you being humble and not including yourself? ;) [13:17:41] (you should :) [13:17:46] (include yourself) [13:17:57] godog: yeah I think reboot is the appropriate response here [13:18:01] paravoid: (you should be doing pushups instead of IRC!) [13:18:11] not yet, that's friday onwards :) [13:18:20] paravoid: :D not yet, though. Although of late the additional engineers in Android have let me do more work here. [13:18:31] bblack: pushing keys makes you spend energy too! [13:18:36] paravoid: sigh, okay! [13:21:00] cmjohnson1: your actions were correct, as far as I can tell :) [13:21:15] it's just LSI being LSI I guess [13:21:42] godog: btw, the catch that cmjohnson briefly mentioned is [13:21:44] okay..I did the same for ms-be1007 and it came back fine...if I see this in the future what should I do? [13:22:10] when we lose a disk, the controller keeps the write bufffers in the BBU expecting to see that disk again [13:22:31] and doesn't allow you to reuse that particular logical id [13:22:40] so the trick there is to discard that cache [13:22:45] that's [13:22:46] megacli -GetPreservedCacheList -a0 [13:22:54] megacli -DiscardPreservedCache -LXXX -a0 [13:23:10] and then [13:23:11] megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0 [13:23:17] to create the LD [13:24:20] lunch, brb [13:24:30] hah, that's what might have prevented ms-be1007 to come back up (?) [13:24:49] had to fight a bit with raid bios to discard the cache and so on [13:24:52] godog: no, when I replaced the disk the 2nd time there wasn't anything cached [13:25:03] maybe you cleared it b4 [13:26:27] cmjohnson1: yeah likely, it refuses to boot otherwise (it was shut down ms-be1007) [13:30:19] !log reboot ms-be1005, raid controller confused (?) after disk replacement [13:30:23] Logged the message, Master [13:32:45] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [13:34:05] ACKNOWLEDGEMENT - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% Filippo Giunchedi expected reboot [13:37:55] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [13:40:06] PROBLEM - check if dhclient is running on ms-be1005 is CRITICAL: Connection refused by host [13:40:06] PROBLEM - puppet disabled on ms-be1005 is CRITICAL: Connection refused by host [13:40:06] PROBLEM - check configured eth on ms-be1005 is CRITICAL: Connection refused by host [13:40:16] PROBLEM - swift-object-server on ms-be1005 is CRITICAL: Connection refused by host [13:40:16] PROBLEM - DPKG on ms-be1005 is CRITICAL: Connection refused by host [13:40:16] PROBLEM - swift-account-auditor on ms-be1005 is CRITICAL: Connection refused by host [13:40:16] PROBLEM - Disk space on ms-be1005 is CRITICAL: Connection refused by host [13:40:26] PROBLEM - swift-object-replicator on ms-be1005 is CRITICAL: Connection refused by host [13:40:37] PROBLEM - swift-container-auditor on ms-be1005 is CRITICAL: Connection refused by host [13:40:37] PROBLEM - swift-account-reaper on ms-be1005 is CRITICAL: Connection refused by host [13:40:37] PROBLEM - swift-account-replicator on ms-be1005 is CRITICAL: Connection refused by host [13:40:37] PROBLEM - swift-account-server on ms-be1005 is CRITICAL: Connection refused by host [13:40:46] PROBLEM - swift-container-updater on ms-be1005 is CRITICAL: Connection refused by host [13:40:46] PROBLEM - swift-container-server on ms-be1005 is CRITICAL: Connection refused by host [13:40:46] PROBLEM - SSH on ms-be1005 is CRITICAL: Connection refused [13:40:46] PROBLEM - RAID on ms-be1005 is CRITICAL: Connection refused by host [13:40:47] PROBLEM - swift-container-replicator on ms-be1005 is CRITICAL: Connection refused by host [13:40:47] PROBLEM - swift-object-auditor on ms-be1005 is CRITICAL: Connection refused by host [13:40:47] PROBLEM - swift-object-updater on ms-be1005 is CRITICAL: Connection refused by host [13:41:15] icinga-wm: <3 [13:42:41] somebody already on ms-be1005 console by any chance? [13:44:13] nope [13:45:21] godog if it's still locked do racadm racreset [13:46:40] cmjohnson1: mh, after the password I get "No more sessions are available for this type of connection!" [13:47:32] (03CR) 10Hashar: Disable addurl captcha trigger for es/ca wikis on beta labs (032 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144933 (owner: 10KartikMistry) [13:48:49] godog: that is weird ...you rebooted so that would've kicked everyone [13:49:37] wait no it wouldn't....let's shut down and and I will pull power [13:50:48] cmjohnson1: ack! [13:53:20] godog: no ssh ....i am going to do it locally via crash cart [13:53:41] cmjohnson1: sigh.. thanks! [13:58:36] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [13:59:44] I went ahead and downtimed all the ulsfo stuff in icinga and updated https://office.wikimedia.org/wiki/Operations/ULSFO_Floor_Migration for what prep-work is complete [14:00:11] I think we're down to physical datacenter stuff only at this point, aside from that "update + reboot all machines" step at the end of the prep list [14:00:27] godog: you can access now... I booted it and it's at this point [14:00:30] /dev/md0: clean, 142787/3661824 files, 3423220/14639840 blocks [14:00:31] The disk drive for /srv/swift-storage/sdg1 is not ready yet or not present. [14:00:31] Continue to wait, or Press S to skip mounting or M for manual recovery [14:00:39] jgage: ^ [14:00:41] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 11:59:45 UTC [14:03:41] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [14:03:44] cmjohnson1: thanks! yeah having swift disks in "nobootwait" is on the TODO :| [14:04:21] RECOVERY - swift-account-auditor on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [14:04:21] RECOVERY - Disk space on ms-be1005 is OK: DISK OK [14:04:41] RECOVERY - swift-account-replicator on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [14:04:41] RECOVERY - swift-account-server on ms-be1005 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [14:04:42] RECOVERY - swift-account-reaper on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [14:04:42] RECOVERY - swift-container-updater on ms-be1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [14:04:42] RECOVERY - swift-container-server on ms-be1005 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [14:05:35] (03CR) 10Andrew Bogott: [C: 031] swift: add swift-dispersion-report and stats [operations/puppet] - 10https://gerrit.wikimedia.org/r/144932 (owner: 10Filippo Giunchedi) [14:05:51] RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [14:09:12] time to go home early, celebrate birthday. See you folks! [14:09:13] (03CR) 10ArielGlenn: "I'm pretty sure that all require statements will be applied and so if multiple groups are specified, only users that are members of all th" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140881 (owner: 10Hoo man) [14:09:29] cmjohnson1: ye I think we're good, sdg is filling up again \o/ [14:09:31] apergos: What [14:09:39] That would be against apache docs [14:10:02] usually multiple require act like they are in a require any block [14:10:51] I didn't see anything useful in the docs for these directives, and when I went a-googling, folks complained that multiple directives did not do what they intended [14:12:04] mh, maybe memory fools me here [14:12:11] let me look it up quickly [14:12:23] sure [14:12:59] !log Jenkins: bringing back puppet-compiler02.eqiad.wmflabs node online. /tmp get filled when running huge catalog compilations which causes Jenkins to unpool the node :/ [14:13:02] apergos: https://httpd.apache.org/docs/current/mod/mod_authz_core.html#require [14:13:03] Logged the message, Master [14:13:15] When multiple Require directives are used in a single configuration section and are not contained in another authorization directive like , they are implicitly contained within a directive. Thus the first one to authorize a user authorizes the entire request, and subsequent Require directives are ignored. [14:15:18] ah ha [14:15:30] hoo: you actually want https://httpd.apache.org/docs/2.2/mod/core.html#require and not 2.4 version but the result is the same [14:15:31] I was looking for that and not finding it or anything like it [14:15:42] the default case I mean [14:15:44] godog: great news! so it seems to me that if this happens again the correct response is to reboot...assuming we do all the same things. [14:16:19] akosiaris: apergos I guess we can explicitly wrap it in an requireAny, if you prefer [14:16:52] it would at least make it obvious to the reader [14:17:06] sounds like a good idea [14:17:23] ok, will do that sometime [14:17:30] <_joe_> apergos: hoo is right, I did that kind of Require magic before [14:17:54] well lemme retract my comment then [14:18:23] <_joe_> apergos: I concur explicitly using will make it obvious [14:18:33] <_joe_> and will avoid future confusion [14:19:14] (03CR) 10ArielGlenn: "And 'pretty sure' turns out to be wrong, the default is to act as 'RequireAny', as hoo pointed out, https://httpd.apache.org/docs/2.2/mod/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140881 (owner: 10Hoo man) [14:19:33] commented [14:19:39] thanks [14:20:41] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Wed Jul 9 14:20:31 UTC 2014 [14:21:46] (03PS2) 10Ottomata: Use hive serde jar from site's hive setup [operations/puppet] - 10https://gerrit.wikimedia.org/r/144845 (owner: 10QChris) [14:21:51] (03PS2) 10Milimetric: Add CORS support to public files [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/144761 [14:21:54] (03CR) 10Ottomata: [C: 032 V: 032] Use hive serde jar from site's hive setup [operations/puppet] - 10https://gerrit.wikimedia.org/r/144845 (owner: 10QChris) [14:29:29] cmjohnson1 paravoid I haven't found anything re: clearing the cache, added a section to https://wikitech.wikimedia.org/wiki/Swift/How_To#Replacing_a_disk_without_touching_the_rings but please double check! [14:33:40] hoo: Please prepare the patches to update the submodule in mediawiki/core for your SWAT patches. [14:34:53] anomie: I have a special request... we got an itern over here and he would like to see the deployment process [14:35:03] can I maybe deploy the thing myself at 17:45 to show him? [14:35:27] (03PS1) 10Giuseppe Lavagetto: Add init and upstart scripts [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/144981 [14:35:32] hoo: 17:45 in which timezone? [14:35:42] anomie: UTC+2... sorry [14:35:54] (03CR) 10Hashar: "Puppet compilation against gallium.wikimedia.org" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144692 (owner: 10Hashar) [14:36:06] ... in an hour and 10 [14:36:13] hoo: So 15:45 UTC. That's fine with me. [14:37:10] actually it's 16:45 [14:37:17] but I can do a little earlier [14:37:28] oh no, 15:$5 [14:37:42] ori: mwprof is still allocating memory like crazy on tungsten and getting periodically oom-killed, ideas on what we could do on mwprof itself? (I'd ulimit -v it at least for now) [14:37:43] wrong line :P [14:38:46] godog: i'll try to fix it today or tomorrow [14:39:15] hashar: thanks for fixing jenkins [14:39:27] ori: it got really bad ~7d ago though https://graphite.wikimedia.org/render/?title=Memory&from=-15days&width=1024&height=500&until=now&areaMode=none&hideLegend=&target=alias(servers.tungsten.memory.Buffers.value,%22buffers%22)&target=alias(servers.tungsten.memory.Active.value,%22active%22)&target=alias(servers.tungsten.memory.Cached.value,%22cached%22)&target=alias(servers.tungsten.memory.Inactive.value,%22inactive%22)&target=alias(serve [14:39:29] hoo: You could probably go even earlier, unless we get a bunch of last-minute SWAT patches. I'll ping you when I'm done SWATting and you can have the rest of the window. [14:39:34] matanya: :-) [14:39:35] manybubbles: I'll SWAT today [14:39:38] * godog slow claps graphite URLs [14:39:52] godog: no idea what changed then [14:39:57] anomie: I can't go earlier as the intern is away for now :P [14:39:58] godog: definitely worth investigating [14:40:34] ori: ack! let me know if I can be of help [14:41:06] godog: nod, let me know if you figure out what happened seven days ago! :) [14:43:37] haha yep, nothing that some rhabdomancy can't fix [14:45:02] anomie: cool [14:45:04] thanks [14:45:49] godog: are you done with swift games? [14:46:32] matanya: to your delight, no, not yet :) [14:47:22] ok. i'll wait with my rebase. want to give me birthday present? [14:47:37] a review or two? [14:47:58] haha sure, is it today? [14:48:03] yes [14:49:04] haha sure matanya [14:49:21] https://gerrit.wikimedia.org/r/144442 [14:49:44] https://gerrit.wikimedia.org/r/144908 [14:50:04] https://gerrit.wikimedia.org/r/144927 [14:50:22] James_F: Ping, SWAT in 10 minutes. [14:50:46] https://gerrit.wikimedia.org/r/144033 [14:51:08] this should be enough for now. thanks! [14:52:09] matanya: haha okay, btw if you want my attention but me in the reviewers, I tend to pay attention to gerrit emails addressed to me [14:52:20] godog: the scoping changes ran though jenkins puppet-complier [14:52:22] s/but/put/ [14:52:27] i will [14:53:15] <_joe_> matanya: also add the reference to the compiler in as a comment [14:53:21] <_joe_> that will speed up reviews [14:58:40] good idea [14:59:15] (03PS1) 10Andrew Bogott: Be a little careful about which project volumes we archive. [operations/puppet] - 10https://gerrit.wikimedia.org/r/144985 [14:59:29] anomie: we're going to do our own swat later (if greg-g is ok with it) [14:59:43] aude: Already said that ;) [14:59:45] ok [14:59:57] * hoo is also going to deploy global rename after (as a volunteer) [14:59:58] hoo: did you update the deployments page? [15:00:04] manybubbles, anomie: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140709T1500) [15:00:15] aude: I added the change and made it my item inside the swat [15:00:19] ok [15:00:38] hoo: so showtime is now? [15:00:43] * anomie starts SWAT and immediately waits on James_F to respond [15:01:20] matanya: For global rename? In an hour [15:01:50] i'm looking forward to this pain [15:02:10] heh [15:02:26] _joe_: would you be free tomorrow morning for another puppet review sprint ? :D [15:02:33] (03CR) 10Andrew Bogott: [C: 032] Be a little careful about which project volumes we archive. [operations/puppet] - 10https://gerrit.wikimedia.org/r/144985 (owner: 10Andrew Bogott) [15:02:37] <_joe_> hashar: nope sorry [15:02:48] <_joe_> hashar: unless you can convince everyone that hhvm can wait [15:02:58] <_joe_> then I'll be glad to help [15:02:58] yeah understandable :-D [15:03:02] we should delegate this to users when there are no naming conflicts [15:03:19] will finish that after my vacations hehe [15:03:30] ACKNOWLEDGEMENT - puppet last run on ms-be3003 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi failure due to broken/unmountable disk [15:05:34] time for groceries ! [15:14:21] (03PS2) 10Giuseppe Lavagetto: Add init and upstart scripts [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/144981 [15:16:18] ottomata: do you want analytics-dell.cfg for all? [15:16:44] (03CR) 10Giuseppe Lavagetto: "This is still a little bit a WiP, in particular the init part as it seems hhvm does not handle different signals for different uses, like " [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/144981 (owner: 10Giuseppe Lavagetto) [15:16:57] James_F: Last call for SWAT. [15:18:17] (03CR) 10Filippo Giunchedi: [C: 031] bugzilla: vars in scope, no need for lookup [operations/puppet] - 10https://gerrit.wikimedia.org/r/144442 (owner: 10Matanya) [15:20:26] Ok, I'm done waiting. hoo, it's all yours now. [15:21:05] anomie: Ok, I'm still waiting for our build to finish (jenkins...) and the intern [15:21:08] but I'll stay in time [15:21:22] * anomie leaves to go work on SecurePoll [15:22:46] (03CR) 10Filippo Giunchedi: [C: 031] bacula: fix var scoping [operations/puppet] - 10https://gerrit.wikimedia.org/r/144927 (owner: 10Matanya) [15:30:32] (03Abandoned) 10KartikMistry: Disable addurl captcha trigger for es/ca wikis on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144933 (owner: 10KartikMistry) [15:31:01] (03CR) 10Filippo Giunchedi: platform: simplify hardware specific configuration (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144033 (owner: 10Matanya) [15:32:40] (03PS1) 10Cmjohnson: adding analytics1028-1041 to dhcpd and netboot.cfg [operations/puppet] - 10https://gerrit.wikimedia.org/r/144992 [15:33:37] (03PS2) 10Giuseppe Lavagetto: mediawiki: manage with puppet on all nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/144917 [15:34:16] (03CR) 10Ori.livneh: "Do we need the init script at all? I thought that was transitional" (032 comments) [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/144981 (owner: 10Giuseppe Lavagetto) [15:35:09] (03CR) 10Ori.livneh: [C: 031] mediawiki: manage with puppet on all nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/144917 (owner: 10Giuseppe Lavagetto) [15:36:03] (03PS3) 10Giuseppe Lavagetto: mediawiki: manage with puppet on all nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/144917 [15:36:28] (03PS4) 10Hashar: zuul: migrate settings to role::zuul::configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/144709 [15:36:37] (03CR) 10Giuseppe Lavagetto: "We need to add it to the package, as it belongs here and not in puppet :)" [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/144981 (owner: 10Giuseppe Lavagetto) [15:38:41] (03CR) 10Giuseppe Lavagetto: Add init and upstart scripts (032 comments) [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/144981 (owner: 10Giuseppe Lavagetto) [15:39:00] (03PS1) 10JanZerebecki: Give jzerebecki access to analytics data [operations/puppet] - 10https://gerrit.wikimedia.org/r/144994 [15:40:23] (03CR) 10Ori.livneh: Add init and upstart scripts (031 comment) [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/144981 (owner: 10Giuseppe Lavagetto) [15:41:23] (03PS1) 10Hashar: zuul: remove $zuul_url from zuul::server [operations/puppet] - 10https://gerrit.wikimedia.org/r/144997 [15:43:10] (03PS3) 10Giuseppe Lavagetto: Add init and upstart scripts [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/144981 [15:44:34] (03PS3) 10Andrew Bogott: Tools: Remove unused syslog role [operations/puppet] - 10https://gerrit.wikimedia.org/r/120347 (owner: 10Tim Landscheidt) [15:47:46] (03PS5) 10Hashar: zuul: migrate settings to role::zuul::configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/144709 [15:47:48] (03PS2) 10Hashar: zuul: remove $zuul_url from zuul::server [operations/puppet] - 10https://gerrit.wikimedia.org/r/144997 [15:48:18] (03CR) 10Andrew Bogott: [C: 032] Tools: Remove unused syslog role [operations/puppet] - 10https://gerrit.wikimedia.org/r/120347 (owner: 10Tim Landscheidt) [15:49:09] (03CR) 10Hashar: zuul: migrate settings to role::zuul::configuration (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144709 (owner: 10Hashar) [15:50:14] (03CR) 10Filippo Giunchedi: [C: 031] mediawiki: manage with puppet on all nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/144917 (owner: 10Giuseppe Lavagetto) [15:59:57] csteipp: I'm going to run a little overtime with my Wikidata fix right now sorry [16:00:04] csteipp, legoktm, hoo: The time is nigh to deploy CentralAuth (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140709T1600) [16:00:04] will need 5 more minutes max. [16:00:05] hoo: how dare you [16:00:10] :D [16:02:58] !log hoo Synchronized php-1.24wmf12/extensions/Wikidata/: Update Wikibase to fix a fatal and various JS things (duration: 00m 14s) [16:03:08] Logged the message, Master [16:04:33] (03CR) 10Filippo Giunchedi: Add init and upstart scripts (033 comments) [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/144981 (owner: 10Giuseppe Lavagetto) [16:04:38] * hoo is done [16:04:41] stuff verified [16:09:36] !log Shutdown WMF HQ BGP sessions on cr1-ulsfo [16:09:42] Logged the message, Master [16:10:27] !log Shutdown WMF HQ BGP sessions on cr2-ulsfo [16:10:32] Logged the message, Master [16:10:48] !log Shutdown IXP BGP sessions on cr2-ulsfo [16:10:53] Logged the message, Master [16:13:38] !log Shutdown TiNet BGP sessions on cr1-ulsfo [16:13:43] Logged the message, Master [16:14:44] (03CR) 10Ori.livneh: "yes, we need the upstart script. but do we need the init script, though?" [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/144981 (owner: 10Giuseppe Lavagetto) [16:15:07] hoo: hi [16:15:18] hey :) [16:15:56] hoo: I'm in the office right now, but csteipp isn't here :P [16:16:06] legoktm: He'll deploy from home [16:16:08] legoktm: I'm at home [16:16:25] I'm also around to test etc, [16:16:28] Didn't realize the deployment started at 9, when I'm normally on the train :) [16:16:30] just created a dummy user to rename [16:16:58] !log Shutdown NTT BGP sessions on cr2-ulsfo [16:16:58] after that we'll do one real example [16:17:03] Logged the message, Master [16:17:04] !log ulsfo is now offline [16:17:08] Is there a force option for git submodule update? [16:17:09] Logged the message, Master [16:17:11] and if that all works, I'll probably give the steward a thumbs up [16:17:21] csteipp: no [16:17:36] you can manually git checkout inside the repos, though [16:17:50] and then verify with git status that you're on the right commit [16:17:52] that should work [16:21:07] going to move to another room [16:21:10] be back in aminute [16:22:24] !log csteipp Started scap: Update CentralAuth for Global Rename [16:22:29] Logged the message, Master [16:30:48] * hoo is back [16:37:53] csteipp: https://meta.wikimedia.org/wiki/Special:CentralAuth/Hoo%27sRenameTest0 this is going to be my initial test case [16:38:12] ""? [16:38:30] legoktm: wait for scap to finish [16:38:36] ok :) [16:38:37] it's a new message [16:38:41] oh right [16:39:14] Yeah, that threw me too [16:39:46] Wow scap is slow again.. [16:40:01] 17 minutes [16:40:10] Probably another 5-8 minutes for average [16:41:06] Yeah, I probably shouldn't complain since it used to be 50+... I though it was running in like 10 minutes fairly recently [16:41:47] (03CR) 1020after4: "I'm trying to rework this package with dh_phppear (pkg-tools-php) and following the guidelines at http://pkg-php.alioth.debian.org/ but it" [operations/debs/php-mailparse] (review) - 10https://gerrit.wikimedia.org/r/142751 (owner: 1020after4) [16:49:06] * matanya holds fingers for this change [16:51:11] !log csteipp Finished scap: Update CentralAuth for Global Rename (duration: 28m 46s) [16:51:15] Logged the message, Master [16:51:31] hoo / legoktm ^ ready for testing [16:51:36] \o/ [16:51:41] yay! [16:51:56] I'll let hoo test since I don't have an account that has the proper permissions [16:52:34] I'll do it in a moment [16:52:40] me too please :P [16:57:52] (03CR) 10Dzahn: [C: 032] bacula: fix var scoping [operations/puppet] - 10https://gerrit.wikimedia.org/r/144927 (owner: 10Matanya) [16:58:24] thanks mutante [16:58:38] csteipp: First test worked [16:58:45] wuhoo! [16:59:09] (03CR) 10Ottomata: [C: 04-1] "Still waiting for netboot.cfg too?" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144992 (owner: 10Cmjohnson) [17:00:04] manybubbles, ^d: The time is nigh to deploy Search (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140709T1700) [17:00:18] <^d> I should write a patch for that. [17:00:29] deplyoing search? [17:01:11] (03PS1) 10Chad: eswiki getting Cirrus as primary search engine [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145015 [17:01:22] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [17:02:01] ^d: ready! [17:02:28] * ^d waiting on jenkins [17:03:41] yay! [17:08:14] (03CR) 10Dzahn: [C: 032] bugzilla: vars in scope, no need for lookup [operations/puppet] - 10https://gerrit.wikimedia.org/r/144442 (owner: 10Matanya) [17:10:29] (03PS6) 10Tim Landscheidt: Tools: Unify Tools and Toolsbeta configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/102385 [17:14:28] (03PS1) 10Reedy: Swap from AdminSettings to PrivateSettings for snapshots/dumps [operations/puppet] - 10https://gerrit.wikimedia.org/r/145017 [17:14:46] (03CR) 10Reedy: [C: 04-1] Swap from AdminSettings to PrivateSettings for snapshots/dumps [operations/puppet] - 10https://gerrit.wikimedia.org/r/145017 (owner: 10Reedy) [17:14:55] (03CR) 10Reedy: "To go with https://wikitech.wikimedia.org/wiki/Incident_documentation/20140328-DB-Queries" [operations/puppet] - 10https://gerrit.wikimedia.org/r/145017 (owner: 10Reedy) [17:15:30] (03CR) 10Chad: [C: 032 V: 032] "No jenkins?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145015 (owner: 10Chad) [17:17:24] !log demon Synchronized wmf-config/InitialiseSettings.php: eswiki cirrus (duration: 00m 04s) [17:17:29] Logged the message, Master [17:18:02] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: Fetching readonly [17:18:29] ^d: looks like we have a load spike - odd, given that I load tested it and didn't see one [17:18:38] give it a minute to fade [17:18:46] <^d> time of day, right? [17:18:55] <^d> Your queries were all from 0 UTC I thought. [17:19:02] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "just a couple of corrections, then LGTM" (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144612 (owner: 10Ori.livneh) [17:19:12] (03PS1) 10ArielGlenn: check-raid syntax fixes, check all raids on system [operations/puppet] - 10https://gerrit.wikimedia.org/r/145018 [17:19:26] <_joe_> ori: sorry for the late review ^^ [17:20:18] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [17:20:50] ^d: nah - I time shifted them lately. I just skipped to later in the file [17:20:55] ^d: but, yeah, it does matter [17:21:01] ^d: also, we're calmed down now [17:21:38] <^d> yep. [17:21:38] (03CR) 10Filippo Giunchedi: "it looks like the directory/file structure is not what it is expecting, you can push what you have for review though" [operations/debs/php-mailparse] (review) - 10https://gerrit.wikimedia.org/r/142751 (owner: 1020after4) [17:21:45] elastic1005 went orange for a few minutes and we had a few slow queries right as we cut over. but now we're humming along, I think [17:22:12] keep in mind varnishes are under extra traffic pressure today with the ulsfo move, which I'd think would also net some increased load at the app layer (unless ulsfo+eqiad caches tend to be identical) [17:22:41] <^d> This wouldn't affect us, search traffic's uncached anyway. [17:23:13] yeah I lack context [17:23:17] bblack: thanks though! [17:23:23] <^d> yeah it's all good :) [17:26:28] hoo: so...are we good? [17:26:57] legoktm: Yes :) [17:27:00] Both tests went fine [17:27:29] :D [17:27:54] you renamed someone to .js? >.> [17:28:07] User:.js/common.js [17:32:46] legoktm: Yeah :D [17:44:44] Anyone with shell around willing to help check something for me? I just need wfMessage( 'visualeditor-specialcharinspector-characterlist-insert' )->plain() from itwiki [17:45:23] sure [17:46:15] Krenair: https://dpaste.de/kd5N/raw [17:46:43] I've been getting conflicting answers about what that is - https://it.wikipedia.org/w/index.php?title=Speciale%3AMessaggi&prefix=visualeditor-specialcharinspector&filter=all&lang=it&limit=50 vs. mw.msg( 'visualeditor-specialcharinspector-characterlist-insert' ) [17:47:38] Interesting. I wonder why mw.msg is giving the english version. [17:50:04] Thanks legoktm [18:00:04] yurik: The time is nigh to deploy Wikipedia Zero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140709T1800) [18:08:00] (03PS1) 10Ottomata: Add DNS entires for 14 new analytics nodes (analytics1028-analytics1041) [operations/dns] - 10https://gerrit.wikimedia.org/r/145024 [18:12:14] (03PS2) 10Cmjohnson: adding analytics1028-1041 to dhcpd and netboot.cfg [operations/puppet] - 10https://gerrit.wikimedia.org/r/144992 [18:12:21] ottomata hey you around? what is the status of this? do I need to fix anything https://gerrit.wikimedia.org/r/#/c/142483/ [18:12:48] ottomata also, i failed on the packaging for extjs :( i'm not the best person to help with that [18:13:02] dogeydogey: s'ok [18:13:06] greg-g, there are tons of require(): Unable to allocate memory for pool. in ... is it safe t odeploy? [18:13:08] there are some trailling spaces in some of the files [18:13:11] e.g.https://gerrit.wikimedia.org/r/#/c/142483/4/manifests/misc/firewall.pp [18:13:24] yurik: on call, Reedy ^ [18:13:54] Reedy, looking at https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor [18:16:09] apparently these events have started about 45 min ago, tons of them [18:16:21] now they are still there, just not as many [18:21:44] ottomata how do you make it display whitespace like that [18:22:30] dogeydogey: do you use vim ? highlight ExtraWhitespace ctermbg=red guibg=red [18:22:52] on gerrit [18:23:11] ehm.. gerrit just does it? [18:23:15] yeah, i thought so too [18:23:19] oh ok [18:25:16] (03PS3) 10Ottomata: adding analytics1028-1041 to dhcpd and netboot.cfg [operations/puppet] - 10https://gerrit.wikimedia.org/r/144992 (owner: 10Cmjohnson) [18:26:13] cmjohnson1: those nodes should use raid1-30G [18:26:15] i updated patch [18:26:52] ottomata: okay..i couldn't recall what we talked about yesterday. [18:26:55] thanks for fixing [18:26:58] yup [18:27:13] (03CR) 10Ottomata: [C: 032 V: 032] adding analytics1028-1041 to dhcpd and netboot.cfg [operations/puppet] - 10https://gerrit.wikimedia.org/r/144992 (owner: 10Cmjohnson) [18:27:30] looks good though, merged [18:27:30] danke [18:29:27] !log yurik Synchronized php-1.24wmf11/extensions/: update to JsonConfig, ZeroBanner, ZeroPortal (duration: 01m 24s) [18:29:32] Logged the message, Master [18:35:32] Reedy, greg-g, apparently, the APC trash picks up right after deployment [18:35:52] Right [18:36:05] Reedy, is that normal? [18:36:08] But usually only on major version deployments turning on/off [18:36:09] Yeah [18:36:13] !log yurik Synchronized php-1.24wmf12/extensions/: update to JsonConfig, ZeroBanner, ZeroPortal (duration: 01m 22s) [18:36:19] Logged the message, Master [18:37:58] (03CR) 10Matanya: [C: 04-1] "See inline comments." (037 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142483 (owner: 10Scottlee) [18:38:28] robh: gotta a sec...maybe you can help me and ottomata out. an1009 won't install, I see the request in in the logs but nothing is served up. MAC is correct in dhcpd file, IP's are right, in correct vlan and ports are enabled and up. I am outta of ideas. any thoughts? [18:39:03] is the mac address in the proper serial file for dhcp? [18:39:08] cuz ciscos are different than others [18:39:28] Can someone point me to the WP debug log? Or do a small search for me? [18:39:31] checking [18:39:32] RobH, we see the dhcp request acked in the logs [18:39:35] (pretty sure) [18:39:38] right, that doesnt mean its right [18:39:38] but installer never loads [18:39:49] the dhcp file also sets what serial redirection settitngs to use [18:39:56] awight, https://logstash.wikimedia.org [18:39:57] and, crash cart console (not login console) shows an error about uhh.... [18:39:58] so my question is what file is it in, but im just lookin it up [18:40:04] something with not being able to connect to something? [18:40:18] robh: right file [18:40:25] its in linux-host-entries.ttyS0-115200 [18:40:31] wehre it has always been, [18:40:37] we've installed this one before too : [18:40:38] :/ [18:40:40] Reedy, btw, not sure if its a known issue - i did two depls, but only one label appeared in the fatalmonitor graph [18:40:41] MaxSem: thanks! [18:40:49] MaxSem, what about logstash+ [18:41:09] wait, you guys are having a new error on isntall on a system that was working right? [18:41:15] and there have been no changes to system? [18:41:31] MaxSem: sigh, I have no access [18:41:38] you have [18:41:43] i don't see any errors...just not getting the installer [18:41:54] unless you're not in wmf LDAP group [18:42:01] blargh [18:42:30] and you guys have installed other ciscos recently, so we know the install server is serving up images ok? [18:42:39] im asking rather than going to look myself ;] [18:42:48] yes...An1010 yesterday [18:42:57] yurik: Yeah, it's intermittant [18:43:02] Known issue IIRC [18:43:07] apergos: re RT #7517, do you happen to remember if either of my LDAP users are in the WMF group? [18:43:29] I can log into wikitech, but that same login does not work for logstash. [18:43:45] um [18:44:06] awight: Prod logstash? You need to be a member of the 'wmf' ldap group [18:44:21] bah don't remember, I woul have to look [18:44:22] * bd808 sees that is under discussion [18:44:27] cmjohnson1: you see the dhcp hit carbon? [18:44:30] yes [18:44:31] cuz i dont see the request in carbon [18:44:34] what time? [18:44:39] i didn't do it today [18:44:42] apergos: only if u have the time [18:44:43] it was from yesterday [18:44:45] ec [18:44:46] sec [18:44:57] I just didn't get to it until now [18:45:05] RobH [18:45:11] Jul 8 19:41:21 carbon dhcpd: DHCPACK on 10.64.5.10 to 88:43:e1:c2:86:6a via 10.64.5.3 [18:45:23] in syslog.1 [18:45:44] awight: that would be a big fat nope, neither are [18:46:34] cmjohnson1: can you turn off the hp server for me if you are onsite? [18:46:37] apergos: aha. That would be wrong, AFAIK. Should I create an RT ticket for adding myself? [18:46:47] its been filling that log awhile and its annoying, and i meant to do it awhile ago [18:47:06] I'm not there now...moved over to the library. Can do in the morning [18:47:39] yeah please do and you can even assign it to me as I'm on this week, but please say which account shoul get added :-D [18:47:47] apergos: k awesome [18:49:57] cmjohnson1: nm, got in web gui, heh [18:50:29] ok..cool. that works too [18:50:47] now that its all cleared out, im going to restart the pxe boot of analytics1009 so i can watch its output [18:50:55] easier to see what happens [18:52:42] yep... yep i still dislike the cisco ilom, heh [18:54:55] RobH, have at it! [18:55:21] partman wasn't working probably for the analytics1010 install yesterday either, and bblack was having problems, but that is a separate problem than analytics1009 current unhappiness [18:55:30] i had to manually partition analytics1010, dunno... [18:55:41] ok, i gotta drive back to NYC now, ttyl [18:56:18] Jeff_Green: do u have access to the WP production debug logs? [18:56:35] ok ill take a look and see what i see [18:56:38] I'm hunting debug lines that start with "CetnralNotice:" [18:56:41] (03PS1) 10Reedy: Update size related dblists [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145040 [18:56:44] no promises, but sometimes a fresh set of eyes helps =] [18:56:50] aye, danke [18:56:51] greg-g ori RoanKattouw: did you catch my addition to SWAT for this afternoon? [18:57:18] awight: probably but that part of the infrastructure is unexplored so it will take some time to figure it out [18:57:51] Or if any other opsen want to take a look at the debug logs for us? [19:00:06] Jeff_Green: my cookie crumbs are: https://wikimedia.mingle.thoughtworks.com/projects/online_fundraiser/cards/1777 [19:00:40] awight|fud: k. [19:01:35] Reedy, one of the servers seems to have stale PHP even though dir-sync didn't report it. Is there a way to force sync-dir on one? [19:02:10] run sync-common on it [19:02:27] ssh to it first, then run sync-common [19:02:36] Reedy, thx [19:02:55] Reedy, any idea why sync-dir didn't report it? [19:03:00] mw1151 [19:03:30] yurik, it might be not in DSH group [19:03:49] MaxSem, Reedy -- https://logstash.wikimedia.org/#dashboard/temp/VnK92eMCQZiiEAGPiomg3w [19:04:12] ok, synced [19:04:18] I believe you [19:04:23] :P [19:04:54] MaxSem, :-P how do i add it to DSH group? [19:05:02] StevenW, it will be looked at when time comes [19:05:04] is that a file defined somewhere? [19:05:29] MaxSem: Спасибо! [19:05:46] reedy@tin:/a/common$ grep mw1151 /etc/dsh/group/* [19:05:46] /etc/dsh/group/apache-eqiad:mw1151 [19:05:46] /etc/dsh/group/apaches:mw1151 [19:05:46] /etc/dsh/group/mediawiki-installation:mw1151 [19:05:46] /etc/dsh/group/mw-eqiad:mw1151 [19:06:40] cmjohnson1: im not even hitting pxe... even when pxe is listed as the primary boot [19:07:11] (03PS5) 10Scottlee: Fixed spacing and lint rules for manifests/misc files. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142483 [19:07:31] odd..we were getting to carbon yesterday [19:07:40] yea, im trying again [19:11:07] (03CR) 10jenkins-bot: [V: 04-1] Fixed spacing and lint rules for manifests/misc files. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142483 (owner: 10Scottlee) [19:14:26] (03PS6) 10Scottlee: Fixed spacing and lint rules for manifests/misc files. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142483 [19:17:01] (03CR) 10Ori.livneh: Add jobrunner class (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144612 (owner: 10Ori.livneh) [19:17:26] (03PS1) 10Hashar: zuul: phase out zuulwikimedia (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/145047 [19:17:38] (03PS5) 10Ori.livneh: Add jobrunner class [operations/puppet] - 10https://gerrit.wikimedia.org/r/144612 [19:19:48] (03CR) 10jenkins-bot: [V: 04-1] zuul: phase out zuulwikimedia (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/145047 (owner: 10Hashar) [19:20:27] (03CR) 10Ori.livneh: [C: 032] Add jobrunner class [operations/puppet] - 10https://gerrit.wikimedia.org/r/144612 (owner: 10Ori.livneh) [19:21:35] (03PS2) 10Hashar: zuul: phase out zuulwikimedia (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/145047 [19:30:03] PROBLEM - Unmerged changes on repository puppet on virt0 is CRITICAL: Fetching origin [19:32:04] RECOVERY - Unmerged changes on repository puppet on virt0 is OK: Fetching origin [19:34:46] (03PS3) 10Hashar: zuul: phase out zuulwikimedia (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/145047 [19:43:05] (03PS4) 10Hashar: zuul: phase out zuulwikimedia (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/145047 [19:43:07] (03PS6) 10Hashar: zuul: migrate settings to role::zuul::configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/144709 [19:43:11] (03PS3) 10Hashar: zuul: remove $zuul_url from zuul::server [operations/puppet] - 10https://gerrit.wikimedia.org/r/144997 [19:44:16] (03CR) 10Hashar: zuul: migrate settings to role::zuul::configuration (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144709 (owner: 10Hashar) [19:46:41] (03PS5) 10Hashar: zuul: phase out zuulwikimedia (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/145047 [19:50:17] (03PS6) 10Hashar: zuul: phase out zuulwikimedia (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/145047 [19:50:57] pff [19:53:08] (03PS2) 10Matanya: platform: simplify hardware specific configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/144033 [19:57:00] (03PS3) 10Matanya: cache: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140678 [19:57:16] (03PS7) 10Hashar: zuul: phase out zuulwikimedia (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/145047 [19:57:18] (03PS7) 10Hashar: zuul: migrate settings to role::zuul::configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/144709 [19:57:20] (03PS4) 10Hashar: zuul: remove $zuul_url from zuul::server [operations/puppet] - 10https://gerrit.wikimedia.org/r/144997 [19:59:42] integration-dev is a Zuul merger (role::zuul::merver) [19:59:42] integration-dev is a Zuul server (scheduler) (role::zuul::server) [19:59:43] !!! [19:59:59] merver you say? :) [20:00:04] gwicke, subbu, cscott: The time is nigh to deploy Parsoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140709T2000) [20:00:23] * hashar congratulates himself and move to next meeting [20:00:47] YuviPanda: ahhhh man thanks [20:00:59] :D [20:02:09] hashar: can you add more slaves to puppet compiler ? [20:02:39] matanya: in meeting [20:02:46] sorry :/ [20:06:48] StevenW: yah, will everything be ok with it? (the swat addition) [20:07:48] !log deployed parsoid 1632288d [20:07:53] Logged the message, Master [20:08:54] (03PS1) 10Chad: Need oxygen access to get at lsearchd logs [operations/puppet] - 10https://gerrit.wikimedia.org/r/145054 [20:10:35] (03PS1) 10Tim Landscheidt: dynamicproxy: Block non-local Redis connections [operations/puppet] - 10https://gerrit.wikimedia.org/r/145056 [20:15:46] Dear ops: is enwiki wfDebugLog going to /dev/null? [20:17:55] <^d> Debug logging without a group and a defined $wgDebugLogGroups gets dropped, I think. [20:18:17] ^d: ok, thank you. I was able to reproduce on testwiki, so I'm in luck I think [20:18:28] (03PS1) 10Rush: fix email auth bug in legalpad [operations/puppet] - 10https://gerrit.wikimedia.org/r/145058 [20:18:50] (03CR) 10Rush: [C: 032 V: 032] "just updates the tag" [operations/puppet] - 10https://gerrit.wikimedia.org/r/145058 (owner: 10Rush) [20:24:30] matanya: ah more slave to puppet compiler [20:24:52] matanya: the one we have has been set by _joe_ hopefully it is fully puppetized and we could get another instance :D [20:25:23] my bait is that there is a few hardcoded things though [20:26:46] role::puppet_compiler ! [20:28:51] matanya: I can probably work with joe later on to have it on the CI slaves [20:30:04] matanya: or at least migrate it to a bigger instance. The one in use only has 2 CPUs allocated [20:41:31] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [20:55:43] awight: beta cluster catch all debugs if that can help [20:55:53] awight: and wfDebug() output :] [20:56:10] hashar: yes, thank you! I ended up reproducing on testwiki and that worked for me [20:56:20] awight: great! :) [20:56:43] and I need to sleep(now) [21:00:27] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [21:02:17] (03CR) 10Hashar: "and role::zuul:merger / role::zuul::server can not be used standalone because they rely on a Jenkins user :-/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/145047 (owner: 10Hashar) [21:06:31] (03CR) 10BBlack: [C: 031] update SSL cipher list for OTRS to support PFS [operations/puppet] - 10https://gerrit.wikimedia.org/r/144734 (owner: 10Dzahn) [21:17:12] (03PS1) 10Aaron Schulz: Added jobrunner.ini file [operations/puppet] - 10https://gerrit.wikimedia.org/r/145130 [21:17:34] chasemp: ori do point me to any other diamond bugs you wish fixed [21:17:42] * YuviPanda just submitted a whitelist/blacklist pull req [21:26:34] (03PS1) 10Gergő Tisza: Add thumbnail buckets for beta sites [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145132 (https://bugzilla.wikimedia.org/67525) [21:26:47] (03CR) 10jenkins-bot: [V: 04-1] Add thumbnail buckets for beta sites [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145132 (https://bugzilla.wikimedia.org/67525) (owner: 10Gergő Tisza) [21:26:53] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:27:43] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.003 second response time [21:30:03] (03CR) 10PiRSquared17: [C: 04-1] "2014-07-09T16:54:52 Hoo man (talk | contribs | block) changed group permissions for Special:GlobalUsers/steward. Added centralauth-rename;" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139655 (owner: 10Gerrit Patch Uploader) [21:32:49] (03PS2) 10Gergő Tisza: Use reference thumbnails for JPEG/PNG thumbnailing on beta sites [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145132 (https://bugzilla.wikimedia.org/67525) [21:33:44] (03CR) 10jenkins-bot: [V: 04-1] Use reference thumbnails for JPEG/PNG thumbnailing on beta sites [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145132 (https://bugzilla.wikimedia.org/67525) (owner: 10Gergő Tisza) [21:37:26] (03CR) 10Ori.livneh: "The convention for Puppet class parameters is lowercase_with_underscores. Otherwise LGTM." [operations/puppet] - 10https://gerrit.wikimedia.org/r/145130 (owner: 10Aaron Schulz) [21:38:46] (03PS3) 10Gergő Tisza: Use reference thumbnails for JPEG/PNG thumbnailing on beta sites [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145132 (https://bugzilla.wikimedia.org/67525) [21:39:53] (03PS2) 10Aaron Schulz: Added jobrunner.ini file [operations/puppet] - 10https://gerrit.wikimedia.org/r/145130 [21:52:17] (03CR) 10Hashar: "Catalog run against gallium.wikimedia.org:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/145047 (owner: 10Hashar) [21:53:41] (03CR) 10Hashar: [C: 04-1] "gearman is firewalled and only reachable via 127.0.0.1!!" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/144709 (owner: 10Hashar) [22:00:05] (03CR) 10Dzahn: [C: 032] "yep, it's the same as on cluster, the diff is removing DHE-RSA* and DHE-DSS* varieties" [operations/puppet] - 10https://gerrit.wikimedia.org/r/144427 (owner: 10Matanya) [22:00:23] (03PS1) 10Steinsplitter: Adding mochila_images.s3.amazonaws.com and mochila_images2.s3.amazonaws.com temporary to wgCopyUploadsDomains for GWToolset upload. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145139 [22:03:12] (03PS2) 10Steinsplitter: Adding mochila_images.s3.amazonaws.com and mochila_images2.s3.amazonaws.com temporary to wgCopyUploadsDomains for GWToolset upload. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145139 (https://bugzilla.wikimedia.org/58224) [22:03:32] (03PS3) 10Steinsplitter: Adding mochila_images.s3.amazonaws.com and mochila_images2.s3.amazonaws.com temporary to wgCopyUploadsDomains for GWToolset upload. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145139 (https://bugzilla.wikimedia.org/58224) [22:04:12] (03PS4) 10Steinsplitter: Adding mochila_images.s3.amazonaws.com and mochila_images2.s3.amazonaws.com temporary to wgCopyUploadsDomains for GWToolset upload. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145139 (https://bugzilla.wikimedia.org/67344) [22:05:04] (03CR) 10Ori.livneh: [C: 031] Added jobrunner.ini file [operations/puppet] - 10https://gerrit.wikimedia.org/r/145130 (owner: 10Aaron Schulz) [22:05:27] !log restarted apache on zirconium for config change [22:05:31] Logged the message, Master [22:14:24] (03PS1) 10Steinsplitter: Adding mochila_images.s3.amazonaws.com and mochila_images2.s3.amazonaws.com temporary to wgCopyUploadsDomains for GWToolset upload. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145144 [22:15:07] (03Abandoned) 10Steinsplitter: Adding mochila_images.s3.amazonaws.com and mochila_images2.s3.amazonaws.com temporary to wgCopyUploadsDomains for GWToolset upload. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145139 (https://bugzilla.wikimedia.org/67344) (owner: 10Steinsplitter) [22:16:06] (03PS2) 10Steinsplitter: Adding mochila_images.s3.amazonaws.com and mochila_images2.s3.amazonaws.com temporary to wgCopyUploadsDomains for GWToolset upload. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145144 (https://bugzilla.wikimedia.org/67344) [22:19:22] * Steinsplitter has done a mess on gerrit. :/ [22:19:56] can somone pls. abandone it [22:20:45] PROBLEM - Puppet freshness on tmh1002 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 22:18:10 UTC [22:20:49] Steinsplitter: that one up there? [22:20:52] (03Abandoned) 10Ori.livneh: Adding mochila_images.s3.amazonaws.com and mochila_images2.s3.amazonaws.com temporary to wgCopyUploadsDomains for GWToolset upload. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145144 (https://bugzilla.wikimedia.org/67344) (owner: 10Steinsplitter) [22:22:41] ah, thanks^^. I have used git the last time months ago, it looks like i have vorgonnen the correct commands, sorry. [22:22:41] /me hides [22:22:45] PROBLEM - Puppet freshness on tmh1002 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 22:18:10 UTC [22:24:45] PROBLEM - Puppet freshness on tmh1002 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 22:18:10 UTC [22:26:45] PROBLEM - Puppet freshness on tmh1002 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 22:18:10 UTC [22:27:13] puppet on tmh1002 looks fine [22:27:20] maybe neon? [22:27:33] 22:18 was ten minutes ago [22:27:55] the check keeps having some false alarms [22:28:00] also on neon itself sometimes [22:28:13] it claims there is a puppet fail, then you go look and it works and recovers [22:28:36] i wish all problems were like that [22:28:45] PROBLEM - Puppet freshness on tmh1002 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 22:18:10 UTC [22:28:55] hah, yea [22:29:32] hmm, could it be NTP? different time on neon and monitored hosts? [22:30:45] PROBLEM - Puppet freshness on tmh1002 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 22:18:10 UTC [22:32:45] PROBLEM - Puppet freshness on tmh1002 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 22:18:10 UTC [22:34:31] (03CR) 10Dzahn: [C: 032] fundraising, replace generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/138007 (owner: 10Rush) [22:34:45] PROBLEM - Puppet freshness on tmh1002 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 22:18:10 UTC [22:34:48] !log Enabling NTT and HE transit links on cr2-ulsfo [22:34:52] Logged the message, Master [22:35:48] !log Enabling WMF HQ BGP sessions on cr2-ulsfo [22:35:54] Logged the message, Master [22:36:45] PROBLEM - Puppet freshness on tmh1002 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 22:18:10 UTC [22:38:20] !log Enabling TiNet transit links on cr1-ulsfo [22:38:25] Logged the message, Master [22:38:25] RECOVERY - Puppet freshness on tmh1002 is OK: puppet ran at Wed Jul 9 22:38:18 UTC 2014 [22:40:23] !log Enabling WMF HQ BGP sessions on cr1-ulsfo [22:40:27] Logged the message, Master [22:40:36] tmh1002.. shhhh [22:40:59] (03CR) 10Dzahn: "noop on aluminium" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138007 (owner: 10Rush) [22:42:46] !log Enabling PAIX BGP sessions on cr2-ulsfo [22:42:50] Logged the message, Master [22:43:18] bblack, have a sec to see if the new stuff working? [22:43:44] you mean play with X-CS=TEST header on prod? [22:45:53] (03CR) 10Andrew Bogott: [C: 032] dynamicproxy: Block non-local Redis connections [operations/puppet] - 10https://gerrit.wikimedia.org/r/145056 (owner: 10Tim Landscheidt) [22:47:25] yurikR2: ? [22:48:09] (03CR) 10Andrew Bogott: [C: 031] openstack-replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138002 (owner: 10Rush) [22:48:53] andrewbogott: done on both proxies, looks ok [22:48:56] andrewbogott: thanks for the merge [22:49:07] YuviPanda: thanks for testing [22:49:13] :) [22:51:26] (03PS1) 10Yurik: 437-01 - opera supports https [operations/puppet] - 10https://gerrit.wikimedia.org/r/145151 [22:51:29] bblack, ^ [22:51:31] (03CR) 10Dzahn: [C: 032] openstack-replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138002 (owner: 10Rush) [22:52:58] ^ yay or ssl hijacking :P [22:53:11] I was hoping that was a test X-CS=ON patch coming :) [22:53:21] (03PS2) 10BBlack: 437-01 - opera supports https [operations/puppet] - 10https://gerrit.wikimedia.org/r/145151 (owner: 10Yurik) [22:53:53] yurikR2: have you tried varnish-levle X-CS=ON in betalabs yet? [22:53:55] bblack, that we can also do - i will create a profile TESTON [22:54:46] bblack,problem is - i'm traveling tomorrow (TBD), and don't want to break things before leaving :) [22:55:11] when are you back? [22:57:14] (03CR) 10Dzahn: "checked on labstore1001 - noop" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138002 (owner: 10Rush) [22:57:15] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Wed 09 Jul 2014 20:56:41 UTC [23:00:04] RoanKattouw, mwalker, ori, MaxSem: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140709T2300) [23:01:57] bd808: do you know if/where we use the historical /home/wikipedia inside labs? [23:02:05] bd808: class nfs::home::wikipedia { [23:02:37] mutante: pmtpa. Not sure if anywhere else [23:02:57] bd808: that would be fenari, yea, but that class also has case $realm labs [23:03:10] well, either way, does not seem like it matters much anymore [23:03:42] bd808: thanks [23:03:47] mutante: Oh labs. Sorry. I've never seen it there [23:04:18] yea, just https://gerrit.wikimedia.org/r/#/c/138003/4/manifests/nfs.pp [23:04:57] I wonder if the single node wiki class uses it [23:05:02] beta + pmtpa = what can happen :) [23:05:22] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [23:05:55] ^.. eh , is that swat? [23:08:13] We've had a bunch of APC thrash today according to logstash. [23:10:02] (03PS1) 10Yurik: Allow TESTON carrier for unified zero design [operations/puppet] - 10https://gerrit.wikimedia.org/r/145155 [23:10:02] bblack, here's a patch for you :) [23:10:07] xmas #1 [23:10:15] (03CR) 10Mattflaschen: "Removed -1 since we're ready." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144938 (owner: 10Phuedx) [23:11:14] MaxSem, Growth team is ready for our part of SWAT [23:11:35] Let me know if you have questions when deploying that part. [23:11:39] * MaxSem scratches his head [23:11:51] I didn't even volunteer:) [23:11:54] YET! [23:12:36] okay, I take the SWAT [23:13:07] yurikR2: on the url check (if (req.url ~ "\:ZeroRatedMobileAccess($|&|\?)" )) we don't have to check for zeroconfig as well, like the rfc was saying? [23:13:13] Sorry, MaxSem, StevenW said you were doing it. [23:13:19] superm401, what are the core changes? [23:13:30] MaxSem, no core changes. [23:13:34] bblack, no, because i don't do it via api [23:13:44] superm401, I mean submodule changes [23:13:47] ok [23:13:53] bblack, or maybe ... [23:13:53] MaxSem, okay, I'll create them. [23:13:54] (03CR) 10MaxSem: [C: 032] Enable TemplateData GUI for Russian Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144857 (https://bugzilla.wikimedia.org/67704) (owner: 10Jforrester) [23:14:03] MaxSem: Thanks! [23:14:10] hmm, we are still using other stuff there... let me add it just to be sure [23:14:53] (03CR) 10Dzahn: [C: 032] "labs + beta, all other changes were noop, this is the last one that removes generic::systemuser" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138003 (owner: 10Rush) [23:15:03] (03PS2) 10Yurik: Allow TESTON carrier for unified zero design [operations/puppet] - 10https://gerrit.wikimedia.org/r/145155 [23:15:11] bblack, ^ [23:16:19] (03PS3) 10BBlack: 437-01 - opera supports https [operations/puppet] - 10https://gerrit.wikimedia.org/r/145151 (owner: 10Yurik) [23:16:41] MaxSem, 1.24wmf11: https://gerrit.wikimedia.org/r/145158 [23:16:47] we need some quantum cpus to run jenkins on :p [23:17:23] the ones that you never know if they are on? [23:17:27] patch ... wait for jenkins... get distracted... oh look jenkins is done and something else is newly merged! rebase ... wait for jenkins... [23:17:54] !log maxsem Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/144857/ (duration: 00m 04s) [23:17:59] Logged the message, Master [23:18:17] James_F, please verify ^^^ [23:18:52] MaxSem: Confirmed, working for me. Thanks! [23:19:08] MaxSem, and wmf/1.24wmf12: https://gerrit.wikimedia.org/r/145160 [23:19:18] (03CR) 10BBlack: [C: 032] 437-01 - opera supports https [operations/puppet] - 10https://gerrit.wikimedia.org/r/145151 (owner: 10Yurik) [23:19:39] (03PS3) 10BBlack: Allow TESTON carrier for unified zero design [operations/puppet] - 10https://gerrit.wikimedia.org/r/145155 (owner: 10Yurik) [23:19:48] (03CR) 10Dzahn: [C: 032 V: 032] "merged all dependencies, this is not used anymore:) gtg!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138011 (owner: 10Rush) [23:20:00] mutante: dude you did it [23:20:21] chasemp: you have the honour to submit :) [23:21:19] yurikR2: I forget - when we have X-CS=ON fully deployed in varnish, do we still need the per-carrier ssl/proxy logic, or is that all now app-layer? [23:21:49] bblack, we will need it for analytics :( [23:21:58] until they fix their code [23:22:17] to use the new json api stuff from the unfragment rfc? [23:22:27] !log maxsem Synchronized php-1.24wmf11/extensions/GettingStarted/: (no message) (duration: 00m 03s) [23:22:32] Logged the message, Master [23:23:41] MaxSem, oh, there are also i18n changes, so we'll need a scap or i18n update. [23:24:06] dude [23:24:29] please warn beforehand:) [23:24:31] (03PS5) 10Rush: generic: remove systemuser definition [operations/puppet] - 10https://gerrit.wikimedia.org/r/138011 [23:24:46] (03CR) 10Rush: [C: 032 V: 032] generic: remove systemuser definition [operations/puppet] - 10https://gerrit.wikimedia.org/r/138011 (owner: 10Rush) [23:24:53] bblack, correct [23:25:44] (03PS4) 10BBlack: Allow TESTON carrier for unified zero design [operations/puppet] - 10https://gerrit.wikimedia.org/r/145155 (owner: 10Yurik) [23:25:51] (03CR) 10BBlack: [C: 032 V: 032] Allow TESTON carrier for unified zero design [operations/puppet] - 10https://gerrit.wikimedia.org/r/145155 (owner: 10Yurik) [23:29:29] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [23:31:05] !log maxsem Started scap: SWAT, GettingStarted introduced a new message [23:31:09] Logged the message, Master [23:32:19] PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: Puppet has 1 failures [23:32:20] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [23:32:38] yurikR2: that doesn't look good ^ :) [23:32:56] ahem... [23:33:16] compile error? [23:33:29] yeah [23:33:39] in the earlier one for https + opera, bad parens [23:33:48] bummer, will fix in a sec [23:34:29] PROBLEM - puppet last run on cp1060 is CRITICAL: CRITICAL: Puppet has 1 failures [23:34:30] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: Puppet has 1 failures [23:35:18] (03PS1) 10Yurik: fixed parens [operations/puppet] - 10https://gerrit.wikimedia.org/r/145164 [23:35:20] ^ puppetfails on cpxxxx are known-issue [23:35:21] bblack, ^ [23:35:42] its good we have that system now - it used to fail silently :D [23:35:49] (03CR) 10BBlack: [C: 032 V: 032] fixed parens [operations/puppet] - 10https://gerrit.wikimedia.org/r/145164 (owner: 10Yurik) [23:36:10] :) [23:36:19] bblack, will it go into prod much sooner than half an hour now? [23:36:35] normally, no. we can push with salt when it's urgent [23:37:09] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Wed Jul 9 23:37:02 UTC 2014 [23:37:17] bblack, if is not hard, could you? i suspect there is a bug in my opera handling [23:37:28] ? [23:37:40] opera zero might not be getting banners [23:38:15] MaxSem, sorry, forgot until just then. [23:38:29] PROBLEM - puppet last run on cp3011 is CRITICAL: CRITICAL: Puppet has 1 failures [23:38:46] bblack, ^^ [23:38:55] again? [23:39:12] yurikR2: I don't think it's again, just still from earlier [23:39:19] :) [23:39:23] there's some latency on current puppet runs finishing -> notifying [23:40:22] RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [23:40:22] (03CR) 10Dzahn: "root@palladium:~# salt '*' cmd.run 'grep systemusers /etc/group' | tee /home/dzahn/salt_chk_systemusers_all.log" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138011 (owner: 10Rush) [23:40:32] RECOVERY - puppet last run on cp1060 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [23:41:09] bblack, is that the salt needed to run? looks scary [23:41:32] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [23:41:32] RECOVERY - puppet last run on cp3011 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [23:41:32] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [23:42:09] yurikR2: no, for salt pushes I use e.g.: salt -G cluster:cache_mobile -t 600 cmd.run 'puppet agent -t' [23:43:04] thx ) [23:43:39] (which is running now, but one or more are responding a bit slow with finishing up the puppet execution) [23:44:48] yurikR2: it should be done everywhere now [23:44:51] !log deleted systemusers group on neon & mw1077 (to check it doesnt break anything [23:44:56] Logged the message, Master [23:46:22] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [23:49:31] (03PS1) 10Rush: phabricator class for installing in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/145169 [23:51:43] (03CR) 10Dzahn: phabricator class for installing in labs (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/145169 (owner: 10Rush) [23:57:36] !log maxsem Finished scap: SWAT, GettingStarted introduced a new message (duration: 26m 31s) [23:57:41] Logged the message, Master [23:58:18] (03CR) 10MaxSem: [C: 032] Re-enable the anonymous signup invite experiment [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/144938 (owner: 10Phuedx)