[00:00:07] I quickly jumped on terbium to have a look but I couldn't get hold of anything useful :/ [00:00:12] telling icinga to check for excess sleepers specifically could be done. would need to work around wikiadmin user which sleeps a lot [00:01:51] hoo: don't know about the memcached timeout [00:02:00] that is rather high [00:03:01] SELECT COUNT(*) FROM PROCESSLIST WHERE user = 'wikiuser' AND time > 30; [00:03:09] something like that would suffice, right? [00:03:22] Will ping Ori or so tomorrow about the memcached timeout [00:03:48] would need AND command = 'Sleep' AND info is null AND time < 1000000 [00:04:26] right... we don't care about non-sleepers (although 30s is quite awry anyway) [00:04:34] anyway, I should go to bed if I want to start working around 11am tomorrow :P [00:04:43] last clause because for a very brief period, new connections have time = 2147483647 [00:05:00] sure. thanks [00:05:02] never obeserved that, I think [00:05:05] * observed [00:05:16] If you find anything mail me or open a bug or so [00:05:16] 5.5 bug iirc [00:06:36] this is the script i've been using to kill: https://git.wikimedia.org/blob/operations%2Fsoftware/a091384003604ed280cf0bf83479162ec360d5fe/dbtools%2Farbiter.pl [00:06:45] simlar queries [00:08:13] (03PS1) 10Ori.livneh: Beta: use MemcachedPhpBagOStuff if running under HHVM [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/122272 [00:09:13] (03CR) 10Ori.livneh: [C: 032] Beta: use MemcachedPhpBagOStuff if running under HHVM [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/122272 (owner: 10Ori.livneh) [00:09:20] (03Merged) 10jenkins-bot: Beta: use MemcachedPhpBagOStuff if running under HHVM [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/122272 (owner: 10Ori.livneh) [00:26:22] springle: nice one, although I'm not into perl much :P [00:26:22] # wikiuser sleepers get killed at 300s [00:26:24] :) [00:26:55] that block is new, as a result of the weekend :) [00:27:12] but its possible the other limits caught sleepers too [00:27:59] something probably killed them, but I'm not sure how long the whole thing lasted etc. [00:28:12] hoo: since you're awake ;)... monitoring max_connections -- how to do it sanely without using the root account i wonder [00:28:39] and how to catch it in the act. icinga checks are quite far apart [00:29:09] maybe check for spikes in Max_used_connections, but that would be after the fact if nagios cannot get a connection itself [00:29:11] not sure how often icinga checks... given that these lasted like 300s maybe we should probably check every 100s or so? [00:29:44] if we go much higher than that occasional troubles might go unnoticed [00:30:05] these sorts of torubles are very obvious in the dberror.log and logstash [00:30:19] maybe watching fo noise there somehow [00:30:21] right... but does anybody really look at those [00:30:32] well true [00:30:33] like I only do then shit gets serious [00:30:36] * when [00:31:15] of course if the dberror are crying that there are no more read slaves at hand stuff is probably quite wrong [00:31:49] so something else will complain, yes. still, it's good point about max_connections [00:32:42] but that would be after the fact if nagios cannot get a connection itself << well it will error than anyway, right? If could probably also be smart about the error returned [00:32:50] * It could [00:32:56] I should go to bed, I think :P [00:33:03] i'd like some sort of icinga dberror-is-growing-fast [00:33:12] yes true, a failure would be critical [00:33:26] failure *to connect [00:34:54] ok, really leaving now... will poke people tomorrow about memcached, that path seems rather likely (and waiting for something like memcached for 250s can only have one purpose: Making sure that stuff doesn't cache stampede in case mc is down) [00:35:29] I guess that's the reason, but who knows... that should probbaly also log [02:13:38] !log LocalisationUpdate completed (1.23wmf19) at 2014-03-31 02:13:38+00:00 [02:13:45] Logged the message, Master [02:30:26] !log LocalisationUpdate completed (1.23wmf20) at 2014-03-31 02:30:26+00:00 [02:30:32] Logged the message, Master [03:08:42] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Mar 31 03:08:39 UTC 2014 (duration 8m 38s) [03:08:48] Logged the message, Master [04:27:22] (03CR) 10Faidon Liambotis: [C: 04-1] Manage scap proxy rsync config in puppet (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/119677 (owner: 10Reedy) [04:28:47] (03CR) 10Faidon Liambotis: [C: 04-1] Update docroot_dir_allows to use network::constants::mediawiki_appservers (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/119668 (owner: 10Reedy) [04:30:02] (03CR) 10Faidon Liambotis: [C: 032] Add mw1161 and mw1201 as scap proxies for EQIAD row C and D [operations/puppet] - 10https://gerrit.wikimedia.org/r/119686 (owner: 10Reedy) [04:33:10] (03CR) 10Faidon Liambotis: Manage scap proxy rsync config in puppet (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/119677 (owner: 10Reedy) [04:33:56] (03CR) 10Faidon Liambotis: [C: 04-1] "Minor gotcha -- otherwise looks good." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/119686 (owner: 10Reedy) [05:11:33] hey TimStarling [05:12:01] hello [05:13:16] busy? [05:14:12] not with anything urgent [05:15:18] it's about my email regarding memcached [05:16:09] on the ops list? [05:16:11] yes [05:17:18] is this high rate of filter:minify-js queries normal? [05:18:50] I'll just document what I've done on bug 62623 and then have a look at it [05:19:02] okay :) [05:23:07] the spam_blacklist_regexes that Ori found is definitely a lot of data [05:23:26] I suspect it might be related to this Ukrainian spammer that has been hitting us for a few days now [05:23:47] * Jasper_Deng has always wanted to ask, do we ever firewall spammers from even connecting? [05:23:49] but I don't think it's the cause for the spike [05:24:09] since that's just mc1004, not mc1010 [05:24:23] Jasper_Deng: firewall no, but mediawiki returns a throttled error [05:27:46] although it's kind of tempting to firewall of this ISP entirely, I must say [05:31:47] from the last 10.000 throttlederror requests [05:31:51] 9917 GeoIP ASNum Edition: AS15895 Kyivstar PJSC [05:32:07] *responses [05:32:19] ResourceLoader::makeModuleResponse & ResourceLoader::filter have not been invoked more frequently: http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1396243737.595&from=-2weeks&target=MediaWiki.ResourceLoader.filter.count&target=MediaWiki.ResourceLoader.makeModuleResponse.count [05:32:40] that's about ~23mins [05:32:42] oh hi ori [05:33:00] sorry, didn't mean to ping you with the above :) [05:33:30] Just as well, since I just logged on looking for some moment's distraction [05:34:57] also: hello [05:36:48] heh: http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1396244200.34&from=-2weeks&target=MediaWiki.SpamBlacklist.getRegex.count [05:36:57] I think we have a winner [05:37:09] nice [05:37:33] that explains mc1004 [05:37:57] also, that looks like aligned to the deployment window, not just the spammer? [05:39:14] so you need to know what calls SpamBlacklist::getRegex()? [05:39:30] https://gerrit.wikimedia.org/r/#/c/117231/ ? [05:40:25] that's a very promising candidate [05:43:57] we can comment the hook at SpamBlacklist.php as a local hack and sync [05:44:10] we can see the effect almost immediately [05:44:18] shall I? I'll wait for TimStarling :) [05:45:13] go ahead [05:45:15] mc1010 gets queried for enwiki:messages:individual:Spam-blacklist constantly [05:47:16] !log faidon synchronized php-1.23wmf19/extensions/SpamBlacklist/SpamBlacklist.php 'local hack to test memcached traffic increase theory' [05:47:21] Logged the message, Master [05:48:25] we have a winner [05:48:34] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Memcached+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [05:49:09] 526M -> 311M [05:50:00] 295 and dropping [05:52:35] dwelling too long on the question of about why 83 kilobytes have to be repeatedly evacuated from memory and retrieved across a network link is liable to make one very sad [05:52:49] s/about // [06:33:46] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=mc1010.eqiad.wmnet&m=cpu_report&r=month&s=by%20name&hc=4&mc=2&st=1396247583&g=network_report&z=large&c=Memcached%20eqiad [06:33:50] I wonder if it was the same issue [06:34:11] it probably was [06:40:25] (03PS5) 10Faidon Liambotis: Add ?download parameter to images [operations/puppet] - 10https://gerrit.wikimedia.org/r/120617 (owner: 10Gilles) [06:53:01] (03PS6) 10Faidon Liambotis: Varnish: add ?download parameter to upload [operations/puppet] - 10https://gerrit.wikimedia.org/r/120617 (owner: 10Gilles) [06:53:13] (03PS7) 10Faidon Liambotis: Varnish: add ?download parameter to upload [operations/puppet] - 10https://gerrit.wikimedia.org/r/120617 (owner: 10Gilles) [07:10:42] Reedy: ping when you're around [07:20:27] paravoid: still around ? [07:20:52] for the next 3' or so [07:21:04] cool, short question: [07:21:52] otto told me to ask you. hashar (and i) would like to bring puppet-lint fron trusty to apt.wikimedia.org what should be done? i have tested it in labs and it seems sane [07:22:03] same dep's and the like [07:22:08] what do we need it for? [07:22:18] better jenkins lints [07:22:45] !log faidon synchronized php-1.23wmf19/extensions/SpamBlacklist/SpamBlacklist.php 'revert I694860b - SpamBlacklist' [07:22:51] Logged the message, Master [07:23:01] okay [07:23:04] !log faidon synchronized php-1.23wmf20/extensions/SpamBlacklist/SpamBlacklist.php 'revert I694860b - SpamBlacklist' [07:23:08] I'll reprepro include it later today [07:23:09] Logged the message, Master [07:23:23] thank you, do you need a ticket for it ? [07:24:07] I don't, but I wouldn't mind it either [07:25:04] for the sake of completence i'll create one. so we know it was reqested and handled. Thanks a lot [07:37:37] (03CR) 10QChris: [C: 04-1] Puppetizing Camus cronjob (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/121546 (owner: 10Ottomata) [07:48:20] (03CR) 10Hashar: Lint misc/icinga.pp (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/122131 (owner: 10Hashar) [07:48:28] (03PS2) 10Hashar: Lint misc/icinga.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/122131 [07:51:20] hashar: fwiw puppet lint != our common practice [07:52:42] hi, i was going to check if puppet-lint update happened, and noticed apt.wm.org doesnt show me much [07:53:40] ah, that ticket is just a couple minutes old, didn't notice:) [07:54:16] yes, if you were around you can see backlog here [07:54:36] yes [07:54:58] and morning, or night, or whatever it is now for you :) [07:56:26] thanks, it's like "season's greetings" on a daily basis [08:12:13] (03CR) 10Dzahn: "icinga::monitor::snmp is included twice (line 43), looks good but the lower section that splits up all the files is really hard to review " (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/122131 (owner: 10Hashar) [08:14:35] (03CR) 10Dzahn: [C: 031] "i see that duplication was in there all this time, so it's not introducing anything new" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122131 (owner: 10Hashar) [08:19:28] (03CR) 10Dzahn: Manage scap proxy rsync config in puppet (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/119677 (owner: 10Reedy) [08:28:04] (03CR) 10Hashar: "Removing dupe include." (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/122131 (owner: 10Hashar) [08:28:11] (03PS3) 10Hashar: Lint misc/icinga.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/122131 [08:28:46] (03CR) 10Hashar: "Alexandros, can you please run your catalog compilation script to verify this change is a no-op ? Thx" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122131 (owner: 10Hashar) [08:28:48] (03PS1) 10Dzahn: lint fundraising [operations/puppet] - 10https://gerrit.wikimedia.org/r/122331 [08:30:06] (03PS1) 10Matanya: db: lint role [operations/puppet] - 10https://gerrit.wikimedia.org/r/122332 [08:30:31] mutante: hopefully we will get akosiaris catalog compilation script added in Jenkins which would provide a diff output or something [08:30:42] I am not sure how long it takes to generate the catalogs though [08:30:54] (03PS1) 10Dzahn: retab role/gerrit [operations/puppet] - 10https://gerrit.wikimedia.org/r/122333 [08:31:17] he said a long time hashar [08:31:35] (03CR) 10Hashar: [C: 031] retab role/gerrit [operations/puppet] - 10https://gerrit.wikimedia.org/r/122333 (owner: 10Dzahn) [08:31:40] hashar: yes, having that in Jenkins has been a thought, would be really nice [08:31:56] matanya: we can make it a Jenkins jobs that folks can trigger manually by passing the change # [08:32:47] hashar: why not just a check [08:33:24] cause it is too long? [08:33:33] that would delay reporting back to gerrit [08:33:39] makes sense [08:35:06] heh, and once we have that (manually triggered job), add to IRC bot.. !check 12345 :) [08:35:11] that would be fun [08:35:22] yea! [08:35:33] btw hashar have you seen https://wikitech.wikimedia.org/wiki/Puppet_usage#Coding_Style ? [08:35:34] can definitely do that one day [08:35:41] upstream is adding an API to Zuul [08:35:59] nice! [08:43:37] (03PS1) 10Dzahn: lint role/keystone (labs) [operations/puppet] - 10https://gerrit.wikimedia.org/r/122334 [08:52:29] (03PS1) 10Dzahn: lint labsproxy.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/122335 [08:52:33] (03CR) 10Matanya: lint fundraising (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/122331 (owner: 10Dzahn) [08:55:45] (03CR) 10Matanya: lint role/keystone (labs) (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/122334 (owner: 10Dzahn) [09:00:00] (03CR) 10Matanya: lint labsproxy.pp (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/122335 (owner: 10Dzahn) [09:09:20] (03PS1) 10Dzahn: lint role/deployment [operations/puppet] - 10https://gerrit.wikimedia.org/r/122338 [09:10:18] (03CR) 10Dzahn: "here's a challenge, try aliging all => correctly in this file between line 7 and 143 :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122338 (owner: 10Dzahn) [09:11:29] (03CR) 10Dzahn: "i say even if correct it would look super ugly and might be an example of the limits of lint/style. that's why i didn't do it on this one" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122338 (owner: 10Dzahn) [09:12:19] * matanya hates it :) [09:12:49] :) [09:17:10] (03CR) 10Matanya: "awww, oh, ouch." (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/122338 (owner: 10Dzahn) [09:30:14] (03PS2) 10Hashar: role::parsoid::beta needs contint slave scripts [operations/puppet] - 10https://gerrit.wikimedia.org/r/120823 [09:37:35] btw mutante you can see in the gdash role it is done nicely [09:46:07] !log Jenkins: applied hasBrowserTests label on both labs slave. Unblocks the browser tests which were still tied to a deleted instance {{gerrit|122341}} [09:46:13] Logged the message, Master [09:48:18] matanya: i would challenge "no need for []". it makes it easier to add a second user? [09:48:53] hmm, it does. but with this argument we should add it to any resource [09:48:56] actually it appears where there was more than 1 user in the past and then got removed [09:49:00] (as well) [09:49:10] still [09:54:16] (03PS1) 10Hashar: contint::slave-scripts recurse submodules [operations/puppet] - 10https://gerrit.wikimedia.org/r/122342 [10:06:48] (03CR) 10Hashar: [C: 031 V: 032] "deployed on integration-puppetmaster.eqiad.wmflabs" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122342 (owner: 10Hashar) [10:20:16] (03PS2) 10Dzahn: lint fundraising [operations/puppet] - 10https://gerrit.wikimedia.org/r/122331 [10:20:20] (03CR) 10Dzahn: lint fundraising (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/122331 (owner: 10Dzahn) [10:22:53] !change 121574 | hashar [10:23:00] (03CR) 10Matanya: [C: 031] lint fundraising [operations/puppet] - 10https://gerrit.wikimedia.org/r/122331 (owner: 10Dzahn) [10:25:41] (03CR) 10Dzahn: [C: 032] Fix repos with checked-in .gitmodules [operations/puppet] - 10https://gerrit.wikimedia.org/r/121574 (owner: 10Ryan Lane) [10:26:05] !gerrt 121575 [10:26:08] !gerrit 121575 [10:26:16] STUPID BOT OF DOOM IS DEAD AGAIN [10:27:30] ah [10:27:30] Fix repos with checked-in .gitmodules [10:27:30] https://gerrit.wikimedia.org/r/121574 [10:27:31] :-] [10:28:00] * mutante merges that and also a change to vagrant ori forgot to merge on palladium [10:28:10] which changes the home directory of vagrant user [10:28:19] ori: [10:28:42] mutante: can you run puppetd on tin.eqiad.wmnet so I can test the fix? [10:28:44] hashar: , yes, done. no deployment right now, heh [10:28:52] yes [10:29:22] running [10:33:25] hashar: it ran once, not sure if that was it already..looking [10:33:31] trying [10:35:54] hashar: deploy.py is on other node .. [10:36:00] salt_master.pp [10:36:10] checks on palladium [10:36:37] salt_master.pp: require => [File["${module_dir}/deploy.py"]], [10:37:04] eh, i should say source => 'puppet:///modules/deployment/modules/deploy.py' [10:37:16] notice: /Stage[main]/Deployment::Salt_master/File[/srv/salt/_modules/deploy.py]/content: content changed [10:37:20] try now [10:38:08] (03CR) 10Dzahn: "has been deployed on salt master (palladium). notice: /Stage[main]/Deployment::Salt_master/File[/srv/salt/_modules/deploy.py]/content: con" [operations/puppet] - 10https://gerrit.wikimedia.org/r/121574 (owner: 10Ryan Lane) [10:38:40] mutante: that fixed it thanks [10:39:20] (03CR) 10Hashar: "Bug fixed!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/121574 (owner: 10Ryan Lane) [10:39:32] hashar: great! [10:39:48] thank you [10:40:23] yw, thank Ryan for the fix [10:40:38] (btw, i met him at salt user group last week0 [10:45:49] (03PS1) 10Hashar: contint: browsertests need ruby1.9.1-dev [operations/puppet] - 10https://gerrit.wikimedia.org/r/122346 [10:48:25] (03CR) 10Hashar: [C: 031 V: 032] "Cherry picked on integration puppet master integration-puppetmaster.eqiad.wmflabs.org" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122346 (owner: 10Hashar) [10:48:52] !log Jenkins fixed up browsertests jobs. Bundler could not compile gems on the eqiad slaves {{gerrit|122346}} [10:48:58] Logged the message, Master [10:58:44] (03PS2) 10Dzahn: decom Tampa: remove service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/120063 [10:58:46] (03CR) 10jenkins-bot: [V: 04-1] decom Tampa: remove service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/120063 (owner: 10Dzahn) [11:00:07] thanks jenkins, i just clicked in web ui to change commit message :p [11:00:21] hah, yea, rebasing [11:05:58] (03PS3) 10Dzahn: decom Tampa: remove service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/120063 [11:07:11] (03PS4) 10Dzahn: decom Tampa: remove service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/120063 [11:13:47] (03PS1) 10Dzahn: remove more Tampa search remnants [operations/dns] - 10https://gerrit.wikimedia.org/r/122350 [11:14:57] (03PS5) 10Dzahn: decom Tampa: remove service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/120063 [11:15:40] (03CR) 10Dzahn: [C: 031] "to confirm, see pybal config" [operations/dns] - 10https://gerrit.wikimedia.org/r/120063 (owner: 10Dzahn) [11:24:08] (03CR) 10Matanya: [C: 031] remove more Tampa search remnants [operations/dns] - 10https://gerrit.wikimedia.org/r/122350 (owner: 10Dzahn) [11:24:37] (03CR) 10Matanya: [C: 031] decom Tampa: remove service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/120063 (owner: 10Dzahn) [11:37:37] (03CR) 10Faidon Liambotis: [C: 032] Varnish: add ?download parameter to upload [operations/puppet] - 10https://gerrit.wikimedia.org/r/120617 (owner: 10Gilles) [11:57:34] !log Jenkins: updating jslint jobs to run a PHP based json linter, will bails out whenever json files are invalid. {{bug|58279}} {{gerrit|113958}} [11:57:40] Logged the message, Master [11:59:00] (03PS1) 10Matanya: glance: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/122353 [12:21:12] hashar: salt -G 'rolename:role::ci::*' grains.items [12:21:43] :-D [12:21:57] (to combine the answer how to get all grains, all the values of the grains AND use a wildcard on multiple roles with common name) [12:22:08] replied on list [12:22:08] got documented at https://wikitech.wikimedia.org/wiki/Salt :D [12:22:59] thx [12:24:22] yw, so many grains [12:24:47] we just need to make the role names follow common schema [12:25:15] and i'd say drop "role" from the value. key is called "rolename", no need to repeat "role" in the value [12:26:19] same as puppet class. it's already called system::role, no need to have "role::" in the name [12:26:47] that's why the syntax becomes "rolename:role::foo::bla" [13:01:11] hashar, you can't go :((( seems you killed the dns resolution for beta [13:15:24] (03PS1) 10Dzahn: fix bastion host system::roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/122399 [13:17:20] (03PS2) 10Dzahn: fix bastion host system::roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/122399 [13:22:51] (03PS3) 10Dzahn: include 'bastionhost' on bastion hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/122399 [13:25:40] (03CR) 10Dzahn: "originally just wanted to make the system::role consistent, then noticed just some bastions use the module" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122399 (owner: 10Dzahn) [13:28:23] (03CR) 10Dzahn: "duplicate of Change-Id: Ia24c0219c3b3ac3dc3b1f50c963f814b72302da1" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122353 (owner: 10Matanya) [13:32:13] (03PS1) 10Ottomata: [WIP] Adding research posix group [operations/puppet] - 10https://gerrit.wikimedia.org/r/122401 [13:33:21] mutante: partial dup :) [13:33:44] you want me to abandon, or merge changes? [13:33:52] (03PS1) 10Hashar: beta: update deployment-cache-upload address [operations/puppet] - 10https://gerrit.wikimedia.org/r/122402 [13:34:29] !log fundraising-system full-stop for hardware repairs on queue server [13:34:35] Logged the message, Master [13:35:13] matanya: you can pick one:) [13:35:16] (03CR) 10Hashar: [C: 031 V: 032] "cherry picked on deployment prep puppet master deployment-salt.eqiad.wmflabs" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122402 (owner: 10Hashar) [13:35:55] ok, merge yours and ill rebase mutante fair? [13:39:50] ottomata: di you mail the remaining 6 stat1 account holders ? [13:40:20] *did [13:42:11] no [13:42:23] plan to ? [13:43:09] hadn't! but maybe I should! [13:45:07] PROBLEM - Host silicon is DOWN: PING CRITICAL - Packet loss = 100% [13:46:22] Jeff_Green: hey [13:46:24] ^ [13:46:49] mutante: yep, thanks. it's being rebooted for a disk replacement. we forgot to disable notification [13:47:21] good, that's the best answer:) [13:48:09] !log icinga alerts disabled for silicon [13:48:14] Logged the message, Master [13:48:22] matanya: emailed, [13:48:24] danke [13:48:30] ACKNOWLEDGEMENT - Host silicon is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn disk being replaced [13:48:39] thanks to you :) [13:50:17] RECOVERY - Host silicon is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms [13:52:10] Jeff_Green:can you please to change the topic to reflect you are on the duty ? [13:52:17] -to [13:52:25] yep. [13:52:43] thanks [13:55:16] (03PS1) 10Dzahn: remove db67 from coredb,decom db64,db65,db66,db70 [operations/puppet] - 10https://gerrit.wikimedia.org/r/122406 [13:55:25] (03PS1) 10Hashar: beta: update deployment-cache-upload address [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/122407 [13:55:45] (03CR) 10Hashar: [C: 032] beta: update deployment-cache-upload address [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/122407 (owner: 10Hashar) [13:56:13] (03Merged) 10jenkins-bot: beta: update deployment-cache-upload address [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/122407 (owner: 10Hashar) [13:58:17] (03CR) 10Dzahn: [C: 032] "51.17.68.10.in-addr.arpa domain name pointer deployment-cache-upload02.eqiad.wmflabs." [operations/puppet] - 10https://gerrit.wikimedia.org/r/122402 (owner: 10Hashar) [14:02:08] (03CR) 10Dzahn: [C: 032] beta: point autoupdater to use /data/project [operations/puppet] - 10https://gerrit.wikimedia.org/r/121360 (owner: 10Hashar) [14:03:05] (03PS2) 10Dzahn: beta: bring in scap related scripts on bastion [operations/puppet] - 10https://gerrit.wikimedia.org/r/121365 (owner: 10Hashar) [14:03:14] (03CR) 10Dzahn: [C: 032] beta: bring in scap related scripts on bastion [operations/puppet] - 10https://gerrit.wikimedia.org/r/121365 (owner: 10Hashar) [14:03:41] mutante: do you plan on a dns patch for those db's remove? [14:05:21] matanya: if you wanna make it, go ahead:) [14:05:23] yes [14:05:29] ok, i will [14:05:33] thanks! [14:08:07] (03CR) 10Dzahn: [C: 032] role::parsoid::beta needs contint slave scripts [operations/puppet] - 10https://gerrit.wikimedia.org/r/120823 (owner: 10Hashar) [14:10:39] (03PS1) 10Matanya: db: remove 64-67 and 70 [operations/dns] - 10https://gerrit.wikimedia.org/r/122412 [14:11:10] springle: https://gerrit.wikimedia.org/r/#/c/122406/1 :) [14:11:15] how about db67 [14:12:36] I'm going to be messing with the logstash cluster for a bit. I'm upgrading the elasticsearch there to the latest 1.0.x version in our apt repo [14:13:25] !log Stopped logstash on logstash1001 [14:13:31] Logged the message, Master [14:16:07] PROBLEM - ElasticSearch health check on logstash1002 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.137 [14:16:17] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [14:16:27] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [14:16:31] !log Stopped elasticsearch on logstash100[123] [14:16:37] Logged the message, Master [14:17:21] (03CR) 10Andrew Bogott: [C: 031] retab role/glance [operations/puppet] - 10https://gerrit.wikimedia.org/r/121667 (owner: 10Dzahn) [14:17:34] bd808: I have switched beta cluster to EQIAD :-] [14:17:35] bd808: thanks for the heads up [14:18:03] hashar: yay!:) [14:18:09] Jeff_Green: Could you please abandon the old changes at https://gerrit.wikimedia.org/r/#/q/owner:%22Jgreen+%253Cjgreen%2540wikimedia.org%253E%22+status:open,n,z that I linked at https://meta.wikimedia.org/wiki/User_talk:Jgreen#Open_review_requests_at_Gerrit? Thanks! [14:18:30] !log beta cluster DNS entries migrated to point to the EQIAD datacenter. Keeping pmtpa instances around for a couple days [14:18:35] Logged the message, Master [14:19:31] (03CR) 10Andrew Bogott: [C: 032] retab role/glance [operations/puppet] - 10https://gerrit.wikimedia.org/r/121667 (owner: 10Dzahn) [14:19:42] :) [14:20:30] hashar: re: eqiad switch https://gerrit.wikimedia.org/r/#/q/project:operations/puppet+owner:hashar+status:merged+topic:beta,n,z [14:22:37] great [14:22:44] mutante: I am deploying them on a local puppet master :D [14:22:47] https://gerrit.wikimedia.org/r/#/c/122338/ :p [14:22:49] (03Abandoned) 10Jgreen: reenable otrs GenericAgent.pm [operations/puppet] - 10https://gerrit.wikimedia.org/r/78819 (owner: 10Jgreen) [14:23:01] (03PS2) 10Matanya: glance: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/122353 [14:23:05] !log Elasticsearch upgraded to 1.0.1 on logstash100[123] [14:23:08] andrewbogott: ^^ [14:23:10] Logged the message, Master [14:23:27] hashar: i remember you guys set that up recently, yea [14:23:39] mutante: yeah Bryan did :] [14:24:06] nice [14:24:45] (03Abandoned) 10Jgreen: puppetize otrs exim system_filter [operations/puppet] - 10https://gerrit.wikimedia.org/r/77840 (owner: 10Jgreen) [14:27:46] (03Abandoned) 10Jgreen: Revert "more fighting with puppet re. exim variables" [operations/puppet] - 10https://gerrit.wikimedia.org/r/77475 (owner: 10Jgreen) [14:28:53] !log Started logstash on logstash1001 [14:28:58] Logged the message, Master [14:29:21] the speed of gerrit! [14:30:17] Jeff_Green: its like the flash! [14:30:53] (03Abandoned) 10Jgreen: start puppetizing new OTRS host iodine [operations/puppet] - 10https://gerrit.wikimedia.org/r/77227 (owner: 10Jgreen) [14:32:19] it was fine until you guys started working .. hush : [14:32:21] :) [14:33:29] manybubbles: Some shards are still recovering, but it looks like the upgrade worked as expected. [14:33:35] (03CR) 10Andrew Bogott: [C: 032] glance: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/122353 (owner: 10Matanya) [14:33:36] sweet [14:33:40] bd808: I'm happy to hear that! [14:33:46] It'll take a few shards a while to recover [14:33:54] the primaries come much faster [14:35:56] (03Abandoned) 10Jgreen: Revert "try setting some variables in my.cnf for fundraising db" [operations/puppet] - 10https://gerrit.wikimedia.org/r/72157 (owner: 10Jgreen) [14:36:11] Jeff_Green: https://gerrit.wikimedia.org/r/#/c/122331/1 ? [14:36:56] mutante: accessing.. [14:36:59] accessing... [14:37:02] accessing... [14:37:04] * Jeff_Green dies [14:38:31] :..yea, confirmed it became slow again [14:39:06] and next reload.. much better [14:39:12] try this https://gerrit.wikimedia.org/r/#/c/122331/2/manifests/role/fundraising.pp [14:40:47] i wonder if gerrit would be happier with double the RAM [14:41:31] there is one single thing about that, $exim_signs_dkim/bounce_collectors must not be used in any if-statements and still work when they are boolean instead of string [14:41:56] i'd expect that the RAM option has been discussed before [14:42:02] but not sure [14:43:24] accessing...accessing.accessing. [14:44:20] re. $exim_* ok [14:46:49] andrewbogott: thanks for that merge. i have more labs infra related stuff anytime you feel like it:) [14:47:29] yea, gerrit slowness reports [14:48:00] mutante: merged [14:48:03] well [14:48:05] (03CR) 10Jgreen: [C: 032 V: 031] lint fundraising [operations/puppet] - 10https://gerrit.wikimedia.org/r/122331 (owner: 10Dzahn) [14:48:11] "Working..." anyway [14:48:15] Jeff_Green: :) yay [14:48:29] HM [14:48:32] 502’s? [14:48:51] now plz nobody make me look at gerrit any more today, it's a blackhole for time [14:48:55] Same as sjoerddebruin. Getting them quite a bit. [14:50:58] +1 [14:51:00] nlwiki is working, zeawiki not [14:54:05] bd808: https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Logstash%20cluster%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1396277586&g=network_report&z=large [14:54:24] (03PS1) 10Hashar: gerrit: remove one replication to gallium [operations/puppet] - 10https://gerrit.wikimedia.org/r/122419 [14:54:24] * Jeff_Green is confused about the whether there is an Ops meeting today [14:55:50] mutante: That's expected. The shards are recovering after I upgraded the elasticsearch version this morning. It's close to done but will probably be high io/cpu for another 20 minutes or so. [14:56:11] mutante: Also, thanks for checking that and pinging me [14:56:19] (03PS1) 10Hashar: beta: update logstash url in monitor fatals [operations/puppet] - 10https://gerrit.wikimedia.org/r/122420 [14:56:55] mutante: another super simple change if you want https://gerrit.wikimedia.org/r/#/c/122420/ :D [14:57:48] bd808: yea, no worries, i just glanced at it because you announced work [14:58:15] sjoerddebruin: zea works for me [14:58:34] does anyone know about the 502 reports? [14:59:49] hashar_: slowness on gerrit .. or i'd be on it already [15:00:02] that just started around the time of switching beta ,fwiw [15:00:11] oh interesting [15:00:15] no idea if it can be related though [15:00:58] I dont think [15:01:12] well, hmm, intermittent slowness... [15:01:18] beta hits Gerrit via an automatic Jenkins job that fetch every six minutes. But I have created that job sometime last week [15:01:22] yea, unlikely [15:01:29] nods [15:02:00] cpu WIO is higher [15:02:50] (03CR) 10Dzahn: [C: 032] beta: update logstash url in monitor fatals [operations/puppet] - 10https://gerrit.wikimedia.org/r/122420 (owner: 10Hashar) [15:04:26] Jeff_Green: the fundraising change is still on master.. [15:04:35] i was about to merge with hashar's change then [15:06:04] eh, wait [15:06:12] Gerrit Code Review: Merge "lint fundraising" into production (2aabae4) [15:06:16] dzahn: lint fundraising (3808112) [15:06:16] is that normal? [15:07:30] mutante: ? I did the gerrit merge but I did not do the puppet-merge on palladium [15:07:47]