[00:00:15] is all of the drama around db9 over and done with, or are there still issues? [00:02:42] !log started replicating db50 from db47 [00:02:50] Logged the message, Master [00:03:40] robla: there will be future downtime [00:04:16] binasher: when and for how long (roughly)? [00:05:11] i still see ant-gcj/lucid uptodate 1.7.1-4ubuntu1.1 hashar [00:05:24] but if that's the only one and stuff works, yay :) [00:05:45] robla: 15 min at some sooner-the-better future time, then 30 min at some later less urgent time [00:14:59] LeslieCarr: looks like ant-gcj can be uninstalled, probably a left over of "ant" [00:15:04] LeslieCarr: not a big priority though [00:15:38] so we should be fine [00:20:48] my phone has not received my bank alerts since verizon ported and took over, vzw forums are full of this complaint, and no one ever posts what the soltuion is =P [00:23:54] New patchset: Hashar; "gallium: enable ssh X11 Forwarding" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1641 [00:31:37] Change abandoned: Hashar; "X11 Forwarding not needed, just found how to install android with no GUI:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1641 [00:55:26] !log ran svn up on openstackmanager on virt1 [00:55:34] Logged the message, Master [00:55:56] !log seems I broke labsconsole :( [00:56:04] Logged the message, Master [00:57:52] robla: is there a time that would be best for 15min of db9 downtime? [00:59:07] binasher: I think the big thing is scheduling it if you know it's going to happen. I'm fine with whatever so long as there's enough advanced notice [00:59:36] robla: how much advance notice would you like? [01:00:24] binasher: a day would be nice if its possible [01:00:56] binasher: could you fire off a note to wikitech-l with your plan? [01:01:09] basically, if there's a triage or something else, it's nice to have time to reschedule [01:01:27] anyway....gotta go to a meeting now [01:57:23] RECOVERY - DPKG on db13 is OK: All packages OK [02:25:56] !log fixed labsconsole. reverted aws-sdk to 1.4 [02:26:05] Logged the message, Master [02:28:33] <^demon|dinner> bugzilla-daemon seems to have been bounced from wikibugs-l again. [03:21:40] PROBLEM - MySQL disk space on db9 is CRITICAL: DISK CRITICAL - free space: /a 10547 MB (3% inode=99%): [03:44:42] PROBLEM - Disk space on db9 is CRITICAL: DISK CRITICAL - free space: /a 10457 MB (3% inode=99%): [07:23:04] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [07:45:14] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [08:14:35] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [08:39:06] Change abandoned: Hashar; "per mark request, no white space cleanup." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1494 [09:56:27] RECOVERY - MySQL slave status on es1004 is OK: OK: [10:16:38] hi [10:16:56] apergos: finally.. http://nagios.wikimedia.org/nagios/cgi-bin/extinfo.cgi?type=2&host=spence&service=check_job_queue [10:18:05] awesome! [10:18:33] :) [10:31:33] apergos: opinion on the "check_all_memcached" one.. that Can not connect to 10.0.8.6:11000 (Connection refused) ? should it try to connect elsewhere / run on another host / not needed ? [10:31:49] we do need the check [10:31:52] apergos: that's actually re: RT #1269 [10:32:30] and I don't know what the check looks like these days, there used to be a commandline script one could run [10:32:36] I think there is a wikitech page about that [10:32:45] you wrote there "relies on nfs and shouldn't" [10:33:00] what was the script? [10:33:15] "The scripts check_job_queue, check_all_memcached.php and check_MySQL.php have been copied from /home/nagios/plugins to /usr/local/nagios/libexec on spence" [10:33:36] check_all_memcached.php [10:33:49] yes [10:34:05] you'll have to look at it to see where it gets its list from [10:34:17] I would have to dig around to figure out which the right hosts are [10:34:56] ok [10:38:32] relies on nfs = require_once( '/home/w/common/wmf-config/mc.php' ); [10:38:45] and that has the list of servers and is publicly viewable [10:40:51] so it can read the list of servers, because /home is mounted on spence, but can't connect from spence to (some of?) the IPs in wgMemCachedServers [10:42:26] is that list anywhere else on spence? does spence have /usr/local/? [10:42:58] yes, the nagios checks are in /usr/local/nagios/libexec [10:43:13] but i didnt see that list anywhere yet [10:45:56] spence is unbearably slow to log into [10:46:34] seems like we should puppetize mc.php [10:46:46] /usr/local/apache/common-local/wmf-config/mc.php [10:46:52] no need, spence has it [10:46:55] ah [10:47:05] thaat must be a relatively new development, it didn't used to have /usr/local/apache stuff [10:47:31] i guess related to dist-upgrade and fixes after that, installing appservers and stuff [10:47:53] uhhuh [10:48:29] mostly fixes Roan suggested, also fixed the mw source tree on it [10:48:39] (was _really_ outdated before) [10:48:52] wikimedia-task-appserver i meant [10:49:20] right [10:49:55] arg,no, i mixed some of that up with fenari.. but still..yea [10:53:53] New patchset: Dzahn; "check_all_memcached - do not rely on NFS (fix RT1269)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1642 [10:54:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1642 [10:58:33] RoanKattouw: hi. job queue check now resolved for real (in Nagios web UI) :) +1 for team work [10:59:06] yay [11:01:37] New review: Dzahn; "we don't want to rely on NFS, and can now require /usr/local/apache/common-local/wmf-config/mc.php i..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1642 [11:01:38] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1642 [11:04:52] mutante: Please revert that [11:05:10] We don't sync things to spence:/usr/local/apache , so the mc.php copy in that dir will get out of date [11:05:50] RoanKattouw: do you know then those files appeared there? because: < apergos> thaat must be a relatively new development, it didn't used to have /usr/local/apache stuff [11:06:21] hmm.. any other idea how to "not rely on NFS"? [11:06:47] where can i get mc.php from? should i pupptize it for spence then? [11:06:48] Convince Ryan and Mark to set up syncing to spence? [11:07:00] I set it up but not everyone has an account there so they told me off for it [11:07:25] Did someone tell you to make the spence checks NFS-independent? [11:07:29] doesn't sync work with mwdeploy now? [11:07:31] yes, an RT ticket did [11:07:36] and they should be [11:07:36] apergos: Well, sort of [11:07:40] for this specific check [11:07:47] "sort of"? [11:07:49] apergos: You ssh in as you, *then* sudo to mwdeploy [11:07:57] meh [11:08:14] mutante: Then tell them it can't be done unless spence is synced to properly, which requires accounts to be set up [11:08:40] you know that's not really true [11:08:44] RoanKattouw: alright, i'll revert and paste that to ticket, k? [11:08:55] hmm [11:09:02] if we need a certain number of conf files from fenari, we can put em over with a cron job, rsync em or something [11:09:16] and that can be puppetized perfectly fine [11:09:25] I suppose it will have to be discussed though [11:10:18] Right [11:10:25] But that would be an entirely novel approach [11:10:30] So yeah, it would have to be discussed [11:11:24] given that right now we have zero approach [11:11:30] yes, anything we do will be novel [11:11:33] New patchset: Dzahn; "check_all_memcached - revert change, and use NFS path again, needs discussion" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1643 [11:11:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1643 [11:12:16] would be great if you guys add a comment on that. seems like the right place for the discussion? [11:12:33] PROBLEM - Host srv199 is DOWN: PING CRITICAL - Packet loss = 100% [11:13:18] prolly so [11:16:34] New review: Dzahn; "what's the best way to ensure mc.php is present and up-to-date on spence?" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/1643 [11:16:49] New patchset: Hashar; "WikipediaMobile: add css/html for nightly builds" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1644 [11:16:59] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/1644 [11:19:03] New patchset: Hashar; "WikipediaMobile: add css/html for nightly builds" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1644 [11:19:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1644 [11:23:35] mutante: will you be available this afternoon to built the testswarm package and make it available on our apt? [11:23:47] (note I already have build it in a lab VM if it can save troubles) :D [11:26:16] New review: Dzahn; "also see: http://rt.wikimedia.org/Ticket/Display.html?id=1269" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/1643 [11:28:13] hashar: yes [11:28:20] awesome :) [11:30:00] eh, well, i thought it is built already, and it's just the "put on our repo" part though [11:30:26] let's check wikitech again together, i may have to ask you stuff as well [11:30:50] but we should get it done today, yep [11:54:38] New patchset: Dzahn; "process monitoring for mobile traffic loggers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1645 [11:54:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1645 [11:55:40] New patchset: Dzahn; "process monitoring for mobile traffic loggers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1645 [11:55:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1645 [11:57:55] New review: Dzahn; "hashar, re: "recursive dirs". fyi: http://christian.hofstaedtler.name/blog/2008/11/puppet-managing-d..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1640 [13:58:24] New patchset: Hashar; "enable testswarmm on gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1646 [14:10:23] PROBLEM - MySQL disk space on db9 is CRITICAL: DISK CRITICAL - free space: /a 10730 MB (3% inode=99%): [14:25:27] New review: Dzahn; "just like the other process checks just with different arguments" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1645 [14:25:27] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1645 [14:39:03] well I have no idea how to merge the branches :-D [14:39:10] going to do that manually instead :-D [14:40:01] manually how? :) [14:40:18] hello mark :) [14:40:41] well I have a ton of changes related to testswarm in the 'test' branch and I need to merge them in 'production' [14:41:09] Oh [14:41:13] You'll want to cherry-pick them [14:41:14] I though I could just cherry pick them but there is so many file moves in between that they do not cleanly apply [14:41:29] (Please, please do it the proper way rather than by hand :) ) [14:41:39] Hmm [14:41:48] would love to do it properly so we can keep the sha1 and history [14:41:50] We should probably just merge test into prod again [14:43:00] one of my issue is that my changes are in manifests/misc-server [14:43:22] but in production that was split in a different file manifests/misc/contint.pp [14:43:29] so cherry picking does not work that well [14:43:34] Yeah [14:43:39] Let me see how painful a merge would be [14:43:48] so one possibly would be to create a branch back to the common ancestor [14:43:54] cherry pick my changes from 'test' [14:44:01] then attempt to rebase on origin/production [14:44:13] but I am not really sure that make any sense nor it will work [14:44:15] Only two files conflicted [14:44:21] Hmm, that sounds like an interesting possibility [14:44:22] Anyway [14:44:26] Let me look at my merge conflicts [14:44:37] which command have you run? [14:44:44] out of curiosity [14:45:03] git checkout production [14:45:05] git merge origin/test [14:45:13] hashar: You added class misc::contint::test { to misc-server.pp ? [14:45:18] yes [14:45:22] OK [14:45:27] and I got change there in the test branch [14:45:54] but in production that is in manifests/misc/contint.pp [14:45:57] I just gotta fix the conflict in lvs.pp now [14:45:59] which got change too [14:47:40] we're not gonna accept that merge anyway [14:47:46] since I don't know what can already be merged and what can't [14:47:50] and we can't even see the diff in gerrit [14:48:00] so best cherrypick your changes, I guess [14:48:30] yeah that is what I thought, but then I get issues with the file moves [14:51:43] ! [remote rejected] HEAD -> refs/for/production (you are not allowed to upload merges) [14:51:45] Screw you, gerrit [14:51:56] That should be allowed [14:52:02] no [14:52:25] we currently have no way of seeing what it consists of [14:52:35] I just merged test into production locally [14:52:48] I would like to be able to push that, because it sounds useful and makes hashar's life easier [14:52:55] but you can't do that [14:53:08] we have no way of telling what your merge contains [14:53:31] git fetch blah blah blah [14:53:33] I know it sucks, but that's how it is currently :( [14:53:36] git diff origin/production..FETCH_HEAD [14:54:28] i'm not even willing to merge test into production MYSELF right now, as I don't know what can go in already and what can't [14:54:31] so, the current model sucks [14:54:40] people should have their own branches :( [14:55:35] yeah that would be ten times easier [14:56:57] * RoanKattouw switches to plan B for incorporating hashar's work [14:57:52] thanks Roan, cause I am still puzzled by my changes locally :-\ [15:09:17] mark_: i still have an RT to upgrade "tarin" (poolcounter), kernel,apt,exim4,perl.. but remembering our recent talk about upgrades.. it is not really a specific issue with poolcounter itself and i would have to disable "wmgUsePoolCounter" in wmf-config for a couple minutes and then enable again [15:09:50] hashar: Hmph, I can't really figure out which revs have been applied and which ones haven't. Do you have a list of some osrT? [15:09:51] *sort [15:11:31] yeah somewhere :/ [15:11:37] I am sure I saved the list earlier [15:11:38] mutante: so only those packages need upgrading? [15:13:23] mark_: there is a bit more..(meanwhile, since the ticket was created) several libs, logrotate, parted, w3m, rsync, php5-common.. [15:13:40] just upgrade those packages, not the kernel [15:13:46] tarin is internal, right? [15:14:07] has a public ip [15:14:12] whut [15:14:16] 208.80.152.174 [15:14:20] then let's make a ticket to move poolcounter to another machine [15:14:43] ok [15:19:51] !log installing security upgrades on tarin (includes perl and php) [15:19:59] Logged the message, Master [15:25:40] RT created [15:40:16] New patchset: Dzahn; "planet - use star.wmf ssl cert, move to own file, remove hard-coded IP, add locales" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1606 [15:40:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1606 [15:40:48] New patchset: Dzahn; "planet - use star.wmf ssl cert, move to own file, remove hard-coded IP, add locales" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1606 [15:41:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1606 [15:57:33] anyone having better luck than me at getting ganglia up in their browser? [15:58:52] getting the Connection refused screen [15:59:00] i am on spence though [16:00:10] ok [16:01:04] not working for me either [16:01:18] can you kick it since you're over there? [16:01:25] ( mutante ) [16:01:31] ganglia-monitor , right [16:01:34] then i already did [16:01:36] ok [16:01:48] hmm [16:02:01] no dice [16:02:31] New patchset: Catrope; "script to fetch mediawiki + puppetization" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1647 [16:02:57] hashar: ---^^ [16:03:05] which process is the one listening on 8654 [16:03:15] (if it is running) [16:03:29] RoanKattouw: well done Roan!! [16:03:32] will review that :) [16:04:25] gmond is running [16:05:40] it's like the known issue, the difference is: it isnt temp. anymore [16:11:26] apergos: its back [16:11:30] yay [16:11:33] what did you do? [16:12:19] kill a history.cgi and wait a bit [16:12:41] ok [16:12:55] how is anyone invoking that, I thought you stopmped on that one [16:13:05] it's not for sure that it was really related [16:13:13] bummer [16:13:14] but like last time [16:13:59] i wanted to use it again myself :p [16:14:10] i want the history some way..hrmm [16:14:40] * apergos cues up the music [16:14:41] but also see my recent mail about duplicate service definitions in nagios [16:14:52] * you can't always git what you wa-ant..." [16:14:58] yeah saw it [16:15:02] puppet is pegged on spence too [16:15:47] maybe we can truncate history, just keep last few weeks [16:16:02] or tweak the query to return only the last 200 rows or something [16:19:28] hashar: So yeah, in the future you should really have your own branch for testswarm work [16:20:00] RoanKattouw: and having the testswarm project on labs to run puppet from that branch :D [16:20:06] apergos: hehe, yeah, especially "git" ;) [16:20:07] Yes [16:20:41] ah bummer, log_slow_queries is disabled on db9 [16:24:31] ohhhh, history.cgi greps the log!??!?!?! that's insane. [16:24:39] hahaha [16:24:54] what the hell is wrong with these people? [16:25:02] it could tail -something | grep I guess :-D [16:25:08] New patchset: Hashar; "script to fetch mediawiki + puppetization" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1647 [16:25:09] ok so /var/log/nagios/nagios.log is 2.6G [16:25:13] that's not going to go well [16:25:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1647 [16:26:02] you don't think tail -something would be smart enough to seek to the end and then try walking back a few blocks? [16:26:11] prolly not, eh [16:26:29] not sure [16:26:44] but also history.cgi is a binary [16:26:54] so much for that fine idea [16:27:39] New patchset: Hashar; "script to fetch mediawiki + puppetization" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1647 [16:27:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1647 [16:28:42] we should probably rotate that log at least [16:28:49] we don't?? [16:29:04] it's 2.6GB now, I'm guessing no [16:29:33] first entry is 10/14 [16:29:48] october!! [16:29:52] *eyeroll* [16:29:58] that may have been me actually [16:29:58] RoanKattouw: ok got the merge reviewed and fixed :-))) [16:30:27] Yay [16:38:23] can someone please merge production change https://gerrit.wikimedia.org/r/#change,1647 [16:38:36] that is a merge of my work on testswarm which was validated in a VM on labs [16:38:44] the merge itself was made by Roan and I reviewed it [16:38:54] Jeff_Green: apergos,.. but log_rotation_method=d in nagios.cfg .. that should be daily [16:39:35] log_archive_path=/var/log/nagios/archives [16:39:36] mutante: also it seems like there's debug crap in that log [16:39:45] looking for a verbosity toggle [16:39:49] or maybe to have that stuff log elsewhere? [16:39:56] use_syslog=1 [16:40:30] ah that makes sense [16:40:36] you think that'll break history.cgi though? [16:41:05] "If you have log rotation enabled, you can browse history information present in archived log files by using the navigational links near the top of the page. " [16:41:06] * apergos tries not to say "but it's already broken!" [16:41:14] ha [16:41:28] apergos: at least you tried :-) [16:41:31] the strange thing is: [16:41:39] i am not suggesting to configure it like that, it is [16:41:39] heh [16:41:55] oh i see that now [16:41:59] /operations/puppet/files/nagios$ grep log nagios.cfg [16:42:19] maybe that's why log_rotation_method=d isn't working :-) [16:42:31] nagios makes me smile more :-) [16:43:24] there's nothing nagios-specific in the rsyslog.d conf [16:44:37] are we sure we're running on /etc/nagios/* and not /etc/nagios3/* [16:44:51] New patchset: Hashar; "script to fetch mediawiki + puppetization" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1647 [16:45:01] Jeff_Green: not sure enough ..hrmm [16:45:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1647 [16:45:06] hmm I can prolly gain back about 1.5T by doing a folloup thumb cleaning job [16:45:09] guess it's worth it [16:46:07] Jeff_Green: yes file { "/etc/nagios/nagios.cfg": [16:46:49] k [16:47:24] ah. spence is about to apply puppet changes [16:47:31] oh " logged to the syslog facility, as well as the NetAlarm log" [16:56:50] I dunno, it just seems like nagios log rotation is broken and I don't see any way to debug it [16:59:03] !log manually rotated spence:/var/log/nagios/nagios.log because nagios log rotation appears broken and the file is ~2.6G [16:59:13] Logged the message, Master [17:05:30] someone remind me - who is the mobile site guy? [17:05:39] Patrick Reilly [17:05:46] preilly: [17:06:05] Prodego: ---^^ [17:06:20] ok, looks like he isn't here, I'll just leave him a message in the channel anyway [17:07:10] !log spence: check out "nagios -s /etc/nagios/nagios.cfg" for performance data - it suggests "Value for 'max_concurrent_checks' option should be >= 1231" [17:07:18] Logged the message, Master [17:07:22] preilly: looks like there is some sort of cache issue with &mobileaction=view_normal_site - if you compare http://en.m.wikipedia.org/w/index.php?title=Main_Page&useformat=mobile&mobileaction=view_normal_site to http://en.wikipedia.org/wiki/Main_Page you can see the version given by the 'view main site' link is out of date [17:08:23] !log spence: according to [http://nagios.manubulon.com/traduction/docs25en/tuning.html] we should even double that if we have "high latency values (> 10 or 15 seconds)" and we have like > 1000 [17:08:32] Logged the message, Master [17:10:12] ben-: do you think it would take too long to not prefill swift, and just let it grow as squids request them until we have an acceptable "hit rate"? [17:10:24] doing that is of course a nice way to get rid of unused thumbs... [17:12:19] New patchset: Dzahn; "remove special.cfg from nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1648 [17:12:34] New patchset: Dzahn; "change max_concurrent_checks from 8 to 1000" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1649 [17:12:49] New patchset: Hashar; "jenkins: add git configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1650 [17:14:07] New review: Hashar; "I am pretty sure that is how you can kill a box hard by having nagios fork until the box is out of m..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/1649 [17:14:26] * apergos peeks in [17:14:53] New review: Hashar; "Looks fine now. Thanks Roan for the merge!" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1647 [17:15:05] main thing would be the possibility of the scalers falling over [17:15:10] no [17:15:13] scalers are not involved [17:15:26] thumbs are requested from ms5, not the scalers [17:15:45] yes but if the thumb isn't there then the scaler will be asked for it [17:15:50] Yeah but that would happen anyway [17:15:54] Swift or no Swift [17:15:54] yes but the thumb would be there [17:15:58] why wouldn't it be? [17:16:13] If may not have been generated yet [17:16:22] of course, but that's the same as now [17:16:25] Then the scaler will have to generate it; but it'll have to do that anyway [17:16:26] Exactly [17:16:29] so nothing would be different for ms5 or the scalers [17:16:39] except swift is in the middle between squid and ms5 [17:16:44] thenI'm not getting your initial question [17:16:44] No, only for whatever fallback thingy you use [17:16:52] New review: Dzahn; "the log file is so huge because it is full of "Max concurrent service checks (8) has been reached", ..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/1649 [17:17:06] The Swift population thing needs to not assume that everything will be present on ms5 [17:17:20] why not? [17:17:29] because it might not be [17:17:33] then ms5 will get it [17:17:35] Ahm [17:17:37] Rephrase [17:17:40] as it has done for years and years [17:17:52] ah you are not talking about the production phase [17:17:52] The scalers will, via a 404 handler and an LVS thing [17:18:09] What I'm saying is [17:18:10] this is your "use swift for reads only" piece is it? [17:18:16] yes [17:18:19] this is for the very soon phase :P [17:18:25] that is what I was missing [17:18:26] or the now phase [17:18:32] Whatever populates Swift needs to handle the regeneration-upon-request case correctly [17:18:52] Hmm, actually I guess that's not even technically necessary [17:18:58] all it has to do *right now* is just ask ms5 for a copy. [17:19:01] It's not like there's a negative presence cache in Swift [17:19:09] mutante: you have to tune it manually [17:19:09] If the thumb is not on ms5 it can afford to just ignore it [17:19:17] And it'll go into Swift when it's requested for the 2nd time [17:19:32] It would be nicer to get it in right upon creation but it's not strictly necessary to do that [17:19:38] why? [17:19:41] mutante: look at http://nagios.manubulon.com/traduction/docs25en/tuning.html : need to "nagios -s", find out the minimum number of concurrent checks and double that value [17:19:42] ms5 always returns a thumb [17:19:44] if it's present or not [17:19:45] mutante: that should do it [17:19:50] assuming it's valid of course [17:19:57] if we had to sum up hume's reason to be in one line, what would it be? [17:20:00] hashar: exactly what i did, look at the commit message [17:20:11] mutante: oh sorry [17:20:14] I think it's fine if he prepoulates the stuff that's not on commons [17:20:18] i'm thinking about naming schemes for its cron jobs, under manifests/misc [17:20:19] that will give us a little testbed [17:20:25] yeah it's fine, and will work [17:20:38] just wondering if we really need to prefill [17:20:43] "cleanest" would be not to [17:20:47] but it would take a bit longer [17:20:55] but in a month, we'd be in a position to get rid of ms5 then if we wanted [17:20:57] that will also tell us something about the length of time it would take to prepopulate commons if we went that route [17:20:59] earlier, if we prefill [17:21:04] hashar: but yeah, it suggests 1231, and the tuning page would then suggest 2500 :o [17:21:18] mutante: there must be something wrong somewhere [17:21:30] mutante: cause we really don't want 1230 process in parallel :) [17:21:32] hashar: also just wanted to start discussion about a good value, but 8 is really low ..also reading http://nagios.manubulon.com/traduction/docs14en/checkscheduling.html#max_concurrent_checks [17:21:44] 8 is too low for sure [17:21:58] set it to 64 or so [17:22:11] and watch its graphs [17:22:16] increase or decrease as necessary [17:22:59] * apergos goes to look at space usage on ms6, out of curiosity [17:23:10] New patchset: Dzahn; "change max_concurrent_checks from 8 to 64" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1649 [17:24:14] 3.2T [17:24:33] because another possibility is to prepopulate by copying off of ms6 [17:24:46]