[00:00:15] is all of the drama around db9 over and done with, or are there still issues? [00:02:42] !log started replicating db50 from db47 [00:02:50] Logged the message, Master [00:03:40] robla: there will be future downtime [00:04:16] binasher: when and for how long (roughly)? [00:05:11] i still see ant-gcj/lucid uptodate 1.7.1-4ubuntu1.1 hashar [00:05:24] but if that's the only one and stuff works, yay :) [00:05:45] robla: 15 min at some sooner-the-better future time, then 30 min at some later less urgent time [00:14:59] LeslieCarr: looks like ant-gcj can be uninstalled, probably a left over of "ant" [00:15:04] LeslieCarr: not a big priority though [00:15:38] so we should be fine [00:20:48] my phone has not received my bank alerts since verizon ported and took over, vzw forums are full of this complaint, and no one ever posts what the soltuion is =P [00:23:54] New patchset: Hashar; "gallium: enable ssh X11 Forwarding" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1641 [00:31:37] Change abandoned: Hashar; "X11 Forwarding not needed, just found how to install android with no GUI:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1641 [00:55:26] !log ran svn up on openstackmanager on virt1 [00:55:34] Logged the message, Master [00:55:56] !log seems I broke labsconsole :( [00:56:04] Logged the message, Master [00:57:52] robla: is there a time that would be best for 15min of db9 downtime? [00:59:07] binasher: I think the big thing is scheduling it if you know it's going to happen. I'm fine with whatever so long as there's enough advanced notice [00:59:36] robla: how much advance notice would you like? [01:00:24] binasher: a day would be nice if its possible [01:00:56] binasher: could you fire off a note to wikitech-l with your plan? [01:01:09] basically, if there's a triage or something else, it's nice to have time to reschedule [01:01:27] anyway....gotta go to a meeting now [01:57:23] RECOVERY - DPKG on db13 is OK: All packages OK [02:25:56] !log fixed labsconsole. reverted aws-sdk to 1.4 [02:26:05] Logged the message, Master [02:28:33] <^demon|dinner> bugzilla-daemon seems to have been bounced from wikibugs-l again. [03:21:40] PROBLEM - MySQL disk space on db9 is CRITICAL: DISK CRITICAL - free space: /a 10547 MB (3% inode=99%): [03:44:42] PROBLEM - Disk space on db9 is CRITICAL: DISK CRITICAL - free space: /a 10457 MB (3% inode=99%): [07:23:04] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [07:45:14] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [08:14:35] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [08:39:06] Change abandoned: Hashar; "per mark request, no white space cleanup." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1494 [09:56:27] RECOVERY - MySQL slave status on es1004 is OK: OK: [10:16:38] hi [10:16:56] apergos: finally.. http://nagios.wikimedia.org/nagios/cgi-bin/extinfo.cgi?type=2&host=spence&service=check_job_queue [10:18:05] awesome! [10:18:33] :) [10:31:33] apergos: opinion on the "check_all_memcached" one.. that Can not connect to 10.0.8.6:11000 (Connection refused) ? should it try to connect elsewhere / run on another host / not needed ? [10:31:49] we do need the check [10:31:52] apergos: that's actually re: RT #1269 [10:32:30] and I don't know what the check looks like these days, there used to be a commandline script one could run [10:32:36] I think there is a wikitech page about that [10:32:45] you wrote there "relies on nfs and shouldn't" [10:33:00] what was the script? [10:33:15] "The scripts check_job_queue, check_all_memcached.php and check_MySQL.php have been copied from /home/nagios/plugins to /usr/local/nagios/libexec on spence" [10:33:36] check_all_memcached.php [10:33:49] yes [10:34:05] you'll have to look at it to see where it gets its list from [10:34:17] I would have to dig around to figure out which the right hosts are [10:34:56] ok [10:38:32] relies on nfs = require_once( '/home/w/common/wmf-config/mc.php' ); [10:38:45] and that has the list of servers and is publicly viewable [10:40:51] so it can read the list of servers, because /home is mounted on spence, but can't connect from spence to (some of?) the IPs in wgMemCachedServers [10:42:26] is that list anywhere else on spence? does spence have /usr/local/? [10:42:58] yes, the nagios checks are in /usr/local/nagios/libexec [10:43:13] but i didnt see that list anywhere yet [10:45:56] spence is unbearably slow to log into [10:46:34] seems like we should puppetize mc.php [10:46:46] /usr/local/apache/common-local/wmf-config/mc.php [10:46:52] no need, spence has it [10:46:55] ah [10:47:05] thaat must be a relatively new development, it didn't used to have /usr/local/apache stuff [10:47:31] i guess related to dist-upgrade and fixes after that, installing appservers and stuff [10:47:53] uhhuh [10:48:29] mostly fixes Roan suggested, also fixed the mw source tree on it [10:48:39] (was _really_ outdated before) [10:48:52] wikimedia-task-appserver i meant [10:49:20] right [10:49:55] arg,no, i mixed some of that up with fenari.. but still..yea [10:53:53] New patchset: Dzahn; "check_all_memcached - do not rely on NFS (fix RT1269)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1642 [10:54:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1642 [10:58:33] RoanKattouw: hi. job queue check now resolved for real (in Nagios web UI) :) +1 for team work [10:59:06] yay [11:01:37] New review: Dzahn; "we don't want to rely on NFS, and can now require /usr/local/apache/common-local/wmf-config/mc.php i..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1642 [11:01:38] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1642 [11:04:52] mutante: Please revert that [11:05:10] We don't sync things to spence:/usr/local/apache , so the mc.php copy in that dir will get out of date [11:05:50] RoanKattouw: do you know then those files appeared there? because: < apergos> thaat must be a relatively new development, it didn't used to have /usr/local/apache stuff [11:06:21] hmm.. any other idea how to "not rely on NFS"? [11:06:47] where can i get mc.php from? should i pupptize it for spence then? [11:06:48] Convince Ryan and Mark to set up syncing to spence? [11:07:00] I set it up but not everyone has an account there so they told me off for it [11:07:25] Did someone tell you to make the spence checks NFS-independent? [11:07:29] doesn't sync work with mwdeploy now? [11:07:31] yes, an RT ticket did [11:07:36] and they should be [11:07:36] apergos: Well, sort of [11:07:40] for this specific check [11:07:47] "sort of"? [11:07:49] apergos: You ssh in as you, *then* sudo to mwdeploy [11:07:57] meh [11:08:14] mutante: Then tell them it can't be done unless spence is synced to properly, which requires accounts to be set up [11:08:40] you know that's not really true [11:08:44] RoanKattouw: alright, i'll revert and paste that to ticket, k? [11:08:55] hmm [11:09:02] if we need a certain number of conf files from fenari, we can put em over with a cron job, rsync em or something [11:09:16] and that can be puppetized perfectly fine [11:09:25] I suppose it will have to be discussed though [11:10:18] Right [11:10:25] But that would be an entirely novel approach [11:10:30] So yeah, it would have to be discussed [11:11:24] given that right now we have zero approach [11:11:30] yes, anything we do will be novel [11:11:33] New patchset: Dzahn; "check_all_memcached - revert change, and use NFS path again, needs discussion" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1643 [11:11:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1643 [11:12:16] would be great if you guys add a comment on that. seems like the right place for the discussion? [11:12:33] PROBLEM - Host srv199 is DOWN: PING CRITICAL - Packet loss = 100% [11:13:18] prolly so [11:16:34] New review: Dzahn; "what's the best way to ensure mc.php is present and up-to-date on spence?" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/1643 [11:16:49] New patchset: Hashar; "WikipediaMobile: add css/html for nightly builds" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1644 [11:16:59] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/1644 [11:19:03] New patchset: Hashar; "WikipediaMobile: add css/html for nightly builds" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1644 [11:19:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1644 [11:23:35] mutante: will you be available this afternoon to built the testswarm package and make it available on our apt? [11:23:47] (note I already have build it in a lab VM if it can save troubles) :D [11:26:16] New review: Dzahn; "also see: http://rt.wikimedia.org/Ticket/Display.html?id=1269" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/1643 [11:28:13] hashar: yes [11:28:20] awesome :) [11:30:00] eh, well, i thought it is built already, and it's just the "put on our repo" part though [11:30:26] let's check wikitech again together, i may have to ask you stuff as well [11:30:50] but we should get it done today, yep [11:54:38] New patchset: Dzahn; "process monitoring for mobile traffic loggers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1645 [11:54:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1645 [11:55:40] New patchset: Dzahn; "process monitoring for mobile traffic loggers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1645 [11:55:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1645 [11:57:55] New review: Dzahn; "hashar, re: "recursive dirs". fyi: http://christian.hofstaedtler.name/blog/2008/11/puppet-managing-d..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1640 [13:58:24] New patchset: Hashar; "enable testswarmm on gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1646 [14:10:23] PROBLEM - MySQL disk space on db9 is CRITICAL: DISK CRITICAL - free space: /a 10730 MB (3% inode=99%): [14:25:27] New review: Dzahn; "just like the other process checks just with different arguments" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1645 [14:25:27] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1645 [14:39:03] well I have no idea how to merge the branches :-D [14:39:10] going to do that manually instead :-D [14:40:01] manually how? :) [14:40:18] hello mark :) [14:40:41] well I have a ton of changes related to testswarm in the 'test' branch and I need to merge them in 'production' [14:41:09] Oh [14:41:13] You'll want to cherry-pick them [14:41:14] I though I could just cherry pick them but there is so many file moves in between that they do not cleanly apply [14:41:29] (Please, please do it the proper way rather than by hand :) ) [14:41:39] Hmm [14:41:48] would love to do it properly so we can keep the sha1 and history [14:41:50] We should probably just merge test into prod again [14:43:00] one of my issue is that my changes are in manifests/misc-server [14:43:22] but in production that was split in a different file manifests/misc/contint.pp [14:43:29] so cherry picking does not work that well [14:43:34] Yeah [14:43:39] Let me see how painful a merge would be [14:43:48] so one possibly would be to create a branch back to the common ancestor [14:43:54] cherry pick my changes from 'test' [14:44:01] then attempt to rebase on origin/production [14:44:13] but I am not really sure that make any sense nor it will work [14:44:15] Only two files conflicted [14:44:21] Hmm, that sounds like an interesting possibility [14:44:22] Anyway [14:44:26] Let me look at my merge conflicts [14:44:37] which command have you run? [14:44:44] out of curiosity [14:45:03] git checkout production [14:45:05] git merge origin/test [14:45:13] hashar: You added class misc::contint::test { to misc-server.pp ? [14:45:18] yes [14:45:22] OK [14:45:27] and I got change there in the test branch [14:45:54] but in production that is in manifests/misc/contint.pp [14:45:57] I just gotta fix the conflict in lvs.pp now [14:45:59] which got change too [14:47:40] we're not gonna accept that merge anyway [14:47:46] since I don't know what can already be merged and what can't [14:47:50] and we can't even see the diff in gerrit [14:48:00] so best cherrypick your changes, I guess [14:48:30] yeah that is what I thought, but then I get issues with the file moves [14:51:43] ! [remote rejected] HEAD -> refs/for/production (you are not allowed to upload merges) [14:51:45] Screw you, gerrit [14:51:56] That should be allowed [14:52:02] no [14:52:25] we currently have no way of seeing what it consists of [14:52:35] I just merged test into production locally [14:52:48] I would like to be able to push that, because it sounds useful and makes hashar's life easier [14:52:55] but you can't do that [14:53:08] we have no way of telling what your merge contains [14:53:31] git fetch blah blah blah [14:53:33] I know it sucks, but that's how it is currently :( [14:53:36] git diff origin/production..FETCH_HEAD [14:54:28] i'm not even willing to merge test into production MYSELF right now, as I don't know what can go in already and what can't [14:54:31] so, the current model sucks [14:54:40] people should have their own branches :( [14:55:35] yeah that would be ten times easier [14:56:57] * RoanKattouw switches to plan B for incorporating hashar's work [14:57:52] thanks Roan, cause I am still puzzled by my changes locally :-\ [15:09:17] mark_: i still have an RT to upgrade "tarin" (poolcounter), kernel,apt,exim4,perl.. but remembering our recent talk about upgrades.. it is not really a specific issue with poolcounter itself and i would have to disable "wmgUsePoolCounter" in wmf-config for a couple minutes and then enable again [15:09:50] hashar: Hmph, I can't really figure out which revs have been applied and which ones haven't. Do you have a list of some osrT? [15:09:51] *sort [15:11:31] yeah somewhere :/ [15:11:37] I am sure I saved the list earlier [15:11:38] mutante: so only those packages need upgrading? [15:13:23] mark_: there is a bit more..(meanwhile, since the ticket was created) several libs, logrotate, parted, w3m, rsync, php5-common.. [15:13:40] just upgrade those packages, not the kernel [15:13:46] tarin is internal, right? [15:14:07] has a public ip [15:14:12] whut [15:14:16] 208.80.152.174 [15:14:20] then let's make a ticket to move poolcounter to another machine [15:14:43] ok [15:19:51] !log installing security upgrades on tarin (includes perl and php) [15:19:59] Logged the message, Master [15:25:40] RT created [15:40:16] New patchset: Dzahn; "planet - use star.wmf ssl cert, move to own file, remove hard-coded IP, add locales" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1606 [15:40:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1606 [15:40:48] New patchset: Dzahn; "planet - use star.wmf ssl cert, move to own file, remove hard-coded IP, add locales" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1606 [15:41:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1606 [15:57:33] anyone having better luck than me at getting ganglia up in their browser? [15:58:52] getting the Connection refused screen [15:59:00] i am on spence though [16:00:10] ok [16:01:04] not working for me either [16:01:18] can you kick it since you're over there? [16:01:25] ( mutante ) [16:01:31] ganglia-monitor , right [16:01:34] then i already did [16:01:36] ok [16:01:48] hmm [16:02:01] no dice [16:02:31] New patchset: Catrope; "script to fetch mediawiki + puppetization" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1647 [16:02:57] hashar: ---^^ [16:03:05] which process is the one listening on 8654 [16:03:15] (if it is running) [16:03:29] RoanKattouw: well done Roan!! [16:03:32] will review that :) [16:04:25] gmond is running [16:05:40] it's like the known issue, the difference is: it isnt temp. anymore [16:11:26] apergos: its back [16:11:30] yay [16:11:33] what did you do? [16:12:19] kill a history.cgi and wait a bit [16:12:41] ok [16:12:55] how is anyone invoking that, I thought you stopmped on that one [16:13:05] it's not for sure that it was really related [16:13:13] bummer [16:13:14] but like last time [16:13:59] i wanted to use it again myself :p [16:14:10] i want the history some way..hrmm [16:14:40] * apergos cues up the music [16:14:41] but also see my recent mail about duplicate service definitions in nagios [16:14:52] * you can't always git what you wa-ant..." [16:14:58] yeah saw it [16:15:02] puppet is pegged on spence too [16:15:47] maybe we can truncate history, just keep last few weeks [16:16:02] or tweak the query to return only the last 200 rows or something [16:19:28] hashar: So yeah, in the future you should really have your own branch for testswarm work [16:20:00] RoanKattouw: and having the testswarm project on labs to run puppet from that branch :D [16:20:06] apergos: hehe, yeah, especially "git" ;) [16:20:07] Yes [16:20:41] ah bummer, log_slow_queries is disabled on db9 [16:24:31] ohhhh, history.cgi greps the log!??!?!?! that's insane. [16:24:39] hahaha [16:24:54] what the hell is wrong with these people? [16:25:02] it could tail -something | grep I guess :-D [16:25:08] New patchset: Hashar; "script to fetch mediawiki + puppetization" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1647 [16:25:09] ok so /var/log/nagios/nagios.log is 2.6G [16:25:13] that's not going to go well [16:25:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1647 [16:26:02] you don't think tail -something would be smart enough to seek to the end and then try walking back a few blocks? [16:26:11] prolly not, eh [16:26:29] not sure [16:26:44] but also history.cgi is a binary [16:26:54] so much for that fine idea [16:27:39] New patchset: Hashar; "script to fetch mediawiki + puppetization" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1647 [16:27:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1647 [16:28:42] we should probably rotate that log at least [16:28:49] we don't?? [16:29:04] it's 2.6GB now, I'm guessing no [16:29:33] first entry is 10/14 [16:29:48] october!! [16:29:52] *eyeroll* [16:29:58] that may have been me actually [16:29:58] RoanKattouw: ok got the merge reviewed and fixed :-))) [16:30:27] Yay [16:38:23] can someone please merge production change https://gerrit.wikimedia.org/r/#change,1647 [16:38:36] that is a merge of my work on testswarm which was validated in a VM on labs [16:38:44] the merge itself was made by Roan and I reviewed it [16:38:54] Jeff_Green: apergos,.. but log_rotation_method=d in nagios.cfg .. that should be daily [16:39:35] log_archive_path=/var/log/nagios/archives [16:39:36] mutante: also it seems like there's debug crap in that log [16:39:45] looking for a verbosity toggle [16:39:49] or maybe to have that stuff log elsewhere? [16:39:56] use_syslog=1 [16:40:30] ah that makes sense [16:40:36] you think that'll break history.cgi though? [16:41:05] "If you have log rotation enabled, you can browse history information present in archived log files by using the navigational links near the top of the page. " [16:41:06] * apergos tries not to say "but it's already broken!" [16:41:14] ha [16:41:28] apergos: at least you tried :-) [16:41:31] the strange thing is: [16:41:39] i am not suggesting to configure it like that, it is [16:41:39] heh [16:41:55] oh i see that now [16:41:59] /operations/puppet/files/nagios$ grep log nagios.cfg [16:42:19] maybe that's why log_rotation_method=d isn't working :-) [16:42:31] nagios makes me smile more :-) [16:43:24] there's nothing nagios-specific in the rsyslog.d conf [16:44:37] are we sure we're running on /etc/nagios/* and not /etc/nagios3/* [16:44:51] New patchset: Hashar; "script to fetch mediawiki + puppetization" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1647 [16:45:01] Jeff_Green: not sure enough ..hrmm [16:45:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1647 [16:45:06] hmm I can prolly gain back about 1.5T by doing a folloup thumb cleaning job [16:45:09] guess it's worth it [16:46:07] Jeff_Green: yes file { "/etc/nagios/nagios.cfg": [16:46:49] k [16:47:24] ah. spence is about to apply puppet changes [16:47:31] oh " logged to the syslog facility, as well as the NetAlarm log" [16:56:50] I dunno, it just seems like nagios log rotation is broken and I don't see any way to debug it [16:59:03] !log manually rotated spence:/var/log/nagios/nagios.log because nagios log rotation appears broken and the file is ~2.6G [16:59:13] Logged the message, Master [17:05:30] someone remind me - who is the mobile site guy? [17:05:39] Patrick Reilly [17:05:46] preilly: [17:06:05] Prodego: ---^^ [17:06:20] ok, looks like he isn't here, I'll just leave him a message in the channel anyway [17:07:10] !log spence: check out "nagios -s /etc/nagios/nagios.cfg" for performance data - it suggests "Value for 'max_concurrent_checks' option should be >= 1231" [17:07:18] Logged the message, Master [17:07:22] preilly: looks like there is some sort of cache issue with &mobileaction=view_normal_site - if you compare http://en.m.wikipedia.org/w/index.php?title=Main_Page&useformat=mobile&mobileaction=view_normal_site to http://en.wikipedia.org/wiki/Main_Page you can see the version given by the 'view main site' link is out of date [17:08:23] !log spence: according to [http://nagios.manubulon.com/traduction/docs25en/tuning.html] we should even double that if we have "high latency values (> 10 or 15 seconds)" and we have like > 1000 [17:08:32] Logged the message, Master [17:10:12] ben-: do you think it would take too long to not prefill swift, and just let it grow as squids request them until we have an acceptable "hit rate"? [17:10:24] doing that is of course a nice way to get rid of unused thumbs... [17:12:19] New patchset: Dzahn; "remove special.cfg from nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1648 [17:12:34] New patchset: Dzahn; "change max_concurrent_checks from 8 to 1000" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1649 [17:12:49] New patchset: Hashar; "jenkins: add git configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1650 [17:14:07] New review: Hashar; "I am pretty sure that is how you can kill a box hard by having nagios fork until the box is out of m..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/1649 [17:14:26] * apergos peeks in [17:14:53] New review: Hashar; "Looks fine now. Thanks Roan for the merge!" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1647 [17:15:05] main thing would be the possibility of the scalers falling over [17:15:10] no [17:15:13] scalers are not involved [17:15:26] thumbs are requested from ms5, not the scalers [17:15:45] yes but if the thumb isn't there then the scaler will be asked for it [17:15:50] Yeah but that would happen anyway [17:15:54] Swift or no Swift [17:15:54] yes but the thumb would be there [17:15:58] why wouldn't it be? [17:16:13] If may not have been generated yet [17:16:22] of course, but that's the same as now [17:16:25] Then the scaler will have to generate it; but it'll have to do that anyway [17:16:26] Exactly [17:16:29] so nothing would be different for ms5 or the scalers [17:16:39] except swift is in the middle between squid and ms5 [17:16:44] thenI'm not getting your initial question [17:16:44] No, only for whatever fallback thingy you use [17:16:52] New review: Dzahn; "the log file is so huge because it is full of "Max concurrent service checks (8) has been reached", ..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/1649 [17:17:06] The Swift population thing needs to not assume that everything will be present on ms5 [17:17:20] why not? [17:17:29] because it might not be [17:17:33] then ms5 will get it [17:17:35] Ahm [17:17:37] Rephrase [17:17:40] as it has done for years and years [17:17:52] ah you are not talking about the production phase [17:17:52] The scalers will, via a 404 handler and an LVS thing [17:18:09] What I'm saying is [17:18:10] this is your "use swift for reads only" piece is it? [17:18:16] yes [17:18:19] this is for the very soon phase :P [17:18:25] that is what I was missing [17:18:26] or the now phase [17:18:32] Whatever populates Swift needs to handle the regeneration-upon-request case correctly [17:18:52] Hmm, actually I guess that's not even technically necessary [17:18:58] all it has to do *right now* is just ask ms5 for a copy. [17:19:01] It's not like there's a negative presence cache in Swift [17:19:09] mutante: you have to tune it manually [17:19:09] If the thumb is not on ms5 it can afford to just ignore it [17:19:17] And it'll go into Swift when it's requested for the 2nd time [17:19:32] It would be nicer to get it in right upon creation but it's not strictly necessary to do that [17:19:38] why? [17:19:41] mutante: look at http://nagios.manubulon.com/traduction/docs25en/tuning.html : need to "nagios -s", find out the minimum number of concurrent checks and double that value [17:19:42] ms5 always returns a thumb [17:19:44] if it's present or not [17:19:45] mutante: that should do it [17:19:50] assuming it's valid of course [17:19:57] if we had to sum up hume's reason to be in one line, what would it be? [17:20:00] hashar: exactly what i did, look at the commit message [17:20:11] mutante: oh sorry [17:20:14] I think it's fine if he prepoulates the stuff that's not on commons [17:20:18] i'm thinking about naming schemes for its cron jobs, under manifests/misc [17:20:19] that will give us a little testbed [17:20:25] yeah it's fine, and will work [17:20:38] just wondering if we really need to prefill [17:20:43] "cleanest" would be not to [17:20:47] but it would take a bit longer [17:20:55] but in a month, we'd be in a position to get rid of ms5 then if we wanted [17:20:57] that will also tell us something about the length of time it would take to prepopulate commons if we went that route [17:20:59] earlier, if we prefill [17:21:04] hashar: but yeah, it suggests 1231, and the tuning page would then suggest 2500 :o [17:21:18] mutante: there must be something wrong somewhere [17:21:30] mutante: cause we really don't want 1230 process in parallel :) [17:21:32] hashar: also just wanted to start discussion about a good value, but 8 is really low ..also reading http://nagios.manubulon.com/traduction/docs14en/checkscheduling.html#max_concurrent_checks [17:21:44] 8 is too low for sure [17:21:58] set it to 64 or so [17:22:11] and watch its graphs [17:22:16] increase or decrease as necessary [17:22:59] * apergos goes to look at space usage on ms6, out of curiosity [17:23:10] New patchset: Dzahn; "change max_concurrent_checks from 8 to 64" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1649 [17:24:14] 3.2T [17:24:33] because another possibility is to prepopulate by copying off of ms6 [17:24:46] but why would you do that [17:25:03] if we had to choose some files [17:25:08] ms6 doesn't have everything [17:25:32] mutante: here is the formula to compute it http://nagios.manubulon.com/traduction/docs25en/checkscheduling.html#max_concurrent_checks [17:25:32] no, but we're already talkin about not stuffing in everything righ taway [17:25:56] if we had to make a good guess about thumbs to copy over, the ones cached by ms6 might be a good base [17:25:59] I fear ms6 might have non-fresh thumbs [17:26:00] let's not [17:26:06] hashar: yea, that's the page i was reading too [17:26:07] food, bbl [17:26:15] same here, bbl [17:26:22] have a good dinner [17:26:37] we're serving non fresh thumbs? [17:29:04] ms5 always returns a thumb [17:29:12] That depends on what you mean by returns [17:29:20] NFS? HTTP request? Something else? [17:29:20] if there is a source image, yes [17:29:39] http [17:29:42] Ah, yes [17:29:45] HTTP will return one [17:29:45] since that's how requests get made. [17:29:52] I forgot that the storage server and the web server are the same [17:30:09] and of course it's then on the filesystem too [17:32:51] 6.5T, still using a lot on ms5 [17:43:41] New patchset: Dzahn; "remove special.cfg from nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1648 [17:45:14] (had to do that so it doesnt break stuff) /me away now [17:59:10] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1649 [17:59:49] if someone has any time, would be great to review the hacky https://gerrit.wikimedia.org/r/#change,1644 [18:00:03] it is a basic portal for our Android application nightly builds [18:00:12] will have someone enhance the layout / css later on [18:00:22] off for today see you tomorrow :) [18:10:40] mark: ms5's hit rate is between 70 and 110 qps over the course of a day. unless I can get swift's rate for cache misses > 110, I don't think it's feasible. (currently it's ~50qps) [18:11:25] ben-: I don't understand [18:11:43] oh [18:11:51] you're saying swift can't write faster than 110qps? [18:12:18] PROBLEM - Lighttpd HTTP on dataset1 is CRITICAL: Connection refused [18:12:25] I see [18:12:28] yes, that is a problem [18:13:17] that's not critical, hush naggy [18:13:20] New patchset: Jgreen; "new class for misc::maintenance stuff, cronjobs for hume" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1651 [18:13:30] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/1651 [18:13:47] gar. [18:14:08] mark: I suspect that there's tuning to be done in the number of processes etc. running on the storage nodes. [18:14:23] but I didn't see immediately in ganglia what the bottleneck was. [18:14:37] maplebed: so basically you're saying that, we need to prefill because swift is not capable of keeping up with the normal squid miss request rate if they're all writes [18:14:39] is that correct? [18:14:41] the ms servers are all using a surprising amount of CPU but I haven't looked at it yet. [18:14:56] mark: yes, that's correct. [18:14:58] are you working this week? [18:14:59] ok [18:15:24] normal passthrough to the image scalers is ~25qps, so it can handle current miss rate if it's prepopulated. [18:15:30] right [18:15:31] but not if it's empty. [18:15:40] but that's worryingly low, hopefully we can tune it to perform much better [18:15:48] since that is gonna hit us anyhow [18:15:50] that is worryingly low, it's true. [18:16:05] I would bet it will scale very well with the number of storage nodes; [18:16:16] yeah [18:16:19] but I don't have a 4th node to play with yet. [18:16:19] I certainly hope so [18:16:20] can I run your scripts tomorrow on something? [18:16:20] ;) [18:16:34] fine by me. [18:16:40] would it be feasible to add an eqiad node to the mix, or would the latency be too high? [18:17:04] I think the latency would invalidate the test. [18:17:11] ok [18:17:35] I think swift could handle it, as a cluster, I'm just not sure it would get us very useful info. [18:17:44] right, valid concern [18:17:59] ok, i'll play tomorrow [18:18:08] and look at bottlenecks and such [18:18:13] do you think there's any chance of getting ms4 working? [18:18:21] I fear it might take a while :( [18:18:27] bummer. [18:18:30] memory/mainboard damage is not easy to fix :( [18:18:36] yup. [18:18:48] apergos: ms8 is still replicating off ms7, right? [18:18:56] so it would be a shame to reinstall that box also [18:19:01] yes it is [18:19:03] the current state does give us good info about cpu usage though, and that we'll want our storage bricks to be dual quad core. [18:19:15] perhaps it's hashing [18:19:20] are ya'll following the openstack list? [18:19:35] i'm not [18:19:38] jeremyb: not as closely as I should be. [18:19:45] I'm subscribed but rarely read it. [18:20:06] nope, not on it [18:20:27] see this thread: https://lists.launchpad.net/openstack/msg06187.html [18:20:59] oomph. their performance is way worse than ours, but the ratio is similar. [18:21:15] well, increasing the number of workers would be a good thing to test anyhow [18:21:17] (3qps writes, 25 qps reads vs. 50qps writes and 1100 qps reads) [18:21:27] until we see cpu saturation [18:21:35] +1 mark. [18:21:42] ...or no increase in performance ;) [18:22:53] so maplebed [18:23:26] if you want to pull the commons images before my "clean up large sizes" jobs runs, you could do this: [18:23:48] maplebed: ok, let me know what you find today, and i'll play tomorrow [18:23:53] apergos: I'm not concerned [18:23:54] any thumbs which have both a 1280px- and a 1024px- (the thumb file name will start with that), skip those two files [18:23:56] this is just a test cluster [18:24:01] I'm going to wipe it anyways. [18:24:08] because those are the ones I'm going to remove anyways [18:24:10] even if I get stuff you want to delete, it won't really have any effect. [18:24:21] okeley dokely [18:24:22] mark: will do. [18:24:45] in the mean time, I eagerly await news from RobH. [18:24:47] :) [18:24:52] yeah [18:25:09] RobH: could you investigate ways to make ms4 work again? [18:25:16] like... find a new mainboard somewhere or so? [18:25:38] it's a shame to not have that box in working condition [18:25:43] if it doesn't cost much to fix it up anyway [18:26:19] Ryan_Lane: anything you need now? ;) [18:26:22] because I'm about to leave again [18:27:13] pony! [18:27:18] New patchset: Jgreen; "new class for misc::maintenance stuff, cronjobs for hume typofix: semicolons to commas" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1651 [18:27:24] maplebed: so is this swift coming from ubuntu lucid? [18:27:30] if so, we might want to try much newer releases? [18:27:40] at least for comparative testing [18:27:42] no, it's coming from ppa. [18:27:46] ok [18:27:48] we're on 2.4.5, which is relatively current. [18:27:58] (might even be the most recent stable) [18:27:58] imported into our repo? [18:28:05] yeah. [18:28:20] alright [18:28:27] see http://wikitech.wikimedia.org/view/Swift/Hackathon_Installation_Notes#packages [18:28:29] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1651 [18:28:29] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1651 [18:28:48] oh, sorry. that says we're on 1.4.3, not 1.4.5. [18:29:18] mark: umm, I don't think so [18:29:33] Ryan_Lane: your lvs/networking changes I guess, I hope to get to that this week ;) [18:29:39] heh [18:29:50] yeah. wasn't going to say that one since you were leaving ;) [18:29:58] I couldn't imagine that's a super quick change [18:30:05] probably won't be no [18:30:18] alrighty [18:30:19] see ya later [18:30:26] * Ryan_Lane waves [18:31:12] maplebed: give me a heads up when you decide to start running your find [18:31:53] apergos: I would like too soon; my goal is to see if swift performance changes when it has millions of files instead of thousands. [18:32:07] are you thinking today? [18:32:11] yeah. [18:32:13] ok [18:32:22] maplebed: per container? [18:32:33] jeremyb: yes. [18:32:45] so tomorrow I'll see if it's still running [18:32:57] apergos: feel free to kill it if it is. [18:33:00] maplebed: i think ~1 million is about the ideal ceiling per container? [18:33:05] so I'm not competing with you for the few free i/o cycles [18:33:06] but i may be rusty [18:33:12] nah, the test needs to get going sooner rather than later [18:33:18] my cleanup can wait a few days [18:33:43] apergos: maplebed: i'm not totally clear on which rates you're measuring. or where in the upload process these writes happen [18:33:57] not always in the upload process [18:34:09] they are in the "user wants a thumb. we don't have it. make one" process [18:34:48] jeremyb: this picture might help: http://wikitech.wikimedia.org/view/File:Thumbnail_request_path_all_swift.jpg [18:34:59] (or it might just make things worse) [18:35:27] heh [18:36:03] that was not written by anna lena :( [18:36:11] mark: eep still there ? [18:36:26] who's anna lena? [18:36:33] maplebed: 1sec [18:39:15] sorry, was out snagging food [18:39:20] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [18:39:34] maplebed: my call with dell is later this afternoon, it got pushed by them =P [18:39:54] annoying, but no big deal, I suppose. [18:42:38] hrmm, identi.ca's broken and snapbird's not working either :( [18:45:40] maplebed: http://commons.wikimedia.org/wiki/Category:Visual_documentation_of_Chapters%27_Meeting_2011 [18:45:55] https://twitter.com/#!/annalena https://twitter.com/#!/amanda_lyons [18:46:15] anyway, that's a tangent. /me tries to parse the picture [18:46:39] heh... no, I am not a graphic artist or visual display of information pro. [18:47:03] but the picture at least helped me get all the pieces straight in my own mind. [18:47:28] I would like to be; I have incredible respect for people good at visual information. [18:52:43] jeremyb: those make me think of http://www.youtube.com/watch?v=u6XAPnuFjJc [18:54:42] (which is awesome for content as well as presentation, but that's a different story.) [18:54:47] maplebed: why do they make it so hard (on their site not the video) to figure out what their TLA stands for? [18:55:06] i found it eventually [18:55:09] oh, the rsa? no idea. [18:55:32] at first i thought it was going to be a guy explaining crypto... [18:55:42] on a whiteboard [18:55:57] or with kids toys or something [18:56:51] maplebed: does this test have any nfs at all? [19:01:35] swift doesn't use nfs. [19:01:39] maplebed: what does dotted line mean vs. solid? [19:01:48] nfs is in the image [19:01:58] uppper right corner and bottom middle [19:02:13] right - the image scalars, ms5, and ms7 all use nfs. [19:02:24] they will continue to? [19:02:24] swift is a step towards getting us away from that. [19:02:51] i understand you may need nfs during the transition. but to simplify things, maybe first do a test with no nfs at all [19:03:07] i.e. the way that things will be after transition is complete [19:03:19] all the content for the test I'm running stops at ms5. [19:03:38] (i.e. doesn't go back to the scalers and make use of nfs) [19:03:53] so essentially, yes, this test does not touch the parts of the system that use nfs. [19:03:53] are you running your own scalers? [19:04:04] no, I'm just not using any content that needs to be scaled. [19:05:30] http://wikitech.wikimedia.org/view/Swift/Load_Thumbnail_Data is the process I'm following to test [19:05:52] by using a directory listing from ms5 as the source for the list of requests, I make sure to avoid anything that would need to be scaled. [19:07:12] maplebed: unless some one purged a few things in the mean time :) [19:07:26] heh... yeah, there is that. [19:07:54] but from the disk space listing on ms5, I'm pretty sure I'm not hitting much of that. [19:08:07] sure, I'm just being pedantic [19:09:52] maplebed: instead of `find $i -type f` you may try `find $i -name commons -a -type d -prune -o -type f -print` or something like that (i can tweak it if you like) [19:10:05] I'll be deleting prolly about 2 million thumbs over the next several days/2 weeks [19:10:09] maplebed: although that's gnu find. idk what you have [19:10:18] I am not working off of a pre-existing list for those [19:10:23] ( maplebed ) [19:10:49] jeremyb: that would exclude all the thumbs in commons? [19:11:15] maplebed: yeah... been a little while since i did it. let me play with it for a min [19:12:06] I'm ok including commons for now. [19:12:35] apergos's scirpts will just re-delete anything I accidentally recreate. :P [19:13:12] orly [19:13:21] they will? :-P [19:14:02] won't they? your criteria don't say anything about creation time (except for the google stuff, which is 'newer' than when they started, so stuff I recreate counts) [19:14:48] well I was going to do it based on creation time but I decided against it [19:15:01] instead the "it has both 1280px and 1024px" seemed like a good enough indicator [19:15:10] +1 [19:16:09] script ready to go. I'll start it when your find is all happy and done [19:16:22] maplebed: so, in your test you fetch each one once only and start with empty containers? so then you end up with all misses? or are there some hits? (is this test repeatable/has been repeated?) [19:17:06] I've only been doing all misses or all hits, so as to get clearly differentiated numbers. [19:17:30] so, I start with an empty container, load all the images, then load them all again, and I get two sets of stats; the first for writing misses and the second for reading hits. [19:17:42] to repeat, I drop the container, recreate it, and do it again. [19:19:08] soooo.... if you are really going to use a list of the commons images on ms5 now for these tests, it makes no sense for me to remove anything [19:19:55] apergos: I don't think I'll repeat the commons test, [19:20:21] so it's ok to delete stuff. [19:20:30] or I can recreate the file list so not hit stuff you've deleted. [19:20:30] ok [19:20:46] ok [19:20:57] (it could be worse, it could be zfs with snaps enabled) [19:25:36] no Reedy... I want him [19:25:59] I'm here? [19:27:21] Reedy: are you not here? [19:27:28] * AaronSchulz wonders where "here" is [19:27:36] * Reedy goes to find a mirror [19:27:53] <^demon|away> I'm here too! [19:28:51] * jeremyb can't parse the Reedyness [19:28:52] ah there you are! [19:28:59] sorry, you were so quiet all day... [19:29:07] is the rotatebot backed up or something? [19:29:20] There's like 7000 images.. [19:29:21] or was [19:29:29] that seems like a lot [19:29:34] indeed [19:29:39] I think people went on a tagging spree [19:29:58] I found a SVG with a rotation request... [19:30:00] whch is great but I thought it was running a lot faster on hume now... [19:30:09] awesome [19:30:19] maplebed: see you later... i'll think about your artwork while i'm gone [19:30:19] 5390 at the moment [19:30:33] jeremyb: I've got a few more fro you. [19:30:39] As it seemed to be faster on toolserver (though, they don't seem to think it is now), it was decided easier just to let it be on toolserver [19:30:46] oh. [19:30:52] But saibo has been talking about getting us to clear the backlog [19:30:54] jeremyb: http://wikitech.wikimedia.org/view/Swift#All_Swift [19:30:54] I thought after the fixing up that it was faster on hume [19:30:58] that puts it in context [19:31:06] it's faster on both [19:31:10] ok [19:31:24] there's a couple of lines to merge in.. And we've got to decide on a sleeping factor if we run it (to save the ccalers) [19:31:29] yes [19:32:20] maplebed: danke [19:32:41] at a 5 sec sleep after every rotate (which might be totally superfluous) it's still going to get through thinigs relatively quickly [19:32:49] long as it runs one process only [19:32:56] it shouldn't really be a problem [19:33:11] maplebed: erm, < jeremyb> maplebed: what does dotted line mean vs. solid? [19:33:23] * maplebed looks [19:33:38] ah, [19:33:42] dotted is NFS [19:33:55] solid is HTTP [19:33:56] almost. [19:34:09] heh [19:34:16] dotted is also load balanced [19:34:39] for the connection from LVSx to srvx or sqx, it indicates it's going to one of a pool [19:35:23] it's a little hard to see, but pink dotted == NFS, black dotted == load balanced HTTP. [19:35:24] yeah, ok [19:35:34] danke [19:35:35] (more visible at full resolution) [19:36:29] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [19:36:36] (that should have been labeled. mybad.) [19:39:35] apergos, or we could cron it like they do on TS - every X minutes, if it's not running, run it [19:41:03] well the first run I would do by hand (7000 images) [19:41:21] but after that why not stick it in cron for every ... dunno, 20 mins or so [19:42:56] the other thing that it could do is run only on "off peak" hours on hume [19:43:10] when the US is asleep [19:43:21] yeah, which would be easy enough when in cron [19:43:25] yup [19:43:37] ok, tht's all I had, I was just curious... [19:43:43] preilly: pokr [19:43:46] r->e [19:44:00] I think it's approved to get a bot flag too [19:44:17] Prodego: ? [19:44:31] I would hope so, otherwise it would be filling rc! [19:44:38] preilly: I pinged you a while back about the mobile site, did you see that message? [19:44:42] or should I paste it again [19:44:49] Prodego: nope, paste again [19:44:54] preilly: looks like there is some sort of cache issue with &mobileaction=view_normal_site - if you compare http://en.m.wikipedia.org/w/index.php?title=Main_Page&useformat=mobile&mobileaction=view_normal_site to http://en.wikipedia.org/wiki/Main_Page you can see the version given by the 'view main site' link is out of date [19:45:19] Prodego: those are the same for me [19:48:36] preilly: they aren't for me [19:48:44] Prodego: where are you located? [19:48:56] for the mobile link the FA is Andalusian horse [19:49:11] whereas it is currently McCormick Tribune Plaza & Ice Rink [19:49:21] not for me [19:49:25] it's correct for me [19:49:26] US PA [19:49:55] ok, well I'm not sure what that would be from then [19:50:11] I've never opened that page before, so it shouldn't be a local cache [19:50:16] and I am not the only user seeing a difference [19:50:40] Prodego: yeah, I know — I've been trying to figure it out [19:50:50] Prodego: I'll continue to work on it [19:51:49] let me know if there is anything I can check for you [19:52:36] Prodego: okay, if I delete all my cookies I see the stale page [19:53:10] Prodego: okay, try it again [19:53:12] logged in users can bypass the cache if they set that in their prefs [19:53:36] still see Andalusian horse after a purge [19:54:54] mark: any news on the exim set up? [19:55:02] puppetization that is [19:55:06] Prodego: I'll continue to look into it [19:55:14] sure [19:57:51] Prodego: what sort of mobile device or browser are you using? [19:58:10] I'm just using my computer, chrome browser [19:58:17] I could check it on my phone too if we want [19:58:28] maplebed: https://answers.launchpad.net/swift/+question/159085 [19:59:15] binasher: do you think that the &useformat=mobile&mobileaction=view_normal_site isn't getting the HTCP purges? [19:59:35] interestingly I see the old FA both with the &mobileaction=view_normal_site and in the normal mobile view on my phone [19:59:56] on my computer I see the correct FA without &mobileaction... [20:00:09] Prodego: do you have any cookies set? [20:00:18] I can't really check that on my phone [20:00:39] AaronSchulz: you're on teh email apergos sent, but the estimate for commons thumbs is 98 million. [20:00:42] preilly: yeah, that is likely it.. and there could be a cached version of that full url with a variance that we aren't hitting in our browsers [20:00:58] AaronSchulz: so sounds like we should shard commons thumbs. [20:01:00] :( [20:01:13] binasher: any ideas how to fix it? [20:01:20] on chome I'm logged in on the main site, so I'll have those cookies, I can check... [20:01:22] maplebed: no SSDS? :/ [20:01:24] chrome* [20:01:51] sharding won't be hard though, so it might be cheaper anyway [20:02:35] 10 million. hmm [20:02:40] if you have wiki login cookies but not the mobile beta opt-in cookie, you'll still hit cache via the mobile site [20:02:53] 16 shards? [20:02:56] though getting a listing of all thumbs won't be as easy, if anyone wants to do that [20:02:58] 0, 1, 2...? [20:03:19] (cheap but it's right there in the directory path and the filename hash so... ) [20:03:22] AaronSchulz: ssds are scaling vertically, where cost increases exponentially with scale. sharding in the URL scales horizontally, where cost scales linearly. so no, I don't think we should do the ssd route when we have an easy method of doing the horiz method. [20:03:39] preilly: I have 3 cookies from en.m.wikipedia.org on my computer [20:03:40] it would have to iterate through the segments, mostly can be abstracted away in the swift subclass [20:03:42] apergos: we already shard on 2 hex digits, I'd vote we keep that and do 256. [20:03:47] anyway, gotta run for lunch. [20:03:48] bbl. [20:03:54] 256 sounds fine to me [20:04:18] clicktracking-session, mediaWiki.user.bucket:ext.articleFeedback-options, andmediaWiki.user.bucket:ext.articleFeedback-tracking [20:04:22] maplebed: for sure, it's cheaper [20:04:29] ben already is planning a find to get the thumb list. [20:04:59] yeah, 256 is future-proof [20:05:08] we could be more clever about it than that but it doesn't matter, it can just run slowly and finish eventually [20:05:23] preilly: we could make varnish not cache any urls with mobileaction=view_normal_site [20:05:32] well the other thing that would make this futureproof is that we don't really have to keep all thumbs we generate for all time [20:05:38] binasher: yeah, that is probably best for now [20:05:41] we should think about treating it like a true cache [20:05:50] or only cache them for a short time [20:05:52] a very very large one but one that has limits [20:06:03] binasher: can you do that change? [20:06:07] binasher: or, are you slammed? [20:06:21] binasher: note that on my phone I see the wrong version in the regular mobile view too [20:06:37] so I'm not entirely sure why that would be [20:07:32] Prodego: what actual url are you viewing on your phone? [20:08:15] en.m.w.../wiki/Main_page [20:08:23] preilly: grabbing lunch in a min, but i can make that right after [20:08:33] binasher: cool [20:08:36] binasher: if I capitalize Page then I see todays TFA [20:09:50] binasher: preilly and http://en.m.wikipedia.org/w/index.php?title=Main_Page&mobileaction=view_normal_site&useformat=mobile gives a totally different view [20:10:21] same URL, only the parameters are in a different order [20:10:31] Prodego: it will be fixed shortly [20:10:33] is must be a very top level cache problem [20:10:44] Prodego: it is a varnish cache issue [20:11:00] Pages get purged from cache via their cononical url. If mediawiki is going to serve the same page at MaiN_paGE, but only send a purge when its updated for Main_Page, that isn't a varnish or mobile issue. [20:11:21] squid would have the same issue [20:11:54] binasher: Main_page is a redirect to Main_Page [20:11:58] binasher: but, it should be fixed soon by your change [20:12:02] right? [20:12:16] so it isn't serving the same page, it is serving a different page that is a redirect, if that makes sense [20:13:52] http://en.wikipedia.org/wiki/Main_page is not a redirect to Main_Page on the full site, and Main_page is cached by squid at that url, as a distinct object from Main_Page. [20:13:57] RECOVERY - Lighttpd HTTP on dataset1 is OK: HTTP OK HTTP/1.0 200 OK - 1512 bytes in 0.009 seconds [20:15:03] binasher: it is - http://en.wikipedia.org/wiki/Main_page?redirect=no [20:15:04] Change abandoned: Hashar; "Not required. This can be configured from the Jenkins web interface." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1650 [20:15:30] Prodego: i'm talking about redirects in terms of http responses [20:15:39] oh yes, it isn't an HTTP redirect, no [20:16:24] re: internal mediawiki redirects - does mediawiki have logic to send purges for all distinct names that redirect to a page when that page is changed? [20:42:49] hey ho :) [20:43:11] is there any SF ops willing to review a nice change? That is mostly apache configuration and some ugly HTML/CSS design [20:43:11] https://gerrit.wikimedia.org/r/#change,1644 [20:43:35] that would let us publish our android apps nightly build [20:51:37] PROBLEM - mobile traffic loggers on cp1044 is CRITICAL: PROCS CRITICAL: 1 process with args varnishncsa [21:09:21] maplebed: https://answers.launchpad.net/swift/+question/134996, you're right [21:14:28] http://en.wikipedia.org/wiki/CAP_theorem <- fun stuff [21:26:06] New patchset: Asher; "reduce cache ttl to 60s for "mobileaction=view_normal_site" urls since they don't get purged. also fix frontend / hack timing." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1652 [21:26:23] preilly: https://gerrit.wikimedia.org/r/1652 [21:27:15] binasher: what about the &useformat=mobile case? [21:28:05] hm.. do we ever provide links relative to m.wiki that include useformat=mobile but not the mobileaction= bit? [21:29:08] apergos: dunno if you're still around, but ganglia suggests that the ionice -c 3 trick actaully means that the find has no effect on ms5 throughput. image requests per second is unchanged despite 100% disk io. [21:29:19] New review: preilly; "This looks okay to me." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1652 [21:29:32] binasher: yes [21:29:55] Ryan_Lane: https://gerrit.wikimedia.org/r/#change,1652 [21:30:05] preilly: ok, i'll amend the commit. [21:30:10] Ryan_Lane: sorry, that was: https://gerrit.wikimedia.org/r/#change,1652,blowme [21:30:28] preilly: do urls with mobileaction= always include useformat=mobile? [21:30:35] binasher: no [21:30:40] grr [21:31:34] Ryan_Lane: can you merge https://gerrit.wikimedia.org/r/#change,1644 [21:32:04] preilly: why are you asking me to merge it? [21:32:33] what's it for? [21:33:04] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1644 [21:33:05] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1644 [21:33:29] New patchset: Asher; "reduce cache ttl to 60s for "mobileaction=view_normal_site" urls since they don't get purged. also fix frontend / hack timing." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1652 [21:33:50] preilly: can you reload / re-review 1652? [21:34:39] New review: preilly; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1652 [21:34:56] binasher: looks good [21:35:13] ok [21:35:22] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1652 [21:35:23] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1652 [21:37:11] preilly: ok, its successfully deployed to the varnish servers. i'm not going to flush caches though, i don't think it's that big of a deal to wait [21:37:21] binasher: OKAY [21:37:27] binasher: WHAT?!? [21:37:32] binasher: jk [21:37:56] preilly: tell people to add &ok=WHAT to urls if they want a fresher version [21:38:16] binasher: that would be awesome you sir are a genius [21:38:57] yeah I have stopped counting the number of beers I owe to the ops [21:39:09] I guess I will just offer several rounds of beer :D [22:21:20] PROBLEM - MySQL disk space on db9 is CRITICAL: DISK CRITICAL - free space: /a 10596 MB (3% inode=99%): [22:23:17] binasher: you're doing something with db9 tonight, right ? [22:23:38] LeslieCarr: nope, tomorrow at 6pm [22:23:55] ah, just saw the disk warning [22:24:01] will db9 survive until tomorrow ? [22:24:04] yup [22:24:26] thanks for checking tho [22:28:39] AaronSchulz: I'm gonna break the eqiad swift cluster for a bit to try out including the hash in the container name. [22:38:32] AaronSchulz: interested in doing a code review on my change? [22:38:56] depends what it is, I could try [22:39:04] one sec; lemme check it in. [22:41:36] AaronSchulz: https://www.mediawiki.org/wiki/Special:Code/MediaWiki/106882 [22:45:02] nice move using named subgroups [22:46:17] thanks. I always think it's confusing without, but optional substrings just make it way worse. [22:46:53] of course, this change does increase the number of containers we will need from 2,568 to 657,408. [22:47:02] which is a little absurd. [22:47:19] but... I don't know that there's much of a limit on the number of containers... [22:48:02] I don't know why you are capturing "thumb/" and you can assume it's just thumb/x/xy, it could be thumb/archive/x/xy [22:48:32] that's not the pattern I expected. [22:48:36] 657,408, wow :) [22:49:05] I thuoght it would be thumb/a/ab/aoeu or archived/a/ab/aoeu, not thumb/archived/a/ab/aoeu [22:49:47] russ's code obscured it previously, but worked [22:49:58] damn. [22:50:10] this is why readable code is good [22:50:54] * AaronSchulz thought he added some example urls [22:54:08] maplebed: see r106886 [22:55:25] does the original media store also use archived/deleted in the same way? [22:55:33] could you throw in examples for that too? [22:55:47] there is no /deleted for thumbs [22:55:58] some things might use /temp though [22:56:14] ::sigh:: these are all things I do not know. [22:56:19] temp and archive are the only special cases...and I don't even know what uses the former [22:57:00] AaronSchulz: do you know if there is an exhaustive list of every possible URL style? [22:57:25] look at phase3/thumb.php [22:58:07] wfExtractThumbParams() [22:58:53] ogg and pagetiffhandler have there own basename formats that are not in that file (see upload-scripts/thumb-handler.php) [22:58:59] but I don't think you care about basenames [22:59:24] I'll pick apart the regexes, though I was hoping for just a comment. [22:59:25] ah well. [23:00:26] maplebed: I think thumb(/temp|/archive)/x/xy/source name/thumb name.ext covers what you want [23:01:01] thumb(/temp|/archive)/x/xy// to be clearer [23:01:31] arg... thumb(/temp|/archive)?/x/xy// [23:02:04] lol [23:02:11] that's why I want it written down... [23:02:54] there it is :) [23:03:16] !log creating a new logical volume on streber called syslog for syslog-ng purposes [23:03:26] Logged the message, Mistress of the network gear. [23:03:47] x/xy can be strengthen to only be hex chars if you want, as thumb.php does [23:04:33] AaronSchulz: I see if ( $archOrTemp == '/archive' ) { $params['archived']; } [23:04:40] *strengthened [23:04:41] is the URL really archive but the directory is archived? [23:04:52] or am I reading that wrong... [23:05:22] the URL is /archive, 'archived' is a thumbnail transform parameter [23:06:00] well, it's really just a param to wfStreamThumb() for doing the transform do be clearer [23:06:10] $isOld = ( isset( $params['archived'] ) && $params['archived'] ); [23:06:18] http://www.mediawiki.org/wiki/Extension:SwiftMedia expects '-archived' [23:06:26] this implies that that's wrong. [23:06:53] then swiftmedia is wrong [23:07:11] how about deleted? is that -delete or -deleted? [23:07:31] MW repo zones map to containers, there is no archive container...unless we plan on migrating stuff around [23:07:44] "archived" files go in the "public" zone [23:07:50] in an /archive subdir [23:07:59] "The middleware inserts the account name into the URL, converts the "wikipedia/commons" section into a Swift container name by replacing slash with %2F, adds "%2Fthumb" or "%2Farchived" or "%2Fdeleted" to the container name and adds the rest of the hashing and filename as the object name" [23:09:42] yeah, that should change...it may work for the thumbnail deploy, but will suck later [23:09:51] since it doesn't match up with MW zones [23:10:30] given that I don't understand what it should be, would you mind filing a bug for making it correct? [23:15:53] archived thumbs worked in rewrite.py before...the docs were wrong. I'll make a report, since it will need more handling for public/deleted/temp anyway [23:16:49] could you remind me how to make a new svn commit refer to the one I just made? [23:17:06] i.e. I corrected the regex to do the archive thing right; I watn to make this commit refer to the one introducing the change. [23:17:16] just mention rxxxxxx in the summary [23:17:22] k. [23:23:16] AaronSchulz: https://www.mediawiki.org/wiki/Special:Code/MediaWiki/106890 [23:24:00] !log rebooting streber [23:24:08] Logged the message, Master [23:25:58] maplebed: https://bugzilla.wikimedia.org/show_bug.cgi?id=33286 [23:28:53] AaronSchulz: where'd the -images part come from? [23:29:53] the FileBackend branch merge [23:30:54] we may store things other than local images in the future, so it's best to use decent container names [23:31:00] might it be possible to use -media instead of -images? considering sounds and movies are in there too? [23:31:24] hmm, good idea [23:31:52] I'm changing that right now [23:32:04] ok. thanks! [23:34:24] AaronSchulz: what about -deleted? [23:34:53] we shouldn't need a rewrite for those [23:35:17] they are private and are streamed in MediaWiki requests using cloudfiles [23:35:22] ok. [23:35:27] $repo->streamFile() [23:35:28] but do we need a -deleted bucket? [23:35:32] yep [23:35:45] s/bucket/container/ [23:35:55] S3 says bucket, swift says container ;) [23:36:06] ok. what should the full name of the container be? (given the insertion of -media in the name) [23:36:54] site-lang-media-deleted [23:37:50] ok, then we still have 4 containers per site-lang pair, so my total container count is still 657,408. [23:38:19] (the four being -public, -thumb, -temp, and -deleted) [23:42:50] * AaronSchulz runs tests [23:48:06] ok, changed [23:52:00] AaronSchulz: could you update r33286 too? [23:57:43] done