[00:00:04] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [00:00:05] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [00:02:40] rainman-sr: ok, cool! thank you [00:05:07] New patchset: Lcarr; "requiring facter files before running tcp tweaks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2630 [00:05:43] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2630 [00:05:44] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2630 [00:08:37] Ryan_Lane: try running puppet now ? I required the script to be installed ? [00:17:39] New patchset: Lcarr; "Making initcwnd.erb only generate if puppet fact exists" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2631 [00:17:43] Ryan_Lane: ^^ [00:18:02] yep [00:18:03] wait [00:18:04] no [00:19:03] New review: Ryan Lane; "If is slightly off. See inline comment." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2631 [00:20:28] New patchset: Lcarr; "Making initcwnd.erb only generate if puppet fact exists" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2631 [00:20:35] Ryan_Lane: look good now ? [00:21:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:23:32] LeslieCarr: yep [00:24:08] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2631 [00:24:08] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2631 [00:24:46] hooper==secure.wm.o ? is something broken there? (or was it finally taken offline? i thought there would be redirects) [00:24:59] I think it's down but not supposed to be [00:25:27] etherpad's up... [00:25:35] singer is secure [00:25:41] oh, right [00:26:09] nagios says 4 hrs [00:26:33] and only HTTP is monitored not HTTPS?!!! [00:26:44] jeremyb: eh? [00:26:56] we're being alerted from watchmouse about https on singer [00:27:09] oh, watchmouse is different [00:27:11] http://nagios.wikimedia.org/nagios/cgi-bin/status.cgi?navbarsearch=1&host=singer [00:27:23] i didn't even think to check watchmouse [00:27:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.841 seconds [00:31:07] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [00:32:28] PROBLEM - MySQL Slave Delay on db1003 is CRITICAL: CRIT replication delay 193 seconds [00:33:22] PROBLEM - MySQL Slave Delay on db1019 is CRITICAL: CRIT replication delay 246 seconds [00:35:55] RECOVERY - HTTP on singer is OK: HTTP OK - HTTP/1.1 302 Found - 0.001 second response time [00:36:31] RECOVERY - MySQL Slave Delay on db1003 is OK: OK replication delay 0 seconds [00:37:16] RECOVERY - MySQL Slave Delay on db1019 is OK: OK replication delay 0 seconds [00:38:05] woot, singer's back [00:38:34] yay Ryan_Lane ! [00:38:44] !log fixed singer by adding in ssl configuration to the planet configuration [00:38:46] Logged the message, Master [00:40:11] what about the stafford HTTP check above? is 400 really normal? why? does it need a specific path requested or something? [00:46:39] probably [00:46:42] * Ryan_Lane shrugs [00:46:53] even more likely it needs authentication [00:53:10] PROBLEM - Puppet freshness on ganglia1001 is CRITICAL: Puppet has not run in the last 10 hours [01:02:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:03:02] !log disabling mobile skin for the blogs - we need to fix varnish support first [01:03:04] Logged the message, Master [01:04:59] binasher: ah. seems w3total cache supports this [01:05:02] heh [01:05:06] sorry [01:05:06] for the blog [01:05:20] "Create a group of user agents by specifying names in the user agents field. Assign a set of user agents to use a specific theme, redirect them to another domain or if an existing mobile plugin is active, create user agent groups to ensure that a unique cache is created for each user agent group. Drag and drop groups into order (if needed) to determine their priority (top -> down)." [01:05:22] oh, yeah.. seems like it should [01:07:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.167 seconds [01:08:55] it might have some implicit CSRF protection [01:09:02] it's hard to tell because it just killed my browser [01:09:35] eh? [01:09:50] it's only firefox 10 [01:09:59] it's not like it killed some extremely stable browser [01:10:32] am I missing some backstory to this? [01:10:40] New patchset: Bhartshorne; "porting most recent version from svn - passes along x-forwarded-for and other headers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2633 [01:11:05] ah. heh. another channel [01:11:18] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2633 [01:11:19] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2633 [01:11:24] oh, you don't expect me to know what channel I'm in do you? [01:11:28] :D [01:12:09] !log re-enabled the mobile plugin for the blogs, seems w3 total cache supports varying [01:12:12] Logged the message, Master [01:39:54] New patchset: Ottomata; "test commit for git branch push" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2634 [01:39:55] New patchset: Ottomata; "Adding ability to reference observation instances by just trait_sets, without trait values." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2635 [01:39:57] New patchset: Ottomata; "Hacky first work on loader classes." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2475 [01:39:58] New patchset: Ottomata; "Renaming the concept of variables to 'traits'. Allowing trait_sets to be specified so that we don't record HUGE amounts of data." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2477 [01:39:59] New patchset: Ottomata; "base.py - adding schema in comments. Got lots of work to do to make this prettier" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2476 [01:40:01] New patchset: Ottomata; "device_pipeline.py - comments about hackyness" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2479 [01:40:02] New patchset: Ottomata; "Adding loader.py - first hacky loader, just so we can get some data into mysql to work with." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2478 [01:40:03] New patchset: Ottomata; "Buncha mini changes + hackiness to parse a few things. This really needs more work" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2623 [01:40:04] New patchset: Ottomata; "pipeline/user_agent.py - adding comment that this file should not be used" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2480 [01:41:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:49:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.028 seconds [01:56:00] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 620s [01:56:27] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 649s [02:00:12] RECOVERY - mysqld processes on db1035 is OK: PROCS OK: 1 process with command name mysqld [02:02:32] Ryan_Lane: needing auth doesn't sound like a valid reason for 400 ;) (maybe 403) [02:02:48] * Ryan_Lane shrugs [02:04:24] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 5518 seconds [02:21:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:22:52] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [02:27:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.524 seconds [02:29:55] !log installed labstore1-4 [02:29:57] Logged the message, Master [02:31:33] RECOVERY - Puppet freshness on search1002 is OK: puppet ran at Fri Feb 17 02:31:25 UTC 2012 [02:34:33] RECOVERY - Disk space on search1002 is OK: DISK OK [02:34:42] RECOVERY - DPKG on search1002 is OK: All packages OK [02:34:42] RECOVERY - RAID on search1002 is OK: OK: no RAID installed [02:41:27] PROBLEM - Disk space on mw1 is CRITICAL: DISK CRITICAL - free space: /tmp 41 MB (2% inode=87%): [02:42:30] RECOVERY - NTP on search1002 is OK: NTP OK: Offset -0.01181232929 secs [02:43:50] !log deployed updated thumb_handler.php to ms5 to include Content-Length in generated images [02:43:52] Logged the message, Master [02:44:00] RECOVERY - Disk space on mw1 is OK: DISK OK [02:45:48] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 2s [02:46:27] New patchset: Bhartshorne; "correcting syntax for passing through headers to ms5" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2636 [02:46:33] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:46:50] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2636 [02:46:50] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2636 [02:47:00] RECOVERY - Lucene on search1002 is OK: TCP OK - 0.026 second response time on port 8123 [02:56:09] PROBLEM - Lucene on search1001 is CRITICAL: Connection refused [03:03:31] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2634 [03:04:04] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2475 [03:04:04] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2475 [03:04:21] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2476 [03:04:21] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2476 [03:04:35] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2477 [03:04:35] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2477 [03:04:48] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2478 [03:04:49] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2478 [03:05:04] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2479 [03:05:04] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2479 [03:05:21] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2623 [03:05:36] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2635 [03:05:54] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2480 [03:05:55] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2634 [03:05:55] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2623 [03:05:55] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2635 [03:05:56] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2480 [04:21:57] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:23:18] New review: Tim Starling; "Your C skills are improving rapidly, so in this review I included some comments about style and conv..." [analytics/udp-filters] (refactoring); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2626 [04:23:18] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [07:38:21] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [07:40:45] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.001 second response time on port 8123 [07:41:21] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [07:42:08] !log upgraded mysql on db40 to 5.1.53-facebook-r3753, enabled innodb_use_purge_thread [07:42:10] Logged the message, Master [07:53:03] PROBLEM - Lucene on search3 is CRITICAL: Connection timed out [07:55:18] RECOVERY - Lucene on search3 is OK: TCP OK - 0.001 second response time on port 8123 [07:55:36] RECOVERY - Lucene on search9 is OK: TCP OK - 0.004 second response time on port 8123 [09:05:39] PROBLEM - Disk space on mw43 is CRITICAL: DISK CRITICAL - free space: /tmp 32 MB (1% inode=87%): [09:06:06] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:07:18] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [09:09:33] RECOVERY - Disk space on mw43 is OK: DISK OK [09:54:42] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [09:57:42] PROBLEM - Puppet freshness on search1003 is CRITICAL: Puppet has not run in the last 10 hours [09:58:01] !log Shutdown ragweed for decommissioning [09:58:03] Logged the message, Master [10:00:42] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [10:00:42] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [10:45:06] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [10:46:00] RECOVERY - Lucene on search15 is OK: TCP OK - 0.002 second response time on port 8123 [10:49:45] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [10:53:30] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.010 second response time on port 8123 [10:54:15] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [10:56:39] RECOVERY - Lucene on search15 is OK: TCP OK - 2.995 second response time on port 8123 [11:10:00] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:11:21] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [11:35:30] PROBLEM - Disk space on mw48 is CRITICAL: DISK CRITICAL - free space: /tmp 70 MB (3% inode=87%): [11:46:00] RECOVERY - Disk space on mw48 is OK: DISK OK [12:35:39] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [12:39:42] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [12:41:39] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [13:35:03] PROBLEM - Disk space on srv233 is CRITICAL: DISK CRITICAL - free space: /tmp 46 MB (2% inode=89%): [13:39:06] RECOVERY - Disk space on srv233 is OK: DISK OK [13:40:42] <^demon> !log srv233: removed /tmp/mw-cache-1.17 to give it a little more space for now [13:40:44] Logged the message, Master [13:41:12] you know, ... *sigh* nm [13:41:49] <^demon> Hm? [13:42:44] <^demon> Theres a lot of random .jpg's that are pretty recent in /tmp. Wonder if they're FileBackend related. [13:43:05] <^demon> More leakage possibly, Aaron should investigate :) [13:43:31] jpgs? urk [13:45:40] <^demon> But yeah, the 1.17 cache freed up about ~700M so it should be content for now. [13:46:36] PROBLEM - Disk space on mw40 is CRITICAL: DISK CRITICAL - free space: /tmp 65 MB (3% inode=87%): [13:46:59] <^demon> Argh, are we going to play whack-a-mole on all the apaches since we pushed out 1.19? [13:47:00] apergos (hi) do you happen to know why all syslogs are sent to the same file at /home/wikipedia/syslog/syslog ? [13:47:16] <^demon> They're all in /h/w/logs/syslogs [13:47:17] because it's way easier for us to grep through one than through a bunch [13:47:31] ohhh [13:47:36] well the apache messages go to another one but [13:47:51] I am saying that because there are some apache error messages there [13:47:57] that should be redirected somewhere else [13:48:12] also the swift stuff is really spamming that file and it sounds like that traffic could be directed to another log file [13:48:16] the apache ones should land in apache.log or something like that, same directory [13:48:28] I would be fine with having the swift stuff split out [13:49:09] RECOVERY - Disk space on mw40 is OK: DISK OK [13:49:19] whackanapache [13:55:54] ok, I've figured out the last thing that is required to extend our search cluster into eqiad: that's good! [13:56:18] notpeter, so what was the problem? [13:56:20] the bad news is that it will require restarting all of the lucene procs *with a new init script*: that's bad [13:56:42] if start-stop-daemon --start --quiet --background --user lsearch --chuid lsearch --pidfile $pid --make-pidfile --exec /usr/bin/java -- -Xmx3000m -Djava.rmi.server.codebase=file://$BINDIR/LuceneSearch.jar -Djava.rmi.server.hostname=$HOSTNAME -jar $BINDIR/LuceneSearch.jar [13:56:54] specifically: -Djava.rmi.server.hostname=$HOSTNAME [13:57:10] so, each host is in the rmi registry without an fqdn [13:57:21] so, in the example of searchidx2 [13:57:32] the host in eqiad successfully hits it [13:57:41] asks what it's name is in the rmi registry [13:57:47] then tries to use that for communication [13:57:50] which is not an fqdn [13:57:55] and it fails across DCs [13:59:34] so in the init script I can just replace $HOSTNAME with `hostname --fqdn` and it should work across DCs [13:59:51] rainman-sr: does that make sense? [14:01:20] yep, i think that's a good explanation [14:02:04] awesome. I'm going to do some more testing on this to make sure of it before some kinda messy transition (as this all came to me when I was pretty drunk last night...) [14:02:19] but, I'm glad that this sounds like a logical explanation [14:19:36] PROBLEM - Disk space on srv270 is CRITICAL: DISK CRITICAL - free space: /tmp 42 MB (2% inode=88%): [14:39:33] RECOVERY - Disk space on srv270 is OK: DISK OK [16:06:29] New patchset: Pyoungmeister; "adding raid5 setup for searchidx boxxies" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2637 [16:06:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2637 [16:29:39] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [16:34:04] New review: RobH; "looks right, but partman is tricky" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2637 [16:34:05] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2637 [16:39:51] PROBLEM - Disk space on mw8 is CRITICAL: DISK CRITICAL - free space: /tmp 9 MB (0% inode=86%): [16:44:03] PROBLEM - Disk space on mw16 is CRITICAL: DISK CRITICAL - free space: /tmp 67 MB (3% inode=86%): [16:49:18] RECOVERY - Disk space on mw16 is OK: DISK OK [16:49:29] New patchset: Pyoungmeister; "increasing timeout on check_lucene, at least temporarily, to attempt to combat flapping" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2638 [16:49:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2638 [16:50:20] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2638 [16:50:20] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2638 [16:51:31] just a headsup - I'm about to put swift back into production. [16:51:40] 50%, wait 5 minutes, 100%. [16:54:14] "mindight" is just a proxy for "time between when I went to sleep and when I woke up this morning." [16:54:39] moop. [16:54:41] ww. [16:57:58] moop? ww? [16:58:37] wrong window. [16:58:51] heh [16:58:55] !log turned swift live for 50% of all thumbnail requests [16:58:57] Logged the message, Master [16:59:07] ah bringing it back? [16:59:45] yeah... [16:59:55] yay! [16:59:57] we got three changes that will make it Better(tm) this time [17:00:42] oh? [17:01:05] we taught ms5 to sent etag and content-length theaders [17:01:14] oh good! [17:01:18] that's been a while in coming [17:01:43] disabled chunked-upload for putting the objects into swift (so that swift could pass in the content-length header and toss the put if the inserted content didn't match it or the etag) [17:02:01] oh, no chunked upload :-( [17:02:14] bummer.... [17:02:14] and taught swift to pass along x-forwarded-for, user-agent, and one other header. [17:02:21] it's actually not any different. [17:02:25] good about the headers, that's a win [17:02:29] and it's also separate from chunked-upload for the user. [17:02:37] ah ok [17:02:38] this is only chunking between ms5 and swift. [17:03:01] and literally, the only difference was send(size, chunk) to send(chunk). [17:03:08] ok [17:03:31] it's still inserting the data one chunk at a time; it's just also giving it a "just a heads up, I'm going to send you exactly this many bits." [17:03:39] right [17:03:42] RECOVERY - Disk space on mw8 is OK: DISK OK [17:03:54] "chunked upload" really just means "I'm not going to tell you ahead of time how long my upload is." [17:04:09] a silyl system [17:04:13] *silly [17:04:27] it's useful if you really don't know... [17:04:34] but in this case we always do. [17:05:01] we sure do [17:48:51] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:50:12] ok, I don't see anything wrong with swift; gonna switch to 100%. [17:51:15] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [17:52:08] !log changing squids to send 100% of thumbnail traffic to swift [17:52:10] Logged the message, Master [17:52:13] woosters: robla ^^^ [17:52:34] maplebed: coolio...when did the scripts finish? [17:53:08] robla: they started overlapping around 13:30UTC [17:53:20] woo hoo! [17:53:28] and really petered out just a little bit ago. [17:54:00] (since there were 25-30 threads running different containers, some started redoing work while there were still untouched buckets) [18:00:37] !log temporarily stopping puppet on brewster. please let me know if you need to turn it back on [18:00:39] Logged the message, and now dispaching a T1000 to your position to terminate you. [18:01:42] notpeter: manually tweaking partman stuff? [18:01:48] not me... I am basically done for the day. if I have any energy a little later I'll work more on a bit of code but that's about it [18:02:11] RobH: yep! [18:02:23] heh, thats the only reason i ever stop puppet on brewster =] [18:02:57] yep! [18:03:02] it works so well :) [18:03:02] hrmph. [18:03:06] :-) [18:05:50] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=ms-fe1.pmtpa.wmnet&c=Swift%20pmtpa&m=swift_HEAD_404_avg&r=hour [18:06:44] !log sending thumbnail traffic back to ms5, taking swift out of production [18:06:46] Logged the message, Master [18:07:09] ? [18:07:18] :-( [18:07:20] woosters: robla: ^^^ - for some reason, HEAD requests to nonexistent images are taking 60s. [18:07:32] GETs to those same images work just fine. [18:08:13] granted, there's only one of those every 5 seconds or something, [18:08:17] but still. 60s is bad. [18:10:32] not acceptable :-( [18:21:36] maplebed: any sense of which part of the pipeline the delay seems to be introduced? [18:21:58] if you have a sec, I'll switch to RL. [18:22:32] I'm doing a lot of multitasking here [18:24:47] ok - the code that is supposed to copy data into swift on 404 does the whole copy-and-send-back thing, right? well, with a HEAD it's not supposed to send any data back, so what does it do? I haven't figured out where it's stopping yet, but that's where I'm looking. [18:26:43] also it's only HEADs to things that don't exist that crap out - apparently cloudfiles is doing HEADs to containers now, and those are doing just fine (0.00x ms) [18:33:44] mark: I'm having a meeting with legal today about Labs [18:33:58] mark: have any opinion on wmf private data in labs? [18:34:20] I think either a separate zone, or very strict enforcement of project membership is enough [18:34:34] people will have to identify to have access to those projects, I assume? [18:34:45] that's a question for today [18:34:54] it could be that everyone that has a public IP will have to [18:34:58] oh my [18:35:15] since they are hosting a service on a wikimedia project, and have access to the IP logs [18:35:33] our privacy policy is incredibly strict :D [18:36:11] I'd kind of like an exception for Labs, but that would take a board decision [18:36:50] i'm not sure if i like having private data on labs [18:36:59] labs is supposed to be for playing with projects and a beta zone [18:37:10] tool labs is for analytics [18:37:22] it's supposed to be considered quasi-production [18:37:34] what about a separate pod for quasi production that has way stricter access rules ? [18:37:44] we could have a separate zone, yeah [18:38:02] but realistically, it doesn't offer much more protection than a separate project [18:38:12] well, I lie :) [18:38:40] it offers hardware separation, which protects against instance hopping via a host [18:38:58] but the data would be stored in the gluster cluster.... [18:39:09] so, it would be based on access to the data in gluster [18:39:34] nwfilter denied mac and ip spoofing [18:39:37] *denies [18:40:04] so, it shouldn't technically be possible for an instance to masquerade as another for data access [18:40:27] PROBLEM - Disk space on mw9 is CRITICAL: DISK CRITICAL - free space: /tmp 6 MB (0% inode=86%): [18:40:31] there's one small issue, though [18:40:59] if an instance is deleted from projectA, then gluster needs to remove it. if that IP is reassigned, it's possible for a short period of time that another instance may have the same IP [18:41:17] we can work around that by creating a new network, for use in private projects [18:41:34] we can assign that network specifically to a private project, ensuring only it gets those IPs [18:41:48] RECOVERY - Disk space on mw9 is OK: DISK OK [18:42:18] we can also change the access.conf on instance in projects marked private [18:42:32] so that only members of an approved group can ssh into them [18:43:47] the reason none this worries me much, is that this software is used for public clouds [18:44:13] where project separation is used to separate entire organizations [18:44:59] full zone separation is an option, though :) [18:45:18] hmm [18:45:33] it'll eat up a couple pieces of hardware to do so [18:45:59] worth it to allow non-identified use of some projects, I think [18:46:33] well, people have to identify if they can read ip logs of any web request [18:46:34] s/think/am pretty dang sue [18:46:48] which means if they are publically demoing something. [18:46:55] *sure [18:46:57] hmmmmm [18:47:04] that bites [18:47:12] that's different from accessing wmf private data, though [18:47:21] for that they'd likely need to sign an agreement [18:47:27] right [18:47:37] thankfully, we can have groups for all of this stuff [18:47:43] and enforce access via group members [18:47:48] *group membership [18:47:53] so implementation isn't terribly hard [18:47:58] that group membership code better be rock solid [18:48:08] it's posix groups via LDAP [18:48:20] using access.conf (pam_security) [18:48:32] I feel relatively confident about pam [18:48:36] same [18:48:38] I feel less so about ldap [18:48:44] LDAP just stores the data [18:48:54] if the system can't access LDAP, it can't get the groups, and denies access [18:49:13] but if someone can shoehorn their own data in there [18:49:17] (for example) [18:49:30] difficult, but possible [18:49:54] PROBLEM - ps1-d2-pmtpa-infeed-load-tower-A-phase-Y on ps1-d2-pmtpa is CRITICAL: ps1-d2-pmtpa-infeed-load-tower-A-phase-Y CRITICAL - *2525* [18:50:07] unless we have two consoles, and two LDAP domains, there's no getting around that, though [18:50:35] you undrestand I'm thinking about the serious troll/hacker who has decided (wikipedia-review style) to try to either get info, cause damange or just harm our reputation [18:50:44] maplebed [18:50:49] I could restrict write access to those specific groups to a limited subset of users [18:50:50] so most people wouldn't fall into that category, but [18:50:52] woosters [18:50:58] apergos: indeed [18:51:02] yeah I think you need to [18:51:02] what is the impact on that Head request? [18:51:03] there's no direct access to the LDAP server [18:51:14] all it takes is one person to screw that up for us [18:51:17] well, I lie. that's not totally true [18:51:24] there's access via Labs [18:51:29] *shudder* [18:51:39] woosters: I'm not sure what you mean. [18:51:44] meh. I could limit write access to specific IPs [18:51:54] woosters: (I took swift back out of production till I figure out what it's about) [18:51:55] so that only labsconsole, and a handful of other servers can write to it [18:52:06] that would make me happier [18:52:14] ok maplebed [18:52:22] ok let's assume the groups stuff can be tightened up [18:52:28] then the zone approach makes sense [18:52:29] hm. do I have audit log enabled? [18:52:42] pretty sure I do [18:52:46] woosters: I have something I think might fix it though; testing now. [18:53:27] apergos: groups stuff is pretty easy. I can just make two more global groups, and wiki admins can manage them via labsconsole [18:53:45] so there will be: projects with non public ips [18:53:49] projects with public ips [18:53:53] projects with private data [18:54:06] and global projects [18:54:23] global projects are projects used by all other projects (like bastion) [18:54:25] example of a globl project? [18:54:26] ah [18:54:31] in which no one has root, except ops [18:54:41] ok [18:54:59] I'm handling that via puppet, currently [18:55:09] I don't *really* love the way I'm doing it, but it works [18:55:22] maybe you'll have someone who can take that over soon :-D [18:55:27] heh [18:55:28] hopefully [18:56:04] I really need a better way of marking a project that puppet will know about [18:56:08] so in interesting-ish news, the only people who it looks like haven't honored dns ttl's and are using the mobile wap are chinese search engines [18:56:17] hahaha [18:56:17] huh [18:56:22] imagine that [18:56:33] youdao and baidu [18:56:39] mostly youdao [18:57:06] ip range? [18:57:18] can't tell since it's redirected [18:57:27] bah [18:57:29] just could figure out youdao by plugging in the site to google translate [18:57:37] :-D nice [18:58:42] New patchset: Lcarr; "Minor whitespace cleanup For purposes of triggering git post-merge hook on sockpuppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2639 [18:59:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2639 [18:59:05] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2639 [18:59:06] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2639 [18:59:11] so, either way. I'm going to get legal's opinion on private data in labs, then will ask for comments on ops list [18:59:31] i might just be paranoid :) [18:59:43] nah. it's a legitimate concern [18:59:58] I think it's heavily influenced by implementation, though :) [19:00:05] you're paranoid and they are out to get us [19:00:10] worksforme [19:00:12] well, some people *are* [19:00:18] yup [19:00:20] provably [19:01:11] the other thing to think about is what sort of access people will have to projects with private data [19:01:14] what I mean is [19:01:22] access to the data is one thing [19:01:32] root-level access to do stuff on the machine is another [19:01:44] !log changed sockpuppet's post-merge hook so that you need to have ssh keys forwarded (though you really would anyways due to brokenness) [19:01:45] because what you don't want is that the data becomes exposed [19:01:46] Logged the message, Mistress of the network gear. [19:02:01] by leaving some sort of security hole [19:02:12] PROBLEM - Disk space on mw52 is CRITICAL: DISK CRITICAL - free space: /tmp 67 MB (3% inode=87%): [19:02:32] * Ryan_Lane nods [19:02:34] yes, you did need to have ssh keys forwarded, I got bit on that about every other time till I finally drilled it into the brain [19:02:59] so, I'm already going to be breaking up the bastion instances [19:03:06] PROBLEM - Disk space on mw21 is CRITICAL: DISK CRITICAL - free space: /tmp 33 MB (1% inode=86%): [19:03:43] bastion, bastion-limited (users with shell on production cluster), bastion-identified, bastion-private, etc. [19:03:58] having that many would be confusing, though [19:04:54] RECOVERY - Disk space on mw52 is OK: DISK OK [19:05:08] I definitely want to break production shell users away from others [19:05:24] yes indeed [19:06:01] using kerberos would solve a lot of these problems [19:06:14] yeah, you did, though in theory it was forwarding a key which wasn't on the authorized keys list [19:06:18] so was a bit pointless [19:06:20] i just took that out [19:06:45] ah, that's why it never worked :-D [19:07:01] :) [19:07:22] so i have a question -we occasionally get these nagios alerts for free space and often /tmp is filled with stuff [19:07:30] are these files actually safe to delete ? [19:07:34] or if i do will it break something ? [19:09:07] for example, mw21 [19:09:46] is mw21 a scaler or what is is? [19:09:48] *it [19:09:51] PROBLEM - Disk space on mw17 is CRITICAL: DISK CRITICAL - free space: /tmp 37 MB (2% inode=86%): [19:10:05] * apergos goes to look at mw17 [19:10:57] are all the scalers fillin up or something? [19:11:06] RobH: ? [19:11:20] mw52, mw21, and mw17 all disk space warning [19:11:33] ah. [19:11:48] wtf [19:11:53] no cron job on those? [19:12:18] the "real" scalers have a cron job that clears out the cruft, runs every 5 mins [19:12:21] are those part of the image scalingc cluster? they're not in ganglia. [19:12:31] I dunno [19:12:32] (as part of the image scalers pmtpa cluster){ [19:12:41] I don't see how they can be but... [19:12:42] ok. [19:12:47] * maplebed goes away again [19:12:52] ok, mw17 and 21 are apaches [19:12:53] not scalers [19:13:09] well there's a pile of stuff in tmp that's just like it was a scaler [19:13:14] so I really wonder what's going on [19:13:27] phpDzwNJ2.jpg [19:13:28] etc. [19:13:48] can items in /tmp be removed on apacheS? [19:13:56] not all of em [19:14:07] hrm [19:14:08] but can remove the old ones right? [19:14:12] anything older than 10 mins that's a flat file [19:14:14] (sorry was distracted by tim tams….mmmm…) [19:14:15] lemme see if that's true [19:14:25] even if i could just clean out the 2011 stuff that should help [19:14:50] well, if we have a timeline, like 1 week [19:14:54] wurfl.xml and wurfl-2.3.xml, what are these [19:14:56] i am happy to rm all the files older than that [19:15:00] but anything that is EasyTimeline* [19:15:09] phpXXXX* (flat files) [19:15:12] those can all go [19:15:17] older than 15 mins [19:15:23] we cannot remove easytimeline? [19:15:35] you can remove EasyTimeline* and phpoXXX* flat files [19:15:38] older than 15 mins [19:15:52] ls [19:15:57] bah, wrong window ;] [19:16:00] there's some xxxxxx.png that are old [19:16:07] you can toss those too [19:16:15] the phprandom.image fiels? [19:16:21] toss [19:16:23] is there anything in tmp that cannot go away ? [19:16:34] I dunno about the *xml [19:16:46] the mw-cache* get kept [19:17:20] lost+found of course gets kept [19:17:30] the rest, *poof* [19:17:41] cool [19:17:49] if someone knew about these xml files maybe they could go to [19:17:50] o [19:18:11] ugh it's some mobile crap [19:18:22] don't want to toss [19:18:33] someone ask the mobile folks if they can go [19:19:09] RECOVERY - Disk space on mw17 is OK: DISK OK [19:20:09] should be cronned [19:21:34] heh [19:21:39] who fixed mw17? [19:21:42] PROBLEM - Disk space on mw41 is CRITICAL: DISK CRITICAL - free space: /tmp 7 MB (0% inode=87%): [19:21:43] or did it self correct? [19:22:05] me [19:22:11] I left a lot of cruft [19:22:12] ok, just making sure [19:22:16] but tossed some [19:22:52] i am ditching specific filetype patterns over one day old [19:23:00] easier and works with the find command i already know [19:23:02] ;] [19:23:09] great [19:23:11] i'm on mw21 [19:23:17] LeslieCarr: i am [19:23:19] already ;] [19:23:19] if you wanna do the phpXXX ones on mw17, I didn't do those [19:23:20] oh [19:23:21] hehe [19:23:28] RobH: what commands are you using ? [19:23:40] find /tmp/ -name *.png -type f -mtime +1 -delete [19:23:54] ah cool [19:23:56] kills files older than a day [19:25:21] i guess i could have used -mmin +number of minutes [19:25:52] is the mw-cache-1.17 ok to go ? [19:25:56] 15 mins [19:26:03] you know they tossed it on some others [19:26:06] i was about to ask what LeslieCarr did [19:26:07] I think you can poof it [19:26:14] we only need 1.19 right? [19:26:16] 18 [19:26:17] 19 [19:26:19] ok [19:26:57] RECOVERY - Disk space on mw41 is OK: DISK OK [19:27:17] ditching 1.17 gave me back to 72% from 94% [19:27:30] !log manually cleaned up tmp on mw21 [19:27:32] Logged the message, RobH [19:27:34] yeah me too :) looks like 1.17 is the key [19:27:40] !log manually cleaned up tmp on mw41 [19:27:42] Logged the message, Mistress of the network gear. [19:27:42] ah [19:27:45] i has an idea! [19:27:54] dsh removal of the directory across cluster? [19:27:57] puppet requiring /tmp/mw-cache-1.17 to be absent [19:28:09] i prefer ddsh solution ;] [19:28:09] RECOVERY - Disk space on mw21 is OK: DISK OK [19:28:17] wont it not come back since we moved past 1.17? [19:28:43] cool, the ddsh solution just won't get all the new machines , then again no new machines are on 1.17 so.... [19:28:43] yay [19:29:26] it's gone on the real scalers already [19:29:33] reedy I think got it at some point [19:30:01] so i have not used dsh in a few months, you guys mind checking my syntax? [19:30:06] ddsh -cM -g apaches -- rm -rf /tmp/mw-cache-1.17 [19:30:28] seem right? [19:30:39] (since it runs across all apaches, i prefer a reality check) [19:30:44] our scripts put the actual remote command in '' [19:31:03] oops. sorry for th gratuitous ping [19:31:05] ddsh -cM -g apaches -- "rm -rf /tmp/mw-cache-1.17" [19:31:20] Reedy: like that? [19:32:00] bueller? [19:32:08] (silence) [19:32:16] What's hte -- for? [19:32:26] its just whats on the docs and i have always used [19:32:38] http://wikitech.wikimedia.org/view/Dsh [19:33:00] based on that, your original should be fine then [19:33:18] quotes must just be a style type issue [19:33:37] I dunno what the -- is either [19:35:23] ddsh -cM -g apaches 'rm -rf /tmp/mw-cache-1.17' [19:35:30] is essentially what I was using earlier in the week [19:36:38] i will run in a moment, comcast calling me for overdue bill due to failed auto payment [19:36:41] =P [19:37:23] !log ran dsh command to remove all the /tmp/mw-cache-1.17 [19:37:25] Logged the message, RobH [19:37:43] sites still up. [19:37:45] \o/ [19:37:48] yay [19:38:08] that directory was huge [19:38:13] lots more space now [19:38:31] I know deleting the php-1.17 folder earlier this week saved about half a gig [19:38:55] :) [19:40:04] heh, when comcast calls me, i take the call [19:40:04] s [19:40:04] ince [19:40:04] [19:40:05]  [19:40:18] damned colloquy bug, restarting. (it likes to send extra returns) [19:40:55] if only other osx irc clients were as pretty. [19:41:04] form over function im afraid [19:41:34] awww [19:44:11] yeah, anyone found any better than colloquy ones ? [19:45:10] I haven't found a client I don't hate in some fundemental way [19:45:28] I'm using adium, and finally got it to a spot where I don't want to burn it with fire [19:45:35] but it still has issues I hate [19:48:33] New patchset: Bhartshorne; "Do not write the object to swift for HEAD requests, just return it to the user." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2641 [19:48:55] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2641 [19:48:55] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2641 [19:52:21] i use adium for all but irc [19:52:25] irc in adium frustrates me. [19:53:21] PROBLEM - Host search1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:53:40] !log search1001 down for reinstall [19:53:42] Logged the message, RobH [19:54:50] New patchset: Bhartshorne; "add in Accept-Ranges header to be consistent. Also it's ok if we can't get the headers." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2642 [19:55:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2642 [19:55:34] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2642 [19:55:34] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2642 [19:55:45] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [19:56:25] so on the es slaves, should we be worried about a full /a ? [19:56:35] i dunno whats up with 10.04 but post dhcp pre partman it seems to go to blank screen for a long time. [19:57:23] it eventually continues, but its annoying. [19:57:51] yes we should, that means the binlog won't have a place to be written right? [19:58:00] and that means that the slaves won't be able to stay synced [19:58:22] maplebed: setup the rotation on those, may wanna let him know [19:58:43] atleast i think he did. [19:58:45] I noticed that RobH (the blank screen thing) [19:58:45] PROBLEM - Puppet freshness on search1003 is CRITICAL: Puppet has not run in the last 10 hours [19:58:45] RECOVERY - Host search1001 is UP: PING OK - Packet loss = 0%, RTA = 31.12 ms [19:58:51] but it does come back after a bit [19:59:01] apergos: its been doing that since 10.04, i think it may have to do with serial console redirection during the installer load [19:59:03] it's just not very user friendly [19:59:06] and it hates bios redirecting it [19:59:10] so it just goes blank [19:59:13] plausible [19:59:21] without bios redirection after post, you dont know when to hit f12 or see it pxe boot though [19:59:26] 6 of one half dozen of the other. [19:59:28] right [19:59:33] rather have it [19:59:47] indeed, i just know that the blank screen is two minutes of not knowing if its gonna work or not [19:59:48] heh [20:01:27] PROBLEM - DPKG on search1001 is CRITICAL: Connection refused by host [20:01:45] PROBLEM - RAID on search1001 is CRITICAL: Connection refused by host [20:01:45] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [20:01:46] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [20:02:12] PROBLEM - SSH on search1001 is CRITICAL: Connection refused [20:02:30] PROBLEM - Disk space on search1001 is CRITICAL: Connection refused by host [20:02:46] nagios-wm: duh. [20:03:37] :-D [20:08:12] New patchset: Bhartshorne; "Better to leave out the header than have its value be 'None'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2644 [20:08:35] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2644 [20:08:35] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2644 [20:08:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2644 [20:10:54] !log reinstalling search1001 and searchidx1001 [20:10:56] Logged the message, RobH [20:11:30] PROBLEM - Host search1001 is DOWN: PING CRITICAL - Packet loss = 100% [20:13:13] !log sending thumbnail traffic back to swift; head bug is fixed. [20:13:15] Logged the message, Master [20:14:10] yay [20:14:12] wait... fucking a [20:14:19] nm, thats not it [20:14:29] for a moment i thought 'maybe this is like the ssd squids with no controller' [20:14:34] but thats not right, they arent ssd [20:14:40] it doesnt show as a boot option in bios though [20:15:37] notpeter: searchidx1001 is a bitch. [20:16:19] RobH: that was my assessment as well. [20:16:25] AaronSchulz: robla: woosters: swift is back in production service [20:16:41] cool [20:16:43] 100%? [20:16:57] maplebed: congrats =] [20:16:59] yeah. [20:17:03] RECOVERY - Host search1001 is UP: PING OK - Packet loss = 0%, RTA = 30.89 ms [20:17:09] * maplebed holds his breath... [20:17:20] ;) [20:17:34] yay [20:17:49] notpeter: so something is wrong, its not showing the bios prompt for the raid controller [20:18:13] checking a few things still, but thats not right at all [20:18:23] the h700 thats inside it should prompt during post [20:19:11] i'm checking some bios setttings to ensure its not disabled from working anyplace [20:22:30] RobH: I think I was able to crtl-r into bios [20:22:37] er, raid setup [20:22:51] i just toggled some bios setttings, lessee what it does now [20:23:23] most. cursed. project. ever. [20:24:24] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Feb 17 20:23:50 UTC 2012 [20:25:59] I'm going out to get some food; page me if anythincg swift-related shows up. [20:26:12] so far the best litmus test (http://commons.wikimedia.org/wiki/Special:NewFiles) looks fine. [20:26:20] as does http://ganglia.wikimedia.org/latest/?r=20min&cs=&ce=&m=&tab=v&vn=swift [20:27:37] maplebed: i just spent quite a bit hammering at a mahcine that couldnt see the disks, because it was h700 and wont let us see the disk swithout raid [20:27:46] i know you can identify with this since you went through it months ago ;] [20:29:48] PROBLEM - Host search1001 is DOWN: PING CRITICAL - Packet loss = 100% [20:31:15] RECOVERY - SSH on search1001 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [20:31:24] RECOVERY - Host search1001 is UP: PING OK - Packet loss = 0%, RTA = 30.85 ms [20:33:04] New patchset: Lcarr; "assigning neon as a ganglia aggregator" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2645 [20:33:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2645 [20:36:17] New patchset: Pyoungmeister; "turns out, no software raid on searchidx boxxies..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2646 [20:36:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2646 [20:41:01] New patchset: Pyoungmeister; "turns out, no software raid on searchidx boxxies..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2646 [20:41:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2646 [20:42:52] New patchset: Pyoungmeister; "turns out, no software raid on searchidx boxxies..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2646 [20:43:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2646 [20:43:34] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2646 [20:43:34] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2646 [20:44:22] !log restarting puppet on brewster [20:44:23] Logged the message, and now dispaching a T1000 to your position to terminate you. [20:51:12] PROBLEM - NTP on search1001 is CRITICAL: NTP CRITICAL: No response from NTP server [20:58:11] apergos: you there, by any chance? [20:59:44] yes [20:59:48] what's up? [21:00:09] (fair warning: it's 11 pm on a Friday night. Just sayin.) [21:00:52] notpeter: [21:00:54] that's fine, just asking questions and you can leave any time :) [21:01:03] ok [21:01:08] you worked on the searchidx migration last time, yes? [21:01:31] not really no [21:01:40] hrm [21:01:40] ok [21:01:49] I just tried to keep the server afloat til the new one could be set up [21:02:20] do you have any idea why it wants mounts of ms5.pmtpa.wmnet:/export/thumbs [21:02:21] if I had worked on it I woul dhave been a lot more helpful to you while you're doing all this work... [21:02:26] it does? [21:02:26] and ms7.pmtpa.wmnet:/export/upload [21:02:27] ? [21:02:37] try removing them [21:02:50] seriously. [21:02:58] alright [21:03:30] !log unmounting ms nfs mounts from searchidx2 [21:03:32] Logged the message, and now dispaching a T1000 to your position to terminate you. [21:04:29] so... why? [21:05:06] apergos: might it needs those for search on commons? [21:05:14] what for? [21:05:17] no idea [21:05:18] it's not serving images [21:05:27] true [21:05:31] it can build an index from the file description pages [21:05:39] what is it going to do with a raw jpeg? [21:06:16] make jpeg sushi? [21:06:22] meh [21:07:14] prolly taste awful, have you seen some of the images we have?? [21:07:21] hahahaha [21:07:29] * apergos hums a few bars of "commons fornication" [21:07:37] hah! [21:08:22] well [21:09:34] let's see if that breaks anything, shall we? [21:09:47] I'm not seeing much in the logs [21:10:23] but I might remount, just to try to not fuck up this house of cards [21:11:03] be strong [21:11:06] stay the course man [21:11:18] read the topic and believe! [21:12:52] but our beloved house of cards! [21:14:09] our beloved mcgyvered infrastructure of duct tape and paper clips? [21:14:12] it'll hold! [21:14:28] with only a couple of nagios pages per night ;) [21:15:07] *phone on vibrate* [21:19:24] i wrote that those were needed in wikitech? [21:19:28] the plot thickens... [21:19:36] what did past peter know that present peter doesn't know... [21:24:53] apergos: reading your comments on the searchidx1 ticket is funny [21:25:45] I don't have any memory of it [21:26:05] but I'm glad to know I have created something with entertainment vallue [21:26:20] I have turned off the snapshot writing in cron, as folks [21:26:20] will see in the logs. At least, I hope that's what I turned off. Docs [21:26:23] are opaque and the codebase is huge. [21:26:53] I laugh because domas was talking about how this should all be easy! because it's documented [21:26:58] I wrote that? I *still* have no memory of it [21:27:17] yeah. real easy [21:27:25] the experience that you, jeff, and I have had with the documentation has proven otherwise [21:27:27] anyone know what url.wikimedia.org is yet?? [21:27:45] and I think I was working here for over 2 years before I ever heard of cachemgr [21:27:52] nope, i'd say let's just turn off dns :) [21:28:06] :-D [21:28:16] it will turn out to be some vital service [21:28:20] :-/ [21:28:24] well it's already down [21:28:31] it is? [21:28:35] so the service would be broken [21:28:36] yeah [21:28:44] I thought ryan brought it back after fixing ssl for planet [21:28:48] yeah, it just redirects to the default of contacts [21:29:20] what is "contacts" anyways? [21:29:25] now that's the first time I've seen that page [21:29:30] not sure, but it's used by some people [21:29:35] civimail jobs? [21:29:47] the docroot of url.wm.org doesn't even exist and it's not puppetized [21:29:54] I know it doesn't exist [21:30:03] it's not mentioned *anywhere* [21:30:16] it's a mystery :) [21:30:17] maybe it's mentioned on someone's hard drive someplace [21:30:36] !log remounted nfs mounts on searchidx2. to protect our house of cards [21:30:38] Logged the message, and now dispaching a T1000 to your position to terminate you. [21:30:43] booooooooo [21:30:46] * apergos stabs nfs [21:30:59] I emailed rainman [21:31:07] ok [21:31:09] hopefully he'll tell me why they are mounted [21:31:23] if he can't, then I'll unmount again, and... well... buy the ticket, take the ride [21:31:29] * apergos predicts: "they are mounted? I had no idea" [21:32:18] now if only I could predict whether we default next monday or not.... [21:32:23] :-/ [21:32:45] apergos: that seems likely [21:32:50] I mean, both of them =P [21:32:51] no [21:32:56] I don't think you'll default on monday [21:33:05] you really think that you're at breaking point? [21:33:45] I think the ecb just exchanged all its current bonds for new ones, insulating itself [21:34:02] and that the greek govt is setting up the cacs [21:34:15] and that the eurogroup will tell us something monday [21:34:32] it's really hard to know whether we're going to be in selective default already at that point [21:34:48] already? [21:34:52] by monday? [21:34:59] how defaulty are the weekends over there? [21:35:04] hahaha [21:35:11] so the cabinet is having a weekend meeting [21:35:26] I am sure that everyone will be on the phone constantly until the monday meeting [21:36:04] apergos: yeah, seems likely [21:36:10] it's intricate financial negotiations, who can figure this stuff out [21:36:27] the invisible hand of the market can figure it out [21:36:29] that's who. [21:36:45] what a load of [21:36:52] :D [21:37:19] the bond exchange program is supposed to go (maybe) from feb 22 til march.. 6? [21:37:37] if there isn't enough voluntary participation then they'll make it mandatory [21:37:44] who knows what that means [21:37:57] selective default [21:38:31] well no [21:38:39] selective defautl is the voluntary program [21:38:55] ah, selective default with extreme prejudice? [21:39:01] I mean in theory selective default is the mandatory one too but then the cdses [21:42:18] <^demon> Completely pointless bug that I noticed and don't know where to file (rt?). On nagios the left frame link to load the server admin log still points at wikitech.leuksman. [21:43:30] heh [21:43:57] apergos: well, the fact that lsof | grep mnt comes back completely empty on searchidx2 supports your theory that those nfs mounts are not actually needed... [21:44:11] :-D [21:44:38] of course my theory is based on logic, which may not apply... [21:44:49] heh [21:46:50] it's possible they are just not in use right now ;) [21:46:56] notpeter: lsof will only show you currently open files (or directories) - you may have just snuck in between requests. [21:47:01] hah [21:47:19] maplebed: yes, I'm not taking this as conclusive evidence [21:48:19] but, on the flip side, I'm confused as to why ms5.pmtpa.wmnet:/export/thumbs and ms7.pmtpa.wmnet:/export/upload would be needed.... [21:49:47] on which system? [21:49:52] searchidx2 [21:50:00] weird [21:50:03] totally [21:50:12] puppet class including them? [21:50:13] sent rainman an email asking [21:50:41] Ryan_Lane: no.... there's a note on wikitech from the idx1 -> idx2 migration notes saying they're needed [21:50:50] is it indexing uploads based on metadata? [21:50:50] sadly, I am the one who wrote that note [21:51:02] heh [21:51:33] Ryan_Lane: seems possible. but wouldn't that come out of s4? [21:52:09] no. we don't actually have metadata in the database. how awesome is that? :) [21:52:17] unless that changed sometime recently [21:52:23] what is this i dont eve [21:52:45] Ryan_Lane: have you come across openTSDB? [21:53:01] nope. whats that? [21:53:04] or dschoon ^^? [21:53:19] http://opentsdb.net/ [21:53:42] interesting [21:54:44] buh? [21:55:08] ah, interesting. [21:55:17] Yeah, I'll check it out. [21:55:29] Weird that they use HBase. That seems a poor fit. [21:55:49] oh? [21:55:50] It's not like timeseries metrics will collide, so you really do not need strong consistency or transactions. [21:55:57] ah [21:56:10] mongo, all the way, then? ;) [21:56:14] ew. [21:56:16] No. [21:56:32] I have heard only bad things about clustered mongo. [21:56:43] but it's webscale!! [21:57:18] irc need troll face support [21:57:18] when it comes to mostly counters and event-data -- stuff that's super-high write-throughput -- something like cassandra works really well [21:57:22] hehe [21:57:39] if you're able to bulk-load, then it doesn [21:58:12] cassandra 1.0 has some real serious read/write improvements compared to 0.7 [21:58:14] t matter as much what you choose. you'll need something that supports distributed queries, mostly. [21:58:23] is this more a ganglia replacement, than nagios? [21:58:33] cassandra has always been fast, but yeah. 1.0 has gotten some nice bells and wistles. [21:58:48] it was fast, now it's crazy fast :) [21:58:59] i think it was around 0.7 it got atomic increment and decrement, which is what turned it into a monster for analytics. [21:59:09] check [21:59:42] personally i love most that it has an homogenous topology. i don't know of any other open system that has its level of self-healing [21:59:53] this is one of the big problems with hadoop+hbase [22:00:03] yeah that's pretty cool indeed, no spof [22:00:31] the job management system for both is reliant on zookeeper, which doesn't scale all that well. [22:00:32] the major rewrite of hadoop will also make it without a spof [22:00:52] zookeeper hanging can grind everything to a halt :/ [22:01:19] yeah. the problem is that even without a "failure" your system can become unusuable. [22:02:05] i don't have much faith in things patterned on bigtable -- the model where you elect a new coordination node when one goes down [22:02:34] in my experience, the biggest issues with big systems are never so cut-and-dry as to have a clear "failure" state. [22:02:42] New patchset: Ottomata; "Trying my darndest to clean things up here! I've cloned a new repo, and am checking in my non-committed (an non-approved?) changes into this new branch. Hopefully gerrit will be happier with me." [analytics/reportcard] (otto/pipeline) - https://gerrit.wikimedia.org/r/2647 [22:02:57] a server can merely be slow, or intermittently responsive, and that can ruin your day [22:03:21] anti-entropy and quorum do a better job handling these middle states than failover [22:03:30] failover tends to flap horribly in those situations [22:08:02] well, a tcpdump is showing no traffic between searchidx2 and either of those ms boxxies. I'll leave that running over night while waiting to hear back from rainman [22:09:44] notpeter: you might want to check your tcpdump is called with -n to avoid DNS lookup :D [22:12:58] hashar: yeah, I'm grepping out arps [22:13:22] should be fine so :-D [22:14:06] New patchset: Lcarr; "Trying to create multiple gmond-*.conf files using info found https://groups.google.com/group/puppet-users/browse_thread/thread/efbe92386ca2c441" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2648 [22:14:19] anyone with puppet foo have time to check that out ? [22:14:26] i'm sure it's horribly broken ;) [22:14:44] I am soooo glad I'm off the clock [22:14:46] :-D [22:14:57] LeslieCarr: well the linter is not happy about this change :D [22:15:25] hehehehe [22:15:28] LeslieCarr: Could not parse for environment production: Syntax error at '/'; expected '}' at /var/lib/gerrit2/review_site/tmblablab/manifests/ganglia.pp:218 [22:15:39] ah at least that one should be easy to fix :) [22:19:24] New patchset: Lcarr; "Trying to create multiple gmond-*.conf files using info found https://groups.google.com/group/puppet-users/browse_thread/thread/efbe92386ca2c441" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2648 [22:20:25] New patchset: Lcarr; "Trying to create multiple gmond-*.conf files using info found https://groups.google.com/group/puppet-users/browse_thread/thread/efbe92386ca2c441" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2648 [22:22:11] LeslieCarr: you probably want to install puppet locally to validate your files [22:22:15] sudo gem install puppet [22:22:17] then [22:22:28] puppet parser validate some/broken/manifest.pp [22:22:30] * Ryan_Lane shudders [22:22:37] :) [22:23:25] there is puppet-lint to (hint hint, got a change that add a wrapper for it : https://gerrit.wikimedia.org/r/#change,2629 ) [22:23:56] hehe, yeah, there's also local-lint :p which i also didn't do [22:24:30] New review: Diederik; "Ok." [analytics/reportcard] (otto/pipeline); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2647 [22:24:30] Change merged: Diederik; [analytics/reportcard] (otto/pipeline) - https://gerrit.wikimedia.org/r/2647 [22:29:10] why did my puppet parser disappear though ? [22:30:04] New patchset: Lcarr; "Trying to create multiple gmond-*.conf files using info found https://groups.google.com/group/puppet-users/browse_thread/thread/efbe92386ca2c441" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2648 [22:33:45] Change abandoned: Lcarr; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2648 [22:33:51] sigh, i give up on that one for now [22:34:07] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2645 [22:34:08] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2645 [22:40:42] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [22:42:39] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [22:48:07] New patchset: Lcarr; "Making a 2nd nagios server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2649 [22:48:26] Change abandoned: Lcarr; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2649 [22:52:25] New patchset: Lcarr; "Making a 2nd nagios server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2650 [22:56:25] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2650 [22:56:25] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2650 [23:05:39] New patchset: Lcarr; "Removing nagios-plugins from nagios::monitor directly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2652 [23:08:11] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2652 [23:08:12] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2652 [23:12:27] New patchset: Lcarr; "fixing nagios::monitor and nrpe conflict" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2654 [23:16:06] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2654 [23:16:07] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2654 [23:34:02] New patchset: Lcarr; "Adding in more requirements for new nagios box" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2658 [23:35:20] New patchset: Ryan Lane; "Removing gluster peering support" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2659 [23:40:18] New patchset: Lcarr; "ensuring directory structure for nagios machines" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2660 [23:40:35] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2660 [23:40:41] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2658 [23:40:48] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2658 [23:40:48] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2658 [23:41:41] New patchset: Lcarr; "ensuring directory structure for nagios machines" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2660 [23:42:00] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2659 [23:42:00] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2659 [23:43:05] New patchset: Lcarr; "ensuring directory structure for nagios machines" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2660 [23:44:34] New patchset: Lcarr; "ensuring directory structure for nagios machines" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2660 [23:45:51] New patchset: Lcarr; "ensuring directory structure for nagios machines" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2660 [23:46:33] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2660 [23:46:34] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2660 [23:48:08] New patchset: Ryan Lane; "Also remove system-specific uuid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2661 [23:48:51] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2661 [23:48:57] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2661 [23:56:40] New patchset: Ryan Lane; "Remove dependency on file we're already removed." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2662 [23:57:02] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2662 [23:57:10] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2662