[00:00:04] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [00:00:05] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [00:02:40] rainman-sr: ok, cool! thank you [00:05:07] New patchset: Lcarr; "requiring facter files before running tcp tweaks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2630 [00:05:43] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2630 [00:05:44] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2630 [00:08:37] Ryan_Lane: try running puppet now ? I required the script to be installed ? [00:17:39] New patchset: Lcarr; "Making initcwnd.erb only generate if puppet fact exists" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2631 [00:17:43] Ryan_Lane: ^^ [00:18:02] yep [00:18:03] wait [00:18:04] no [00:19:03] New review: Ryan Lane; "If is slightly off. See inline comment." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2631 [00:20:28] New patchset: Lcarr; "Making initcwnd.erb only generate if puppet fact exists" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2631 [00:20:35] Ryan_Lane: look good now ? [00:21:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:23:32] LeslieCarr: yep [00:24:08] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2631 [00:24:08] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2631 [00:24:46] hooper==secure.wm.o ? is something broken there? (or was it finally taken offline? i thought there would be redirects) [00:24:59] I think it's down but not supposed to be [00:25:27] etherpad's up... [00:25:35] singer is secure [00:25:41] oh, right [00:26:09] nagios says 4 hrs [00:26:33] and only HTTP is monitored not HTTPS?!!! [00:26:44] jeremyb: eh? [00:26:56] we're being alerted from watchmouse about https on singer [00:27:09] oh, watchmouse is different [00:27:11] http://nagios.wikimedia.org/nagios/cgi-bin/status.cgi?navbarsearch=1&host=singer [00:27:23] i didn't even think to check watchmouse [00:27:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.841 seconds [00:31:07] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [00:32:28] PROBLEM - MySQL Slave Delay on db1003 is CRITICAL: CRIT replication delay 193 seconds [00:33:22] PROBLEM - MySQL Slave Delay on db1019 is CRITICAL: CRIT replication delay 246 seconds [00:35:55] RECOVERY - HTTP on singer is OK: HTTP OK - HTTP/1.1 302 Found - 0.001 second response time [00:36:31] RECOVERY - MySQL Slave Delay on db1003 is OK: OK replication delay 0 seconds [00:37:16] RECOVERY - MySQL Slave Delay on db1019 is OK: OK replication delay 0 seconds [00:38:05] woot, singer's back [00:38:34] yay Ryan_Lane ! [00:38:44] !log fixed singer by adding in ssl configuration to the planet configuration [00:38:46] Logged the message, Master [00:40:11] what about the stafford HTTP check above? is 400 really normal? why? does it need a specific path requested or something? [00:46:39] probably [00:46:42] * Ryan_Lane shrugs [00:46:53] even more likely it needs authentication [00:53:10] PROBLEM - Puppet freshness on ganglia1001 is CRITICAL: Puppet has not run in the last 10 hours [01:02:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:03:02] !log disabling mobile skin for the blogs - we need to fix varnish support first [01:03:04] Logged the message, Master [01:04:59] binasher: ah. seems w3total cache supports this [01:05:02] heh [01:05:06] sorry [01:05:06] for the blog [01:05:20] "Create a group of user agents by specifying names in the user agents field. Assign a set of user agents to use a specific theme, redirect them to another domain or if an existing mobile plugin is active, create user agent groups to ensure that a unique cache is created for each user agent group. Drag and drop groups into order (if needed) to determine their priority (top -> down)." [01:05:22] oh, yeah.. seems like it should [01:07:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.167 seconds [01:08:55] it might have some implicit CSRF protection [01:09:02] it's hard to tell because it just killed my browser [01:09:35] eh? [01:09:50] it's only firefox 10 [01:09:59] it's not like it killed some extremely stable browser [01:10:32] am I missing some backstory to this? [01:10:40] New patchset: Bhartshorne; "porting most recent version from svn - passes along x-forwarded-for and other headers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2633 [01:11:05] ah. heh. another channel [01:11:18] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2633 [01:11:19] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2633 [01:11:24] oh, you don't expect me to know what channel I'm in do you? [01:11:28] :D [01:12:09] !log re-enabled the mobile plugin for the blogs, seems w3 total cache supports varying [01:12:12] Logged the message, Master [01:39:54] New patchset: Ottomata; "test commit for git branch push" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2634 [01:39:55] New patchset: Ottomata; "Adding ability to reference observation instances by just trait_sets, without trait values." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2635 [01:39:57] New patchset: Ottomata; "Hacky first work on loader classes." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2475 [01:39:58] New patchset: Ottomata; "Renaming the concept of variables to 'traits'. Allowing trait_sets to be specified so that we don't record HUGE amounts of data." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2477 [01:39:59] New patchset: Ottomata; "base.py - adding schema in comments. Got lots of work to do to make this prettier" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2476 [01:40:01] New patchset: Ottomata; "device_pipeline.py - comments about hackyness" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2479 [01:40:02] New patchset: Ottomata; "Adding loader.py - first hacky loader, just so we can get some data into mysql to work with." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2478 [01:40:03] New patchset: Ottomata; "Buncha mini changes + hackiness to parse a few things. This really needs more work" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2623 [01:40:04] New patchset: Ottomata; "pipeline/user_agent.py - adding comment that this file should not be used" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2480 [01:41:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:49:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.028 seconds [01:56:00] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 620s [01:56:27] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 649s [02:00:12] RECOVERY - mysqld processes on db1035 is OK: PROCS OK: 1 process with command name mysqld [02:02:32] Ryan_Lane: needing auth doesn't sound like a valid reason for 400 ;) (maybe 403) [02:02:48] * Ryan_Lane shrugs [02:04:24] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 5518 seconds [02:21:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:22:52] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [02:27:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.524 seconds [02:29:55] !log installed labstore1-4 [02:29:57] Logged the message, Master [02:31:33] RECOVERY - Puppet freshness on search1002 is OK: puppet ran at Fri Feb 17 02:31:25 UTC 2012 [02:34:33] RECOVERY - Disk space on search1002 is OK: DISK OK [02:34:42] RECOVERY - DPKG on search1002 is OK: All packages OK [02:34:42] RECOVERY - RAID on search1002 is OK: OK: no RAID installed [02:41:27] PROBLEM - Disk space on mw1 is CRITICAL: DISK CRITICAL - free space: /tmp 41 MB (2% inode=87%): [02:42:30] RECOVERY - NTP on search1002 is OK: NTP OK: Offset -0.01181232929 secs [02:43:50] !log deployed updated thumb_handler.php to ms5 to include Content-Length in generated images [02:43:52] Logged the message, Master [02:44:00] RECOVERY - Disk space on mw1 is OK: DISK OK [02:45:48] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 2s [02:46:27] New patchset: Bhartshorne; "correcting syntax for passing through headers to ms5" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2636 [02:46:33] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:46:50] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2636 [02:46:50] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2636 [02:47:00] RECOVERY - Lucene on search1002 is OK: TCP OK - 0.026 second response time on port 8123 [02:56:09] PROBLEM - Lucene on search1001 is CRITICAL: Connection refused [03:03:31] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2634 [03:04:04] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2475 [03:04:04] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2475 [03:04:21] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2476 [03:04:21] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2476 [03:04:35] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2477 [03:04:35] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2477 [03:04:48] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2478 [03:04:49] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2478 [03:05:04] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2479 [03:05:04] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2479 [03:05:21] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2623 [03:05:36] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2635 [03:05:54] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2480 [03:05:55] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2634 [03:05:55] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2623 [03:05:55] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2635 [03:05:56] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2480 [04:21:57] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:23:18] New review: Tim Starling; "Your C skills are improving rapidly, so in this review I included some comments about style and conv..." [analytics/udp-filters] (refactoring); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2626 [04:23:18] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [07:38:21] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [07:40:45] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.001 second response time on port 8123 [07:41:21] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [07:42:08] !log upgraded mysql on db40 to 5.1.53-facebook-r3753, enabled innodb_use_purge_thread [07:42:10] Logged the message, Master [07:53:03] PROBLEM - Lucene on search3 is CRITICAL: Connection timed out [07:55:18] RECOVERY - Lucene on search3 is OK: TCP OK - 0.001 second response time on port 8123 [07:55:36] RECOVERY - Lucene on search9 is OK: TCP OK - 0.004 second response time on port 8123 [09:05:39] PROBLEM - Disk space on mw43 is CRITICAL: DISK CRITICAL - free space: /tmp 32 MB (1% inode=87%): [09:06:06] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:07:18] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [09:09:33] RECOVERY - Disk space on mw43 is OK: DISK OK [09:54:42] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [09:57:42] PROBLEM - Puppet freshness on search1003 is CRITICAL: Puppet has not run in the last 10 hours [09:58:01] !log Shutdown ragweed for decommissioning [09:58:03] Logged the message, Master [10:00:42] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [10:00:42] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [10:45:06] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [10:46:00] RECOVERY - Lucene on search15 is OK: TCP OK - 0.002 second response time on port 8123 [10:49:45] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [10:53:30] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.010 second response time on port 8123 [10:54:15] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [10:56:39] RECOVERY - Lucene on search15 is OK: TCP OK - 2.995 second response time on port 8123 [11:10:00] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:11:21] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [11:35:30] PROBLEM - Disk space on mw48 is CRITICAL: DISK CRITICAL - free space: /tmp 70 MB (3% inode=87%): [11:46:00] RECOVERY - Disk space on mw48 is OK: DISK OK [12:35:39] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [12:39:42] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [12:41:39] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [13:35:03] PROBLEM - Disk space on srv233 is CRITICAL: DISK CRITICAL - free space: /tmp 46 MB (2% inode=89%): [13:39:06] RECOVERY - Disk space on srv233 is OK: DISK OK [13:40:42] <^demon> !log srv233: removed /tmp/mw-cache-1.17 to give it a little more space for now [13:40:44] Logged the message, Master [13:41:12] you know, ... *sigh* nm [13:41:49] <^demon> Hm? [13:42:44] <^demon> Theres a lot of random .jpg's that are pretty recent in /tmp. Wonder if they're FileBackend related. [13:43:05] <^demon> More leakage possibly, Aaron should investigate :) [13:43:31] jpgs? urk [13:45:40] <^demon> But yeah, the 1.17 cache freed up about ~700M so it should be content for now. [13:46:36] PROBLEM - Disk space on mw40 is CRITICAL: DISK CRITICAL - free space: /tmp 65 MB (3% inode=87%): [13:46:59] <^demon> Argh, are we going to play whack-a-mole on all the apaches since we pushed out 1.19? [13:47:00] apergos (hi) do you happen to know why all syslogs are sent to the same file at /home/wikipedia/syslog/syslog ? [13:47:16] <^demon> They're all in /h/w/logs/syslogs [13:47:17] because it's way easier for us to grep through one than through a bunch [13:47:31] ohhh [13:47:36] well the apache messages go to another one but [13:47:51] I am saying that because there are some apache error messages there [13:47:57] that should be redirected somewhere else [13:48:12] also the swift stuff is really spamming that file and it sounds like that traffic could be directed to another log file [13:48:16] the apache ones should land in apache.log or something like that, same directory [13:48:28] I would be fine with having the swift stuff split out [13:49:09] RECOVERY - Disk space on mw40 is OK: DISK OK [13:49:19] whackanapache [13:55:54] ok, I've figured out the last thing that is required to extend our search cluster into eqiad: that's good! [13:56:18] notpeter, so what was the problem? [13:56:20] the bad news is that it will require restarting all of the lucene procs *with a new init script*: that's bad [13:56:42] if start-stop-daemon --start --quiet --background --user lsearch --chuid lsearch --pidfile $pid --make-pidfile --exec /usr/bin/java -- -Xmx3000m -Djava.rmi.server.codebase=file://$BINDIR/LuceneSearch.jar -Djava.rmi.server.hostname=$HOSTNAME -jar $BINDIR/LuceneSearch.jar [13:56:54] specifically: -Djava.rmi.server.hostname=$HOSTNAME [13:57:10] so, each host is in the rmi registry without an fqdn [13:57:21] so, in the example of searchidx2 [13:57:32] the host in eqiad successfully hits it [13:57:41] asks what it's name is in the rmi registry [13:57:47] then tries to use that for communication [13:57:50] which is not an fqdn [13:57:55] and it fails across DCs [13:59:34] so in the init script I can just replace $HOSTNAME with `hostname --fqdn` and it should work across DCs [13:59:51] rainman-sr: does that make sense? [14:01:20] yep, i think that's a good explanation [14:02:04] awesome. I'm going to do some more testing on this to make sure of it before some kinda messy transition (as this all came to me when I was pretty drunk last night...) [14:02:19] but, I'm glad that this sounds like a logical explanation [14:19:36] PROBLEM - Disk space on srv270 is CRITICAL: DISK CRITICAL - free space: /tmp 42 MB (2% inode=88%): [14:39:33] RECOVERY - Disk space on srv270 is OK: DISK OK [16:06:29] New patchset: Pyoungmeister; "adding raid5 setup for searchidx boxxies" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2637 [16:06:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2637 [16:29:39] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [16:34:04] New review: RobH; "looks right, but partman is tricky" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2637 [16:34:05] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2637 [16:39:51] PROBLEM - Disk space on mw8 is CRITICAL: DISK CRITICAL - free space: /tmp 9 MB (0% inode=86%): [16:44:03] PROBLEM - Disk space on mw16 is CRITICAL: DISK CRITICAL - free space: /tmp 67 MB (3% inode=86%): [16:49:18] RECOVERY - Disk space on mw16 is OK: DISK OK [16:49:29] New patchset: Pyoungmeister; "increasing timeout on check_lucene, at least temporarily, to attempt to combat flapping" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2638 [16:49:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2638 [16:50:20] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2638 [16:50:20] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2638 [16:51:31] just a headsup - I'm about to put swift back into production. [16:51:40] 50%, wait 5 minutes, 100%. [16:54:14] "mindight" is just a proxy for "time between when I went to sleep and when I woke up this morning." [16:54:39] moop. [16:54:41] ww. [16:57:58] moop? ww? [16:58:37] wrong window. [16:58:51] heh [16:58:55] !log turned swift live for 50% of all thumbnail requests [16:58:57] Logged the message, Master [16:59:07] ah bringing it back? [16:59:45] yeah... [16:59:55] yay! [16:59:57] we got three changes that will make it Better(tm) this time [17:00:42] oh? [17:01:05] we taught ms5 to sent etag and content-length theaders [17:01:14] oh good! [17:01:18] that's been a while in coming [17:01:43] disabled chunked-upload for putting the objects into swift (so that swift could pass in the content-length header and toss the put if the inserted content didn't match it or the etag) [17:02:01] oh, no chunked upload :-( [17:02:14] bummer.... [17:02:14] and taught swift to pass along x-forwarded-for, user-agent, and one other header. [17:02:21] it's actually not any different. [17:02:25] good about the headers, that's a win [17:02:29] and it's also separate from chunked-upload for the user. [17:02:37] ah ok [17:02:38] this is only chunking between ms5 and swift. [17:03:01] and literally, the only difference was send(size, chunk) to send(chunk). [17:03:08] ok [17:03:31] it's still inserting the data one chunk at a time; it's just also giving it a "just a heads up, I'm going to send you exactly this many bits." [17:03:39] right [17:03:42] RECOVERY - Disk space on mw8 is OK: DISK OK [17:03:54] "chunked upload" really just means "I'm not going to tell you ahead of time how long my upload is." [17:04:09] a silyl system [17:04:13] *silly [17:04:27] it's useful if you really don't know... [17:04:34] but in this case we always do. [17:05:01] we sure do [17:48:51] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:50:12] ok, I don't see anything wrong with swift; gonna switch to 100%. [17:51:15] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [17:52:08] !log changing squids to send 100% of thumbnail traffic to swift [17:52:10] Logged the message, Master [17:52:13] woosters: robla ^^^ [17:52:34] maplebed: coolio...when did the scripts finish? [17:53:08] robla: they started overlapping around 13:30UTC [17:53:20] woo hoo! [17:53:28] and really petered out just a little bit ago. [17:54:00] (since there were 25-30 threads running different containers, some started redoing work while there were still untouched buckets) [18:00:37] !log temporarily stopping puppet on brewster. please let me know if you need to turn it back on [18:00:39] Logged the message, and now dispaching a T1000 to your position to terminate you. [18:01:42] notpeter: manually tweaking partman stuff? [18:01:48] not me... I am basically done for the day. if I have any energy a little later I'll work more on a bit of code but that's about it [18:02:11] RobH: yep! [18:02:23] heh, thats the only reason i ever stop puppet on brewster =] [18:02:57] yep! [18:03:02] it works so well :) [18:03:02] hrmph. [18:03:06] :-) [18:05:50] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=ms-fe1.pmtpa.wmnet&c=Swift%20pmtpa&m=swift_HEAD_404_avg&r=hour [18:06:44] !log sending thumbnail traffic back to ms5, taking swift out of production [18:06:46] Logged the message, Master [18:07:09] ? [18:07:18] :-( [18:07:20] woosters: robla: ^^^ - for some reason, HEAD requests to nonexistent images are taking 60s. [18:07:32] GETs to those same images work just fine. [18:08:13] granted, there's only one of those every 5 seconds or something, [18:08:17] but still. 60s is bad. [18:10:32] not acceptable :-( [18:21:36] maplebed: any sense of which part of the pipeline the delay seems to be introduced? [18:21:58] if you have a sec, I'll switch to RL. [18:22:32] I'm doing a lot of multitasking here [18:24:47] ok - the code that is supposed to copy data into swift on 404 does the whole copy-and-send-back thing, right? well, with a HEAD it's not supposed to send any data back, so what does it do? I haven't figured out where it's stopping yet, but that's where I'm looking. [18:26:43] also it's only HEADs to things that don't exist that crap out - apparently cloudfiles is doing HEADs to containers now, and those are doing just fine (0.00x ms) [18:33:44] mark: I'm having a meeting with legal today about Labs [18:33:58] mark: have any opinion on wmf private data in labs? [18:34:20] I think either a separate zone, or very strict enforcement of project membership is enough [18:34:34] people will have to identify to have access to those projects, I assume? [18:34:45] that's a question for today [18:34:54] it could be that everyone that has a public IP will have to [18:34:58] oh my [18:35:15] since they are hosting a service on a wikimedia project, and have access to the IP logs [18:35:33] our privacy policy is incredibly strict :D [18:36:11] I'd kind of like an exception for Labs, but that would take a board decision [18:36:50] i'm not sure if i like having private data on labs [18:36:59] labs is supposed to be for playing with projects and a beta zone [18:37:10] tool labs is for analytics [18:37:22] it's supposed to be considered quasi-production [18:37:34] what about a separate pod for quasi production that has way stricter access rules ? [18:37:44] we could have a separate zone, yeah [18:38:02] but realistically, it doesn't offer much more protection than a separate project [18:38:12] well, I lie :) [18:38:40] it offers hardware separation, which protects against instance hopping via a host [18:38:58] but the data would be stored in the gluster cluster.... [18:39:09] so, it would be based on access to the data in gluster [18:39:34] nwfilter denied mac and ip spoofing [18:39:37] *denies [18:40:04] so, it shouldn't technically be possible for an instance to masquerade as another for data access [18:40:27] PROBLEM - Disk space on mw9 is CRITICAL: DISK CRITICAL - free space: /tmp 6 MB (0% inode=86%): [18:40:31] there's one small issue, though [18:40:59] if an instance is deleted from projectA, then gluster needs to remove it. if that IP is reassigned, it's possible for a short period of time that another instance may have the same IP [18:41:17] we can work around that by creating a new network, for use in private projects [18:41:34] we can assign that network specifically to a private project, ensuring only it gets those IPs [18:41:48] RECOVERY - Disk space on mw9 is OK: DISK OK [18:42:18] we can also change the access.conf on instance in projects marked private [18:42:32] so that only members of an approved group can ssh into them [18:43:47] the reason none this worries me much, is that this software is used for public clouds [18:44:13] where project separation is used to separate entire organizations [18:44:59] full zone separation is an option, though :) [18:45:18] hmm [18:45:33] it'll eat up a couple pieces of hardware to do so [18:45:59] worth it to allow non-identified use of some projects, I think [18:46:33] well, people have to identify if they can read ip logs of any web request [18:46:34] s/think/am pretty dang sue [18:46:48] which means if they are publically demoing something. [18:46:55] *sure [18:46:57] hmmmmm [18:47:04] that bites [18:47:12] that's different from accessing wmf private data, though [18:47:21] for that they'd likely need to sign an agreement [18:47:27] right [18:47:37] thankfully, we can have groups for all of this stuff [18:47:43] and enforce access via group members [18:47:48] *group membership [18:47:53] so implementation isn't terribly hard [18:47:58] that group membership code better be rock solid [18:48:08] it's posix groups via LDAP [18:48:20] using access.conf (pam_security) [18:48:32] I feel relatively confident about pam [18:48:36] same [18:48:38] I feel less so about ldap [18:48:44] LDAP just stores the data [18:48:54] if the system can't access LDAP, it can't get the groups, and denies access [18:49:13] but if someone can shoehorn their own data in there [18:49:17] (for example) [18:49:30] difficult, but possible [18:49:54] PROBLEM - ps1-d2-pmtpa-infeed-load-tower-A-phase-Y on ps1-d2-pmtpa is CRITICAL: ps1-d2-pmtpa-infeed-load-tower-A-phase-Y CRITICAL - *2525* [18:50:07] unless we have two consoles, and two LDAP domains, there's no getting around that, though [18:50:35] you undrestand I'm thinking about the serious troll/hacker who has decided (wikipedia-review style) to try to either get info, cause damange or just harm our reputation [18:50:44] maplebed [18:50:49] I could restrict write access to those specific groups to a limited subset of users [18:50:50] so most people wouldn't fall into that category, but [18:50:52] woosters [18:50:58] apergos: indeed [18:51:02] yeah I think you need to [18:51:02] what is the impact on that Head request? [18:51:03] there's no direct access to the LDAP server [18:51:14] all it takes is one person to screw that up for us [18:51:17] well, I lie. that's not totally true [18:51:24] there's access via Labs [18:51:29] *shudder* [18:51:39] woosters: I'm not sure what you mean. [18:51:44] meh. I could limit write access to specific IPs [18:51:54] woosters: (I took swift back out of production till I figure out what it's about) [18:51:55] so that only labsconsole, and a handful of other servers can write to it [18:52:06] that would make me happier [18:52:14] ok maplebed [18:52:22] ok let's assume the groups stuff can be tightened up [18:52:28] then the zone approach makes sense [18:52:29] hm. do I have audit log enabled? [18:52:42] pretty sure I do [18:52:46] woosters: I have something I think might fix it though; testing now. [18:53:27] apergos: groups stuff is pretty easy. I can just make two more global groups, and wiki admins can manage them via labsconsole [18:53:45] so there will be: projects with non public ips [18:53:49] projects with public ips [18:53:53] projects with private data [18:54:06] and global projects [18:54:23] global projects are projects used by all other projects (like bastion) [18:54:25] example of a globl project? [18:54:26] ah [18:54:31] in which no one has root, except ops [18:54:41] ok [18:54:59] I'm handling that via puppet, currently [18:55:09] I don't *really* love the way I'm doing it, but it works [18:55:22] maybe you'll have someone who can take that over soon :-D [18:55:27] heh [18:55:28] hopefully [18:56:04] I really need a better way of marking a project that puppet will know about [18:56:08] so in interesting-ish news, the only people who it looks like haven't honored dns ttl's and are using the mobile wap are chinese search engines [18:56:17] hahaha [18:56:17] huh [18:56:22] imagine that [18:56:33] youdao and baidu [18:56:39] mostly youdao [18:57:06] ip range? [18:57:18] can't tell since it's redirected [18:57:27] bah [18:57:29] just could figure out youdao by plugging in the site to google translate [18:57:37] :-D nice [18:58:42] New patchset: Lcarr; "Minor whitespace cleanup For purposes of triggering git post-merge hook on sockpuppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2639 [18:59:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2639 [18:59:05] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2639 [18:59:06] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2639 [18:59:11]