[00:02:18] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 75 MB (1% inode=61%): /var/lib/ureadahead/debugfs 75 MB (1% inode=61%): [00:06:57] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 264 MB (3% inode=61%): /var/lib/ureadahead/debugfs 264 MB (3% inode=61%): [00:08:27] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 281 MB (3% inode=61%): /var/lib/ureadahead/debugfs 281 MB (3% inode=61%): [00:11:09] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 300 seconds [00:12:48] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 284 MB (3% inode=61%): /var/lib/ureadahead/debugfs 284 MB (3% inode=61%): [00:15:30] RECOVERY - Disk space on srv221 is OK: DISK OK [00:16:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:17:09] RECOVERY - Disk space on srv220 is OK: DISK OK [00:19:08] hexmode: You about? [00:19:13] I am working on your bugzilla ticket. [00:19:23] by chance, I am [00:19:23] ie: template changes [00:19:35] cool, what template file is this, do you know or should i start grepping? [00:19:59] im already disappointed in our bugzilla implementation, seems it isnt very puppetized =[ [00:20:21] I was hoping for shell access so I could test that more... I can find it and test tomorrow [00:20:28] yes, puppet would be great [00:20:38] you more than likely will not be getting shell, as only ops get that. [00:20:48] on servers like this, even priyanka when she did bz stuff didnt have shell. [00:20:59] and changing things on kaulen would need root, not just shell. [00:21:03] RECOVERY - Disk space on srv223 is OK: DISK OK [00:21:04] :'( [00:21:12] RECOVERY - Disk space on srv222 is OK: DISK OK [00:21:14] nothing personal intended, we just dont hand out root [00:21:20] not even every ops person has root now ;] [00:21:47] RobH: maybe the best thing to do, then, is puppetize it? [00:21:50] we are working to better grain our access controls so in the future we can be a bit more flexible [00:22:01] that will take longer than me just hacking in your changes, you dont wanna wait on that =] [00:22:12] if ya dunno what file it is no worries, i just figured best to ask [00:22:14] if you can do that, I can work on the templates tomorrow [00:22:21] ok [00:22:21] cuz if ya did, would make it super fast [00:22:29] i look for it =] [00:22:43] bz needs puppetization, but that is a longer than one night process. [00:22:52] ie: puppetize in labs, then import into production [00:23:02] ok, I think you'll probably need more from me tomrrow, let me know [00:23:08] while it needs to happen, dont let anyone suggest holding up bz changes for it, cuz it will take too long [00:23:25] and waiting for it to puppetize for this kind of change is a bit too long a wait =] [00:23:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.078 seconds [00:27:07] but yea, we need to puppetize it then anyone could submit these kinds of changes [00:27:19] ops person then just does code review and pushes into production [00:58:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:02:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.918 seconds [01:06:28] * Jamesofur whines [01:06:29] No wonder people keep getting the security warning from visiting https://shop.wikimedia.org (instead of http) even if they don't have the httpsEverywhere extension. It looks like we force https (maybe just for our addresses? ) even if written out without protorel. So for example even when I'm careful to write out [http://shop.wikimedia.org a link] it still turns into https for people browsing from https on the sites… is there anywa [01:06:29] exempt the shop from that for now? [01:16:49] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [01:38:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:44:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.025 seconds [02:13:49] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [02:16:31] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 196 MB (2% inode=61%): /var/lib/ureadahead/debugfs 196 MB (2% inode=61%): [02:18:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:43] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 263 MB (3% inode=61%): /var/lib/ureadahead/debugfs 263 MB (3% inode=61%): [02:24:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.207 seconds [02:31:23] RECOVERY - Disk space on srv221 is OK: DISK OK [02:31:23] RECOVERY - Disk space on srv224 is OK: DISK OK [03:02:08] PROBLEM - MySQL Slave Delay on db50 is CRITICAL: CRIT replication delay 309 seconds [03:02:17] PROBLEM - MySQL Replication Heartbeat on db1006 is CRITICAL: CRIT replication delay 319 seconds [03:02:26] PROBLEM - MySQL Slave Delay on db1006 is CRITICAL: CRIT replication delay 329 seconds [03:02:35] PROBLEM - MySQL Replication Heartbeat on db46 is CRITICAL: CRIT replication delay 336 seconds [03:02:35] PROBLEM - MySQL Replication Heartbeat on db47 is CRITICAL: CRIT replication delay 338 seconds [03:02:44] PROBLEM - MySQL Slave Delay on db46 is CRITICAL: CRIT replication delay 345 seconds [03:02:53] PROBLEM - MySQL Slave Delay on db47 is CRITICAL: CRIT replication delay 353 seconds [03:03:11] PROBLEM - MySQL Replication Heartbeat on db1040 is CRITICAL: CRIT replication delay 372 seconds [03:03:47] PROBLEM - MySQL Replication Heartbeat on db50 is CRITICAL: CRIT replication delay 407 seconds [03:06:47] RECOVERY - MySQL Replication Heartbeat on db47 is OK: OK replication delay 0 seconds [03:07:14] RECOVERY - MySQL Slave Delay on db47 is OK: OK replication delay 0 seconds [03:08:26] RECOVERY - MySQL Slave Delay on db50 is OK: OK replication delay 0 seconds [03:09:58] RECOVERY - MySQL Replication Heartbeat on db50 is OK: OK replication delay 0 seconds [03:12:56] RECOVERY - MySQL Slave Delay on db1006 is OK: OK replication delay 0 seconds [03:14:44] RECOVERY - MySQL Replication Heartbeat on db46 is OK: OK replication delay 0 seconds [03:14:53] RECOVERY - MySQL Replication Heartbeat on db1006 is OK: OK replication delay 0 seconds [03:15:02] RECOVERY - MySQL Slave Delay on db46 is OK: OK replication delay 0 seconds [03:15:38] PROBLEM - MySQL Slave Delay on db1040 is CRITICAL: CRIT replication delay 265 seconds [03:27:56] RECOVERY - MySQL Replication Heartbeat on db1040 is OK: OK replication delay 0 seconds [03:28:14] RECOVERY - MySQL Slave Delay on db1040 is OK: OK replication delay 0 seconds [04:50:28] RECOVERY - Disk space on search1021 is OK: DISK OK [04:50:28] RECOVERY - Disk space on search1022 is OK: DISK OK [04:56:46] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 3463 MB (3% inode=99%): [04:56:46] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 3463 MB (3% inode=99%): [05:37:16] !log rebooting db42 to finish upgrades [05:37:20] Logged the message, Master [05:41:02] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay seconds [05:41:56] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay seconds [05:42:28] !log db42 - reboot worked despite the grub warning about unreliable blocklists [05:42:30] Logged the message, Master [05:44:20] PROBLEM - mysqld processes on db42 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [05:46:26] RECOVERY - mysqld processes on db42 is OK: PROCS OK: 1 process with command name mysqld [05:49:43] !log db42 - mysql did not autostart after boot, added using update-rc.d [05:49:44] Logged the message, Master [05:51:59] PROBLEM - Puppet freshness on searchidx1001 is CRITICAL: Puppet has not run in the last 10 hours [05:58:25] !log installed security upgrades on brewser, cadmium, capella (apache,mysql,ruby,apt..) [05:58:26] Logged the message, Master [06:00:59] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 4159 MB (3% inode=99%): [06:06:09] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 1068249 seconds [06:06:27] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 1068222 seconds [06:17:51] ACKNOWLEDGEMENT - Host lily is DOWN: CRITICAL - Host Unreachable (91.198.174.121) daniel_zahn has been replaced by sodium [06:21:42] !log installed more package upgrades on sodium [06:21:45] Logged the message, Master [06:25:04] !log powercycling sq40 [06:25:08] Logged the message, Master [06:32:49] ACKNOWLEDGEMENT - Host sq40 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn hardware failure, just added to existing RT 2581 [06:32:49] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 3436 MB (2% inode=99%): [06:37:01] PROBLEM - Disk space on search2 is CRITICAL: DISK CRITICAL - free space: /a 1572 MB (1% inode=99%): [06:50:22] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 159 MB (2% inode=61%): /var/lib/ureadahead/debugfs 159 MB (2% inode=61%): [07:06:25] RECOVERY - Disk space on srv221 is OK: DISK OK [07:11:58] PROBLEM - Disk space on search7 is CRITICAL: DISK CRITICAL - free space: /a 0 MB (0% inode=98%): [07:28:46] PROBLEM - BGP status on cr1-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.196, [07:28:46] PROBLEM - BGP status on cr2-eqiad is CRITICAL: (Service Check Timed Out) [07:31:28] PROBLEM - Router interfaces on mr1-pmtpa is CRITICAL: CRITICAL: host 10.1.2.3, interfaces up: 32, down: 1, dormant: 0, excluded: 0, unused: 0BRfe-0/0/1: down - csw5-pmtpa:8/23:BR [07:40:10] PROBLEM - Varnish HTTP mobile-frontend on cp1044 is CRITICAL: Connection refused [07:42:07] RECOVERY - Varnish HTTP mobile-frontend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.055 seconds [07:45:03] New patchset: Asher; "inline C to use x-forwarded-for for zero/digi acl match" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3898 [07:45:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3898 [08:08:19] PROBLEM - Host cp3001 is DOWN: PING CRITICAL - Packet loss = 100% [08:10:54] New review: preilly; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/3898 [08:11:19] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [08:22:25] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [08:22:25] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [08:30:25] New patchset: Hashar; "jenkins: add existing users to existing group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3903 [08:30:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3903 [08:34:14] New review: Dzahn; "yea, we just had a talk about the "add existing user to groups"-issue with the puppet provider (addi..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3903 [08:34:16] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [08:34:16] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [08:34:17] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3903 [08:52:46] New patchset: Hashar; "Revert "jenkins: add existing users to existing group"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3904 [08:53:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3904 [08:53:13] New review: Hashar; "Patchset reverting this is https://gerrit.wikimedia.org/r/3904" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3903 [08:58:05] New review: Dzahn; "yeah, the "plusignment" doesn't do it. Either "Duplicate definition: Group[jenkins]" or "Only subcla..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3904 [08:58:07] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3904 [08:59:52] http://blog.archive.org/2012/03/29/wayback-machine-machines-are-moving/ [09:08:35] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 268 MB (3% inode=61%): /var/lib/ureadahead/debugfs 268 MB (3% inode=61%): [09:10:51] !log gallium - added demon,hashar,reedy to group jenkins as it's a problem using puppet when users and groups already exist [09:10:54] Logged the message, Master [09:12:47] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 244 MB (3% inode=61%): /var/lib/ureadahead/debugfs 244 MB (3% inode=61%): [09:12:47] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 186 MB (2% inode=61%): /var/lib/ureadahead/debugfs 186 MB (2% inode=61%): [09:12:47] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 101 MB (1% inode=61%): /var/lib/ureadahead/debugfs 101 MB (1% inode=61%): [09:16:59] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 280 MB (3% inode=61%): /var/lib/ureadahead/debugfs 280 MB (3% inode=61%): [09:21:11] RECOVERY - Disk space on srv221 is OK: DISK OK [09:21:11] RECOVERY - Disk space on srv224 is OK: DISK OK [09:21:11] RECOVERY - Disk space on srv219 is OK: DISK OK [09:22:17] <-- if you wonder about that debugfs,, here is something: [09:22:20] "The kernel filesystem tracers that ureadahead uses to 'profile' the boot process are expected to be at the debugfs mountpoint, /sys/kernel/debug. If a quick test reveals that the mountpoint isn't up yet, rather than wait for the mountpoint (and potentially missing more profiling it could do), ureadahead mounts a temporary debugfs at /var/lib/ureadhead/debugfs so that it can get to the filesystem tracers (do_sys_open, open_exec & uselib sys [09:22:26] According to the dev in this bug report (https://bugs.launchpad.net/ubuntu/+s...ad/+bug/499773), a left-over temporary mountpoint indicates that ureadahead crashed out, leaving the mountpoint in /etc/mtab. A quick scan of the source seems to confirm this, but maybe something was overlooked. [09:25:59] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 19, down: 0, shutdown: 1 [09:27:38] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:29:08] PROBLEM - LVS Lucene on search-pool3.svc.pmtpa.wmnet is CRITICAL: Connection timed out [09:29:44] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [09:31:41] RECOVERY - Disk space on srv220 is OK: DISK OK [09:31:41] RECOVERY - Disk space on srv222 is OK: DISK OK [09:31:50] PROBLEM - Lucene on search6 is CRITICAL: Connection timed out [09:32:44] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [09:35:02] mutante: maybe we could have an init.d script that would umount /var/lib/ureadahead/debugfs post boot ? [09:35:17] RECOVERY - LVS Lucene on search-pool3.svc.pmtpa.wmnet is OK: TCP OK - 0.002 second response time on port 8123 [09:35:44] RECOVERY - Lucene on search6 is OK: TCP OK - 0.004 second response time on port 8123 [09:36:32] !log restarted defunct lsearchd on search6 [09:36:34] Logged the message, Master [09:36:38] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.011 second response time on port 8123 [09:37:23] I was jus ton there and did that [09:37:50] hashar: possibly, but it happens while the server is running, then disappears after 10 min [09:37:56] apergos: how about search7? [09:38:02] no not yet [09:38:05] apergos: i see it still running there though [09:38:14] I killed things that wouldn't die [09:38:18] then restarted it on search6 [09:38:20] well, "java" using 106% CPU [09:38:36] ok, lets do the same there [09:38:48] it's all you [09:39:23] up to 675% [09:39:26] ok [09:40:16] !log kill and start lsearchd on search7 [09:40:18] Logged the message, Master [09:40:53] mutante: what I mean is that we have two debugfs mounted [09:41:03] it seems only one is needed, the one on /sys/kernel/debug [09:41:10] hashar@srv221:~$ mount |grep debugfs [09:41:11] none on /sys/kernel/debug type debugfs (rw) [09:41:11] none on /var/lib/ureadahead/debugfs type debugfs (rw,relatime) [09:41:16] https://bugs.launchpad.net/ubuntu/+source/mountall/+bug/499773 [09:41:49] hmmm [09:42:09] "mountall doesn't record unmounts, so it never "forgets" this path" [09:42:53] so to me, the workaround would be to manually unmount /var/lib/ureadahead/debugfs [09:42:56] cat /proc/self/mountinfo | grep debug [09:43:02] would have to try on a server out of production ;) [09:43:35] "useless use of cat"-award :p [09:43:43] or we add a script that take care of amounting /var/lib/ureadahead post boot ;) [09:43:55] it looks like the bug is that it says it is mounted, when it really is not [09:44:10] it says to check /proc/self/mountinfo instead [09:46:05] the hard way is to just reboot those systems so they end up possibly in a clean state [09:46:24] anyway, you might want to raise the issue on the secret ops mailing list ;-) [09:47:08] "Why has this been set to importance: low when tens of thousands of users are told their hard disks are full, and they can no longer store files on them?" :( [09:48:36] hashar: i did [09:49:11] hashar: how about editing /etc/mtab by hand :? [09:51:14] or... mv /etc/init/ureadahead.conf /etc/init/ureadahead.conf.disable [09:51:52] https://bugs.launchpad.net/ubuntu/+source/mountall/+bug/736512 [09:51:52] or just remove ureadahead ;) [09:52:02] it is just mean to speed up the boot process [09:52:11] something we really do not care about on a server [09:52:22] "bug was fixed in the package mountall - 2.25" ..hmm .checking [09:53:13] src/mountall.c: ignore ureadahead's potential mount of [09:53:13] /var/lib/ureadahead/debugfs (LP: #736512). [09:53:16] http://changelogs.ubuntu.com/changelogs/pool/main/m/mountall/mountall_2.35/changelog [09:53:27] \O/ [09:53:44] ii mountall 2.15.3 [09:57:10] hashar: +1 on just disabling it then . mv /etc/init/ureadahead.conf /etc/init/ureadahead.conf.disable [09:57:25] might work [09:57:31] but that would be for next reboot [09:57:39] you still have to umount the fs [09:58:05] combine it with installing new kernel then [09:58:23] i'll do this one, srv221, now [09:59:44] !log srv221, disabling ureadahead, installing package upgrades and new kernel, rebooting [09:59:45] Logged the message, Master [10:05:56] what kernel are yo u going to? (ouot of curiosity only) [10:06:48] 2.6.32-40-server [10:06:55] eh ok [10:10:59] PROBLEM - Apache HTTP on srv221 is CRITICAL: Connection refused [10:13:05] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.102 second response time [10:49:55] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:50:21] Change abandoned: Hashar; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2446 [10:51:52] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [11:18:16] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [11:20:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:22:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.980 seconds [11:56:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:00:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.020 seconds [12:09:19] PROBLEM - Host db47 is DOWN: PING CRITICAL - Packet loss = 100% [12:15:19] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [12:27:11] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [12:31:04] RECOVERY - Lucene on search15 is OK: TCP OK - 2.997 second response time on port 8123 [12:34:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:36:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.275 seconds [13:11:06] !log trimming logs and such on search1-20 [13:11:08] Logged the message, notpeter [13:12:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:14:18] RECOVERY - Disk space on search7 is OK: DISK OK [13:15:30] RECOVERY - Disk space on search2 is OK: DISK OK [13:16:44] is there something I can run to do that? [13:17:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.153 seconds [13:17:01] because I'm happy to trim them on a host if I see nagios is whining [13:17:32] well, I need to figure out how, for example, search7 has made 72 gigs of logs in 6 hours [13:17:41] woah! [13:17:53] hurray for java stacktraces! [13:17:58] yuck [13:18:03] better you than me [13:18:16] I'm just going to turn log level up to crit... [13:24:22] ugh. this is so annoying. when you move an index off of a host, it doesn't know to delete the index that's no longer assigned to it. so search2, which is dying for space, has 5x as many indexes as it should... [13:24:29] *sigh* time to do some housekeeping. [13:24:38] ouch [13:25:03] at least it will get some room back after that [13:25:10] quite a bit, yes [13:25:22] but I'm going to go through all of the nodes and see what I can free up [13:25:30] cool [13:25:37] that said, I don't know if it can free up 72 gigs/6 hours worth ;) [13:26:44] well if you turned down the log level can't you jus tmove the logs, restart the indexer or whatever it is and then toss theold ones? [13:27:28] yep [13:27:45] I'll have to do some yucky puppet stuff to turn it down for just the hosts that have exploding logs [13:27:51] but yeah, shoulnd't be too bad [13:28:28] oh hey, look at thta, half of search2's disk is free! [13:29:56] yeah... lots of crap from 2008.... [13:30:01] seems like it can go ;) [13:30:17] um [13:33:12] PROBLEM - Varnish HTTP bits on cp3002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:34:51] PROBLEM - LVS HTTP on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:41:08] apergos mark mutante bits lvs is in fact down in europe [13:41:47] cp3001 is dead, cp3002 is overloaded [13:42:03] And it seems that's all the backends for bits esams ?!? [13:42:10] hurray [13:43:22] oh joy [13:43:26] mutante is sleeping [13:43:28] mark is on vacation [13:43:37] <^demon> Leslie? [13:43:43] is not on line yet [13:43:48] it's not sf waking hours [13:44:01] <^demon> Yeah, but she was the next person I thought of we could ping offline. [13:44:44] ah [13:44:44] Ryan is on vacation too [13:44:46] <^demon> Too late for Tim? [13:44:49] popping a shell [13:44:54] 12:44am [13:45:04] nope [13:45:07] I think rebooting cp300{1,2} ought to do it [13:45:08] can get in via impi [13:45:14] but shell wont come up [13:45:17] going to reboot [13:45:26] !log rebooting (mostly) down cp3001 [13:45:26] <^demon> Ouch yeah, let's avoid pinging him after midnight if we can avoid it :) [13:45:27] Logged the message, notpeter [13:47:29] Jeff is also active in -labs [13:47:33] it's booting up [13:47:37] Jeff_Green: hey [13:47:44] notpeter: hi [13:48:14] I don't think I know how to get on those boxes [13:48:27] I have not ever poked at them [13:49:02] RoanKattouw: can you try to load? [13:49:06] RECOVERY - Host cp3001 is UP: PING WARNING - Packet loss = 28%, RTA = 163.79 ms [13:49:10] pybal logs look/.... [13:49:13] not fully awesome [13:49:18] I'll just look at http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=load_one&s=by+name&c=Bits+caches+esams&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [13:49:19] but might be starting to come up? [13:49:20] network traffic has bounced back though [13:49:24] RECOVERY - LVS HTTP on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3975 bytes in 0.328 seconds [13:49:27] hurray!~ [13:49:30] bits network spiked back up [13:50:02] Wikipedia loads reasonably now [13:50:11] Yeah, working for me [13:50:38] ok, cool [13:50:55] soo, crisis averted for now? [13:51:10] I suspect cp3002 might want rebooting, as it's running 15k+ processes [13:51:53] Or at least check what it's actually doing with 97% system cpu [13:51:54] ;) [13:51:56] well, that will take site down again.... [13:52:42] I think this is a new record for silliness: "load average: 13183.22, 12837.42, 9839.27" [13:52:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:44] it's not serving any network traffic [13:52:47] Jeff_Green: No [13:52:52] Jeff_Green: Let me dig up this tweet of mine [13:53:02] well, it's down to 12k now ;) [13:53:08] so I think it's improving [13:53:25] the box is oddly responsive for reporting that load avg [13:53:26] It would seem cp3001 has been down for a few hours [13:54:09] hhhhmmmm, do we think a restart of varnish is in order? [13:54:10] Jeff_Green: Wait, no, you're right. The one I tweeted about was only 7000. 13000 is insane [13:54:28] wtf is it doing? [13:54:31] Yeah we should be able to restart Varnish on cp3002 I think [13:54:37] not a lot, apparently [13:54:53] there's free RAM, disk io is mellow, cpu is not burning [13:55:33] 2360% varnishd [13:55:38] that's interesting [13:56:09] RECOVERY - Varnish HTTP bits on cp3002 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.218 seconds [13:56:15] it really is serving up almost no dta [13:56:23] any thoughts on restarting varnishd? [13:56:28] I think I'm pro [13:56:35] just because it is already doing so little [13:56:52] strace on varnish: [13:56:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.898 seconds [13:57:02] read(9, "ping\n", 8191) = 5 [13:57:02] writev(12, [{"200 19 \n", 13}, {"PONG 1333029400 1.0", 19}, {"\n", 1}], 3) = 33 [13:57:02] poll([{fd=9, events=POLLIN}], 1, -1) = 1 ([{fd=9, revents=POLLIN}]) [13:57:19] one blat like that every few seconds is all [13:57:39] Jeff_Green: yeah, seems safe [13:57:44] ok [13:57:45] go for it [13:58:40] load is down to a mere 7k now ;) [13:58:55] yeah, shit's rapidly returning to normal [13:59:08] I'm not sure what to expect, the init script didn't seem to kill it the first time [13:59:18] have you started again? [13:59:23] so I ran it a second time, and it did better [13:59:28] no, issued stop 2x [13:59:46] start it up? [13:59:49] what's varnishstat? it's defunct [13:59:59] haven't started it yet, I'm looking for processes to settle [14:00:01] a fine question, sir [14:00:08] [varnishstat] <----dislike [14:00:20] started [14:01:00] oh, you know, if we were smart, we would have adjusted teh weight before readding.... [14:01:41] !log restarted varnish on on cp3002 because it was thrashing futiley [14:01:43] Logged the message, Master [14:01:55] the weight where? in pybal? [14:02:00] ja [14:02:08] although, the bits appservers seem just fine [14:02:30] mrfpth. what's the point of running a load distributor if it can't handle a backend hiccup?? [14:03:04] hrm [14:03:12] well, I'm also not sure how persistent the cache is [14:03:17] so I mgiht be full of crap [14:03:50] if I'd been awake enough I would have looked for wikitech docs to answer these questions [14:05:09] load seems a bit more stable, it's hovering under 20 which seems reasonable for the number of cpu cores [14:06:04] notpeter: https://rt.wikimedia.org/Ticket/Display.html?id=2730 #2730: usb modem in error again, unplug and replug + pull info for chris [14:06:15] just added you since you were the one who let me know ;] [14:06:20] (as a requestor) [14:06:26] heh, ok [14:06:30] although... it's toast [14:06:37] also have him pulling the info from it for me to get a new one. [14:06:42] nice [14:07:14] maybe we should temporarily switch to email gateways until the sms dongle is replaced [14:08:05] i think tim's http solution sounded more stable. [14:08:18] at minimum using the smtp of the gateway provider, not our own. [14:08:43] I think that ben was implementing... something [14:08:46] because if we do a half ass solution, no one will bother to improve it. [14:09:00] so... to the implementer goes the decision-making! [14:09:05] at elast until we get a new dongle [14:09:13] the permanent solution is the sms device, until mark/CT/mgmt says otherwise [14:09:25] but for temp, whatever, i dont care to argue against email being a poor method anymore [14:09:35] email of any form will be about 5 orders of magnitude more stable than what we've been tolerating [14:09:49] yes, so lets hack a temp solution, which will then become permanent [14:10:00] and no one will bother to actually put in place a real one. [14:10:11] it happens on everything. [14:10:28] * apergos goes back to hacking a less temporary solution to the mirror rsyncs [14:10:39] so then let's make sure to get another sms dongle :) [14:10:44] and then we can move back to that [14:10:47] RobH: I don't understand the issue I guess. Email gateways were rock solid for years and years for Craigslist. [14:11:03] i thought the email list conversation was pretty clear [14:11:13] <^demon> RobH: Please don't tell me those PDUs were a temporary hack :p [14:11:25] if i had to choose a permanent solution i would have chosen email again without a second though [14:11:30] if we are goign to do email we should use the gateway provider smtp, and then have a secondary system for alerting if those emails queued up without result. [14:11:46] otherwise we could lead to the issue of a bunch of queued emails going to no where [14:11:50] i'm not sure what you mean re. gateway provider? [14:12:18] you mean, spence talks directly to the carrier's sms email gateway? [14:12:20] if we use company X which is a email to SMS gateway [14:12:35] we should use X's smtp, and have some system in palce that if we fail to route out to them [14:12:41] would then attempt another method of alerting us. [14:13:21] so we configure spence to mail direct instead of relaying through our systems, and we put a mailing list on spence with each person's email-to-sms-address [14:13:30] Tim's suggestion: "If you're worried about speed then maybe you should use HTTP post [14:13:31] instead of email. Clickatell provides such a service, and probably [14:13:31] lots of other SMS gateways." [14:13:33] i.e. 4154015522.sms.verizon.com or whatever [14:13:46] I don't have an email to sms address. I don't know what they are for cosmote [14:13:52] I tried looking into it once [14:13:59] came up empty-handed [14:14:08] apergos: ah, that is a problem [14:14:20] we didn't run into that since we were all on standard US carriers [14:14:23] we're an international outfit [14:14:31] we cannot rely on the carrier email to sms [14:14:39] yeah my tcom phone has one but I keep that off here for obvious reasons [14:14:42] we would have to use a service that does email to direct SMS messages [14:14:54] *tmobile [14:15:47] apergos: he, yeah this is not the place to exchange those addresses I suppose [14:16:32] no I mean [14:16:46] my tmobile phone would have serious roaming costs if I turned it on here [14:16:52] oh oh oh [14:16:56] it would be unaffordable by anyone's standards [14:17:08] so how many phones do we have that can't be messaged this way I wonder? [14:17:49] good question [14:18:06] RobH: why can't we rely on carrier email to sms? because of the phones on carriers who don't provide that? or because of some difference in expected reliablilty? [14:19:05] reliability is also an issue [14:19:17] I knonw I used to get some smses with a long delay on t-mobile [14:19:17] * Jeff_Green dies [14:19:24] * Jeff_Green dies again [14:19:25] quite variable between 5 minutes to 2 hours [14:19:53] no reasonable explanation, could have been something in the local network, could have been anything [14:20:12] but you can't really lean on the phone company because you need 5 9s reliability [14:20:16] so, we ran exactly this way for 11 years at craigslist [14:20:27] we sent literally hundreds of messages per day per phone for years [14:20:27] reachability (binary works/doesn't work) would be an issue even before reliability (delay) I guess ... (i.e. can your location reach the gateway via the internet) [14:20:46] i thought my stance was clear, and Im really sick of repeating the concerns i raised in email [14:21:16] the proper solution is two nagios instances, one to check the other is up and can send, and both with a carrier to sms gateway that uses a confirmed method of smtp transmission, or HTTP postback to confirm its being received [14:21:30] RobH: I'm sick of having a totally defective messaging system, and being blocked on that for the wrong reasons [14:21:38] i get that folks disagree and think a single nagios with email to sms is reliable, i simply dont agree [14:21:46] then get mark/ct to agree with you and do it [14:21:53] i dont make the decision, and im really sick of arguing [14:22:03] i dont even get to decide what shit i work on a daily basis. [14:22:12] ok, point taken [14:22:40] for now i have to deal with making the soution we have work, and everytime it breaks im resenting getting the third degree about it. [14:23:17] that's understandable. I can't figure out why we're blocked on the permanent fix I guess [14:23:20] sorry if im being a dick, but i have not slept over 4 hours in a night in over a week, and i have been on site every single day. [14:23:28] my degree of patience is nonexistant [14:23:30] no worries there [14:23:46] RobH: so, just for (my) clarification: you currently run one nagios instance, and if that can't reach the sms gateway, no one knows that it is broken? [14:23:48] Jeff_Green: sorry if it seems im taking it out on you, its not meant to be that way [14:23:52] I'm just trying to figure out a way to move us past the broken situation [14:24:22] from my perspective, it should take us about an hour of actual effort to retool to a functional messaging approach I'd be happy with as a permanent solution [14:25:00] meaning--an approach we used for a decade at a comparably high-volume and sensitive site and had very few issues with [14:25:12] T3rminat0r: its a single instance with sms modem attached and a backup system of watchmouse [14:25:26] so we get paged for major events no matter what but its broken and not a good long term solution as it stands [14:25:46] as the modem is dying we are relying on watchmouse, which i hate [14:25:51] (i hate relyin gon a single instance) [14:25:59] RobH: yea, single point of failure with the sms modem plus the GSM net coverage on that location [14:28:36] can we poke email addresses into nagios as messaging destinations alongside the sms stuff? why not do a mix? [14:28:55] Jeff_Green: I think that ben was working on that yesterday [14:29:03] as at least a temp solution until new usb dongle [14:29:07] maybe touch base with him [14:29:13] ok [14:29:32] i'll be happy to switch to email, then we'll have two paths as long as spence/nagios itself is alive [14:29:54] !log stopping puppet runs on brewster so my hacking at the dhcpd.conf file won't get overwritten until I have it working right [14:29:56] Logged the message, RobH [14:32:44] re cp3002--it's totally calm now, was this all just varnish exploding or did something else change? [14:33:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:34:08] I mean, after cp3001 died, yes, it was just that varnish exploded on cp3002 [14:34:35] did cp3001 come back up? [14:34:46] i missed the event sequence [14:35:01] it looks like the explosion coincided with peak traffic in europe [14:35:05] yes, I rebooted cp3001 [14:35:08] and it came up just fine [14:35:12] i see [14:35:35] was more or less unresponsive [14:35:44] cp3001 had been dead for about 5.5h before the explosion [14:36:39] and was cp3001 back up before or after cp3002 skyrocketed? [14:37:42] !log updating dns for virt1001 testing [14:37:44] Logged the message, RobH [14:37:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.922 seconds [14:38:10] after [14:38:21] k [14:38:24] thx [14:38:39] cp3002 seems to have skyrocketed after 3001 had been down for a while, and right at peak europe traffic time [14:39:16] !log all nameservers still online after udpate [14:39:18] Logged the message, RobH [14:47:16] !log did virt1001 wrong, reupdating dns [14:47:18] Logged the message, RobH [14:47:39] if i get this to work i will roll my changes on brewster into puppet and push it live. [14:48:39] pxe arp timeout.... grrrrrr [14:50:31] sigh... why is pxe timing out on tfptp. [14:57:07] New patchset: RobH; "added in vm subnet, cleaned up file indentation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3915 [14:57:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3915 [14:58:11] so anyone wanna check that for me or shall i just review my own work? ;] [14:58:28] still not sure whats up with the pxe timeout, but i dont think its dhcp related [14:59:45] New review: RobH; "good old self review, that never causes problems right?" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3915 [14:59:47] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3915 [15:02:35] !log virt1001 pxe boots via dhcp and fails tftp download, i have to hold off on further troubleshooting until i have a network admin [15:02:37] Logged the message, RobH [15:02:44] !log brewster puppet re-enabled [15:02:47] Logged the message, RobH [15:02:59] every morning, one very tiny step in right direction on ciscos. [15:09:39] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [15:09:48] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [15:09:57] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [15:10:06] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [15:11:54] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [15:13:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:14:27] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 132 MB (1% inode=57%): [15:15:00] !log restarting lsearchd on search3 [15:15:01] Logged the message, notpeter [15:17:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.360 seconds [15:26:27] RECOVERY - Disk space on srv220 is OK: DISK OK [15:26:36] RECOVERY - Disk space on srv222 is OK: DISK OK [15:26:36] RECOVERY - Disk space on srv219 is OK: DISK OK [15:26:54] RECOVERY - Disk space on srv223 is OK: DISK OK [15:26:54] RECOVERY - Disk space on srv224 is OK: DISK OK [15:32:54] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 176 MB (2% inode=61%): /var/lib/ureadahead/debugfs 176 MB (2% inode=61%): [15:33:44] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 141 MB (1% inode=61%): /var/lib/ureadahead/debugfs 141 MB (1% inode=61%): [15:35:50] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [15:35:50] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=57%): [15:37:56] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [15:40:02] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 80 MB (1% inode=61%): /var/lib/ureadahead/debugfs 80 MB (1% inode=61%): [15:42:00] !log finished clearning up all pmtpa search hosts. hey look! they all have lots of space now! [15:42:02] Logged the message, notpeter [15:46:20] RECOVERY - Disk space on srv220 is OK: DISK OK [15:48:26] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 99 MB (1% inode=61%): /var/lib/ureadahead/debugfs 99 MB (1% inode=61%): [15:52:38] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 200 MB (2% inode=61%): /var/lib/ureadahead/debugfs 200 MB (2% inode=61%): [15:53:32] PROBLEM - Puppet freshness on searchidx1001 is CRITICAL: Puppet has not run in the last 10 hours [15:54:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:56:16] New review: Reedy; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/3815 [15:58:56] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [15:58:56] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 199 MB (2% inode=61%): /var/lib/ureadahead/debugfs 199 MB (2% inode=61%): [16:00:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.248 seconds [16:01:02] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 284 MB (3% inode=61%): /var/lib/ureadahead/debugfs 284 MB (3% inode=61%): [16:05:14] RECOVERY - Disk space on srv220 is OK: DISK OK [16:05:14] RECOVERY - Disk space on srv224 is OK: DISK OK [16:05:14] RECOVERY - Disk space on srv219 is OK: DISK OK [16:05:23] RECOVERY - Disk space on srv222 is OK: DISK OK [16:11:41] RECOVERY - Disk space on srv221 is OK: DISK OK [16:15:53] RECOVERY - Disk space on srv223 is OK: DISK OK [16:17:41] urgh [16:17:47] just routing the heavy ass power cables took forever [16:17:53] ^demon|away: is missed today ;] [16:19:38] * ^demon|away has class today [16:27:47] heh [16:34:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:39:32] New patchset: RobH; "typo correction" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3919 [16:39:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3919 [16:39:53] PROBLEM - MySQL Slave Delay on db1033 is CRITICAL: CRIT replication delay 58984 seconds [16:40:32] New review: RobH; "whadda mean the router has to be in the same subnet ;P" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3919 [16:40:35] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3919 [16:40:55] nice commit name RobH [16:40:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.025 seconds [16:41:46] it would be funny if it had not brought my work to a standstill earlier =P [16:43:58] huh, neat.... now the cisco has the issue that ryan was seeing. [16:44:05] its gettting dhcp, its hitting the tftp step [16:44:11] then its blanking its screen and giving no feedback. [16:44:28] so the loader must be pushing output to the wrong interfaces, lemme seee [16:44:52] oh thats right we dont update physcial console. [16:45:35] CISCO Serial Over LAN disabled [16:45:43] apergos: so you found the problem, thank you =] [16:45:53] now we are on to a different set of issues, progress! [16:46:30] yay [16:46:58] now i have to find how to enable serial over lan, which I am going to do after i eat some food. [16:47:03] so is it a matter of waiting for it to finally show up? [16:47:17] i saw a second dhcp request, which is the installer [16:47:19] or it just never gives you output? [16:47:34] so the installer is running on virt1001 with no way to access the serial console for output [16:47:41] major progress. [16:48:36] forgot to load insomniaX, laptop suspended on lid close =P [16:49:07] so messed up it takes a kernel hack to undo that functionality [16:50:24] uuuggghhh [16:54:38] heh, so about 4 months after you guys were here [16:55:06] i discovered the upstairs mezzanine lounge has windows + fridge + microwave [16:55:09] windows ftw. [16:55:16] the physical kind that is. [16:55:28] windows are good [16:56:20] in particular since the inside of the datacenter floor is like working in a large cave. [16:57:58] woah [16:58:00] windows in a dc ? [16:58:02] that's luxury [16:58:32] <^demon> That's the fanciest cave I've ever seen. [16:59:04] <^demon> No bears or guano. [16:59:08] it's a pretty nice cage all right [16:59:10] space age lighting [16:59:19] bending machines and microwave [16:59:53] the goddamn blue lights are stupid [16:59:53] <^demon> Bending machines? [16:59:56] <^demon> Ouch. [16:59:59] they could have put in white leds for the same thing [17:00:09] same cost, same electrical pull, more usability [17:00:17] the blue is to impress folks who dont know better. =[ [17:00:22] <^demon> Yeah, but they don't look nearly as cool ;-) [17:00:29] vending [17:00:38] the upstairs lounge has no vending. [17:00:38] for those of us with typing disabilities :-P [17:00:45] I like the blue lights [17:00:48] also no windows at ground level for security reasons, and the ones up here dont open [17:00:51] hey look, pager fixed [17:01:01] oh yay... [17:01:01] heh [17:01:06] some search poll is broken [17:01:08] pool [17:01:34] was this morning! [17:01:35] no longer [17:01:47] hahaha [17:02:13] it's really that logrotate cripples the servers [17:02:16] which is... sad [17:02:18] but so it goes [17:02:27] I think I can maybe solve that today... [17:02:46] phone on vibrate til it finishes [17:03:05] anyone want to send out a "false alarm" sms? [17:04:25] more importantly... did anyone do anything to "fix" the sms dongle? [17:04:49] notpeter: i just reseated it when I pulled info for robh [17:05:08] oh! [17:05:09] ok [17:05:59] hrmm, i bet it can be fixed if there was a way to use the os to power down the usb port. [17:06:06] then just have it do that every 24 hours. [17:07:53] !log disabling notifications for search lvs nagios checks for 24 hours to test fix [17:07:56] Logged the message, notpeter [17:08:06] meh, quick googling makes it seem difficult. [17:09:33] PROBLEM - Lucene on search3 is CRITICAL: Connection refused [17:09:41] New patchset: preilly; "add comment to Opera browser check and XFF client IP replacement and change header name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3922 [17:09:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3922 [17:11:04] New patchset: preilly; "add comment to Opera browser check and XFF client IP replacement and change header name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3923 [17:11:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3923 [17:13:54] PROBLEM - Host sanger is DOWN: CRITICAL - Host Unreachable (208.80.152.187) [17:15:11] uh oh. [17:15:25] New review: preilly; "(no comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/3898 [17:15:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:50] today is decidedly a not-awesome day [17:15:51] RECOVERY - Lucene on search3 is OK: TCP OK - 0.001 second response time on port 8123 [17:16:13] RobH: what all runs on sanger? [17:16:59] imap [17:17:24] oh [17:17:26] pppfff [17:17:28] that's fine [17:17:45] that's just mar_k and ariel's email [17:17:50] they need to get with the cloud [17:17:55] apergos may disagree [17:18:02] and i hate our gmail [17:18:04] * apergos stabs notpeter [17:18:07] i want our open source mail to keep working. [17:18:07] =P [17:18:11] yes. I disagree. :-P [17:18:12] yes [17:18:25] but here we are... [17:18:35] so, it's dead dea [17:18:36] d [17:18:40] anyone want to reboot? [17:18:42] someone should hop on its mgmt and find out why. [17:18:46] okok I'll look at it [17:18:52] not me, im eating lunch and not working for a minimum of 15 minutes. [17:19:17] apergos: it's not responding to pings or ssh... [17:19:21] safe to just reboot, I'd say [17:19:29] unless the console has something exciting to say [17:19:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.102 seconds [17:19:53] I'm on it and looking to see if there's any messages [17:20:02] also pull racadm getsel [17:20:08] not so much there aren't [17:20:10] gives you the service event log [17:20:48] PROBLEM - Puppet freshness on cp1044 is CRITICAL: Puppet has not run in the last 10 hours [17:20:59] last thing is from 2011 [17:21:08] Description: System Board PS Redundancy: PS redundancy sensor for System Board, redundancy regained [17:21:14] so gonna call that a loss [17:21:35] gonna powercycle it [17:21:39] speak now or else. [17:21:41] New review: preilly; "As, requested by Ben Hartshorne I've added a comment before the XFF head..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/3898 [17:21:46] go for it [17:22:23] did so [17:22:46] watching it boot [17:22:54] New review: preilly; "This is basically a copy of:" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/3922 [17:24:24] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3922 [17:24:27] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3922 [17:25:12] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3923 [17:25:15] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3923 [17:29:03] hmm [17:29:51] hung [17:29:57] * Starting web server apache2 [ OK ] [17:30:00] that's the last message I got [17:30:48] nothing exceptional on bootup [17:31:01] a whine from gmond about a missing conf file [17:31:19] a warning from dovecot that the fd limit could be raised [17:31:20] that's it [17:32:08] gmond--i think that's related to changes in ubuntu ganglia packages [17:33:14] shouldn't hang the box though [17:33:23] and yet. [17:33:26] no it shouldn't [17:33:56] where I've seen that happen I've had better luck with an alternate init scrip, i think it was ganglia-monitor (?) [17:34:52] given that I can't get on the box, it makes it a bit harder [17:35:06] apache is the last started and it claims to beok [17:35:13] so either the next things doesn't even get out the door [17:35:25] os some previous thing fries it [17:35:47] not even pingable [17:35:52] did you reboot it? that's my vote all along [17:36:00] that's what I did first [17:36:05] watched it boot up [17:36:08] with the results reported here [17:38:03] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Thu Mar 29 17:38:01 UTC 2012 [17:38:07] any thoughts? [17:40:31] preilly: http://ganglia.wikimedia.org/latest/?r=20min&cs=&ce=&m=cpu_report&s=by+name&c=Mobile+caches+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 is what I'm watching. [17:41:31] apergos: if it's not pingable that means net isn't loaded or it's not working. That should happen long before something like apache or gmond, meaning your problem's not in the tail end of the boot cycle. [17:41:55] I admit to be less than psyched about booting single user and etc in my evening [17:42:02] servesme right for hoping to fix anything fast [17:42:07] is it pingable during part of the boot process? i.e. reboot again and keep a ping running. if it shows up then goes away again, that hurts. If it never shows up, blame the network. [17:42:25] oh, single user? it may not be loading the network. [17:42:37] I have not booted singel user [17:42:42] but logically that would be the next step [17:43:00] I am happy to reboot it and try pinging during the duration however [17:48:45] nada [17:48:57] here we are after "startin webserver apache" [17:49:03] apache2. whatever [17:49:08] it never responded to its ping? [17:49:09] no ping the entire time [17:49:10] nope [17:49:14] blame the network. [17:49:18] great [17:49:22] IP conflict? [17:49:24] you could ask leslie to check the switch or ask chris to check the cable [17:49:30] n/m no then you'd get a pig [17:49:31] ping [17:49:36] typing is hrad. [17:49:41] yes it is [17:49:43] s/h// [17:49:53] a pig [17:49:54] (if you're typing in the 80s) [17:50:06] * apergos thinks about pigs [17:50:09] hehe [17:50:13] on the wing? [17:50:13] nom nom nom ? [17:50:19] I am hungry [17:50:23] ur pigs were blue uniforms [17:50:25] *our [17:50:28] is the mail server down? [17:50:35] (http://en.wikipedia.org/wiki/Pigs_on_the_Wing) [17:50:39] yes, if you have an imap account [17:50:39] so sanger's console isn't working ? [17:50:43] nimish_g: dunno. are you not getting mail? [17:50:45] ;) [17:50:45] nimish_g: give into the gmail ... [17:50:49] the console hangs after the [17:50:57] * Starting web server apache2 [ OK ] [17:50:57] NEVAAR!! [17:51:00] nothing else shows up. [17:51:10] it's never pingable durin boot. [17:51:32] there were no messages available on the console when it went out to lunch (so that it needed to be rebooted). [17:51:44] apergos: which server on you having a problem connecting to? [17:51:47] sanger [17:51:55] maplebed: yeah, thunderbird tells me the imap server is down. that or no one likes me and hasn't sent me email all day [17:52:11] you should be so lucky [17:52:18] * apergos forwards nimish all their ops cron spam  [17:52:21] let me check cables [17:52:24] ok [17:53:01] !log search1021 coming down for ssd fit test [17:53:03] Logged the message, RobH [17:54:14] sanger's not going up (network port isn't going up and down) [17:54:27] nimish_g: well we were going to wait to tell you ... [17:54:31] see we fixed the glitch [17:54:45] +1 LeslieCarr [17:55:00] PROBLEM - Host search1021 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:19] now the port is up [17:55:21] apergos: network cable was not fully engage [17:55:25] huh [17:55:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:55:28] wonder how that happened [17:55:31] did cmjohnson1 fix it ? [17:55:31] responds to a ping! [17:55:34] yay cmjohnson1 [17:55:35] yes [17:55:37] yay cmjohnson1 !!! [17:55:45] thaat is it, it's pinging now [17:56:03] RECOVERY - Host sanger is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [17:56:08] how long has it been down? [17:56:35] 40 mins? [17:57:15] apergos: most likely my fault...i was over there picking apart dataset1 [17:57:20] I wonder why it wasn't responding on mgmt console before though [17:57:35] hey, now you save some of ds1 for us so we can destroy it properly [17:57:38] no idea...the cabling on that rack is a disgusting mess [17:58:00] the mgmt console part is weird [17:58:34] i went ahead and replaced that cable...i thought that was the problem [17:59:40] Mar 29 17:11:10 sanger kernel: [5598778.468228] bnx2: eth0 NIC Copper Link is Down [17:59:46] Mar 29 17:12:00 sanger nagios3: SERVICE ALERT: gateway;PING;CRITICAL;SOFT;1;PING CRITICAL - Packet loss = 100% [17:59:59] !log search1021 coming back up, done with tests [18:00:00] Logged the message, RobH [18:00:18] so it was there and should have rsponded with login prompt on console, no idea why it didn't [18:00:21] anyways, thanks for fixing [18:00:24] PROBLEM - LDAP on sanger is CRITICAL: Connection refused [18:00:42] PROBLEM - LDAPS on sanger is CRITICAL: Connection refused [18:00:47] oh come on [18:00:55] I just cleared off of there too [18:01:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.032 seconds [18:02:08] anyone want to take over ? [18:03:51] RECOVERY - Host search1021 is UP: PING OK - Packet loss = 0%, RTA = 26.69 ms [18:04:35] New patchset: Jgreen; "granting zexly non-root shell access on storage3 for banner impression logs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3927 [18:04:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3927 [18:05:13] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3927 [18:05:16] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3927 [18:05:30] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 95 MB (1% inode=61%): /var/lib/ureadahead/debugfs 95 MB (1% inode=61%): [18:10:54] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 278 MB (3% inode=61%): /var/lib/ureadahead/debugfs 278 MB (3% inode=61%): [18:12:51] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [18:16:27] so seriously. opendj is running on sanger, I even restarted it for goodmeasure [18:22:07] RECOVERY - Disk space on srv224 is OK: DISK OK [18:22:16] RECOVERY - Disk space on srv223 is OK: DISK OK [18:23:37] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [18:23:37] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [18:26:55] PROBLEM - NTP on sanger is CRITICAL: NTP CRITICAL: Offset unknown [18:30:00] grrr [18:35:37] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [18:35:37] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [18:36:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:37:25] RECOVERY - NTP on sanger is OK: NTP OK: Offset 0.06688630581 secs [18:40:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.163 seconds [18:44:05] LeslieCarr is kicking opendj again [18:47:09] it looks like there's iptables rules that didn't start up for some reason [18:48:25] maybe it sohuld get rebooted now that there is a network link [18:48:35] weird things can happen if the network is unavailable during boot [18:52:26] ahha, maplebed figured it out [18:52:42] there's a template for the init script … and the addresses aren't getting sourced properly in puppet [18:52:52] and because it wasn't restarted forever, we didn't realize what would happen [18:53:01] restarting opendj [18:53:55] and it didn't do what I expected. [18:53:59] grumble. [18:54:31] RECOVERY - LDAPS on sanger is OK: TCP OK - 0.007 second response time on port 636 [18:54:32] oh wow [18:54:50] restartingc opendj again [18:54:52] sleeping bug waiting to break [18:55:11] looks like it was part of getting ldap to work right in labs [18:55:34] RECOVERY - LDAP on sanger is OK: TCP OK - 0.000 second response time on port 389 [18:55:50] hey! [18:55:55] nagios thinks its working. [18:57:07] and mchenry thinks it is as well [18:57:29] test mail sent and recieved. [18:57:32] it's working. [18:57:39] ... until we reboot sanger again. [18:57:40] \o/ [19:01:41] RobH: you still need the template updates for Bz from me, right? [19:02:31] I should look at puppet... LeslieCarr did give me that quick overview. I mean... how hard can it be? [19:02:41] hehehe [19:02:44] * hexmode hears distant mocking laughter [19:02:50] hexmode: i recommend just looking at one specific thing [19:03:00] like look through one machines' chain of information [19:03:04] browsing files is info overload [19:03:49] LeslieCarr: sure... is there a base image to use? Something that I can say "this + install bz package + these templates" [19:04:00] 'cause that seems the thing to do [19:05:08] "standard" [19:05:15] is the normal image to start with [19:06:34] easy enough [19:07:17] hexmode: I am on site @ eqiad today, so I doubt I will be getting to anything software related today. [19:07:23] LeslieCarr: I'll probably have more questions when the labs instance i make fails. [19:07:27] RobH: np [19:07:29] by the time i leave here, i am pretty well exhausted [19:07:31] =P [19:07:40] hehe [19:07:45] hexmode: heh, +1 to labs bz instance [19:07:50] cuz if you edidnt do it, i was gonna have to [19:07:51] did you drag demon with you again ? [19:07:59] since the change you guys asked for will require using custom templates [19:08:09] and hacking at those on the live bz instance is not recommended. [19:08:27] RobH: no doubt. why can't computers be easy? [19:08:43] this is 2012 -- they should read my mind [19:16:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.022 seconds [19:57:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:02:34] !log restarting lsearchd on search7 to del the logfile to end all logfiles [20:02:35] Logged the message, notpeter [20:04:28] how large was that one? [20:05:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.022 seconds [20:07:40] !log restarting lsearchd on search2 to del the logfile to end all logfiles [20:07:42] Logged the message, notpeter [20:21:06] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 277 MB (3% inode=61%): /var/lib/ureadahead/debugfs 277 MB (3% inode=61%): [20:25:18] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 278 MB (3% inode=61%): /var/lib/ureadahead/debugfs 278 MB (3% inode=61%): [20:35:48] RECOVERY - Disk space on srv222 is OK: DISK OK [20:35:57] RECOVERY - Disk space on srv223 is OK: DISK OK [20:37:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:43:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.019 seconds [20:56:48] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 137 MB (1% inode=61%): /var/lib/ureadahead/debugfs 137 MB (1% inode=61%): [21:00:15] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 189 seconds [21:00:51] PROBLEM - MySQL Replication Heartbeat on db24 is CRITICAL: CRIT replication delay 214 seconds [21:01:09] PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 223 seconds [21:01:54] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 242 seconds [21:02:57] RECOVERY - MySQL Replication Heartbeat on db24 is OK: OK replication delay 0 seconds [21:03:15] RECOVERY - MySQL Slave Delay on db24 is OK: OK replication delay 0 seconds [21:04:00] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [21:04:27] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [21:09:42] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 137 MB (1% inode=61%): /var/lib/ureadahead/debugfs 137 MB (1% inode=61%): [21:13:45] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 135 MB (1% inode=61%): /var/lib/ureadahead/debugfs 135 MB (1% inode=61%): [21:17:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:20:03] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [21:20:48] !log rebooting db47 [21:20:50] Logged the message, Mistress of the network gear. [21:23:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.021 seconds [21:25:18] RECOVERY - MySQL Slave Delay on db1033 is OK: OK replication delay 1 seconds [21:25:36] RECOVERY - Host db47 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [21:26:21] RECOVERY - Disk space on srv220 is OK: DISK OK [21:26:30] RECOVERY - Disk space on srv219 is OK: DISK OK [21:26:39] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 0 seconds [21:30:15] PROBLEM - MySQL Recent Restart on db47 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:15] PROBLEM - RAID on db47 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:24] PROBLEM - MySQL disk space on db47 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:42] PROBLEM - SSH on db47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:30:52] PROBLEM - mysqld processes on db47 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:31:09] PROBLEM - MySQL Slave Running on db47 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:31:09] PROBLEM - Disk space on db47 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:31:09] PROBLEM - MySQL Idle Transactions on db47 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:31:18] PROBLEM - MySQL Slave Delay on db47 is CRITICAL: Connection refused by host [21:31:18] PROBLEM - DPKG on db47 is CRITICAL: Connection refused by host [21:31:18] PROBLEM - MySQL Replication Heartbeat on db47 is CRITICAL: Connection refused by host [21:32:12] RECOVERY - MySQL Recent Restart on db47 is OK: OK seconds since restart [21:32:12] RECOVERY - RAID on db47 is OK: OK: State is Optimal, checked 2 logical device(s) [21:32:21] RECOVERY - MySQL disk space on db47 is OK: DISK OK [21:32:22] hah now db47 gives messages [21:32:25] thanks for nothing nagios :p [21:32:39] RECOVERY - SSH on db47 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [21:33:06] RECOVERY - MySQL Slave Running on db47 is OK: OK replication [21:33:06] RECOVERY - MySQL Idle Transactions on db47 is OK: OK longest blocking idle transaction sleeps for seconds [21:33:06] RECOVERY - Disk space on db47 is OK: DISK OK [21:33:15] RECOVERY - MySQL Slave Delay on db47 is OK: OK replication delay seconds [21:33:15] RECOVERY - MySQL Replication Heartbeat on db47 is OK: OK replication delay seconds [21:33:15] RECOVERY - DPKG on db47 is OK: All packages OK [21:55:17] RECOVERY - mysqld processes on db47 is OK: PROCS OK: 1 process with command name mysqld [21:58:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:59:11] PROBLEM - MySQL Replication Heartbeat on db47 is CRITICAL: CRIT replication delay 33127 seconds [21:59:29] PROBLEM - MySQL Slave Delay on db47 is CRITICAL: CRIT replication delay 32922 seconds [22:04:43] New patchset: Lcarr; "decommissioning old servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3950 [22:04:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3950 [22:05:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.022 seconds [22:15:59] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 36 MB (0% inode=61%): /var/lib/ureadahead/debugfs 36 MB (0% inode=61%): [22:17:18] New patchset: Lcarr; "deleting br1-knams" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3952 [22:17:29] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [22:17:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3952 [22:17:51] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3950 [22:17:54] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3950 [22:18:18] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3952 [22:18:20] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3952 [22:30:41] RECOVERY - Disk space on srv219 is OK: DISK OK [22:33:32] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [22:34:19] New patchset: Lcarr; "TESTING seeing if relative variables make a difference to icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3954 [22:34:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3954 [22:35:01] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3954 [22:35:02] RECOVERY - MySQL Slave Delay on db47 is OK: OK replication delay 0 seconds [22:35:02] RECOVERY - MySQL Replication Heartbeat on db47 is OK: OK replication delay 0 seconds [22:35:04] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3954 [22:39:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:45:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.715 seconds [22:56:34] New patchset: Lcarr; "TESTING icinga stuff" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3956 [22:56:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3956 [22:57:21] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3956 [22:57:23] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3956 [23:10:53] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 274 MB (3% inode=61%): /var/lib/ureadahead/debugfs 274 MB (3% inode=61%): [23:21:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:25:35] RECOVERY - Disk space on srv223 is OK: DISK OK [23:25:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.085 seconds [23:27:33] AaronSchulz: have you turned off the thumbnail writes on the copper cluster while you're testing originals (so it would be MW writing the thumbnail, not swift)? [23:28:09] I haven't touched rewrite.py [23:28:24] AaronSchulz: it's configured in the proxy config file, not in rewrite. [23:29:34] ok. [23:29:37] it'd be interesting to try that. [23:29:56] I want to know if the Traceback that's showing up in /var/log/syslog disappears when the swift writingc is turned off. [23:30:14] * AaronSchulz hasn't touched that config [23:30:38] no should be hitting it atm [23:33:48] *no one [23:33:49] arg [23:51:42] New patchset: Bhartshorne; "tersifying nagios alert messaging for email->sms messages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3961 [23:51:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3961 [23:53:20] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3961 [23:53:22] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3961