[00:02:18] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 75 MB (1% inode=61%): /var/lib/ureadahead/debugfs 75 MB (1% inode=61%): [00:06:57] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 264 MB (3% inode=61%): /var/lib/ureadahead/debugfs 264 MB (3% inode=61%): [00:08:27] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 281 MB (3% inode=61%): /var/lib/ureadahead/debugfs 281 MB (3% inode=61%): [00:11:09] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 300 seconds [00:12:48] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 284 MB (3% inode=61%): /var/lib/ureadahead/debugfs 284 MB (3% inode=61%): [00:15:30] RECOVERY - Disk space on srv221 is OK: DISK OK [00:16:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:17:09] RECOVERY - Disk space on srv220 is OK: DISK OK [00:19:08] hexmode: You about? [00:19:13] I am working on your bugzilla ticket. [00:19:23] by chance, I am [00:19:23] ie: template changes [00:19:35] cool, what template file is this, do you know or should i start grepping? [00:19:59] im already disappointed in our bugzilla implementation, seems it isnt very puppetized =[ [00:20:21] I was hoping for shell access so I could test that more... I can find it and test tomorrow [00:20:28] yes, puppet would be great [00:20:38] you more than likely will not be getting shell, as only ops get that. [00:20:48] on servers like this, even priyanka when she did bz stuff didnt have shell. [00:20:59] and changing things on kaulen would need root, not just shell. [00:21:03] RECOVERY - Disk space on srv223 is OK: DISK OK [00:21:04] :'( [00:21:12] RECOVERY - Disk space on srv222 is OK: DISK OK [00:21:14] nothing personal intended, we just dont hand out root [00:21:20] not even every ops person has root now ;] [00:21:47] RobH: maybe the best thing to do, then, is puppetize it? [00:21:50] we are working to better grain our access controls so in the future we can be a bit more flexible [00:22:01] that will take longer than me just hacking in your changes, you dont wanna wait on that =] [00:22:12] if ya dunno what file it is no worries, i just figured best to ask [00:22:14] if you can do that, I can work on the templates tomorrow [00:22:21] ok [00:22:21] cuz if ya did, would make it super fast [00:22:29] i look for it =] [00:22:43] bz needs puppetization, but that is a longer than one night process. [00:22:52] ie: puppetize in labs, then import into production [00:23:02] ok, I think you'll probably need more from me tomrrow, let me know [00:23:08] while it needs to happen, dont let anyone suggest holding up bz changes for it, cuz it will take too long [00:23:25] and waiting for it to puppetize for this kind of change is a bit too long a wait =] [00:23:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.078 seconds [00:27:07] but yea, we need to puppetize it then anyone could submit these kinds of changes [00:27:19] ops person then just does code review and pushes into production [00:58:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:02:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.918 seconds [01:06:28] * Jamesofur whines [01:06:29] No wonder people keep getting the security warning from visiting https://shop.wikimedia.org (instead of http) even if they don't have the httpsEverywhere extension. It looks like we force https (maybe just for our addresses? ) even if written out without protorel. So for example even when I'm careful to write out [http://shop.wikimedia.org a link] it still turns into https for people browsing from https on the sites… is there anywa [01:06:29] exempt the shop from that for now? [01:16:49] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [01:38:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:44:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.025 seconds [02:13:49] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [02:16:31] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 196 MB (2% inode=61%): /var/lib/ureadahead/debugfs 196 MB (2% inode=61%): [02:18:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:43] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 263 MB (3% inode=61%): /var/lib/ureadahead/debugfs 263 MB (3% inode=61%): [02:24:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.207 seconds [02:31:23] RECOVERY - Disk space on srv221 is OK: DISK OK [02:31:23] RECOVERY - Disk space on srv224 is OK: DISK OK [03:02:08] PROBLEM - MySQL Slave Delay on db50 is CRITICAL: CRIT replication delay 309 seconds [03:02:17] PROBLEM - MySQL Replication Heartbeat on db1006 is CRITICAL: CRIT replication delay 319 seconds [03:02:26] PROBLEM - MySQL Slave Delay on db1006 is CRITICAL: CRIT replication delay 329 seconds [03:02:35] PROBLEM - MySQL Replication Heartbeat on db46 is CRITICAL: CRIT replication delay 336 seconds [03:02:35] PROBLEM - MySQL Replication Heartbeat on db47 is CRITICAL: CRIT replication delay 338 seconds [03:02:44] PROBLEM - MySQL Slave Delay on db46 is CRITICAL: CRIT replication delay 345 seconds [03:02:53] PROBLEM - MySQL Slave Delay on db47 is CRITICAL: CRIT replication delay 353 seconds [03:03:11] PROBLEM - MySQL Replication Heartbeat on db1040 is CRITICAL: CRIT replication delay 372 seconds [03:03:47] PROBLEM - MySQL Replication Heartbeat on db50 is CRITICAL: CRIT replication delay 407 seconds [03:06:47] RECOVERY - MySQL Replication Heartbeat on db47 is OK: OK replication delay 0 seconds [03:07:14] RECOVERY - MySQL Slave Delay on db47 is OK: OK replication delay 0 seconds [03:08:26] RECOVERY - MySQL Slave Delay on db50 is OK: OK replication delay 0 seconds [03:09:58] RECOVERY - MySQL Replication Heartbeat on db50 is OK: OK replication delay 0 seconds [03:12:56] RECOVERY - MySQL Slave Delay on db1006 is OK: OK replication delay 0 seconds [03:14:44] RECOVERY - MySQL Replication Heartbeat on db46 is OK: OK replication delay 0 seconds [03:14:53] RECOVERY - MySQL Replication Heartbeat on db1006 is OK: OK replication delay 0 seconds [03:15:02] RECOVERY - MySQL Slave Delay on db46 is OK: OK replication delay 0 seconds [03:15:38] PROBLEM - MySQL Slave Delay on db1040 is CRITICAL: CRIT replication delay 265 seconds [03:27:56] RECOVERY - MySQL Replication Heartbeat on db1040 is OK: OK replication delay 0 seconds [03:28:14] RECOVERY - MySQL Slave Delay on db1040 is OK: OK replication delay 0 seconds [04:50:28] RECOVERY - Disk space on search1021 is OK: DISK OK [04:50:28] RECOVERY - Disk space on search1022 is OK: DISK OK [04:56:46] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 3463 MB (3% inode=99%): [04:56:46] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 3463 MB (3% inode=99%): [05:37:16] !log rebooting db42 to finish upgrades [05:37:20] Logged the message, Master [05:41:02] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay seconds [05:41:56] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay seconds [05:42:28] !log db42 - reboot worked despite the grub warning about unreliable blocklists [05:42:30] Logged the message, Master [05:44:20] PROBLEM - mysqld processes on db42 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [05:46:26] RECOVERY - mysqld processes on db42 is OK: PROCS OK: 1 process with command name mysqld [05:49:43] !log db42 - mysql did not autostart after boot, added using update-rc.d [05:49:44] Logged the message, Master [05:51:59] PROBLEM - Puppet freshness on searchidx1001 is CRITICAL: Puppet has not run in the last 10 hours [05:58:25] !log installed security upgrades on brewser, cadmium, capella (apache,mysql,ruby,apt..) [05:58:26] Logged the message, Master [06:00:59] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 4159 MB (3% inode=99%): [06:06:09] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 1068249 seconds [06:06:27] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 1068222 seconds [06:17:51] ACKNOWLEDGEMENT - Host lily is DOWN: CRITICAL - Host Unreachable (91.198.174.121) daniel_zahn has been replaced by sodium [06:21:42] !log installed more package upgrades on sodium [06:21:45] Logged the message, Master [06:25:04] !log powercycling sq40 [06:25:08] Logged the message, Master [06:32:49] ACKNOWLEDGEMENT - Host sq40 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn hardware failure, just added to existing RT 2581 [06:32:49] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 3436 MB (2% inode=99%): [06:37:01] PROBLEM - Disk space on search2 is CRITICAL: DISK CRITICAL - free space: /a 1572 MB (1% inode=99%): [06:50:22] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 159 MB (2% inode=61%): /var/lib/ureadahead/debugfs 159 MB (2% inode=61%): [07:06:25] RECOVERY - Disk space on srv221 is OK: DISK OK [07:11:58] PROBLEM - Disk space on search7 is CRITICAL: DISK CRITICAL - free space: /a 0 MB (0% inode=98%): [07:28:46] PROBLEM - BGP status on cr1-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.196, [07:28:46] PROBLEM - BGP status on cr2-eqiad is CRITICAL: (Service Check Timed Out) [07:31:28] PROBLEM - Router interfaces on mr1-pmtpa is CRITICAL: CRITICAL: host 10.1.2.3, interfaces up: 32, down: 1, dormant: 0, excluded: 0, unused: 0BRfe-0/0/1: down - csw5-pmtpa:8/23:BR [07:40:10] PROBLEM - Varnish HTTP mobile-frontend on cp1044 is CRITICAL: Connection refused [07:42:07] RECOVERY - Varnish HTTP mobile-frontend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.055 seconds [07:45:03] New patchset: Asher; "inline C to use x-forwarded-for for zero/digi acl match" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3898 [07:45:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3898 [08:08:19] PROBLEM - Host cp3001 is DOWN: PING CRITICAL - Packet loss = 100% [08:10:54] New review: preilly; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/3898 [08:11:19] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [08:22:25] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [08:22:25] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [08:30:25] New patchset: Hashar; "jenkins: add existing users to existing group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3903 [08:30:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3903 [08:34:14] New review: Dzahn; "yea, we just had a talk about the "add existing user to groups"-issue with the puppet provider (addi..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3903 [08:34:16] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [08:34:16] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [08:34:17] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3903 [08:52:46] New patchset: Hashar; "Revert "jenkins: add existing users to existing group"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3904 [08:53:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3904 [08:53:13] New review: Hashar; "Patchset reverting this is https://gerrit.wikimedia.org/r/3904" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3903 [08:58:05] New review: Dzahn; "yeah, the "plusignment" doesn't do it. Either "Duplicate definition: Group[jenkins]" or "Only subcla..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3904 [08:58:07] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3904 [08:59:52] http://blog.archive.org/2012/03/29/wayback-machine-machines-are-moving/ [09:08:35] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 268 MB (3% inode=61%): /var/lib/ureadahead/debugfs 268 MB (3% inode=61%): [09:10:51] !log gallium - added demon,hashar,reedy to group jenkins as it's a problem using puppet when users and groups already exist [09:10:54] Logged the message, Master [09:12:47] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 244 MB (3% inode=61%): /var/lib/ureadahead/debugfs 244 MB (3% inode=61%): [09:12:47] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 186 MB (2% inode=61%): /var/lib/ureadahead/debugfs 186 MB (2% inode=61%): [09:12:47] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 101 MB (1% inode=61%): /var/lib/ureadahead/debugfs 101 MB (1% inode=61%): [09:16:59] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 280 MB (3% inode=61%): /var/lib/ureadahead/debugfs 280 MB (3% inode=61%): [09:21:11] RECOVERY - Disk space on srv221 is OK: DISK OK [09:21:11] RECOVERY - Disk space on srv224 is OK: DISK OK [09:21:11] RECOVERY - Disk space on srv219 is OK: DISK OK [09:22:17] <-- if you wonder about that debugfs,, here is something: [09:22:20] "The kernel filesystem tracers that ureadahead uses to 'profile' the boot process are expected to be at the debugfs mountpoint, /sys/kernel/debug. If a quick test reveals that the mountpoint isn't up yet, rather than wait for the mountpoint (and potentially missing more profiling it could do), ureadahead mounts a temporary debugfs at /var/lib/ureadhead/debugfs so that it can get to the filesystem tracers (do_sys_open, open_exec & uselib sys [09:22:26] According to the dev in this bug report (https://bugs.launchpad.net/ubuntu/+s...ad/+bug/499773), a left-over temporary mountpoint indicates that ureadahead crashed out, leaving the mountpoint in /etc/mtab. A quick scan of the source seems to confirm this, but maybe something was overlooked. [09:25:59] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 19, down: 0, shutdown: 1 [09:27:38] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:29:08] PROBLEM - LVS Lucene on search-pool3.svc.pmtpa.wmnet is CRITICAL: Connection timed out [09:29:44] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [09:31:41] RECOVERY - Disk space on srv220 is OK: DISK OK [09:31:41] RECOVERY - Disk space on srv222 is OK: DISK OK [09:31:50] PROBLEM - Lucene on search6 is CRITICAL: Connection timed out [09:32:44] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [09:35:02] mutante: maybe we could have an init.d script that would umount /var/lib/ureadahead/debugfs post boot ? [09:35:17] RECOVERY - LVS Lucene on search-pool3.svc.pmtpa.wmnet is OK: TCP OK - 0.002 second response time on port 8123 [09:35:44] RECOVERY - Lucene on search6 is OK: TCP OK - 0.004 second response time on port 8123 [09:36:32] !log restarted defunct lsearchd on search6 [09:36:34] Logged the message, Master [09:36:38] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.011 second response time on port 8123 [09:37:23] I was jus ton there and did that [09:37:50] hashar: possibly, but it happens while the server is running, then disappears after 10 min [09:37:56] apergos: how about search7? [09:38:02] no not yet [09:38:05] apergos: i see it still running there though [09:38:14] I killed things that wouldn't die [09:38:18] then restarted it on search6 [09:38:20] well, "java" using 106% CPU [09:38:36] ok, lets do the same there [09:38:48] it's all you [09:39:23] up to 675% [09:39:26] ok [09:40:16] !log kill and start lsearchd on search7 [09:40:18] Logged the message, Master [09:40:53] mutante: what I mean is that we have two debugfs mounted [09:41:03] it seems only one is needed, the one on /sys/kernel/debug [09:41:10] hashar@srv221:~$ mount |grep debugfs [09:41:11] none on /sys/kernel/debug type debugfs (rw) [09:41:11] none on /var/lib/ureadahead/debugfs type debugfs (rw,relatime) [09:41:16] https://bugs.launchpad.net/ubuntu/+source/mountall/+bug/499773 [09:41:49] hmmm [09:42:09] "mountall doesn't record unmounts, so it never "forgets" this path" [09:42:53] so to me, the workaround would be to manually unmount /var/lib/ureadahead/debugfs [09:42:56] cat /proc/self/mountinfo | grep debug [09:43:02] would have to try on a server out of production ;) [09:43:35] "useless use of cat"-award :p [09:43:43] or we add a script that take care of amounting /var/lib/ureadahead post boot ;) [09:43:55] it looks like the bug is that it says it is mounted, when it really is not [09:44:10] it says to check /proc/self/mountinfo instead [09:46:05] the hard way is to just reboot those systems so they end up possibly in a clean state [09:46:24] anyway, you might want to raise the issue on the secret ops mailing list ;-) [09:47:08] "Why has this been set to importance: low when tens of thousands of users are told their hard disks are full, and they can no longer store files on them?" :( [09:48:36] hashar: i did [09:49:11] hashar: how about editing /etc/mtab by hand :? [09:51:14] or... mv /etc/init/ureadahead.conf /etc/init/ureadahead.conf.disable [09:51:52] https://bugs.launchpad.net/ubuntu/+source/mountall/+bug/736512 [09:51:52] or just remove ureadahead ;) [09:52:02] it is just mean to speed up the boot process [09:52:11] something we really do not care about on a server [09:52:22] "bug was fixed in the package mountall - 2.25" ..hmm .checking [09:53:13] src/mountall.c: ignore ureadahead's potential mount of [09:53:13] /var/lib/ureadahead/debugfs (LP: #736512). [09:53:16] http://changelogs.ubuntu.com/changelogs/pool/main/m/mountall/mountall_2.35/changelog [09:53:27] \O/ [09:53:44] ii mountall 2.15.3 [09:57:10] hashar: +1 on just disabling it then . mv /etc/init/ureadahead.conf /etc/init/ureadahead.conf.disable [09:57:25] might work [09:57:31] but that would be for next reboot [09:57:39] you still have to umount the fs [09:58:05] combine it with installing new kernel then [09:58:23] i'll do this one, srv221, now [09:59:44] !log srv221, disabling ureadahead, installing package upgrades and new kernel, rebooting [09:59:45] Logged the message, Master [10:05:56] what kernel are yo u going to? (ouot of curiosity only) [10:06:48] 2.6.32-40-server [10:06:55] eh ok [10:10:59] PROBLEM - Apache HTTP on srv221 is CRITICAL: Connection refused [10:13:05] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.102 second response time [10:49:55] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:50:21] Change abandoned: Hashar; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2446 [10:51:52] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [11:18:16] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [11:20:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:22:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.980 seconds [11:56:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:00:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.020 seconds [12:09:19] PROBLEM - Host db47 is DOWN: PING CRITICAL - Packet loss = 100% [12:15:19] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [12:27:11] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [12:31:04] RECOVERY - Lucene on search15 is OK: TCP OK - 2.997 second response time on port 8123 [12:34:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:36:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.275 seconds [13:11:06] !log trimming logs and such on search1-20 [13:11:08] Logged the message, notpeter [13:12:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:14:18] RECOVERY - Disk space on search7 is OK: DISK OK [13:15:30] RECOVERY - Disk space on search2 is OK: DISK OK [13:16:44] is there something I can run to do that? [13:17:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.153 seconds [13:17:01] because I'm happy to trim them on a host if I see nagios is whining [13:17:32] well, I need to figure out how, for example, search7 has made 72 gigs of logs in 6 hours [13:17:41] woah! [13:17:53] hurray for java stacktraces! [13:17:58] yuck [13:18:03] better you than me [13:18:16] I'm just going to turn log level up to crit... [13:24:22] ugh. this is so annoying. when you move an index off of a host, it doesn't know to delete the index that's no longer assigned to it. so search2, which is dying for space, has 5x as many indexes as it should... [13:24:29] *sigh* time to do some housekeeping. [13:24:38] ouch [13:25:03] at least it will get some room back after that [13:25:10] quite a bit, yes [13:25:22] but I'm going to go through all of the nodes and see what I can free up [13:25:30] cool [13:25:37] that said, I don't know if it can free up 72 gigs/6 hours worth ;) [13:26:44] well if you turned down the log level can't you jus tmove the logs, restart the indexer or whatever it is and then toss theold ones? [13:27:28] yep [13:27:45] I'll have to do some yucky puppet stuff to turn it down for just the hosts that have exploding logs [13:27:51] but yeah, shoulnd't be too bad [13:28:28] oh hey, look at thta, half of search2's disk is free! [13:29:56] yeah... lots of crap from 2008.... [13:30:01] seems like it can go ;) [13:30:17] um [13:33:12] PROBLEM - Varnish HTTP bits on cp3002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:34:51] PROBLEM - LVS HTTP on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:41:08] apergos mark mutante bits lvs is in fact down in europe [13:41:47] cp3001 is dead, cp3002 is overloaded [13:42:03] And it seems that's all the backends for bits esams ?!? [13:42:10] hurray [13:43:22] oh joy [13:43:26] mutante is sleeping [13:43:28] mark is on vacation [13:43:37] <^demon> Leslie? [13:43:43] is not on line yet [13:43:48] it's not sf waking hours [13:44:01] <^demon> Yeah, but she was the next person I thought of we could ping offline. [13:44:44] ah [13:44:44] Ryan is on vacation too [13:44:46] <^demon> Too late for Tim? [13:44:49] popping a shell [13:44:54] 12:44am [13:45:04] nope [13:45:07] I think rebooting cp300{1,2} ought to do it [13:45:08] can get in via impi [13:45:14] but shell wont come up [13:45:17] going to reboot [13:45:26] !log rebooting (mostly) down cp3001 [13:45:26] <^demon> Ouch yeah, let's avoid pinging him after midnight if we can avoid it :) [13:45:27] Logged the message, notpeter [13:47:29] Jeff is also active in -labs [13:47:33] it's booting up [13:47:37] Jeff_Green: hey [13:47:44] notpeter: hi [13:48:14] I don't think I know how to get on those boxes [13:48:27] I have not ever poked at them [13:49:02] RoanKattouw: can you try to load? [13:49:06] RECOVERY - Host cp3001 is UP: PING WARNING - Packet loss = 28%, RTA = 163.79 ms [13:49:10] pybal logs look/.... [13:49:13] not fully awesome [13:49:18] I'll just look at http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=load_one&s=by+name&c=Bits+caches+esams&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [13:49:19] but might be starting to come up? [13:49:20] network traffic has bounced back though [13:49:24] RECOVERY - LVS HTTP on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3975 bytes in 0.328 seconds [13:49:27] hurray!~ [13:49:30] bits network spiked back up [13:50:02] Wikipedia loads reasonably now [13:50:11] Yeah, working for me [13:50:38] ok, cool [13:50:55] soo, crisis averted for now? [13:51:10] I suspect cp3002 might want rebooting, as it's running 15k+ processes [13:51:53] Or at least check what it's actually doing with 97% system cpu [13:51:54] ;) [13:51:56] well, that will take site down again.... [13:52:42] I think this is a new record for silliness: "load average: 13183.22, 12837.42, 9839.27" [13:52:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:44] it's not serving any network traffic [13:52:47] Jeff_Green: No [13:52:52] Jeff_Green: Let me dig up this tweet of mine [13:53:02] well, it's down to 12k now ;) [13:53:08] so I think it's improving [13:53:25] the box is oddly responsive for reporting that load avg [13:53:26] It would seem cp3001 has been down for a few hours [13:54:09] hhhhmmmm, do we think a restart of varnish is in order? [13:54:10] Jeff_Green: Wait, no, you're right. The one I tweeted about was only 7000. 13000 is insane [13:54:28] wtf is it doing? [13:54:31] Yeah we should be able to restart Varnish on cp3002 I think [13:54:37] not a lot, apparently [13:54:53] there's free RAM, disk io is mellow, cpu is not burning [13:55:33] 2360% varnishd [13:55:38] that's interesting [13:56:09] RECOVERY - Varnish HTTP bits on cp3002 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.218 seconds [13:56:15] it really is serving up almost no dta [13:56:23] any thoughts on restarting varnishd? [13:56:28] I think I'm pro [13:56:35] just because it is already doing so little [13:56:52] strace on varnish: [13:56:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.898 seconds [13:57:02] read(9, "ping\n", 8191) = 5 [13:57:02] writev(12, [{"200 19 \n", 13}, {"PONG 1333029400 1.0", 19}, {"\n", 1}], 3) = 33 [13:57:02] poll([{fd=9, events=POLLIN}], 1, -1) = 1 ([{fd=9, revents=POLLIN}]) [13:57:19] one blat like that every few seconds is all [13:57:39] Jeff_Green: yeah, seems safe [13:57:44] ok [13:57:45] go for it [13:58:40] load is down to a mere 7k now ;) [13:58:55] yeah, shit's rapidly returning to normal [13:59:08] I'm not sure what to expect, the init script didn't seem to kill it the first time [13:59:18] have you started again? [13:59:23] so I ran it a second time, and it did better [13:59:28] no, issued stop 2x [13:59:46] start it up? [13:59:49] what's varnishstat? it's defunct [13:59:59] haven't started it yet, I'm looking for processes to settle [14:00:01] a fine question, sir [14:00:08] [varnishstat] <----dislike [14:00:20] started [14:01:00] oh, you know, if we were smart, we would have adjusted teh weight before readding.... [14:01:41] !log restarted varnish on on cp3002 because it was thrashing futiley [14:01:43] Logged the message, Master [14:01:55] the weight where? in pybal? [14:02:00] ja [14:02:08] although, the bits appservers seem just fine [14:02:30] mrfpth. what's the point of running a load distributor if it can't handle a backend hiccup?? [14:03:04] hrm [14:03:12] well, I'm also not sure how persistent the cache is [14:03:17] so I mgiht be full of crap [14:03:50] if I'd been awake enough I would have looked for wikitech docs to answer these questions [14:05:09] load seems a bit more stable, it's hovering under 20 which seems reasonable for the number of cpu cores [14:06:04] notpeter: https://rt.wikimedia.org/Ticket/Display.html?id=2730 #2730: usb modem in error again, unplug and replug + pull info for chris [14:06:15] just added you since you were the one who let me know ;] [14:06:20] (as a requestor) [14:06:26] heh, ok [14:06:30] although... it's toast [14:06:37] also have him pulling the info from it for me to get a new one. [14:06:42] nice [14:07:14] maybe we should temporarily switch to email gateways until the sms dongle is replaced [14:08:05] i think tim's http solution sounded more stable. [14:08:18] at minimum using the smtp of the gateway provider, not our own. [14:08:43] I think that ben was implementing... something [14:08:46] because if we do a half ass solution, no one will bother to improve it. [14:09:00] so... to the implementer goes the decision-making! [14:09:05] at elast until we get a new dongle [14:09:13] the permanent solution is the sms device, until mark/CT/mgmt says otherwise [14:09:25] but for temp, whatever, i dont care to argue against email being a poor method anymore [14:09:35] email of any form will be about 5 orders of magnitude more stable than what we've been tolerating [14:09:49] yes, so lets hack a temp solution, which will then become permanent [14:10:00] and no one will bother to actually put in place a real one. [14:10:11] it happens on everything. [14:10:28] * apergos goes back to hacking a less temporary solution to the mirror rsyncs [14:10:39] so then let's make sure to get another sms dongle :) [14:10:44] and then we can move back to that [14:10:47] RobH: I don't understand the issue I guess. Email gateways were rock solid for years and years for Craigslist. [14:11:03] i thought the email list conversation was pretty clear [14:11:13] <^demon> RobH: Please don't tell me those PDUs were a temporary hack :p [14:11:25] if i had to choose a permanent solution i would have chosen email again without a second though [14:11:30] if we are goign to do email we should use the gateway provider smtp, and then have a secondary system for alerting if those emails queued up without result. [14:11:46] otherwise we could lead to the issue of a bunch of queued emails going to no where [14:11:50] i'm not sure what you mean re. gateway provider? [14:12:18] you mean, spence talks directly to the carrier's sms email gateway? [14:12:20] if we use company X which is a email to SMS gateway [14:12:35] we should use X's smtp, and have some system in palce that if we fail to route out to them [14:12:41] would then attempt another method of alerting us. [14:13:21] so we configure spence to mail direct instead of relaying through our systems, and we put a mailing list on spence with each person's email-to-sms-address [14:13:30] Tim's suggestion: "If you're worried about speed then maybe you should use HTTP post [14:13:31] instead of email. Clickatell provides such a service, and probably [14:13:31] lots of other SMS gateways." [14:13:33] i.e. 4154015522.sms.verizon.com or whatever [14:13:46] I don't have an email to sms address. I don't know what they are for cosmote [14:13:52] I tried looking into it once [14:13:59] came up empty-handed [14:14:08]