[00:06:49] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [00:11:01] RECOVERY - LVS HTTP on m.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 0.106 second response time [00:14:18] i'm about to do some bgp changes that should be seamless in eqiad but if something goes wrong will suck - doing a 5 minute rollback timer in case [00:25:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:29:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.939 seconds [00:48:04] PROBLEM - NTP on sq34 is CRITICAL: NTP CRITICAL: No response from NTP server [00:48:49] LeslieCarr: can you do me a huge favor and flush the mobile varnish cache (http://wikitech.wikimedia.org/view/MobileFrontend#Flushing_the_cache) [00:48:58] okay awjr [00:49:04] u r my hero LeslieCarr [00:50:11] awjr: look ok ? [00:51:37] LeslieCarr no but… it might not be your fault. hang on [00:53:25] LeslieCarr: we're still seeing stale pages [00:53:50] hrm [00:53:53] LeslieCarr did you purge or flush? [00:53:54] just redid it [00:53:57] dsh -g mobile "varnishadm ban.url . ; varnishadm -n frontend ban.url ." ? [00:54:06] yeah that looks right [00:55:04] LeslieCarr: on Fenari? [00:55:08] yeah [00:55:25] LeslieCarr: and that is the command that you just ran? [00:55:32] yes [00:55:45] the 2nd time with "-M" [00:57:39] LeslieCarr you're the best thanks - it was looking wonky because i forgot to push a particular file :$ [00:57:55] ah [00:58:05] glad it's working out now [01:04:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:11:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.645 seconds [01:46:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:53:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.113 seconds [02:17:30] PROBLEM - mysqld processes on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:17:48] PROBLEM - Full LVS Snapshot on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:17:57] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:18:24] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:18:24] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:18:33] PROBLEM - RAID on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:19:27] RECOVERY - mysqld processes on db1047 is OK: PROCS OK: 1 process with command name mysqld [02:19:45] RECOVERY - Full LVS Snapshot on db1047 is OK: OK no full LVM snapshot volumes [02:19:54] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [02:20:21] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 271886 seconds since restart [02:20:21] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [02:20:30] RECOVERY - RAID on db1047 is OK: OK: State is Optimal, checked 2 logical device(s) [02:25:36] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 313 seconds [02:25:36] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 313 seconds [02:28:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:34:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.034 seconds [02:36:06] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [02:36:06] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [02:41:03] RECOVERY - Puppet freshness on brewster is OK: puppet ran at Tue Mar 27 02:40:53 UTC 2012 [03:09:42] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 317 seconds [03:09:51] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 325 seconds [03:19:18] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:19:27] PROBLEM - RAID on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:19:36] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:30] PROBLEM - DPKG on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:39] PROBLEM - mysqld processes on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:20:57] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:21:06] PROBLEM - Full LVS Snapshot on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:23:03] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [03:23:21] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 275672 seconds since restart [03:23:30] RECOVERY - RAID on db1047 is OK: OK: State is Optimal, checked 2 logical device(s) [03:25:18] RECOVERY - Full LVS Snapshot on db1047 is OK: OK no full LVM snapshot volumes [03:26:57] RECOVERY - mysqld processes on db1047 is OK: PROCS OK: 1 process with command name mysqld [03:27:51] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [03:28:45] RECOVERY - DPKG on db1047 is OK: All packages OK [04:02:11] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [04:02:38] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 1 seconds [04:27:14] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [04:53:22] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 3596 MB (3% inode=99%): [04:55:46] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 3593 MB (3% inode=99%): [05:10:28] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 4301 MB (3% inode=99%): [05:47:34] PROBLEM - Puppet freshness on search6 is CRITICAL: Puppet has not run in the last 10 hours [05:47:34] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [05:48:28] PROBLEM - Puppet freshness on search1016 is CRITICAL: Puppet has not run in the last 10 hours [05:59:52] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 4297 MB (3% inode=99%): [06:02:34] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [06:02:34] PROBLEM - Puppet freshness on search1006 is CRITICAL: Puppet has not run in the last 10 hours [06:08:52] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [06:13:31] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [06:13:31] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [06:17:07] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 188 MB (2% inode=61%): /var/lib/ureadahead/debugfs 188 MB (2% inode=61%): [06:21:28] RECOVERY - Disk space on srv223 is OK: DISK OK [06:25:41] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [06:25:41] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [06:27:38] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 3585 MB (3% inode=99%): [06:38:17] PROBLEM - Host emery is DOWN: CRITICAL - Host Unreachable (208.80.152.184) [06:47:16] PROBLEM - Host kaulen is DOWN: CRITICAL - Host Unreachable (208.80.152.149) [06:50:52] RECOVERY - Host kaulen is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [06:59:02] !log powercycled emery, it was unresponsive via the mgmt console and not pingable [06:59:28] no bot either, I'll log it by hand later [07:00:01] RECOVERY - Host emery is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [07:01:47] yeah it's not really up yet, it's doing a fsck [07:04:40] PROBLEM - SSH on emery is CRITICAL: Connection refused [07:04:58] PROBLEM - DPKG on emery is CRITICAL: Connection refused by host [07:05:07] PROBLEM - udp2log log age on emery is CRITICAL: Connection refused by host [07:05:25] PROBLEM - udp2log processes on emery is CRITICAL: Connection refused by host [07:05:25] PROBLEM - Disk space on emery is CRITICAL: Connection refused by host [07:06:19] PROBLEM - RAID on emery is CRITICAL: Connection refused by host [07:14:42] either these are the world's slowest fscks ever or there's something more wrong [07:14:49] but I'm going to let it go for a while yet [07:21:39] and patience pays off... [07:21:46] RECOVERY - udp2log log age on emery is OK: OK: all log files active [07:22:04] RECOVERY - Disk space on emery is OK: DISK OK [07:22:04] RECOVERY - udp2log processes on emery is OK: OK: all filters present [07:23:31] RECOVERY - RAID on emery is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [07:23:55] ah. kswapper, bxn kernel stacktraces [07:23:58] RECOVERY - SSH on emery is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:24:01] *bnx [07:24:16] RECOVERY - DPKG on emery is OK: All packages OK [07:25:31] yep, page alloc failures [07:30:17] New patchset: ArielGlenn; "avoid bnx2 alloc issues on emery tweaking vm.min_free_kbytes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3792 [07:30:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3792 [07:31:38] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3792 [07:31:40] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3792 [07:38:09] ok, tweaked vm.min_free_kbytes on emery [07:47:18] oh my [07:47:19] puppet [07:47:26] does not reload irc bots :-D [07:48:04] no. I do. [07:48:30] I have made change 2675 to kick out nagios-wm form #wikimedia-tech . [07:48:35] leslie merged it yesterday [07:48:36] https://gerrit.wikimedia.org/r/#patch,unified,2675,2,manifests/site.pp [07:48:46] I see [07:48:46] nagios-wm live on spence.wikimedia.org [07:48:58] well I have a different task [07:49:15] I get to walk through this channel and wikitech, find all the stuff people thought they were logging, and actually log it [07:49:26] it is not a big deal, it will be eventually reload one day :) [07:50:09] logging using something like a !log command in the irc channels ? [07:50:14] yes [07:50:25] sounds like a task for perl :D [07:50:39] yeah but my logs have a pile of html and other crap in them [07:50:49] get the .txt ones from http://bots.wmflabs.org/~petrb/logs/%23wikimedia-operations/ [07:51:03] ok so [07:51:18] do the lab bots run in here? or not? [07:51:26] I tried to find something about that and failed [07:51:57] we have so many bots :/ [07:52:07] uh huh [07:52:13] but this is a very simple question: [07:52:25] is there an instance of morebots that runs from labs [07:52:28] PROBLEM - Host cp1017 is DOWN: PING CRITICAL - Packet loss = 100% [07:52:32] or do we still use the copy on wikitech? [07:53:04] unfortunately, to find that out I likely have to ask someone else who isn't here, like with almost everything else [07:53:13] :-] [07:53:23] maybe petan|wk is aware about that [07:53:49] RECOVERY - Host cp1017 is UP: PING OK - Packet loss = 0%, RTA = 26.42 ms [07:54:17] pet an bot is wm-bot [07:54:24] wm-bot: hello you [07:59:47] !log archived old server admin logs since the old page was too long for my connection to download :-/ [08:00:03] you were archiving it while I was editingit [08:00:08] my changes are gone now [08:00:11] oh sorry :-/ [08:00:27] should have warned :-((((((((((((( [08:00:32] would you mind not archiving the current log? at least [08:00:37] keep the last two weeks or something [08:00:49] sure [08:00:54] let me copy paste [08:00:55] or all ofmarch. whatever [08:01:59] thanks [08:03:00] New review: Hashar; "now we need to have nagios-wm bot to be reload on spence." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2675 [08:05:13] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:31] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:39] I see that your attempt to log failed btw [08:05:47] means the bot is still not working [08:06:06] I have made one from fenari using a personal script [08:06:19] I'm adding your log entry [08:06:32] that use /home/wikipedia/bin/dologmsg on fenari [08:06:56] which send a message to port 53412 , probably a listener that write to a file somewhere [08:07:55] PROBLEM - Full LVS Snapshot on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:07:55] PROBLEM - DPKG on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:40] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:44] apergos: https://wikitech.wikimedia.org/view/Logmsgbot [08:08:49] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:49] PROBLEM - RAID on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:53] that is logmsgbot which died somehow [08:09:17] morebots should actually log [08:09:28] I will look at it in a minute as soon as I get the rest of the missing entries in [08:09:43] PROBLEM - mysqld processes on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:09:52] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:14:49] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [08:14:58] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 293163 seconds since restart [08:14:58] RECOVERY - RAID on db1047 is OK: OK: State is Optimal, checked 2 logical device(s) [08:15:06] New patchset: Hashar; "support project wildcard for irc notifications" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3528 [08:15:18] !log test you silly morebot [08:15:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3528 [08:15:20] Logged the message, Master [08:15:27] \o/ [08:15:29] yay [08:15:34] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [08:15:52] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [08:15:52] RECOVERY - mysqld processes on db1047 is OK: PROCS OK: 1 process with command name mysqld [08:16:01] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [08:16:10] RECOVERY - Full LVS Snapshot on db1047 is OK: OK no full LVM snapshot volumes [08:16:10] RECOVERY - DPKG on db1047 is OK: All packages OK [08:19:19] New patchset: Hashar; "use wildcards for gerrit IRC notifications" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3529 [08:19:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3529 [08:20:04] New review: Hashar; "I have rebased change." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/3528 [08:20:13] New review: Hashar; "I have rebased change." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/3529 [08:24:45] New patchset: Hashar; "gerrit IRC bot now join #wikimedia-dev" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3530 [08:24:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3530 [08:25:09] New review: Hashar; "rebased change" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/3530 [08:26:20] New patchset: Hashar; "analytics/integration IRC notificiation in #wikimedia-dev" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3531 [08:26:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3531 [08:26:38] New review: Hashar; "rebased change" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/3531 [08:27:19] Nowadays, I just feel like I am spending my life in Gerrit [08:41:14] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:43:11] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [09:10:11] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [09:32:39] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [09:32:39] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [09:37:00] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 199 MB (2% inode=61%): /var/lib/ureadahead/debugfs 199 MB (2% inode=61%): [09:43:18] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 215 MB (3% inode=62%): /var/lib/ureadahead/debugfs 215 MB (3% inode=62%): [09:51:33] RECOVERY - Disk space on srv224 is OK: DISK OK [09:51:42] RECOVERY - Disk space on srv223 is OK: DISK OK [09:51:51] RECOVERY - Disk space on srv222 is OK: DISK OK [09:51:51] RECOVERY - Disk space on srv219 is OK: DISK OK [10:01:00] PROBLEM - Host cp1017 is DOWN: PING CRITICAL - Packet loss = 100% [10:08:21] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [12:05:21] PROBLEM - Auth DNS on ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [12:07:18] RECOVERY - Auth DNS on ns0.wikimedia.org is OK: DNS OK: 0.036 seconds response time. www.wikipedia.org returns 208.80.154.225 [12:09:42] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 98 MB (1% inode=61%): /var/lib/ureadahead/debugfs 98 MB (1% inode=61%): [12:09:42] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 252 MB (3% inode=61%): /var/lib/ureadahead/debugfs 252 MB (3% inode=61%): [12:20:12] RECOVERY - Disk space on srv219 is OK: DISK OK [12:26:39] RECOVERY - Disk space on srv223 is OK: DISK OK [13:46:16] PROBLEM - BGP status on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, sessions up: 10, down: 1, shutdown: 0BRPeering with AS1257 not established - The + flag cannot be used with the sub-query features described below.BR [14:02:09] PROBLEM - MySQL Replication Heartbeat on db16 is CRITICAL: CRIT replication delay 185 seconds [14:02:27] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 187 seconds [14:03:53] there is a little bug with puppet bot [14:04:08] it sends log messages to labs by default cause a change has not been merged yet [14:16:06] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:29:09] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [14:37:15] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:57:30] RECOVERY - Puppet freshness on search1016 is OK: puppet ran at Tue Mar 27 14:57:17 UTC 2012 [14:58:24] RECOVERY - Puppet freshness on search1006 is OK: puppet ran at Tue Mar 27 14:58:03 UTC 2012 [15:19:02] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 28 seconds [15:20:32] RECOVERY - MySQL Replication Heartbeat on db16 is OK: OK replication delay 21 seconds [15:32:50] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 10, down: 0, shutdown: 1 [15:45:44] PROBLEM - BGP status on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, sessions up: 10, down: 1, shutdown: 0BRPeering with AS1257 not established - The + flag cannot be used with the sub-query features described below.BR [15:47:50] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 11, down: 0, shutdown: 0 [15:49:29] PROBLEM - Puppet freshness on search6 is CRITICAL: Puppet has not run in the last 10 hours [16:04:29] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [16:15:26] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [16:15:26] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [16:26:53] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [16:26:53] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [16:48:21] New patchset: Pyoungmeister; "well, that explains some things" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3812 [16:48:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3812 [16:49:53] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3812 [16:49:56] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3812 [17:10:41] RECOVERY - Puppet freshness on search15 is OK: puppet ran at Tue Mar 27 17:10:11 UTC 2012 [17:12:33] can I do: define some_name($a, $b=$a) [17:12:48] so that $b will default to $a, but can also be explicitly defined [17:13:25] geppetto's syntax checking seems to think it's ok... [17:15:38] RECOVERY - Puppet freshness on search6 is OK: puppet ran at Tue Mar 27 17:15:37 UTC 2012 [17:18:56] New patchset: Pyoungmeister; "making lsearch user on search hosts." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3813 [17:19:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3813 [17:19:23] mark: if you're around can you take a look at ^^ and let me know if that will work/do what I want or not? [17:19:45] (what I want being to create an lsearch user that has search as its default group) [17:22:49] Ryan_Lane: you around? [17:24:02] notpeter: systemuser is a definition, not a class [17:24:58] ah, ok. that's... a good point. is the modification I made to it reasonable in your eyes? [17:25:17] yes [17:26:54] cool. thank you! [17:28:12] New patchset: Pyoungmeister; "making lsearch user on search hosts." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3813 [17:28:23] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 279 MB (3% inode=61%): /var/lib/ureadahead/debugfs 279 MB (3% inode=61%): [17:28:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3813 [17:36:47] RECOVERY - Disk space on srv220 is OK: DISK OK [18:00:01] hashar: hey [18:00:06] did you get the chance to rebase shit ? [18:00:36] yup [18:00:45] cool, i'll check it all out again [18:00:47] Is ops also outsourcing its git-fu to hashar now? :) [18:00:52] I moved that huge pile of s*** in front of your cubicle [18:00:57] aww you even marked it for me [18:00:58] :-D [18:01:00] :) [18:01:27] RoanKattouw: since I cause mark to face palm constantly, I try to avoid submitting puppet stuff during european day [18:01:36] hehe [18:01:44] so you can get changes merged then go to bed? ;) [18:01:45] and rely on the US ops to scream about my changes [18:01:50] yup [18:02:04] it surely makes everything longer to handle [18:02:17] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3528 [18:02:20] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3528 [18:02:33] at least it saves some scares from mark fronthead [18:03:23] New review: Lcarr; "hehe, poor #dev" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3529 [18:03:26] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3529 [18:03:44] New review: Lcarr; "poor #dev" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3530 [18:03:47] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3530 [18:04:04] New review: Lcarr; "poor #dev" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3531 [18:04:06] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3531 [18:04:20] hihi all. specifically, RobH [18:04:37] how're the ciscos? [18:04:40] hashar: merged it [18:06:34] LeslieCarr: so now I think we need a bot to be restarted [18:06:45] gerrit is going to write to the new files [18:07:10] dschoon: thats been awaiting some network loving, as they are required to go in their own subnet, I am poking our network folks about it [18:07:17] as I made them promise me some forward movement ;] [18:07:18] or maybe not [18:07:28] in my defense, i've been very busy [18:07:33] also, BA has been making me cry [18:07:42] are the pxe troubles cleared up? [18:08:43] dschoon: until i have the network stuff, i cannot touch them [18:08:52] ah [18:09:01] its not forgotten, im pushing on it daily =] [18:09:02] LeslieCarr: we will need puppet to run on Gerrit host and the ircecho process to be reloaded. That will make the bot join #wikimedia-dev [18:09:16] (aka apply https://gerrit.wikimedia.org/r/#patch,unified,3530,3,manifests/gerrit.pp ) [18:09:56] thank you, RobH and LeslieCarr [18:10:18] it's very very high pri for us, as it blocks forward movement on the analytics cluster [18:13:28] hashar: rerunning puppet on manganese (gerrit) now [18:13:47] * hashar crosses fingers [18:13:58] dschoon: i'll get this all made up for you as soon as i stop crying over airline pricing schemes [18:14:09] hehe [18:14:57] good god, we're talking months! [18:15:01] New review: Hashar; "We have migrated to git on March 21st and l10n changes are applied since saturday if I am correct." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2786 [18:15:05] !! [18:15:58] yay [18:16:01] not in tech :) [18:16:30] and a comment I did to integration/* ended up in #wikimedia-dev [18:16:46] sweet :) [18:16:56] all operations/* notification should now land here [18:17:03] like operations/dumps [18:17:38] now I need to write a small message to wikitech-l [18:17:54] or people will complain about the new organization of bots that nobody told them about. Thanks Leslie! [18:18:15] :) [18:18:38] Hey ops folks, could we get https://gerrit.wikimedia.org/r/#q,2786,n,z deployed please? [18:18:54] Not super urgent but would be nice to get deployed before 5pm PDT today [18:18:59] hehe [18:28:48] LeslieCarr: I have made everyone aware about your IRC deployement :) [18:28:54] LeslieCarr: thanks!! [18:28:56] noes [18:29:00] now they'll be mad at me for spam [18:29:56] apergos will surely be when he will be looking for notifications to operations/dumps ;-] (that will be #wikimedia-dev Ariel) [18:30:14] LeslieCarr: can you have a look at 2786 ? The l10n script? [18:30:22] give me a minute [18:30:32] I am going to rebase it just to be sure [18:31:43] (it does apply cleanly) [18:37:36] ok hasher checking [18:37:55] did you want to rebase it ? [18:39:09] do we need it ? [18:39:24] I have merged it locally on top of current gerrit/master and it applied cleanly [18:39:35] so [Submit] should not cause any trouble [18:40:48] Yeah just submit it [18:40:56] ok [18:41:12] and you're sure you want to keep the --svnurl named as that ? [18:42:08] hashar/RoanKattouw ? [18:42:10] Yes [18:42:12] ok [18:42:19] The script it's invoking didn't actually change [18:42:31] New review: Lcarr; "git switched over now. double cheked that --svnurl should stay the same." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2786 [18:42:36] It doesn't know it's being pointed to a git repo instead of an SVN repo, but that's fine in this case [18:42:44] New review: Lcarr; "git switched over now. double cheked that --svnurl should stay the same." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2786 [18:43:02] hrm [18:43:20] oh, need to undo mark's unverified [18:43:39] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2786 [18:59:42] LeslieCarr: if you have time to play with bots, can you reload the irc echo one on spence (for nagios) [19:00:02] with merged change https://gerrit.wikimedia.org/r/#q,2675,n,z , that should kick out nagios notifications from #wikimedia-dev [19:00:13] err #wikimedia-tech [19:00:20] or any other ops reading this :-] [19:00:46] hashar: rerunning puppet [19:00:49] it'll take a while on spence [19:03:25] because of all the nagios templates I suppose? [19:05:18] yeah [19:06:17] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.4674729464 (gt 8.0) [19:06:46] New patchset: Reedy; "Fix sync-dblist to work without spamming permission errors to hell and back (sudo -u mwdeploy) and also add SetupTimeout config per other scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3815 [19:07:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3815 [19:07:22] does scap require sudo at all? [19:08:14] Yeah [19:08:30] all of the scripts do stuff as mwdeploy (or should!) [19:08:46] /usr/bin/scap-1 etc [19:12:08] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [19:12:44] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 2.00242292035 [19:14:52] hhhmmm, ok [19:26:42] LeslieCarr: is puppet still running on spence ? Feel free to get out for lunch :-] [19:27:04] I don't want you to be starving just because some nagios-wm does not want to leave a channel. [19:27:13] hehe [19:27:21] puppet just finished actually [19:27:43] * hashar feels like he has a sixth sense [19:27:46] restarted ircecho [19:27:52] oh only came back in this room [19:27:53] yay [19:28:10] hopefully nagios notifications still works [19:28:20] bots did not join #wikimedia-tech so we are safe there now [19:28:38] shh they can hear us in this channel [19:30:35] thanks again Leslie [19:32:36] New review: Hashar; "Bot reloaded by Leslie. Nagios notifications are not out of #wikimedia-tech !" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2675 [19:36:53] New review: Hashar; "My english keep getting worse and worse." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2675 [20:03:18] New patchset: Pyoungmeister; "removed rainman user from operations loop after creating lsearch user system user. also made things a little cleaner (no -, actual booleans, etc)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3821 [20:03:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3821 [20:05:01] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3813 [20:05:04] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3813 [20:09:50] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [20:15:25] LeslieCarr: I'm pushing out that thing of roan's that you merged [20:15:27] that is ok, yes? [20:17:37] well... it's merged in... so I'm going to go with yes... [20:37:22] notpeter: yes [20:37:23] sorry [20:37:24] was eating [20:37:57] cool, thanks [20:42:32] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 273 MB (3% inode=61%): /var/lib/ureadahead/debugfs 273 MB (3% inode=61%): [20:44:37] can I ask the git question of the day? [20:44:55] just did git commit --amend -a [20:45:01] now running git-review [20:45:19] I get: error refusing to lose untracked file at 'Makefile' [20:45:24] how do I fix thi? [20:47:22] i have no idea … [20:47:29] try asking in wikimedia-tech as well ? [20:47:44] will do, thx [20:50:56] RECOVERY - Disk space on srv221 is OK: DISK OK [20:53:55] New patchset: Pyoungmeister; "if assigning to a specific group, that group will, in fact, be defined somewhere else..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3826 [20:54:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3826 [20:55:39] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3826 [20:55:41] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3826 [21:04:52] New patchset: Bhartshorne; "correcting forgotten number typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3828 [21:05:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3828 [21:05:25] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3828 [21:05:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:07:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.748 seconds [21:27:54] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3821 [21:27:57] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3821 [21:28:09] Change abandoned: Bhartshorne; "git failed me." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3828 [21:29:29] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (23303) [21:41:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:45:21] * schoolcraftT dusts off a kitchen towel and slaps it at Thehelpfulone [21:47:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.338 seconds [21:55:10] New patchset: Pyoungmeister; "boolean cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3833 [21:55:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3833 [21:55:34] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3833 [21:55:37] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3833 [22:22:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:26:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.149 seconds [22:42:59] PROBLEM - MySQL Replication Heartbeat on db24 is CRITICAL: CRIT replication delay 291 seconds [22:43:26] PROBLEM - MySQL Replication Heartbeat on db1002 is CRITICAL: CRIT replication delay 318 seconds [22:43:44] PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 336 seconds [22:44:11] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 363 seconds [22:44:56] RECOVERY - MySQL Replication Heartbeat on db24 is OK: OK replication delay 0 seconds [22:45:50] RECOVERY - MySQL Slave Delay on db24 is OK: OK replication delay 0 seconds [22:46:17] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [22:47:47] RECOVERY - MySQL Replication Heartbeat on db1002 is OK: OK replication delay 0 seconds [22:55:53] PROBLEM - MySQL Slave Delay on db54 is CRITICAL: CRIT replication delay 299 seconds [22:56:11] PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 317 seconds [22:56:11] PROBLEM - MySQL Replication Heartbeat on db1002 is CRITICAL: CRIT replication delay 317 seconds [22:56:38] PROBLEM - MySQL Replication Heartbeat on db1034 is CRITICAL: CRIT replication delay 344 seconds [22:56:47] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 353 seconds [22:56:47] PROBLEM - MySQL Slave Delay on db30 is CRITICAL: CRIT replication delay 353 seconds [22:57:05] PROBLEM - MySQL Replication Heartbeat on db54 is CRITICAL: CRIT replication delay 372 seconds [22:57:32] PROBLEM - MySQL Replication Heartbeat on db24 is CRITICAL: CRIT replication delay 398 seconds [22:57:41] PROBLEM - MySQL Slave Delay on db1034 is CRITICAL: CRIT replication delay 407 seconds [23:00:50] RECOVERY - MySQL Slave Delay on db30 is OK: OK replication delay 0 seconds [23:01:17] RECOVERY - MySQL Replication Heartbeat on db54 is OK: OK replication delay 0 seconds [23:01:53] RECOVERY - MySQL Slave Delay on db1034 is OK: OK replication delay 0 seconds [23:02:02] RECOVERY - MySQL Slave Delay on db54 is OK: OK replication delay 0 seconds [23:02:56] RECOVERY - MySQL Replication Heartbeat on db1034 is OK: OK replication delay 0 seconds [23:02:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:06:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.478 seconds [23:07:17] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 380 seconds [23:07:17] PROBLEM - MySQL Slave Delay on db1002 is CRITICAL: CRIT replication delay 380 seconds [23:11:02] RECOVERY - MySQL Slave Delay on db24 is OK: OK replication delay 0 seconds [23:12:23] RECOVERY - MySQL Replication Heartbeat on db24 is OK: OK replication delay 0 seconds [23:13:08] RECOVERY - MySQL Replication Heartbeat on db1002 is OK: OK replication delay 0 seconds [23:13:35] RECOVERY - MySQL Slave Delay on db1002 is OK: OK replication delay 0 seconds [23:15:41] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [23:15:50] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [23:24:28] where can i see the squid configs? i'd like to double check a particular ACL but i can't find them in the puppet repo [23:31:13] awjr: /home/w/conf/squid [23:31:57] awjr: http://wikitech.wikimedia.org/view/Squid#Configuration [23:35:59] preilly: ^^^ [23:40:38] maplebed: yeah [23:42:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:46:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.444 seconds