[00:00:19] [15:35] binasher AaronSchulz: tp50 == 50% of calls finish in <= that time value in ms. not 1 in 50 [00:00:22] binasher: ? [00:00:41] who said 1/50? [00:01:20] AaronSchulz: i overheard that when you and ben were speaking, just wanted to make sure you guys didn't have that misconception [00:01:32] no, I was talking about MW's sample rate [00:01:37] which is 1/50 requests [00:01:43] not what tp50 was ;) [00:06:04] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [00:13:52] RECOVERY - udp2log processes on locke is OK: OK: all filters present [00:34:48] RECOVERY - Squid on brewster is OK: TCP OK - 0.004 second response time on port 8080 [00:53:24] PROBLEM - Puppet freshness on streber is CRITICAL: Puppet has not run in the last 10 hours [01:07:13] !log another adjustment to redirects.conf and apache-graceful-all for RT#2488 [01:07:16] Logged the message, Master [01:27:00] !log manually updated packages and restarted apache on srv198, srv229, srv262, srv268, mw40 because their apache redirect configs failed to update after sync-apache and restart [01:27:04] Logged the message, Master [01:42:18] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [01:46:12] RECOVERY - udp2log processes on locke is OK: OK: all filters present [01:52:12] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [02:04:23] RECOVERY - udp2log processes on locke is OK: OK: all filters present [02:10:23] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [02:16:14] RECOVERY - udp2log processes on locke is OK: OK: all filters present [02:43:05] PROBLEM - Puppet freshness on db1022 is CRITICAL: Puppet has not run in the last 10 hours [03:05:35] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [03:11:44] RECOVERY - udp2log processes on locke is OK: OK: all filters present [03:17:35] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [03:22:31] RECOVERY - udp2log processes on locke is OK: OK: all filters present [03:27:54] !log installing a couple upgrades on fenari (apache2-utils, update-manager-core, cvs, ruby, libxml*, libopenssl-ruby*...) [03:27:58] Logged the message, Master [03:28:22] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [03:32:03] !log upgrading apache2 packages, base-files, kernel, several libs on bast1001 [03:32:06] Logged the message, Master [03:32:16] RECOVERY - udp2log processes on locke is OK: OK: all filters present [03:33:23] !log rebooting bast1001 for kernel upgrade [03:33:26] Logged the message, Master [03:48:10] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [03:55:58] RECOVERY - udp2log processes on locke is OK: OK: all filters present [04:06:01] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [04:07:58] RECOVERY - udp2log processes on locke is OK: OK: all filters present [04:14:00] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [04:23:54] RECOVERY - udp2log processes on locke is OK: OK: all filters present [04:33:48] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [04:45:39] RECOVERY - udp2log processes on locke is OK: OK: all filters present [05:21:36] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, interfaces up: 73, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-eqiad:xe-5/2/1 (FPL/GBLX, CV71026) [10Gbps wave]BR [05:24:00] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [05:28:39] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [05:28:39] PROBLEM - Puppet freshness on hooper is CRITICAL: Puppet has not run in the last 10 hours [05:33:45] RECOVERY - udp2log processes on locke is OK: OK: all filters present [05:45:36] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [05:49:30] RECOVERY - udp2log processes on locke is OK: OK: all filters present [05:56:51] PROBLEM - Host cp1017 is DOWN: PING CRITICAL - Packet loss = 100% [05:57:27] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [06:05:46] RECOVERY - udp2log processes on locke is OK: OK: all filters present [06:13:43] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [06:17:46] RECOVERY - udp2log processes on locke is OK: OK: all filters present [06:23:37] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [06:25:34] RECOVERY - udp2log processes on locke is OK: OK: all filters present [06:57:37] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [07:03:01] PROBLEM - Puppet freshness on mw1010 is CRITICAL: Puppet has not run in the last 10 hours [07:11:25] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [07:11:25] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [07:26:25] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours [08:20:05] RECOVERY - Puppet freshness on brewster is OK: puppet ran at Tue Mar 6 08:20:00 UTC 2012 [08:22:38] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, interfaces up: 73, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-eqiad:xe-5/2/1 (FPL/GBLX, CV71026) [10Gbps wave]BR [08:36:19] !log on hooper: puppet broken due to dependency Package[libapache2-mod-php5] for Service[apache2] [08:36:23] Logged the message, Master [08:43:38] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [08:52:38] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [09:02:02] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [09:02:02] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [09:33:49] New patchset: Dzahn; "add nagios to the Debian-exim group to allow it to check_disk the tmpfs mount - should fix CRIT on sodium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2927 [09:36:05] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [09:43:39] New patchset: Dzahn; "include class {'webserver::php5': ssl => 'true'; } in misc::racktables to fix broken puppet dependency on hooper" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2928 [09:44:49] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2928 [09:44:52] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2928 [09:50:18] New patchset: Dzahn; "ugh, need to remove duplicate service definition for apache2 as well then (hooper/racktables)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2929 [09:50:54] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2929 [09:50:57] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2929 [09:53:41] New patchset: Dzahn; "..and another duplicate..Apache_module[ssl] already defined in the generic class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2930 [09:54:20] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2930 [09:54:23] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2930 [09:55:53] RECOVERY - Puppet freshness on hooper is OK: puppet ran at Tue Mar 6 09:55:41 UTC 2012 [09:56:11] New review: Dzahn; "alright, now puppet runs again on hooper" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2930 [10:00:54] New review: Dzahn; "it's still true that puppet cant manage existing users, right? so using an Exec .." [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/2927 [11:25:45] * Meech9z prout [11:50:24] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:52:12] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [12:44:36] PROBLEM - Puppet freshness on db1022 is CRITICAL: Puppet has not run in the last 10 hours [15:30:05] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [17:04:39] PROBLEM - Puppet freshness on mw1010 is CRITICAL: Puppet has not run in the last 10 hours [17:12:36] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [17:12:36] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [17:27:36] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours [17:42:45] mutante: can you give us some warning next time you upgrade bastion hosts? i had some screen sessions there i would like to have saved some state on [17:51:45] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.0466279646 (gt 8.0) [17:57:45] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 0.584431071429 [17:57:58] !log taking the opportunity to apply security updates to virt0-4 [17:58:01] Logged the message, Master [18:04:08] New patchset: Lcarr; "Adding in neon as a ntp monitoring server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2934 [22:08:57] RECOVERY - SSH on search1006 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [22:09:03] RECOVERY - SSH on search1007 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [22:09:03] PROBLEM - Lucene on search1010 is CRITICAL: Connection timed out [22:09:36] RECOVERY - SSH on search1009 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [22:10:03] RECOVERY - RAID on search1001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [22:10:21] RECOVERY - SSH on search1010 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [22:11:15] PROBLEM - Lucene on search1012 is CRITICAL: Connection timed out [22:11:24] RECOVERY - Disk space on search1001 is OK: DISK OK [22:11:33] PROBLEM - DPKG on search1016 is CRITICAL: Connection refused by host [22:11:33] PROBLEM - RAID on search1015 is CRITICAL: Connection refused by host [22:11:33] PROBLEM - DPKG on search1015 is CRITICAL: Connection refused by host [22:11:33] PROBLEM - SSH on search1016 is CRITICAL: Connection refused [22:11:42] PROBLEM - Disk space on search1015 is CRITICAL: Connection refused by host [22:12:00] PROBLEM - SSH on search1015 is CRITICAL: Connection refused [22:12:09] PROBLEM - RAID on search1020 is CRITICAL: Connection refused by host [22:12:09] PROBLEM - SSH on search1020 is CRITICAL: Connection refused [22:12:27] RECOVERY - SSH on search1012 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [22:12:36] RECOVERY - SSH on search1011 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [22:12:45] PROBLEM - Lucene on search1013 is CRITICAL: Connection refused [22:12:54] PROBLEM - RAID on search1016 is CRITICAL: Connection refused by host [22:12:54] PROBLEM - Disk space on search1016 is CRITICAL: Connection refused by host [22:12:54] PROBLEM - Lucene on search1019 is CRITICAL: Connection refused [22:13:03] PROBLEM - Disk space on search1020 is CRITICAL: Connection refused by host [22:13:12] PROBLEM - DPKG on search1020 is CRITICAL: Connection refused by host [22:13:30] RECOVERY - SSH on search1013 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [22:16:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:16:30] RECOVERY - NTP on search1001 is OK: NTP OK: Offset -0.00751376152 secs [22:17:06] PROBLEM - Lucene on search1015 is CRITICAL: Connection refused [22:18:00] RECOVERY - SSH on search1015 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [22:18:18] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 285 seconds [22:18:45] PROBLEM - Lucene on search1020 is CRITICAL: Connection timed out [22:18:54] PROBLEM - Lucene on search1016 is CRITICAL: Connection refused [22:19:48] RECOVERY - SSH on search1016 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [22:20:15] RECOVERY - SSH on search1019 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [22:20:24] RECOVERY - SSH on search1020 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [22:20:24] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 399 seconds [22:21:18] PROBLEM - NTP on search1003 is CRITICAL: NTP CRITICAL: No response from NTP server [22:21:36] PROBLEM - NTP on search1002 is CRITICAL: NTP CRITICAL: No response from NTP server [22:22:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.937 seconds [22:22:48] PROBLEM - NTP on search1011 is CRITICAL: NTP CRITICAL: No response from NTP server [22:23:06] PROBLEM - NTP on search1004 is CRITICAL: NTP CRITICAL: No response from NTP server [22:25:03] PROBLEM - NTP on search1005 is CRITICAL: NTP CRITICAL: No response from NTP server [22:25:21] PROBLEM - NTP on search1006 is CRITICAL: NTP CRITICAL: No response from NTP server [22:25:21] PROBLEM - NTP on search1007 is CRITICAL: NTP CRITICAL: No response from NTP server [22:26:17] New patchset: Pyoungmeister; "well that's a weird dependency loop..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2947 [22:27:18] PROBLEM - NTP on search1009 is CRITICAL: NTP CRITICAL: No response from NTP server [22:27:37] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2947 [22:28:39] PROBLEM - NTP on search1010 is CRITICAL: NTP CRITICAL: No response from NTP server [22:30:45] PROBLEM - NTP on search1012 is CRITICAL: NTP CRITICAL: No response from NTP server [22:30:54] PROBLEM - NTP on search1019 is CRITICAL: NTP CRITICAL: No response from NTP server [22:30:54] PROBLEM - NTP on search1013 is CRITICAL: NTP CRITICAL: No response from NTP server [22:31:03] New patchset: Pyoungmeister; "well that's a weird dependency loop..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2947 [22:32:51] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2947 [22:33:12] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2947 [22:33:15] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2947 [22:34:56] TimStarling: hello [22:35:21] hi [22:35:24] PROBLEM - NTP on search1015 is CRITICAL: NTP CRITICAL: No response from NTP server [22:37:21] PROBLEM - NTP on search1016 is CRITICAL: NTP CRITICAL: No response from NTP server [22:37:57] PROBLEM - NTP on search1020 is CRITICAL: NTP CRITICAL: No response from NTP server [22:38:11] New patchset: Pyoungmeister; "easy way." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2948 [22:40:59] New patchset: Pyoungmeister; "easy way." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2948 [22:41:28] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2948 [22:41:31] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2948 [22:45:27] New patchset: Asher; "testing a varnish instance in front of gdash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2949 [22:45:54] PROBLEM - Puppet freshness on db1022 is CRITICAL: Puppet has not run in the last 10 hours [22:54:38] RobH: are you in equinix today ? [22:54:59] New review: Asher; "but without a lint check.. here goes nothing" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2949 [22:55:01] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2949 [22:55:23] hi notpeter, are you around? [22:56:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:02:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.426 seconds [23:21:36] RECOVERY - Host ms-be4 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [23:25:48] RECOVERY - DPKG on search1001 is OK: All packages OK [23:27:09] RECOVERY - DPKG on search1002 is OK: All packages OK [23:27:36] RECOVERY - RAID on search1002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [23:28:21] RECOVERY - Disk space on search1002 is OK: DISK OK [23:31:12] RECOVERY - NTP on search1002 is OK: NTP OK: Offset 0.04822313786 secs [23:33:42] New patchset: Asher; "looks like xff_sources is required for varnish defs now to avoid an unused acl fatal error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2950 [23:34:39] RECOVERY - Lucene on search1001 is OK: TCP OK - 0.026 second response time on port 8123 [23:37:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:39:08] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2950 [23:39:11] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2950 [23:40:03] RECOVERY - SSH on ms-be4 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [23:43:53] LeslieCarr: sorry about that, thought it was unuased and didnt check good enough for screen sessions, just for current logins [23:45:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.032 seconds [23:50:23] New patchset: Ryan Lane; "Test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2951 [23:51:14] so, the lint check is happening. why is the verify not coming in? [23:51:27] [2012-03-06 23:50:33,211 +0000] 45485ffc gerrit2 a/14 'gerrit approve --verified "+1" -m '\''Lint check passed.'\'' bc135562792d54928a122cfdd75b4d7ea95d247d' 0ms 324ms killed [23:51:29] killed? [23:51:30] why? [23:52:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2951 [23:52:14] -_- [23:52:21] it works when I run it directly [23:52:27] !log deploying new frontend squid config to include googlebot in mobile redirects [23:52:30] Logged the message, Master [23:56:15] RECOVERY - RAID on search1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [23:56:15] RECOVERY - Disk space on search1003 is OK: DISK OK [23:57:36] RECOVERY - DPKG on search1003 is OK: All packages OK [23:58:58] Ryan_Lane: time for strace? [23:59:10] it's java. [23:59:11] for gerrit [23:59:24] yeah i was thinking that... might help a little [23:59:31] I think it's likely the way I'm handling things in the hooks [23:59:35] which are pythin [23:59:51] RECOVERY - DPKG on search1004 is OK: All packages OK [23:59:51] RECOVERY - RAID on search1004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0