[00:00:00] i checked for a line with just a dot! [00:00:18] that wouldn't do it [00:00:27] the mailman archive has this known bug [00:00:43] you are correct [00:01:23] how evil [00:01:32] ok, well well known at least [00:01:40] danke [02:18:46] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1408s [02:23:38] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1423s [02:33:38] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:38:48] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:46:58] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Tue Dec 27 02:46:50 UTC 2011 [04:30:32] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [05:55:05] PROBLEM - mobile traffic loggers on cp1044 is CRITICAL: PROCS CRITICAL: 7 processes with args varnishncsa [06:04:55] RECOVERY - mobile traffic loggers on cp1044 is OK: PROCS OK: 1 process with args varnishncsa [09:26:30] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [09:53:51] RECOVERY - MySQL slave status on es1004 is OK: OK: [10:02:04] PROBLEM - mobile traffic loggers on cp1043 is CRITICAL: PROCS CRITICAL: 7 processes with args varnishncsa [10:11:34] RECOVERY - mobile traffic loggers on cp1043 is OK: PROCS OK: 1 process with args varnishncsa [14:26:53] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [14:39:53] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [14:54:02] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [15:28:41] bugzilla seems to be hung again [15:28:49] last time it was swap death [15:30:44] nm, it seems to be back [15:31:59] hexmode, see -tech, there's a few very busy report.cgi processes [15:32:00] or hmm... maybe not... very slow pulling up https://bugzilla.wikimedia.org/show_bug.cgi?id=27283 [15:32:12] Reedy: thanks [16:54:31] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:58:11] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:41] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.001 seconds [17:17:21] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [17:18:41] !log Power cycled amslvs3 [17:18:50] Logged the message, Master [17:24:11] RECOVERY - BGP status on csw2-esams is OK: OK: host 91.198.174.244, sessions up: 4, down: 0, shutdown: 0 [17:24:21] RECOVERY - Host amslvs3 is UP: PING OK - Packet loss = 0%, RTA = 110.23 ms [17:40:11] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Tue Dec 27 17:40:03 UTC 2011 [17:43:01] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=60%): /var/lib/ureadahead/debugfs 0 MB (0% inode=60%): [17:50:58] !log removed files out of srv223's /tmp directory to free up its space [17:51:07] Logged the message, Mistress of the network gear. [17:55:06] RECOVERY - Disk space on srv223 is OK: DISK OK [18:24:14] !log powercycled kaulen yet again [18:24:23] Logged the message, Master [18:35:48] New patchset: Lcarr; "Re-adding localhost allow rule to iptables for udp2log machines" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1717 [18:42:16] !log clearing some logs on brewster and restarting lighttpd and squid [18:42:25] Logged the message, Master [18:43:06] RECOVERY - Squid on brewster is OK: TCP OK - 0.006 second response time on port 8080 [18:50:05] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1717 [18:50:06] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1717 [18:52:52] bad puppet iptables! putting new rules at the very bottom.. [18:57:09] oh i had an incorrect service on the purge statement, that's probably why [19:03:34] New patchset: Lcarr; "fixing purge rule" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1718 [19:03:50] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1718 [19:03:51] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1718 [19:07:54] !log removing manual localhost accept all iptables rule on locke (replacing with puppetized rule) [19:08:03] Logged the message, Mistress of the network gear. [19:35:27] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [19:42:47] !log removed puppetdlock on brewster [19:42:56] Logged the message, Mistress of the network gear. [19:43:07] RECOVERY - Puppet freshness on brewster is OK: puppet ran at Tue Dec 27 19:42:52 UTC 2011 [21:25:53] PROBLEM - check_gcsip on payments2 is CRITICAL: Connection timed out [21:25:53] PROBLEM - check_gcsip on payments3 is CRITICAL: Connection timed out [21:25:53] PROBLEM - check_gcsip on payments4 is CRITICAL: Connection timed out [21:25:53] PROBLEM - check_gcsip on payments1 is CRITICAL: Connection timed out [21:31:23] PROBLEM - check_gcsip on payments1 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [21:31:23] PROBLEM - check_gcsip on payments3 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [21:31:23] PROBLEM - check_gcsip on payments4 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [21:31:23] PROBLEM - check_gcsip on payments2 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [21:35:23] PROBLEM - check_gcsip on payments1 is CRITICAL: CRITICAL - Cannot make SSL connection [21:35:23] PROBLEM - check_gcsip on payments4 is CRITICAL: CRITICAL - Cannot make SSL connection [21:36:23] PROBLEM - check_gcsip on payments3 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [21:36:23] PROBLEM - check_gcsip on payments2 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [21:36:33] Jeff_Green: should we be worried ? [21:36:37] heh [21:36:45] monitoring works! yay [21:36:51] yeah we're already chattering about it [21:37:29] the nagios check is basically a simple HTTPS request [21:37:44] my guess is network trouble or something blew up at gc's end but the fundraising folks are investigating [21:38:12] where GC == global collect, payments handler [21:39:25] gotcha [21:40:13] RECOVERY - check_gcsip on payments1 is OK: OK [21:40:13] RECOVERY - check_gcsip on payments4 is OK: OK [21:41:23] PROBLEM - check_gcsip on payments3 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [21:41:23] PROBLEM - check_gcsip on payments2 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [21:43:16] ugh bugzilla [21:43:59] I see how it is a piece of evil, but without knowing a lot more about how the reporting stuff is supposed to work I don't see an easy fix short of breaking it [21:45:23] RECOVERY - check_gcsip on payments3 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.557 second response time [21:45:23] RECOVERY - check_gcsip on payments2 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.652 second response time [23:03:00] New patchset: Asher; "adding percona nagios checks - code with license" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1719 [23:10:34] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1719 [23:10:34] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1719 [23:59:36] New patchset: Asher; "class to install percona nagios monitors (just the files so far)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1723