[02:04:40] !log LocalisationUpdate completed (1.18) at Tue Dec 27 02:04:40 UTC 2011 [02:04:53] Logged the message, Master [02:18:46] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1408s [02:23:37] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1423s [02:33:37] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:38:47] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:46:57] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Tue Dec 27 02:46:50 UTC 2011 [04:30:32] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [05:08:05] incident note: a page vandalised and reverted 4 days ago (via cluebot) was still displaying vandalised revision until manually purged. [05:55:05] PROBLEM - mobile traffic loggers on cp1044 is CRITICAL: PROCS CRITICAL: 7 processes with args varnishncsa [06:04:55] RECOVERY - mobile traffic loggers on cp1044 is OK: PROCS OK: 1 process with args varnishncsa [09:26:30] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [09:53:50] RECOVERY - MySQL slave status on es1004 is OK: OK: [10:02:04] PROBLEM - mobile traffic loggers on cp1043 is CRITICAL: PROCS CRITICAL: 7 processes with args varnishncsa [10:11:34] RECOVERY - mobile traffic loggers on cp1043 is OK: PROCS OK: 1 process with args varnishncsa [13:10:56] !log nikerabbit synchronized wmf-config/ 'cs gender aliases bug 33367' [13:11:07] Logged the message, Master [13:48:36] !log nikerabbit synchronizing Wikimedia installation... : I18ndeploy r107378 [13:48:44] Logged the message, Master [13:52:47] sync done. [14:26:53] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [14:36:41] Reedy, hi. Do you have any more doubts about bug:33319 [14:38:44] Nope [14:39:09] I just wanted to confirm you wanted it varialising, not putting everything into a pt category [14:39:41] ok [14:39:53] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [14:39:58] sames applies to 33320 [14:40:12] Indeed [14:40:19] I'll probably get to them later today [14:40:22] ok [14:40:24] thanks [14:54:02] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [15:22:33] Is bugzilla really slow at the moment? [15:23:28] siebrand: slow as hell for me (Germany) [15:23:43] hoo: *nod* NL here... [15:31:10] siebrand, hoo 2 report.cgi processes keel maxing out the cpu [15:31:16] load average 6.33 10.35 10.39 [15:31:25] ouch [15:31:50] Reedy: you are maintaining testswarm at integration.mediawiki.org? [15:32:00] Reedy: seeing some strange things... [15:32:21] I've got access to some of it, but not actively working on it [15:33:05] if BZ keeps being slow, I'll see if we can find someone to shoot the processes, I guess I won't have enough rights to do so [15:33:06] Reedy: nothing immediate. Just a few issues wrt. client stability and configuration. [15:33:42] Ah. I was trying to help krinkle a couple of days with a bad database row [15:33:51] he said it's likely stopping clients so they need restarting [15:58:55] Reedy: yep, that's what I'm seeing. Once they've gone through the work, they stop. [15:59:50] If you refresh they "should" carry on [16:00:00] Reedy: also, when they're out of work, they reconnect every 30 seconds, I believe. I'd suggest to up that to at least a minute, if not 5. Also the clients that are not supported for tests need cleaning from the main page, the client versions need to be indicated in text on the test page, maybe some more suggestions... [16:00:31] Reedy: they do -- but manual intervention isn't that great. I don't want to set my alarm every 30 minutes to hit buttons in 12 or so browsers.. [16:00:43] Quite a few of those should be logged upstream... Even if we patch it and pass that back [16:54:31] PROBLEM - Apache HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:58:11] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:09:45] !log powercycled kaulen, it was (presumably) swapping away. report.cgi was to blame [17:09:55] Logged the message, Master [17:14:41] RECOVERY - Apache HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.001 seconds [17:17:21] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [17:24:11] RECOVERY - BGP status on csw2-esams is OK: OK: host 91.198.174.244, sessions up: 4, down: 0, shutdown: 0 [17:24:21] RECOVERY - Host amslvs3 is UP: PING OK - Packet loss = 0%, RTA = 110.23 ms [17:40:11] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Tue Dec 27 17:40:03 UTC 2011 [17:43:01] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=60%): /var/lib/ureadahead/debugfs 0 MB (0% inode=60%): [17:55:06] RECOVERY - Disk space on srv223 is OK: DISK OK [18:35:49] New patchset: Lcarr; "Re-adding localhost allow rule to iptables for udp2log machines" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1717 [18:43:06] RECOVERY - Squid on brewster is OK: TCP OK - 0.006 second response time on port 8080 [18:50:06] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1717 [18:50:06] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1717 [19:03:35] New patchset: Lcarr; "fixing purge rule" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1718 [19:03:50] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1718 [19:03:51] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1718 [19:35:27] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [19:43:07] RECOVERY - Puppet freshness on brewster is OK: puppet ran at Tue Dec 27 19:42:52 UTC 2011 [21:24:23] !log catrope synchronized php-1.18/extensions/ArticleFeedbackv5/modules/ext.articleFeedbackv5/ext.articleFeedbackv5.startup.js 'r107427' [21:24:32] Logged the message, Master [21:25:52] PROBLEM - check_gcsip on payments2 is CRITICAL: Connection timed out [21:25:53] PROBLEM - check_gcsip on payments3 is CRITICAL: Connection timed out [21:25:53] PROBLEM - check_gcsip on payments4 is CRITICAL: Connection timed out [21:25:53] PROBLEM - check_gcsip on payments1 is CRITICAL: Connection timed out [21:31:23] PROBLEM - check_gcsip on payments1 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [21:31:23] PROBLEM - check_gcsip on payments3 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [21:31:23] PROBLEM - check_gcsip on payments4 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [21:31:23] PROBLEM - check_gcsip on payments2 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [21:35:23] PROBLEM - check_gcsip on payments1 is CRITICAL: CRITICAL - Cannot make SSL connection [21:35:23] PROBLEM - check_gcsip on payments4 is CRITICAL: CRITICAL - Cannot make SSL connection [21:36:23] PROBLEM - check_gcsip on payments3 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [21:36:23] PROBLEM - check_gcsip on payments2 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [21:40:12] RECOVERY - check_gcsip on payments1 is OK: OK [21:40:13] RECOVERY - check_gcsip on payments4 is OK: OK [21:41:23] PROBLEM - check_gcsip on payments3 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [21:41:23] PROBLEM - check_gcsip on payments2 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [21:45:22] RECOVERY - check_gcsip on payments3 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.557 second response time [21:45:23] RECOVERY - check_gcsip on payments2 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.652 second response time [23:03:00] New patchset: Asher; "adding percona nagios checks - code with license" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1719 [23:10:34] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1719 [23:10:35] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1719 [23:34:04] gn8 folks [23:59:37] New patchset: Asher; "class to install percona nagios monitors (just the files so far)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1723