[00:03:03] New patchset: Lcarr; "switched /etc/nagios-plugins to /etc/nagios-plugins/config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2782 [00:04:29] RobH: no way you're still around, right? [00:05:05] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2782 [00:05:06] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2782 [00:05:24] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2709 [00:05:24] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2709 [00:06:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:06:53] PROBLEM - DPKG on searchidx1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:12:26] PROBLEM - Host cp1019 is DOWN: PING CRITICAL - Packet loss = 100% [00:12:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.450 seconds [00:13:47] RECOVERY - Host cp1019 is UP: PING OK - Packet loss = 0%, RTA = 30.93 ms [00:18:08] PROBLEM - Backend Squid HTTP on cp1019 is CRITICAL: Connection refused [00:18:32] !log stopping indexer on searchidx2 again :/ [00:18:34] Logged the message, and now dispaching a T1000 to your position to terminate you. [00:18:53] PROBLEM - Frontend Squid HTTP on cp1019 is CRITICAL: Connection refused [00:27:13] !log starting indexer on searchidx2 [00:27:15] Logged the message, and now dispaching a T1000 to your position to terminate you. [00:35:14] RECOVERY - DPKG on searchidx1001 is OK: All packages OK [00:39:53] RECOVERY - Backend Squid HTTP on cp1019 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.179 seconds [00:39:53] domas: I wish report.py had calls/req [00:40:47] RECOVERY - Frontend Squid HTTP on cp1019 is OK: HTTP OK HTTP/1.0 200 OK - 27546 bytes in 0.107 seconds [00:48:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:48:35] PROBLEM - SSH on sq39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:50:50] PROBLEM - Host sq39 is DOWN: PING CRITICAL - Packet loss = 100% [00:52:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.475 seconds [00:53:17] New patchset: Catrope; "Point the l10nupdate script to git instead of SVN" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2783 [00:57:13] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2783 [00:57:14] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2783 [00:58:09] New patchset: Catrope; "Revert "Point the l10nupdate script to git instead of SVN"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2784 [00:59:21] New patchset: Ryan Lane; "Revert "Point the l10nupdate script to git instead of SVN"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2785 [00:59:44] Change abandoned: Ryan Lane; "dupe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2785 [00:59:50] PROBLEM - Host cp1043 is DOWN: PING CRITICAL - Packet loss = 100% [00:59:55] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2784 [00:59:56] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2784 [01:02:21] New patchset: Catrope; "Point the l10nupdate script to git instead of SVN" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2786 [01:02:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2786 [01:15:08] PROBLEM - Host ms-be4 is DOWN: PING CRITICAL - Packet loss = 100% [01:17:05] RECOVERY - Host cp1043 is UP: PING OK - Packet loss = 0%, RTA = 30.82 ms [01:17:12] !log rebooted cp1043 [01:17:16] Logged the message, Mistress of the network gear. [01:21:08] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [01:22:08] !log cp1043 is missing /var/lib/varnish/frontend [01:22:10] Logged the message, Mistress of the network gear. [01:23:14] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 2 processes with command name varnishncsa [01:23:25] New review: Catrope; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2787 [01:23:25] Change merged: Catrope; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/2787 [01:23:37] Boo yah [01:25:09] !log reloading cp1043 again [01:25:13] Logged the message, Mistress of the network gear. [01:25:46] !log Creating a 'wikimaniatranscode' user locally on cadmium because I don't really want to run ffmpeg as root [01:25:47] PROBLEM - Host cp1043 is DOWN: PING CRITICAL - Packet loss = 100% [01:25:49] Logged the message, Mr. Obvious [01:26:14] !log Installing ffmpeg2theora on cadmium [01:26:16] Logged the message, Mr. Obvious [01:28:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:28:11] RECOVERY - Host cp1043 is UP: PING OK - Packet loss = 0%, RTA = 30.81 ms [01:31:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.506 seconds [01:32:59] PROBLEM - Varnish HTTP mobile-frontend on cp1043 is CRITICAL: Connection refused [01:33:26] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [01:34:17] !log Installing screen on cadmium [01:34:20] Logged the message, Mr. Obvious [01:34:56] RECOVERY - Varnish HTTP mobile-frontend on cp1043 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.062 seconds [01:35:23] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 2 processes with command name varnishncsa [01:38:12] so i think cp1043 is ok now [01:47:15] aaronschulz: there's a version that does :) [01:47:34] the secret-domas-nobody-mess-with-it version? [01:48:47] it was in ng/ but something is broken there now :) [01:53:52] !log Started transcode jobs on cadmium, 16 parallel jobs running in screen [01:53:54] Logged the message, Mr. Obvious [02:08:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:12:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.733 seconds [02:54:53] RECOVERY - Puppet freshness on ms-be5 is OK: puppet ran at Sat Feb 25 02:54:43 UTC 2012 [04:50:03] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [05:34:18] PROBLEM - Disk space on search1018 is CRITICAL: DISK CRITICAL - free space: /a 3691 MB (2% inode=99%): [05:34:36] PROBLEM - Disk space on search1017 is CRITICAL: DISK CRITICAL - free space: /a 3692 MB (2% inode=99%): [06:25:11] PROBLEM - Puppet freshness on mw1010 is CRITICAL: Puppet has not run in the last 10 hours [06:31:11] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [06:31:11] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [08:14:44] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [08:20:44] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [08:20:44] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [09:02:18] New review: Hashar; "(no comment)" [analytics/reportcard] (master) C: 1; - https://gerrit.wikimedia.org/r/2417 [10:28:01] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:30:14] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [11:23:54] New patchset: Hashar; "use MWScript in 'sql' script for centralauth DB" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2788 [14:52:04] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [16:26:38] PROBLEM - Puppet freshness on mw1010 is CRITICAL: Puppet has not run in the last 10 hours [16:32:38] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [16:32:38] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [17:03:23] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 205 seconds [17:04:52] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [18:15:53] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [18:21:49] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [18:21:49] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [18:31:07] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay 6 seconds [22:45:16] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , frwiktionary (10223) [23:02:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:04:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.079 seconds [23:38:11] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [23:40:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:44:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.949 seconds