[00:01:47] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:04:20] <nagios-wm>	 PROBLEM - Puppet freshness on ssl1003 is CRITICAL: Puppet has not run in the last 10 hours
[00:05:23] <nagios-wm>	 PROBLEM - Puppet freshness on ssl1001 is CRITICAL: Puppet has not run in the last 10 hours
[00:07:56] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.590 seconds
[00:17:01] <RoanKattouw>	 Ryan_Lane: catrope@srv256:/usr/local/apache/common$ grep -Rn 404.php .
[00:17:03] <RoanKattouw>	 This is taking a while
[00:17:11] <Ryan_Lane>	 heh
[00:17:22] <RoanKattouw>	 I figured I'd run it on bare metal rather than NFS
[00:17:34] <Ryan_Lane>	 likely a good idea
[00:18:19] <Ryan_Lane>	 !log powercycling ssl1001
[00:18:23] <morebots>	 Logged the message, Master
[00:18:30] <Ryan_Lane>	 !log powercycling ssl1003
[00:18:33] <morebots>	 Logged the message, Master
[00:20:14] <nagios-wm>	 RECOVERY - SSH on ssl1001 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0)
[00:22:20] <nagios-wm>	 RECOVERY - SSH on ssl1003 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0)
[00:23:50] <nagios-wm>	 RECOVERY - Puppet freshness on ssl1001 is OK: puppet ran at Sat Mar 10 00:23:20 UTC 2012
[00:26:32] <nagios-wm>	 RECOVERY - Puppet freshness on ssl1003 is OK: puppet ran at Sat Mar 10 00:26:22 UTC 2012
[00:42:17] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:46:20] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.067 seconds
[00:52:29] <Ryan_Lane>	 RoanKattouw: giving me a problem I can't figure out on a friday is just mean
[00:52:41] <RoanKattouw>	 hehe
[00:52:43] <RoanKattouw>	 I'm sorry dude
[00:52:45] <Ryan_Lane>	 :D
[00:52:58] <RoanKattouw>	 Benny passed me two problems, I fixed the other one
[00:54:14] <Ryan_Lane>	 heh
[01:01:00] <gerrit-wm>	 New patchset: Bhartshorne; "adding a manager to call swiftcleaner multiple times on newly created objects." [operations/software] (master) - https://gerrit.wikimedia.org/r/3040
[01:04:32] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/software] (master); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3040
[01:04:34] <gerrit-wm>	 Change merged: Bhartshorne; [operations/software] (master) - https://gerrit.wikimedia.org/r/3040
[01:06:41] <gerrit-wm>	 New patchset: Ryan Lane; "Grasping at straws here." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3041
[01:06:52] <maplebed>	 !log rebalanced the swift rings to finish decreasing traffic sent to ms1 and ms2
[01:06:53] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3041
[01:06:55] <morebots>	 Logged the message, Master
[01:06:55] <gerrit-wm>	 New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3041
[01:06:58] <gerrit-wm>	 Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3041
[01:07:06] <maplebed>	 !log started swiftcleaner on owa1 looking for (and purging) bad objects
[01:07:09] <morebots>	 Logged the message, Master
[01:15:33] <gerrit-wm>	 New patchset: Ryan Lane; "More grasping" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3042
[01:15:45] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3042
[01:15:48] <gerrit-wm>	 New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3042
[01:15:51] <gerrit-wm>	 Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3042
[01:18:24] <gerrit-wm>	 New patchset: Ryan Lane; "More troubleshooting" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3043
[01:18:35] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3043
[01:19:02] <gerrit-wm>	 New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3043
[01:19:05] <gerrit-wm>	 Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3043
[01:21:43] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:25:46] <nagios-wm>	 PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours
[01:25:46] <nagios-wm>	 PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours
[01:27:52] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.684 seconds
[02:03:52] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:07:46] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.375 seconds
[02:56:43] <nagios-wm>	 PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours
[03:06:19] <nagios-wm>	 PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours
[03:07:31] <nagios-wm>	 PROBLEM - Puppet freshness on amssq40 is CRITICAL: Puppet has not run in the last 10 hours
[03:07:31] <nagios-wm>	 PROBLEM - Puppet freshness on knsq23 is CRITICAL: Puppet has not run in the last 10 hours
[03:08:34] <nagios-wm>	 PROBLEM - Puppet freshness on amssq56 is CRITICAL: Puppet has not run in the last 10 hours
[03:08:34] <nagios-wm>	 PROBLEM - Puppet freshness on amssq49 is CRITICAL: Puppet has not run in the last 10 hours
[03:08:34] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours
[03:10:31] <nagios-wm>	 PROBLEM - Puppet freshness on knsq24 is CRITICAL: Puppet has not run in the last 10 hours
[03:10:31] <nagios-wm>	 PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours
[03:10:31] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3003 is CRITICAL: Puppet has not run in the last 10 hours
[03:10:31] <nagios-wm>	 PROBLEM - Puppet freshness on knsq21 is CRITICAL: Puppet has not run in the last 10 hours
[03:13:31] <nagios-wm>	 PROBLEM - Puppet freshness on amssq62 is CRITICAL: Puppet has not run in the last 10 hours
[03:15:28] <nagios-wm>	 PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours
[03:15:28] <nagios-wm>	 PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours
[03:25:31] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours
[03:44:34] <nagios-wm>	 PROBLEM - Puppet freshness on amssq31 is CRITICAL: Puppet has not run in the last 10 hours
[03:44:34] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours
[03:51:28] <nagios-wm>	 PROBLEM - Puppet freshness on amssq38 is CRITICAL: Puppet has not run in the last 10 hours
[03:51:28] <nagios-wm>	 PROBLEM - Puppet freshness on amssq35 is CRITICAL: Puppet has not run in the last 10 hours
[03:51:28] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Puppet has not run in the last 10 hours
[03:51:28] <nagios-wm>	 PROBLEM - Puppet freshness on amssq50 is CRITICAL: Puppet has not run in the last 10 hours
[03:51:28] <nagios-wm>	 PROBLEM - Puppet freshness on amssq41 is CRITICAL: Puppet has not run in the last 10 hours
[03:51:28] <nagios-wm>	 PROBLEM - Puppet freshness on amssq58 is CRITICAL: Puppet has not run in the last 10 hours
[03:51:28] <nagios-wm>	 PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours
[03:51:29] <nagios-wm>	 PROBLEM - Puppet freshness on amssq52 is CRITICAL: Puppet has not run in the last 10 hours
[03:51:29] <nagios-wm>	 PROBLEM - Puppet freshness on knsq25 is CRITICAL: Puppet has not run in the last 10 hours
[03:51:30] <nagios-wm>	 PROBLEM - Puppet freshness on knsq17 is CRITICAL: Puppet has not run in the last 10 hours
[03:51:30] <nagios-wm>	 PROBLEM - Puppet freshness on knsq29 is CRITICAL: Puppet has not run in the last 10 hours
[03:59:26] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours
[03:59:26] <nagios-wm>	 PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours
[03:59:26] <nagios-wm>	 PROBLEM - Puppet freshness on amssq36 is CRITICAL: Puppet has not run in the last 10 hours
[03:59:26] <nagios-wm>	 PROBLEM - Puppet freshness on amssq53 is CRITICAL: Puppet has not run in the last 10 hours
[03:59:26] <nagios-wm>	 PROBLEM - Puppet freshness on amssq51 is CRITICAL: Puppet has not run in the last 10 hours
[03:59:26] <nagios-wm>	 PROBLEM - Puppet freshness on amssq39 is CRITICAL: Puppet has not run in the last 10 hours
[03:59:26] <nagios-wm>	 PROBLEM - Puppet freshness on amssq54 is CRITICAL: Puppet has not run in the last 10 hours
[03:59:27] <nagios-wm>	 PROBLEM - Puppet freshness on amssq44 is CRITICAL: Puppet has not run in the last 10 hours
[03:59:28] <nagios-wm>	 PROBLEM - Puppet freshness on amssq59 is CRITICAL: Puppet has not run in the last 10 hours
[03:59:28] <nagios-wm>	 PROBLEM - Puppet freshness on amssq60 is CRITICAL: Puppet has not run in the last 10 hours
[03:59:28] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3004 is CRITICAL: Puppet has not run in the last 10 hours
[03:59:29] <nagios-wm>	 PROBLEM - Puppet freshness on amssq55 is CRITICAL: Puppet has not run in the last 10 hours
[03:59:29] <nagios-wm>	 PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours
[03:59:30] <nagios-wm>	 PROBLEM - Puppet freshness on knsq26 is CRITICAL: Puppet has not run in the last 10 hours
[04:00:38] <nagios-wm>	 PROBLEM - Puppet freshness on amssq32 is CRITICAL: Puppet has not run in the last 10 hours
[04:00:38] <nagios-wm>	 PROBLEM - Puppet freshness on amssq47 is CRITICAL: Puppet has not run in the last 10 hours
[04:00:38] <nagios-wm>	 PROBLEM - Puppet freshness on amssq48 is CRITICAL: Puppet has not run in the last 10 hours
[04:00:38] <nagios-wm>	 PROBLEM - Puppet freshness on amssq45 is CRITICAL: Puppet has not run in the last 10 hours
[04:00:38] <nagios-wm>	 PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours
[04:00:38] <nagios-wm>	 PROBLEM - Puppet freshness on knsq27 is CRITICAL: Puppet has not run in the last 10 hours
[04:01:41] <nagios-wm>	 PROBLEM - Puppet freshness on amssq42 is CRITICAL: Puppet has not run in the last 10 hours
[04:01:41] <nagios-wm>	 PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours
[04:01:41] <nagios-wm>	 PROBLEM - Puppet freshness on amssq34 is CRITICAL: Puppet has not run in the last 10 hours
[04:02:44] <nagios-wm>	 PROBLEM - Puppet freshness on amssq37 is CRITICAL: Puppet has not run in the last 10 hours
[04:02:44] <nagios-wm>	 PROBLEM - Puppet freshness on amssq57 is CRITICAL: Puppet has not run in the last 10 hours
[04:02:44] <nagios-wm>	 PROBLEM - Puppet freshness on knsq18 is CRITICAL: Puppet has not run in the last 10 hours
[04:02:44] <nagios-wm>	 PROBLEM - Puppet freshness on knsq16 is CRITICAL: Puppet has not run in the last 10 hours
[04:02:44] <nagios-wm>	 PROBLEM - Puppet freshness on knsq19 is CRITICAL: Puppet has not run in the last 10 hours
[04:02:44] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3002 is CRITICAL: Puppet has not run in the last 10 hours
[04:02:44] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours
[04:03:38] <nagios-wm>	 PROBLEM - Puppet freshness on amssq46 is CRITICAL: Puppet has not run in the last 10 hours
[04:04:41] <nagios-wm>	 PROBLEM - Puppet freshness on knsq22 is CRITICAL: Puppet has not run in the last 10 hours
[04:05:35] <nagios-wm>	 PROBLEM - Puppet freshness on hooft is CRITICAL: Puppet has not run in the last 10 hours
[04:06:38] <nagios-wm>	 PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours
[04:06:38] <nagios-wm>	 PROBLEM - Puppet freshness on amssq61 is CRITICAL: Puppet has not run in the last 10 hours
[04:06:38] <nagios-wm>	 PROBLEM - Puppet freshness on amssq43 is CRITICAL: Puppet has not run in the last 10 hours
[04:07:41] <nagios-wm>	 PROBLEM - Puppet freshness on knsq20 is CRITICAL: Puppet has not run in the last 10 hours
[04:21:20] <nagios-wm>	 PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100%
[04:21:38] <nagios-wm>	 RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms
[04:43:50] <gerrit-wm>	 New patchset: Dzahn; "nagios - profiler-to-carbon - work around incorrect process count issues with check_procs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3044
[04:44:02] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3044
[04:45:31] <gerrit-wm>	 New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3044
[04:45:34] <gerrit-wm>	 Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3044
[04:51:02] <nagios-wm>	 PROBLEM - Disk space on db57 is CRITICAL: Connection refused by host
[04:51:29] <nagios-wm>	 PROBLEM - MySQL disk space on db57 is CRITICAL: Connection refused by host
[04:52:24] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db25 is CRITICAL: Connection refused by host
[04:52:24] <nagios-wm>	 PROBLEM - mysqld processes on db25 is CRITICAL: Connection refused by host
[04:52:24] <nagios-wm>	 PROBLEM - Disk space on mw1072 is CRITICAL: Connection refused by host
[04:52:33] <nagios-wm>	 PROBLEM - Disk space on snapshot1001 is CRITICAL: Connection refused by host
[04:52:42] <nagios-wm>	 PROBLEM - RAID on capella is CRITICAL: Connection refused by host
[04:52:42] <nagios-wm>	 PROBLEM - Disk space on db25 is CRITICAL: Connection refused by host
[04:52:42] <nagios-wm>	 PROBLEM - MySQL Slave Running on db25 is CRITICAL: Connection refused by host
[04:52:42] <nagios-wm>	 PROBLEM - RAID on db57 is CRITICAL: Connection refused by host
[04:52:51] <nagios-wm>	 PROBLEM - RAID on search1006 is CRITICAL: Connection refused by host
[04:53:00] <nagios-wm>	 PROBLEM - MySQL disk space on db25 is CRITICAL: Connection refused by host
[04:53:09] <nagios-wm>	 PROBLEM - RAID on mw1072 is CRITICAL: Connection refused by host
[04:53:18] <nagios-wm>	 PROBLEM - MySQL Idle Transactions on db25 is CRITICAL: Connection refused by host
[04:53:27] <nagios-wm>	 PROBLEM - DPKG on db57 is CRITICAL: Connection refused by host
[04:53:36] <nagios-wm>	 PROBLEM - DPKG on search1006 is CRITICAL: Connection refused by host
[04:53:36] <nagios-wm>	 PROBLEM - RAID on srv224 is CRITICAL: Connection refused by host
[04:53:36] <nagios-wm>	 PROBLEM - DPKG on capella is CRITICAL: Connection refused by host
[04:53:36] <nagios-wm>	 PROBLEM - DPKG on snapshot1001 is CRITICAL: Connection refused by host
[04:53:36] <nagios-wm>	 PROBLEM - Disk space on capella is CRITICAL: Connection refused by host
[04:53:45] <nagios-wm>	 PROBLEM - MySQL Recent Restart on db25 is CRITICAL: Connection refused by host
[04:53:54] <nagios-wm>	 PROBLEM - Disk space on search1006 is CRITICAL: Connection refused by host
[04:53:54] <nagios-wm>	 PROBLEM - DPKG on srv210 is CRITICAL: Connection refused by host
[04:54:03] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db25 is CRITICAL: Connection refused by host
[04:54:11] <mutante>	 uhh? didnt touch anything related this time..
[04:54:12] <nagios-wm>	 PROBLEM - Disk space on srv210 is CRITICAL: Connection refused by host
[04:54:21] <nagios-wm>	 PROBLEM - DPKG on db25 is CRITICAL: Connection refused by host
[04:54:39] <nagios-wm>	 PROBLEM - DPKG on mw1072 is CRITICAL: Connection refused by host
[04:54:39] <nagios-wm>	 PROBLEM - DPKG on cp1016 is CRITICAL: Connection refused by host
[04:54:48] <nagios-wm>	 PROBLEM - DPKG on srv224 is CRITICAL: Connection refused by host
[04:56:30] <mutante>	 checking
[05:00:11] <mutante>	 oh yeah, the usual nagios-nrpe fails to restart issue after config change
[05:01:06] <mutante>	 !log starting nagios-nrpe-server on all via dsh (fail to restart on config change issue)
[05:01:10] <morebots>	 Logged the message, Master
[05:11:38] <mutante>	 !log doing more (cp*, db*, msbe-* ,mw*) by hand / for loop
[05:11:42] <morebots>	 Logged the message, Master
[05:21:26] <mutante>	 what the.. stopped again ?!
[05:41:36] <nagios-wm>	 RECOVERY - DPKG on virt2 is OK: All packages OK
[05:42:26] <mutante>	 hrmm.. and now we'll see if this happens again
[05:47:36] <gerrit-wm>	 New patchset: Dzahn; "profiler-to-carbon process check - remove -a when using --ereg-argument-array" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3045
[05:47:47] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3045
[05:48:24] <gerrit-wm>	 New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/3045
[05:48:27] <gerrit-wm>	 Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3045
[06:37:44] <nagios-wm>	 RECOVERY - Disk space on srv196 is OK: DISK OK
[06:37:44] <nagios-wm>	 RECOVERY - DPKG on srv265 is OK: All packages OK
[06:37:44] <nagios-wm>	 RECOVERY - RAID on srv274 is OK: OK: no RAID installed
[06:38:02] <nagios-wm>	 RECOVERY - Disk space on srv235 is OK: DISK OK
[06:38:29] <nagios-wm>	 RECOVERY - RAID on srv256 is OK: OK: no RAID installed
[06:59:31] <nagios-wm>	 PROBLEM - Puppet freshness on db1022 is CRITICAL: Puppet has not run in the last 10 hours
[07:03:25] <nagios-wm>	 RECOVERY - Puppet freshness on db1022 is OK: puppet ran at Sat Mar 10 07:03:18 UTC 2012
[07:03:48] <mutante>	 !log ran puppet on db1022, another one that works fine manually but somehow did not by itself
[07:03:52] <morebots>	 Logged the message, Master
[07:32:40] <nagios-wm>	 RECOVERY - Disk space on ms1004 is OK: DISK OK
[08:23:37] <nagios-wm>	 PROBLEM - Puppet freshness on db1033 is CRITICAL: Puppet has not run in the last 10 hours
[10:52:20] <nagios-wm>	 PROBLEM - Disk space on search1018 is CRITICAL: DISK CRITICAL - free space: /a 3253 MB (2% inode=99%):
[10:54:26] <nagios-wm>	 PROBLEM - Disk space on search1017 is CRITICAL: DISK CRITICAL - free space: /a 2121 MB (1% inode=99%):
[11:02:02] <nagios-wm>	 PROBLEM - Disk space on search1018 is CRITICAL: DISK CRITICAL - free space: /a 4807 MB (3% inode=99%):
[11:06:41] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:08:29] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.238 seconds
[11:27:23] <nagios-wm>	 PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours
[11:27:23] <nagios-wm>	 PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours
[11:44:56] <gerrit-wm>	 New patchset: Mark Bergsma; "Don't use probes for upload backend, use upload squids as backend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3047
[11:44:56] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:45:08] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3047
[11:46:42] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3047
[11:46:44] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3047
[11:48:05] <nagios-wm>	 PROBLEM - Disk space on search1017 is CRITICAL: DISK CRITICAL - free space: /a 3368 MB (2% inode=99%):
[11:48:50] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.746 seconds
[12:04:42] <gerrit-wm>	 New patchset: Mark Bergsma; "Put all probes in every VCL" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3048
[12:04:54] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3048
[12:05:12] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3048
[12:05:15] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3048
[12:10:26] <nagios-wm>	 PROBLEM - Puppet freshness on fenari is CRITICAL: Puppet has not run in the last 10 hours
[12:24:16] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:28:10] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.023 seconds
[12:42:46] <gerrit-wm>	 New patchset: Mark Bergsma; "Cache objects for 1 hour (frontend) or 30 days (backend) by default" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3049
[12:42:57] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3049
[12:44:05] <gerrit-wm>	 New patchset: Mark Bergsma; "Restrict target domain to upload.wikimedia.org on frontends as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3050
[12:44:17] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3049
[12:44:17] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3049
[12:44:18] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3050
[12:44:40] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3050
[12:44:43] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3050
[12:58:01] <nagios-wm>	 PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours
[13:04:10] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:08:04] <nagios-wm>	 PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours
[13:08:13] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.879 seconds
[13:09:07] <nagios-wm>	 PROBLEM - Puppet freshness on amssq40 is CRITICAL: Puppet has not run in the last 10 hours
[13:09:07] <nagios-wm>	 PROBLEM - Puppet freshness on knsq23 is CRITICAL: Puppet has not run in the last 10 hours
[13:10:01] <nagios-wm>	 PROBLEM - Puppet freshness on amssq49 is CRITICAL: Puppet has not run in the last 10 hours
[13:10:01] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours
[13:10:01] <nagios-wm>	 PROBLEM - Puppet freshness on amssq56 is CRITICAL: Puppet has not run in the last 10 hours
[13:12:07] <nagios-wm>	 PROBLEM - Puppet freshness on knsq21 is CRITICAL: Puppet has not run in the last 10 hours
[13:12:07] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3003 is CRITICAL: Puppet has not run in the last 10 hours
[13:12:07] <nagios-wm>	 PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours
[13:12:07] <nagios-wm>	 PROBLEM - Puppet freshness on knsq24 is CRITICAL: Puppet has not run in the last 10 hours
[13:15:07] <nagios-wm>	 PROBLEM - Puppet freshness on amssq62 is CRITICAL: Puppet has not run in the last 10 hours
[13:17:04] <nagios-wm>	 PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours
[13:17:04] <nagios-wm>	 PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours
[13:27:07] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours
[13:43:44] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:46:35] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours
[13:46:35] <nagios-wm>	 PROBLEM - Puppet freshness on amssq31 is CRITICAL: Puppet has not run in the last 10 hours
[13:47:38] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.177 seconds
[13:52:35] <nagios-wm>	 PROBLEM - Puppet freshness on amssq38 is CRITICAL: Puppet has not run in the last 10 hours
[13:52:35] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Puppet has not run in the last 10 hours
[13:52:35] <nagios-wm>	 PROBLEM - Puppet freshness on amssq35 is CRITICAL: Puppet has not run in the last 10 hours
[13:52:35] <nagios-wm>	 PROBLEM - Puppet freshness on amssq41 is CRITICAL: Puppet has not run in the last 10 hours
[13:52:35] <nagios-wm>	 PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours
[13:52:35] <nagios-wm>	 PROBLEM - Puppet freshness on amssq50 is CRITICAL: Puppet has not run in the last 10 hours
[13:52:35] <nagios-wm>	 PROBLEM - Puppet freshness on knsq17 is CRITICAL: Puppet has not run in the last 10 hours
[13:52:36] <nagios-wm>	 PROBLEM - Puppet freshness on amssq52 is CRITICAL: Puppet has not run in the last 10 hours
[13:52:36] <nagios-wm>	 PROBLEM - Puppet freshness on amssq58 is CRITICAL: Puppet has not run in the last 10 hours
[13:52:37] <nagios-wm>	 PROBLEM - Puppet freshness on knsq29 is CRITICAL: Puppet has not run in the last 10 hours
[13:52:37] <nagios-wm>	 PROBLEM - Puppet freshness on knsq25 is CRITICAL: Puppet has not run in the last 10 hours
[14:00:32] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours
[14:00:32] <nagios-wm>	 PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours
[14:00:32] <nagios-wm>	 PROBLEM - Puppet freshness on amssq39 is CRITICAL: Puppet has not run in the last 10 hours
[14:00:32] <nagios-wm>	 PROBLEM - Puppet freshness on amssq51 is CRITICAL: Puppet has not run in the last 10 hours
[14:00:32] <nagios-wm>	 PROBLEM - Puppet freshness on amssq36 is CRITICAL: Puppet has not run in the last 10 hours
[14:00:32] <nagios-wm>	 PROBLEM - Puppet freshness on amssq44 is CRITICAL: Puppet has not run in the last 10 hours
[14:00:32] <nagios-wm>	 PROBLEM - Puppet freshness on amssq54 is CRITICAL: Puppet has not run in the last 10 hours
[14:00:33] <nagios-wm>	 PROBLEM - Puppet freshness on amssq55 is CRITICAL: Puppet has not run in the last 10 hours
[14:00:33] <nagios-wm>	 PROBLEM - Puppet freshness on amssq60 is CRITICAL: Puppet has not run in the last 10 hours
[14:00:34] <nagios-wm>	 PROBLEM - Puppet freshness on amssq53 is CRITICAL: Puppet has not run in the last 10 hours
[14:00:34] <nagios-wm>	 PROBLEM - Puppet freshness on amssq59 is CRITICAL: Puppet has not run in the last 10 hours
[14:00:35] <nagios-wm>	 PROBLEM - Puppet freshness on knsq26 is CRITICAL: Puppet has not run in the last 10 hours
[14:00:35] <nagios-wm>	 PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours
[14:00:36] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3004 is CRITICAL: Puppet has not run in the last 10 hours
[14:02:29] <nagios-wm>	 PROBLEM - Puppet freshness on amssq45 is CRITICAL: Puppet has not run in the last 10 hours
[14:02:29] <nagios-wm>	 PROBLEM - Puppet freshness on amssq32 is CRITICAL: Puppet has not run in the last 10 hours
[14:02:29] <nagios-wm>	 PROBLEM - Puppet freshness on knsq27 is CRITICAL: Puppet has not run in the last 10 hours
[14:02:29] <nagios-wm>	 PROBLEM - Puppet freshness on amssq48 is CRITICAL: Puppet has not run in the last 10 hours
[14:02:29] <nagios-wm>	 PROBLEM - Puppet freshness on amssq47 is CRITICAL: Puppet has not run in the last 10 hours
[14:02:29] <nagios-wm>	 PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours
[14:03:32] <nagios-wm>	 PROBLEM - Puppet freshness on amssq34 is CRITICAL: Puppet has not run in the last 10 hours
[14:03:32] <nagios-wm>	 PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours
[14:03:32] <nagios-wm>	 PROBLEM - Puppet freshness on amssq42 is CRITICAL: Puppet has not run in the last 10 hours
[14:04:35] <nagios-wm>	 PROBLEM - Puppet freshness on amssq37 is CRITICAL: Puppet has not run in the last 10 hours
[14:04:35] <nagios-wm>	 PROBLEM - Puppet freshness on amssq57 is CRITICAL: Puppet has not run in the last 10 hours
[14:04:35] <nagios-wm>	 PROBLEM - Puppet freshness on knsq18 is CRITICAL: Puppet has not run in the last 10 hours
[14:04:35] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours
[14:04:35] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3002 is CRITICAL: Puppet has not run in the last 10 hours
[14:04:35] <nagios-wm>	 PROBLEM - Puppet freshness on knsq16 is CRITICAL: Puppet has not run in the last 10 hours
[14:04:35] <nagios-wm>	 PROBLEM - Puppet freshness on knsq19 is CRITICAL: Puppet has not run in the last 10 hours
[14:05:29] <nagios-wm>	 PROBLEM - Puppet freshness on amssq46 is CRITICAL: Puppet has not run in the last 10 hours
[14:06:32] <nagios-wm>	 PROBLEM - Puppet freshness on knsq22 is CRITICAL: Puppet has not run in the last 10 hours
[14:07:35] <nagios-wm>	 PROBLEM - Puppet freshness on hooft is CRITICAL: Puppet has not run in the last 10 hours
[14:08:29] <nagios-wm>	 PROBLEM - Puppet freshness on amssq43 is CRITICAL: Puppet has not run in the last 10 hours
[14:08:29] <nagios-wm>	 PROBLEM - Puppet freshness on amssq61 is CRITICAL: Puppet has not run in the last 10 hours
[14:08:29] <nagios-wm>	 PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours
[14:09:32] <nagios-wm>	 PROBLEM - Puppet freshness on knsq20 is CRITICAL: Puppet has not run in the last 10 hours
[14:23:56] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:27:50] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.425 seconds
[15:03:21] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:09:21] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.035 seconds
[15:18:10] <gerrit-wm>	 New patchset: Mark Bergsma; "Test modified varnishhtcpd with inline http port" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3051
[15:18:22] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3051
[15:19:13] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3051
[15:19:16] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3051
[15:21:18] <gerrit-wm>	 New patchset: Mark Bergsma; "Typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3052
[15:21:30] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3052
[15:21:55] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3052
[15:21:58] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3052
[15:24:41] <gerrit-wm>	 New patchset: Mark Bergsma; "Purge both Varnish instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3053
[15:24:53] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3053
[15:25:07] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3053
[15:25:10] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3053
[15:31:02] <gerrit-wm>	 New patchset: Mark Bergsma; "Convert mobile servers to new htcppurger class, purging both varnish instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3054
[15:31:13] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3054
[15:31:42] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3054
[15:31:45] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3054
[15:43:15] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:49:15] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.247 seconds
[16:08:20] <gerrit-wm>	 New patchset: Mark Bergsma; "Let's not have Varnish writing to the same file concurrently, shall we" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3055
[16:08:32] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3055
[16:09:10] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3055
[16:09:12] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3055
[16:23:27] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:29:27] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.675 seconds
[17:02:18] <gerrit-wm>	 New patchset: Mark Bergsma; "Add serve IPs to X-Cache headers for debugging purposes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3056
[17:02:30] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3056
[17:02:49] <gerrit-wm>	 New patchset: Mark Bergsma; "Add server IPs to X-Cache headers for debugging purposes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3056
[17:03:01] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3056
[17:03:11] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3056
[17:03:14] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3056
[17:03:30] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:05:09] <nagios-wm>	 PROBLEM - Puppet freshness on db1022 is CRITICAL: Puppet has not run in the last 10 hours
[17:09:21] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.325 seconds
[17:14:13] <gerrit-wm>	 New patchset: Mark Bergsma; "server.ip is not a string, so use hostname" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3057
[17:14:25] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3057
[17:14:44] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3057
[17:14:47] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3057
[17:30:19] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:30:19] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:30:19] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:30:19] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:35:16] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:35:17] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:35:17] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:35:17] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:40:22] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:40:22] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:40:31] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:40:32] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:44:16] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:45:19] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:45:19] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:45:20] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:45:20] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:48:19] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.442 seconds
[17:50:16] <gerrit-wm>	 New patchset: Mark Bergsma; "Cache 4xx on upload" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3058
[17:50:16] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:50:28] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3058
[17:50:34] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:50:34] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:50:34] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:51:19] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3058
[17:51:24] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3058
[17:55:22] <nagios-wm>	 RECOVERY - check_minfraud_secondary on payments3 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 118 bytes in 5.748 second response time
[17:55:22] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:55:22] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:55:50] <nagios-wm>	 PROBLEM - check_minfraud_secondary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:58:51] <gerrit-wm>	 New patchset: Mark Bergsma; "Specify cache4xx as a time period" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3059
[17:59:03] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3059
[17:59:54] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3059
[17:59:57] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3059
[18:00:19] <nagios-wm>	 RECOVERY - check_minfraud_secondary on payments2 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 118 bytes in 0.593 second response time
[18:00:19] <nagios-wm>	 RECOVERY - check_minfraud_secondary on payments4 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 118 bytes in 0.604 second response time
[18:00:19] <nagios-wm>	 RECOVERY - check_minfraud_secondary on payments1 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 118 bytes in 0.583 second response time
[18:06:09] <gerrit-wm>	 New patchset: Mark Bergsma; "Don't return(hit_for_pass) when caching 4xx" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3060
[18:06:21] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3060
[18:06:23] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3060
[18:06:26] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3060
[18:24:37] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:24:55] <nagios-wm>	 PROBLEM - Puppet freshness on db1033 is CRITICAL: Puppet has not run in the last 10 hours
[18:28:31] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.484 seconds
[18:28:58] <nagios-wm>	 PROBLEM - Puppet freshness on virt4 is CRITICAL: Puppet has not run in the last 10 hours
[18:32:52] <nagios-wm>	 PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours
[18:37:58] <nagios-wm>	 PROBLEM - Puppet freshness on virt1 is CRITICAL: Puppet has not run in the last 10 hours
[18:39:48] <gerrit-wm>	 New patchset: Mark Bergsma; "Don't udplog PURGE requests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3061
[18:40:00] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3061
[18:40:22] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3061
[18:40:25] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3061
[18:44:11] <gerrit-wm>	 New patchset: Mark Bergsma; "Don't use single quotes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3062
[18:44:23] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3062
[18:44:38] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3062
[18:44:41] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3062
[18:48:02] <nagios-wm>	 PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours
[18:48:35] <gerrit-wm>	 New patchset: Mark Bergsma; "Make Puppet automatically restart the varnish loggers on changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3063
[18:48:47] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3063
[18:48:52] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3063
[18:48:54] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3063
[19:00:15] <mark>	 hey asher
[19:00:19] <mark>	 wanna join the hackathon? ;)
[19:00:33] <domas>	 ASHER COME TO HACKATHON
[19:00:47] <domas>	 binasher: I just made mediawiki twice faster!!!!111
[19:00:58] <binasher>	 zomg!
[19:01:00] <binasher>	 5.4!
[19:01:07] <mark>	 and suhosin
[19:01:17] <mark>	 and -O3 hehe
[19:01:24] <domas>	 yeh
[19:01:33] <binasher>	 death to suhosin
[19:02:14] <domas>	 and newer APC
[19:02:19] <domas>	 anyway
[19:02:31] <mark>	 domas is annoyed with current avg mediawiki request latency
[19:02:36] <mark>	 "was 40ms in my days!"
[19:02:41] <domas>	 160 now
[19:02:44] <domas>	 !!!
[19:02:46] <domas>	 well, 100+
[19:04:14] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:04:18] * domas  stares at http://svn.php.net/viewvc/pecl/apc/trunk/?sortby=date#dirlist
[19:06:52] <binasher>	 when's hphpvm going to be ready??
[19:07:29] <mark>	 i'm about ready to deploy varnish for upload
[19:08:59] <binasher>	 ooh.  want to try the persistent backend instead of file?
[19:09:08] <mark>	 I have file now
[19:09:13] <mark>	 but I suppose we could test it
[19:09:17] <mark>	 perhaps on just a few boxes
[19:09:31] <domas>	 binasher: I guess when it's ready
[19:09:34] <mark>	 upload.eqiad goes to squid in pmtpa now
[19:09:42] <mark>	 except for swift, it contacts that directly
[19:09:54] <binasher>	 yeah, on a few boxes to compare would be good
[19:10:22] <mark>	 but lets start with file for the first few days, since we might have enough issues
[19:11:17] <binasher>	 oh i didn't realize squid didn't send a udp packet per log entry
[19:11:27] <mark>	 yeah
[19:12:02] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.021 seconds
[19:12:40] <mark>	 we could even do jumbo frames, 9000 MTU
[19:12:45] <mark>	 that would fit quite a few requests
[19:22:31] <binasher>	 !log reslaved db1033
[19:22:35] <morebots>	 Logged the message, Master
[19:22:59] <nagios-wm>	 RECOVERY - mysqld processes on db1033 is OK: PROCS OK: 1 process with command name mysqld
[19:26:44] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 148641 seconds
[19:27:02] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db1033 is CRITICAL: CRIT replication delay 148559 seconds
[19:28:08] <binasher>	 !log set sync_binlog = 1 on all current masters and eqiad dbs
[19:28:11] <morebots>	 Logged the message, Master
[19:34:20] <mark>	 i wonder how to deal with the varnish metrics for the two varnish instances
[19:43:27] <gerrit-wm>	 New patchset: Mark Bergsma; "Preseed remaining questions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3065
[19:43:39] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3065
[19:43:55] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3065
[19:43:58] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3065
[19:44:26] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:50:17] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.783 seconds
[19:57:13] <gerrit-wm>	 New patchset: Mark Bergsma; "Add Upload caches eqiad cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3066
[19:57:25] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3066
[19:57:39] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/3066
[19:57:41] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3066
[20:25:32] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:29:26] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.893 seconds
[21:06:02] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:12:11] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.025 seconds
[21:29:08] <nagios-wm>	 PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours
[21:29:08] <nagios-wm>	 PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours
[21:41:21] <nagios-wm>	 PROBLEM - Host ms6 is DOWN: PING CRITICAL - Packet loss = 100%
[21:41:30] <nagios-wm>	 PROBLEM - Host knsq22 is DOWN: PING CRITICAL - Packet loss = 100%
[21:41:30] <nagios-wm>	 PROBLEM - Host amssq36 is DOWN: PING CRITICAL - Packet loss = 100%
[21:41:30] <nagios-wm>	 PROBLEM - Host amssq49 is DOWN: PING CRITICAL - Packet loss = 100%
[21:41:30] <nagios-wm>	 PROBLEM - Host wikipedia-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[21:41:39] <nagios-wm>	 PROBLEM - Host nescio is DOWN: PING CRITICAL - Packet loss = 100%
[21:41:57] <nagios-wm>	 PROBLEM - Host amssq47 is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:06] <nagios-wm>	 PROBLEM - Host amssq51 is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:06] <nagios-wm>	 PROBLEM - Host knsq27 is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:06] <nagios-wm>	 PROBLEM - Host amssq43 is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:06] <nagios-wm>	 PROBLEM - Host knsq29 is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:15] <nagios-wm>	 PROBLEM - Host ssl3002 is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:15] <nagios-wm>	 PROBLEM - Host amssq44 is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:15] <nagios-wm>	 PROBLEM - Host amssq35 is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:15] <nagios-wm>	 PROBLEM - Host amssq38 is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:15] <nagios-wm>	 PROBLEM - Host amssq33 is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:15] <nagios-wm>	 PROBLEM - Host knsq23 is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:15] <nagios-wm>	 PROBLEM - Host foundation-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:16] <nagios-wm>	 PROBLEM - Host amssq54 is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:16] <nagios-wm>	 PROBLEM - Host wiktionary-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:17] <nagios-wm>	 PROBLEM - Host amssq56 is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:24] <nagios-wm>	 PROBLEM - Host knsq21 is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:24] <nagios-wm>	 PROBLEM - Host amssq41 is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:24] <nagios-wm>	 PROBLEM - Host knsq16 is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:33] <nagios-wm>	 PROBLEM - Host knsq19 is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:33] <nagios-wm>	 PROBLEM - Host amssq50 is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:33] <nagios-wm>	 PROBLEM - Host amssq32 is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:42] <nagios-wm>	 PROBLEM - Host amssq59 is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:42] <nagios-wm>	 PROBLEM - Host knsq25 is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:42] <nagios-wm>	 PROBLEM - Host upload.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:51] <nagios-wm>	 PROBLEM - Host ssl3001 is DOWN: PING CRITICAL - Packet loss = 100%
[21:42:51] <nagios-wm>	 PROBLEM - Host knsq17 is DOWN: PING CRITICAL - Packet loss = 100%
[21:43:00] <nagios-wm>	 PROBLEM - Host wikiversity-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[21:43:00] <nagios-wm>	 PROBLEM - Host amssq34 is DOWN: PING CRITICAL - Packet loss = 100%
[21:43:01] <nagios-wm>	 PROBLEM - Host hooft is DOWN: PING CRITICAL - Packet loss = 100%
[21:43:09] <nagios-wm>	 PROBLEM - Host bits.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[21:43:18] <nagios-wm>	 PROBLEM - Host knsq24 is DOWN: PING CRITICAL - Packet loss = 100%
[21:43:36] <nagios-wm>	 PROBLEM - Host amssq39 is DOWN: PING CRITICAL - Packet loss = 100%
[21:43:36] <nagios-wm>	 PROBLEM - Host amssq37 is DOWN: PING CRITICAL - Packet loss = 100%
[21:43:36] <nagios-wm>	 PROBLEM - Host amssq45 is DOWN: PING CRITICAL - Packet loss = 100%
[21:43:36] <nagios-wm>	 PROBLEM - Host wikinews-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[21:43:36] <nagios-wm>	 PROBLEM - Host wikisource-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[21:43:37] <nagios-wm>	 PROBLEM - Host wikiquote-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[21:43:45] <nagios-wm>	 PROBLEM - Host maerlant is DOWN: PING CRITICAL - Packet loss = 100%
[21:43:45] <nagios-wm>	 PROBLEM - Host cp3001 is DOWN: PING CRITICAL - Packet loss = 100%
[21:43:54] <nagios-wm>	 PROBLEM - Host amssq58 is DOWN: PING CRITICAL - Packet loss = 100%
[21:43:54] <nagios-wm>	 PROBLEM - Host amssq48 is DOWN: PING CRITICAL - Packet loss = 100%
[21:43:54] <nagios-wm>	 PROBLEM - Host amssq57 is DOWN: PING CRITICAL - Packet loss = 100%
[21:43:54] <nagios-wm>	 PROBLEM - Host ns2.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[21:43:54] <nagios-wm>	 PROBLEM - Host amslvs3 is DOWN: PING CRITICAL - Packet loss = 100%
[21:44:12] <nagios-wm>	 PROBLEM - Host amssq60 is DOWN: PING CRITICAL - Packet loss = 100%
[21:44:12] <nagios-wm>	 PROBLEM - Host br1-knams is DOWN: PING CRITICAL - Packet loss = 100%
[21:44:21] <nagios-wm>	 PROBLEM - Host ssl3004 is DOWN: PING CRITICAL - Packet loss = 100%
[21:44:21] <nagios-wm>	 PROBLEM - Host csw1-esams is DOWN: PING CRITICAL - Packet loss = 100%
[21:44:39] <nagios-wm>	 PROBLEM - Host amssq42 is DOWN: PING CRITICAL - Packet loss = 100%
[21:44:39] <nagios-wm>	 PROBLEM - Host amslvs4 is DOWN: PING CRITICAL - Packet loss = 100%
[21:44:39] <nagios-wm>	 PROBLEM - Host mediawiki-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[21:44:48] <nagios-wm>	 PROBLEM - Host 91.198.174.6 is DOWN: PING CRITICAL - Packet loss = 100%
[21:44:48] <nagios-wm>	 PROBLEM - Host ssl3003 is DOWN: PING CRITICAL - Packet loss = 100%
[21:44:48] <nagios-wm>	 PROBLEM - Host amssq55 is DOWN: PING CRITICAL - Packet loss = 100%
[21:44:48] <nagios-wm>	 PROBLEM - Host csw2-esams is DOWN: PING CRITICAL - Packet loss = 100%
[21:45:06] <nagios-wm>	 PROBLEM - Host amssq62 is DOWN: PING CRITICAL - Packet loss = 100%
[21:45:15] <nagios-wm>	 PROBLEM - Host amssq52 is DOWN: PING CRITICAL - Packet loss = 100%
[21:45:15] <nagios-wm>	 PROBLEM - Host amssq46 is DOWN: PING CRITICAL - Packet loss = 100%
[21:45:15] <nagios-wm>	 PROBLEM - Host amssq53 is DOWN: PING CRITICAL - Packet loss = 100%
[21:45:24] <nagios-wm>	 PROBLEM - Host amssq31 is DOWN: PING CRITICAL - Packet loss = 100%
[21:45:42] <nagios-wm>	 PROBLEM - Host amssq40 is DOWN: PING CRITICAL - Packet loss = 100%
[21:45:42] <nagios-wm>	 PROBLEM - Host upload.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100%
[21:45:42] <nagios-wm>	 PROBLEM - Host text.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[21:45:43] <nagios-wm>	 PROBLEM - Host wikibooks-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[21:45:43] <nagios-wm>	 PROBLEM - Host wikibooks-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100%
[21:45:51] <nagios-wm>	 PROBLEM - Host amssq61 is DOWN: PING CRITICAL - Packet loss = 100%
[21:45:51] <nagios-wm>	 PROBLEM - Host wikimedia-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[21:45:51] <nagios-wm>	 PROBLEM - Host wikimedia-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100%
[21:45:52] <nagios-wm>	 PROBLEM - Host wikipedia-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100%
[21:45:52] <nagios-wm>	 PROBLEM - Host wikiquote-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100%
[21:45:53] <nagios-wm>	 PROBLEM - Host wikisource-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100%
[21:45:53] <nagios-wm>	 PROBLEM - Host wikinews-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100%
[21:46:00] <nagios-wm>	 PROBLEM - Host knsq26 is DOWN: PING CRITICAL - Packet loss = 100%
[21:46:00] <nagios-wm>	 PROBLEM - Host cp3002 is DOWN: PING CRITICAL - Packet loss = 100%
[21:46:09] <nagios-wm>	 PROBLEM - Host wikiversity-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100%
[21:46:09] <nagios-wm>	 PROBLEM - Host wiktionary-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100%
[21:46:10] <nagios-wm>	 PROBLEM - Host amslvs2 is DOWN: PING CRITICAL - Packet loss = 100%
[21:46:18] <nagios-wm>	 PROBLEM - Host amslvs1 is DOWN: PING CRITICAL - Packet loss = 100%
[21:46:27] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:46:27] <nagios-wm>	 PROBLEM - Host bits.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100%
[21:46:45] <nagios-wm>	 PROBLEM - Host knsq28 is DOWN: PING CRITICAL - Packet loss = 100%
[21:47:03] <nagios-wm>	 PROBLEM - Host knsq20 is DOWN: PING CRITICAL - Packet loss = 100%
[21:47:03] <nagios-wm>	 PROBLEM - Host knsq18 is DOWN: PING CRITICAL - Packet loss = 100%
[21:48:24] <nagios-wm>	 PROBLEM - Host foundation-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100%
[21:48:51] <nagios-wm>	 PROBLEM - Host mediawiki-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100%
[21:50:21] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.855 seconds
[21:51:36] <apergos>	 wtf??
[21:51:49] <apergos>	 (and why does this always happen midnight on my shift?)
[21:55:59] <Nemo_bis>	 americans waking up?
[21:56:45] <apergos>	 no,  it's 2 pm on the west coast
[21:56:50] <apergos>	 and so 5 pm on the east
[21:56:59] <apergos>	 or maybe I'm an hour off, anyways it's definitely not morning
[21:57:25] <apergos>	 bits and upload squids in esams still seem to be acting up
[22:04:33] <apergos>	 I can't ping to a bits squid but I can ssh to it?
[22:04:54] <apergos>	 not making any sense
[22:06:46] <apergos>	 looks like bits are coming back. weird
[22:07:11] <apergos>	 and so are the uploads.
[22:11:30] <nagios-wm>	 PROBLEM - Puppet freshness on fenari is CRITICAL: Puppet has not run in the last 10 hours
[22:20:03] <nagios-wm>	 RECOVERY - Host wikipedia-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 16%, RTA = 109.81 ms
[22:20:04] <nagios-wm>	 RECOVERY - Host wikimedia-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 16%, RTA = 110.06 ms
[22:20:05] <nagios-wm>	 RECOVERY - Host knsq22 is UP: PING OK - Packet loss = 16%, RTA = 109.21 ms
[22:20:05] <nagios-wm>	 RECOVERY - Host knsq25 is UP: PING WARNING - Packet loss = 66%, RTA = 109.38 ms
[22:20:05] <nagios-wm>	 RECOVERY - Host amssq43 is UP: PING OK - Packet loss = 16%, RTA = 110.10 ms
[22:20:05] <nagios-wm>	 RECOVERY - Host amssq40 is UP: PING OK - Packet loss = 16%, RTA = 109.46 ms
[22:20:05] <nagios-wm>	 RECOVERY - Host amssq37 is UP: PING OK - Packet loss = 16%, RTA = 109.96 ms
[22:20:05] <nagios-wm>	 RECOVERY - Host wikibooks-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 16%, RTA = 109.34 ms
[22:20:06] <nagios-wm>	 RECOVERY - Host knsq24 is UP: PING OK - Packet loss = 16%, RTA = 109.30 ms
[22:20:06] <nagios-wm>	 RECOVERY - Host amssq31 is UP: PING WARNING - Packet loss = 66%, RTA = 111.00 ms
[22:20:07] <nagios-wm>	 RECOVERY - Host amssq38 is UP: PING WARNING - Packet loss = 66%, RTA = 109.54 ms
[22:20:07] <nagios-wm>	 RECOVERY - Host amssq32 is UP: PING WARNING - Packet loss = 66%, RTA = 110.61 ms
[22:20:08] <nagios-wm>	 RECOVERY - Host knsq19 is UP: PING WARNING - Packet loss = 66%, RTA = 109.24 ms
[22:20:08] <nagios-wm>	 RECOVERY - Host knsq21 is UP: PING WARNING - Packet loss = 66%, RTA = 109.89 ms
[22:20:09] <nagios-wm>	 RECOVERY - Host amssq59 is UP: PING WARNING - Packet loss = 66%, RTA = 109.73 ms
[22:20:09] <nagios-wm>	 RECOVERY - Host amssq56 is UP: PING OK - Packet loss = 0%, RTA = 109.30 ms
[22:20:10] <nagios-wm>	 RECOVERY - Host amssq50 is UP: PING OK - Packet loss = 0%, RTA = 109.96 ms
[22:20:10] <nagios-wm>	 RECOVERY - Host amssq33 is UP: PING OK - Packet loss = 0%, RTA = 110.71 ms
[22:20:11] <nagios-wm>	 RECOVERY - Host knsq18 is UP: PING WARNING - Packet loss = 28%, RTA = 110.13 ms
[22:20:12] <nagios-wm>	 RECOVERY - Host amssq52 is UP: PING OK - Packet loss = 0%, RTA = 109.42 ms
[22:20:12] <nagios-wm>	 RECOVERY - Host amssq62 is UP: PING OK - Packet loss = 0%, RTA = 108.84 ms
[22:20:12] <nagios-wm>	 RECOVERY - Host amssq44 is UP: PING OK - Packet loss = 0%, RTA = 109.16 ms
[22:20:13] <nagios-wm>	 RECOVERY - Host amssq46 is UP: PING OK - Packet loss = 0%, RTA = 109.33 ms
[22:20:13] <nagios-wm>	 RECOVERY - Host knsq16 is UP: PING OK - Packet loss = 0%, RTA = 109.32 ms
[22:20:14] <nagios-wm>	 RECOVERY - Host amslvs1 is UP: PING OK - Packet loss = 0%, RTA = 109.52 ms
[22:20:14] <nagios-wm>	 RECOVERY - Host amssq58 is UP: PING OK - Packet loss = 0%, RTA = 109.50 ms
[22:20:15] <nagios-wm>	 RECOVERY - Host amssq36 is UP: PING OK - Packet loss = 0%, RTA = 110.07 ms
[22:20:15] <nagios-wm>	 RECOVERY - Host ssl3003 is UP: PING WARNING - Packet loss = 80%, RTA = 110.84 ms
[22:20:16] <nagios-wm>	 RECOVERY - Host ms6 is UP: PING WARNING - Packet loss = 80%, RTA = 110.52 ms
[22:20:16] <nagios-wm>	 RECOVERY - Host mediawiki-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 109.43 ms
[22:20:21] <nagios-wm>	 RECOVERY - Host foundation-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 109.86 ms
[22:20:22] <nagios-wm>	 RECOVERY - Host amssq48 is UP: PING OK - Packet loss = 0%, RTA = 109.26 ms
[22:20:22] <nagios-wm>	 RECOVERY - Host amssq47 is UP: PING OK - Packet loss = 0%, RTA = 109.39 ms
[22:20:22] <nagios-wm>	 RECOVERY - Host ssl3002 is UP: PING OK - Packet loss = 0%, RTA = 109.65 ms
[22:20:22] <nagios-wm>	 RECOVERY - Host amssq41 is UP: PING OK - Packet loss = 0%, RTA = 110.47 ms
[22:20:22] <nagios-wm>	 RECOVERY - Host ns2.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 109.88 ms
[22:20:22] <nagios-wm>	 RECOVERY - Host amssq35 is UP: PING OK - Packet loss = 0%, RTA = 110.71 ms
[22:20:22] <nagios-wm>	 RECOVERY - Host knsq17 is UP: PING OK - Packet loss = 0%, RTA = 109.83 ms
[22:20:23] <nagios-wm>	 RECOVERY - Host amssq53 is UP: PING OK - Packet loss = 0%, RTA = 109.45 ms
[22:20:23] <nagios-wm>	 RECOVERY - Host knsq28 is UP: PING OK - Packet loss = 0%, RTA = 109.79 ms
[22:20:24] <nagios-wm>	 RECOVERY - Host amssq60 is UP: PING OK - Packet loss = 0%, RTA = 109.56 ms
[22:20:24] <nagios-wm>	 RECOVERY - Host knsq27 is UP: PING WARNING - Packet loss = 80%, RTA = 112.29 ms
[22:20:25] <nagios-wm>	 RECOVERY - Host amssq42 is UP: PING WARNING - Packet loss = 86%, RTA = 112.22 ms
[22:20:25] <nagios-wm>	 RECOVERY - Host wikiversity-lb.esams.wikimedia.org is UP: PING WARNING - Packet loss = 80%, RTA = 112.63 ms
[22:20:26] <nagios-wm>	 RECOVERY - Host amssq51 is UP: PING OK - Packet loss = 0%, RTA = 109.41 ms
[22:20:26] <nagios-wm>	 RECOVERY - Host wikinews-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 110.43 ms
[22:20:30] <nagios-wm>	 RECOVERY - Host knsq29 is UP: PING WARNING - Packet loss = 73%, RTA = 112.43 ms
[22:20:30] <nagios-wm>	 RECOVERY - Host ssl3004 is UP: PING OK - Packet loss = 0%, RTA = 109.57 ms
[22:20:30] <nagios-wm>	 RECOVERY - Host hooft is UP: PING OK - Packet loss = 0%, RTA = 109.28 ms
[22:20:30] <nagios-wm>	 RECOVERY - Host knsq26 is UP: PING OK - Packet loss = 0%, RTA = 109.19 ms
[22:20:30] <nagios-wm>	 RECOVERY - Host wikisource-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 109.27 ms
[22:20:39] <nagios-wm>	 RECOVERY - Host amssq61 is UP: PING OK - Packet loss = 0%, RTA = 110.13 ms
[22:20:39] <nagios-wm>	 RECOVERY - Host amssq49 is UP: PING OK - Packet loss = 0%, RTA = 109.92 ms
[22:20:39] <nagios-wm>	 RECOVERY - Host amssq55 is UP: PING OK - Packet loss = 0%, RTA = 109.43 ms
[22:20:39] <nagios-wm>	 RECOVERY - Host cp3002 is UP: PING WARNING - Packet loss = 93%, RTA = 110.08 ms
[22:20:39] <nagios-wm>	 RECOVERY - Host amssq54 is UP: PING WARNING - Packet loss = 80%, RTA = 112.40 ms
[22:20:39] <nagios-wm>	 RECOVERY - Host amssq39 is UP: PING WARNING - Packet loss = 86%, RTA = 112.24 ms
[22:20:39] <nagios-wm>	 RECOVERY - Host amssq34 is UP: PING OK - Packet loss = 0%, RTA = 109.24 ms
[22:20:48] <nagios-wm>	 RECOVERY - Host amslvs2 is UP: PING OK - Packet loss = 0%, RTA = 108.98 ms
[22:20:48] <nagios-wm>	 RECOVERY - Host bits.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 109.10 ms
[22:20:49] <nagios-wm>	 RECOVERY - Host ssl3001 is UP: PING OK - Packet loss = 0%, RTA = 109.44 ms
[22:20:49] <nagios-wm>	 RECOVERY - Host knsq20 is UP: PING WARNING - Packet loss = 93%, RTA = 110.06 ms
[22:20:49] <nagios-wm>	 RECOVERY - Host amslvs3 is UP: PING WARNING - Packet loss = 66%, RTA = 113.23 ms
[22:20:57] <nagios-wm>	 RECOVERY - Host wiktionary-lb.esams.wikimedia.org is UP: PING WARNING - Packet loss = 58%, RTA = 112.17 ms
[22:20:57] <nagios-wm>	 RECOVERY - Host amssq57 is UP: PING OK - Packet loss = 0%, RTA = 109.14 ms
[22:20:58] <nagios-wm>	 RECOVERY - Host nescio is UP: PING OK - Packet loss = 0%, RTA = 109.09 ms
[22:20:58] <nagios-wm>	 RECOVERY - Host csw2-esams is UP: PING OK - Packet loss = 0%, RTA = 111.74 ms
[22:20:58] <nagios-wm>	 RECOVERY - Host cp3001 is UP: PING OK - Packet loss = 0%, RTA = 109.24 ms
[22:20:58] <nagios-wm>	 RECOVERY - Host maerlant is UP: PING OK - Packet loss = 0%, RTA = 108.93 ms
[22:20:58] <nagios-wm>	 RECOVERY - Host csw1-esams is UP: PING OK - Packet loss = 0%, RTA = 110.20 ms
[22:20:58] <nagios-wm>	 RECOVERY - Host upload.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 109.49 ms
[22:21:15] <nagios-wm>	 RECOVERY - Host mediawiki-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 109.32 ms
[22:21:15] <nagios-wm>	 RECOVERY - Host amslvs4 is UP: PING OK - Packet loss = 0%, RTA = 109.38 ms
[22:21:15] <nagios-wm>	 RECOVERY - Host br1-knams is UP: PING OK - Packet loss = 0%, RTA = 109.10 ms
[22:21:15] <nagios-wm>	 RECOVERY - Host knsq23 is UP: PING OK - Packet loss = 0%, RTA = 109.82 ms
[22:21:15] <nagios-wm>	 RECOVERY - Host wikiquote-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 110.24 ms
[22:21:24] <nagios-wm>	 RECOVERY - Host 91.198.174.6 is UP: PING OK - Packet loss = 0%, RTA = 109.18 ms
[22:22:09] <nagios-wm>	 RECOVERY - Host amssq45 is UP: PING OK - Packet loss = 0%, RTA = 110.71 ms
[22:23:21] <nagios-wm>	 RECOVERY - Host text.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 109.92 ms
[22:23:21] <nagios-wm>	 RECOVERY - Host wikibooks-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 109.34 ms
[22:23:22] <nagios-wm>	 RECOVERY - Host upload.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 109.60 ms
[22:23:30] <nagios-wm>	 RECOVERY - Host wikiquote-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 109.13 ms
[22:23:30] <nagios-wm>	 RECOVERY - Host wikimedia-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 109.21 ms
[22:23:31] <nagios-wm>	 RECOVERY - Host wikisource-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 109.86 ms
[22:23:31] <nagios-wm>	 RECOVERY - Host wikipedia-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 109.63 ms
[22:23:32] <nagios-wm>	 RECOVERY - Host wikinews-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 109.56 ms
[22:23:57] <nagios-wm>	 RECOVERY - Host wikiversity-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 110.44 ms
[22:23:57] <nagios-wm>	 RECOVERY - Host wiktionary-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 109.71 ms
[22:24:06] <nagios-wm>	 RECOVERY - Host bits.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 109.10 ms
[22:25:00] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:26:03] <nagios-wm>	 RECOVERY - Host foundation-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 109.78 ms
[22:30:51] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.993 seconds
[22:59:30] <nagios-wm>	 PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours
[23:07:18] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:09:24] <nagios-wm>	 PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours
[23:10:27] <nagios-wm>	 PROBLEM - Puppet freshness on knsq23 is CRITICAL: Puppet has not run in the last 10 hours
[23:10:27] <nagios-wm>	 PROBLEM - Puppet freshness on amssq40 is CRITICAL: Puppet has not run in the last 10 hours
[23:11:12] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.029 seconds
[23:11:30] <nagios-wm>	 PROBLEM - Puppet freshness on amssq49 is CRITICAL: Puppet has not run in the last 10 hours
[23:11:30] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours
[23:11:30] <nagios-wm>	 PROBLEM - Puppet freshness on amssq56 is CRITICAL: Puppet has not run in the last 10 hours
[23:14:57] <nagios-wm>	 PROBLEM - Puppet freshness on knsq21 is CRITICAL: Puppet has not run in the last 10 hours
[23:14:57] <nagios-wm>	 PROBLEM - Puppet freshness on knsq24 is CRITICAL: Puppet has not run in the last 10 hours
[23:14:57] <nagios-wm>	 PROBLEM - Puppet freshness on ssl3003 is CRITICAL: Puppet has not run in the last 10 hours
[23:14:57] <nagios-wm>	 PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours
[23:16:09] <nagios-wm>	 PROBLEM - Puppet freshness on amssq62 is CRITICAL: Puppet has not run in the last 10 hours
[23:18:06] <nagios-wm>	 PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours
[23:18:06] <nagios-wm>	 PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours
[23:28:09] <nagios-wm>	 PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours
[23:46:09] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:48:06] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours
[23:48:06] <nagios-wm>	 PROBLEM - Puppet freshness on amssq31 is CRITICAL: Puppet has not run in the last 10 hours
[23:52:09] <nagios-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.444 seconds
[23:54:06] <nagios-wm>	 PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Puppet has not run in the last 10 hours
[23:54:06] <nagios-wm>	 PROBLEM - Puppet freshness on amssq35 is CRITICAL: Puppet has not run in the last 10 hours
[23:54:06] <nagios-wm>	 PROBLEM - Puppet freshness on amssq41 is CRITICAL: Puppet has not run in the last 10 hours
[23:54:06] <nagios-wm>	 PROBLEM - Puppet freshness on amssq52 is CRITICAL: Puppet has not run in the last 10 hours
[23:54:06] <nagios-wm>	 PROBLEM - Puppet freshness on amssq38 is CRITICAL: Puppet has not run in the last 10 hours
[23:54:06] <nagios-wm>	 PROBLEM - Puppet freshness on amssq50 is CRITICAL: Puppet has not run in the last 10 hours
[23:54:07] <nagios-wm>	 PROBLEM - Puppet freshness on knsq17 is CRITICAL: Puppet has not run in the last 10 hours
[23:54:07] <nagios-wm>	 PROBLEM - Puppet freshness on amssq58 is CRITICAL: Puppet has not run in the last 10 hours
[23:54:08] <nagios-wm>	 PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours
[23:54:08] <nagios-wm>	 PROBLEM - Puppet freshness on knsq29 is CRITICAL: Puppet has not run in the last 10 hours
[23:54:09] <nagios-wm>	 PROBLEM - Puppet freshness on knsq25 is CRITICAL: Puppet has not run in the last 10 hours
[23:57:33] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 0 seconds
[23:58:00] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db1033 is OK: OK replication delay 0 seconds