[01:41:28] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 220 seconds [01:42:36] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 257 seconds [01:49:03] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 643s [01:51:09] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 23 seconds [01:55:48] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 3s [01:56:24] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 3 seconds [02:46:32] New patchset: Krinkle; "Update default/index.html in Apache configuration." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16241 [03:00:15] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [03:24:24] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [05:58:05] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [06:01:05] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [06:03:02] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [06:03:38] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [06:07:32] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [06:30:56] PROBLEM - SSH on srv260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:30:56] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.8:11000 (Connection timed out) 10.0.8.10:11000 (timeout) [06:31:32] PROBLEM - Apache HTTP on srv260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:33:56] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [06:38:07] RECOVERY - SSH on srv260 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:39:01] PROBLEM - Memcached on srv260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:41:34] RECOVERY - Memcached on srv260 is OK: TCP OK - 2.999 second response time on port 11000 [06:42:19] PROBLEM - SSH on srv260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:50:07] PROBLEM - Memcached on srv260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:51:28] RECOVERY - Memcached on srv260 is OK: TCP OK - 2.998 second response time on port 11000 [06:58:01] PROBLEM - Memcached on srv260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:02:04] RECOVERY - Memcached on srv260 is OK: TCP OK - 2.999 second response time on port 11000 [07:19:28] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [07:19:55] RECOVERY - SSH on srv260 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [07:20:13] RECOVERY - Apache HTTP on srv260 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.036 second response time [09:25:02] PROBLEM - LDAP on sanger is CRITICAL: Connection refused [09:25:20] PROBLEM - LDAPS on sanger is CRITICAL: Connection refused [09:29:50] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [10:23:01] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-5/2/1 (FPL/Level3, CV71028) [10Gbps wave]BR [12:26:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:30:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.033 seconds [13:00:47] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [13:02:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:11:08] New patchset: Alex Monk; "(bug 36104) Enable Narayam on bnwikisource." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16252 [13:11:50] Uh oh... [13:13:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.053 seconds [13:25:41] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [13:36:00] New review: Alex Monk; "Have you forgotten about this change?" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/12556 [13:45:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:57:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.769 seconds [14:31:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:44:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.021 seconds [15:16:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.039 seconds [15:58:45] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [16:01:54] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [16:02:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.770 seconds [16:46:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:56:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.260 seconds [17:31:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:42:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [18:15:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [18:38:09] New patchset: Alex Monk; "(bug 37885) Enable FlaggedRevs on trwikiquote." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16264 [19:00:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:14:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [19:29:28] New patchset: Alex Monk; "(bug 27918) Enable wgUseRCPatrol on hewikibooks." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16267 [19:30:39] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [19:46:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:56:15] New patchset: Alex Monk; "Remove swwiki's ridiculously high account creation throttle." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16268 [19:57:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.025 seconds [20:29:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.916 seconds [21:07:46] New review: Thehelpfulone; "+1'ing the change, but I would probably put this on hold for a couple more days in case others comme..." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/16237 [21:13:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:24:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.460 seconds [21:46:33] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [21:59:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:07:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.867 seconds [22:42:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:53:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.585 seconds [22:56:27] RECOVERY - LDAP on sanger is OK: TCP OK - 0.003 second response time on port 389 [22:56:45] RECOVERY - LDAPS on sanger is OK: TCP OK - 0.005 second response time on port 636 [22:56:56] !log restarted opendj on sanger. The process OOM'd due to heap size. [22:57:05] Logged the message, Master [23:01:15] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [23:01:30] ryan_lane - thanks! [23:01:36] mails are following again [23:26:18] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [23:26:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:37:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.462 seconds [23:38:56] woosters: yw