[00:01:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.410 seconds [00:10:33] wikizprav /wg ncs [00:18:53] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [00:18:53] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [00:37:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:50:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [01:08:52] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [01:14:41] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 269 seconds [01:16:21] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 5 seconds [01:18:17] PROBLEM - Varnish HTCP daemon on cp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:19:56] RECOVERY - Varnish HTCP daemon on cp1023 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [01:23:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:28:38] PROBLEM - Varnish HTCP daemon on cp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:28:47] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:29:05] PROBLEM - Varnish HTTP upload-frontend on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:30:17] RECOVERY - Varnish HTCP daemon on cp1023 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [01:30:17] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [01:30:44] RECOVERY - Varnish HTTP upload-frontend on cp1023 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.079 seconds [01:35:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.687 seconds [01:37:56] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [01:37:56] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [01:37:56] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [01:37:56] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [01:37:56] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [01:37:57] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [01:42:26] PROBLEM - Varnish HTCP daemon on cp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:47:14] PROBLEM - Varnish HTTP upload-backend on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:47:23] PROBLEM - SSH on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:47:42] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:49:29] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [01:52:20] RECOVERY - Varnish HTTP upload-backend on cp1023 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.072 seconds [01:52:29] RECOVERY - SSH on cp1023 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [01:52:38] RECOVERY - Varnish HTCP daemon on cp1023 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [02:07:02] PROBLEM - Varnish HTTP upload-frontend on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:08:32] RECOVERY - Varnish HTTP upload-frontend on cp1023 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.334 seconds [02:09:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:23:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [02:25:20] PROBLEM - SSH on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:25:29] PROBLEM - Varnish HTTP upload-backend on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:00] RECOVERY - Varnish HTTP upload-backend on cp1023 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.079 seconds [02:28:22] !log LocalisationUpdate completed (1.21wmf6) at Fri Dec 28 02:28:21 UTC 2012 [02:28:34] Logged the message, Master [02:28:54] RECOVERY - SSH on cp1023 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [02:36:06] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:33] PROBLEM - Varnish HTCP daemon on cp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:40:45] PROBLEM - Varnish HTTP upload-backend on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:40:54] PROBLEM - Varnish HTTP upload-frontend on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:41:30] PROBLEM - SSH on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:43:18] RECOVERY - Varnish HTCP daemon on cp1023 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [02:44:27] New patchset: Hoo man; "Fix wgCentralAuthCookieDomain" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/41018 [02:45:11] Could someone please merge and deploy this? It's causing trouble on meta [02:46:36] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [02:46:36] RECOVERY - SSH on cp1023 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [02:47:24] New review: Hoo man; "This change broke it: https://gerrit.wikimedia.org/r/32167" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/41018 [02:47:39] RECOVERY - Varnish HTTP upload-backend on cp1023 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 1.820 seconds [02:47:57] TimStarling: ^ [02:47:57] RECOVERY - Varnish HTTP upload-frontend on cp1023 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.084 seconds [02:53:24] apergos: ^ [02:55:37] ok [02:56:47] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/41018 [02:57:30] !log tstarling synchronized wmf-config/CommonSettings.php [02:57:40] Logged the message, Master [02:58:00] works now... thanks, TimStarling ;) [03:27:06] PROBLEM - SSH on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:29:57] PROBLEM - Varnish HTTP upload-backend on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:30:24] RECOVERY - SSH on cp1023 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:31:36] RECOVERY - Varnish HTTP upload-backend on cp1023 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 4.895 seconds [03:31:54] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:33:33] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [03:36:06] PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours [03:42:24] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:47:21] PROBLEM - Varnish HTTP upload-backend on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:47:21] PROBLEM - Varnish HTTP upload-frontend on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:49:54] PROBLEM - Varnish HTCP daemon on cp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:49:54] PROBLEM - SSH on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:51:33] RECOVERY - Varnish HTCP daemon on cp1023 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [03:51:33] RECOVERY - SSH on cp1023 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:54:15] RECOVERY - Varnish HTTP upload-frontend on cp1023 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.080 seconds [03:54:15] RECOVERY - Varnish HTTP upload-backend on cp1023 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.592 seconds [04:06:42] PROBLEM - Varnish HTTP upload-frontend on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:09:06] PROBLEM - Varnish HTCP daemon on cp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:10:36] RECOVERY - Varnish HTCP daemon on cp1023 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [04:11:48] RECOVERY - Varnish HTTP upload-frontend on cp1023 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 3.516 seconds [04:17:12] PROBLEM - Varnish HTTP upload-frontend on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:18:42] RECOVERY - Varnish HTTP upload-frontend on cp1023 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 6.030 seconds [04:21:24] PROBLEM - SSH on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:21:24] PROBLEM - Varnish HTCP daemon on cp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:22:54] RECOVERY - SSH on cp1023 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [04:22:54] RECOVERY - Varnish HTCP daemon on cp1023 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [04:28:27] PROBLEM - Varnish HTCP daemon on cp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:29:21] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [04:33:42] RECOVERY - Varnish HTCP daemon on cp1023 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [04:34:36] PROBLEM - Varnish HTTP upload-frontend on cp1023 is CRITICAL: HTTP CRITICAL - No data received from host [04:38:15] RECOVERY - Varnish HTTP upload-frontend on cp1023 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.074 seconds [04:40:00] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:40:56] PROBLEM - Varnish HTCP daemon on cp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:42:08] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [04:42:08] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [04:42:35] RECOVERY - Varnish HTCP daemon on cp1023 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [04:46:20] PROBLEM - Puppet freshness on ms-be3002 is CRITICAL: Puppet has not run in the last 10 hours [04:51:26] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [05:00:53] PROBLEM - SSH on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:02:33] PROBLEM - Varnish HTTP upload-backend on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:04:11] RECOVERY - Varnish HTTP upload-backend on cp1023 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.073 seconds [05:04:20] RECOVERY - SSH on cp1023 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [05:11:32] PROBLEM - SSH on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:11:41] PROBLEM - Varnish HTTP upload-frontend on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:12:44] PROBLEM - Varnish HTCP daemon on cp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:13:11] PROBLEM - Varnish HTTP upload-backend on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:20] RECOVERY - SSH on cp1023 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [05:16:11] RECOVERY - Varnish HTCP daemon on cp1023 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [05:16:38] RECOVERY - Varnish HTTP upload-backend on cp1023 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.078 seconds [05:16:56] RECOVERY - Varnish HTTP upload-frontend on cp1023 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.079 seconds [05:21:44] PROBLEM - Varnish HTCP daemon on cp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:23:23] RECOVERY - Varnish HTCP daemon on cp1023 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [05:27:26] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [05:33:44] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 185 seconds [05:34:47] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 203 seconds [05:39:44] PROBLEM - Varnish HTTP upload-frontend on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:39:44] PROBLEM - Varnish HTTP upload-backend on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:39:45] PROBLEM - SSH on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:40:36] RECOVERY - Varnish HTTP upload-backend on cp1023 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.077 seconds [05:41:38] RECOVERY - SSH on cp1023 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [05:45:41] PROBLEM - Varnish HTCP daemon on cp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:45:59] PROBLEM - Varnish HTTP upload-backend on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:46:26] RECOVERY - Varnish HTTP upload-frontend on cp1023 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.071 seconds [05:49:26] RECOVERY - Varnish HTTP upload-backend on cp1023 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 5.854 seconds [05:50:29] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:50:38] PROBLEM - SSH on cp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:52:08] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [05:52:17] RECOVERY - SSH on cp1023 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [05:52:35] RECOVERY - Varnish HTCP daemon on cp1023 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [05:54:32] PROBLEM - Varnish HTTP upload-backend on cp1023 is CRITICAL: Connection refused [05:54:40] !log restarted varnish front and back ends on cp1023, it was swapping [05:54:49] Logged the message, Master [05:55:54] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [05:56:20] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [05:56:20] RECOVERY - Varnish HTTP upload-backend on cp1023 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.073 seconds [06:13:44] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [06:40:44] PROBLEM - Lucene on search14 is CRITICAL: Connection timed out [06:42:14] RECOVERY - Lucene on search14 is OK: TCP OK - 0.009 second response time on port 8123 [06:52:31] PROBLEM - Lucene on search14 is CRITICAL: Connection timed out [06:55:49] RECOVERY - Lucene on search14 is OK: TCP OK - 0.002 second response time on port 8123 [06:59:34] PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours [06:59:34] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [08:48:16] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [08:56:58] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [08:56:58] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [08:56:58] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [08:56:58] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [09:04:01] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 197 seconds [09:04:28] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 211 seconds [09:22:37] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [09:24:25] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 1.16 ms [09:29:04] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [09:30:31] afk for a little (early lunch with a visitor) [09:41:13] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.052 second response time [09:44:40] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 12 seconds [09:45:52] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [10:02:25] New patchset: Mark Bergsma; "Ignore case when comparing headers" [operations/software] (master) - https://gerrit.wikimedia.org/r/41026 [10:02:25] New patchset: Mark Bergsma; "Separate container creation, use cloudfiles' unicode_quote" [operations/software] (master) - https://gerrit.wikimedia.org/r/41027 [10:02:26] New patchset: Mark Bergsma; "Revert unicode changes" [operations/software] (master) - https://gerrit.wikimedia.org/r/41028 [10:03:39] Change merged: Mark Bergsma; [operations/software] (master) - https://gerrit.wikimedia.org/r/39649 [10:04:28] Change merged: Mark Bergsma; [operations/software] (master) - https://gerrit.wikimedia.org/r/41026 [10:04:56] Change merged: Mark Bergsma; [operations/software] (master) - https://gerrit.wikimedia.org/r/41027 [10:05:13] Change merged: Mark Bergsma; [operations/software] (master) - https://gerrit.wikimedia.org/r/41028 [10:18:52] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 192 seconds [10:19:55] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [10:19:55] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [10:20:31] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [11:10:01] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [11:14:49] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 187 seconds [11:15:34] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 205 seconds [11:24:25] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 26 seconds [11:29:17] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [11:39:29] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [11:39:29] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [11:39:29] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [11:39:29] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [11:39:30] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [11:39:30] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [13:37:29] PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours [14:43:31] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [14:47:25] PROBLEM - Puppet freshness on ms-be3002 is CRITICAL: Puppet has not run in the last 10 hours [14:52:31] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [14:57:37] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [15:25:13] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.002 second response time on port 11000 [15:28:31] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [15:29:05] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40983 [15:31:40] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Dec 28 15:31:23 UTC 2012 [15:57:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:04:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [16:14:25] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [16:16:12] anyone got an idea why squid/varnish would return 200 to a range request if the If-Range header is sent? (https://bugzilla.wikimedia.org/show_bug.cgi?id=43477) [16:26:42] !log authdns-update adding mc17/18 to pmtpa mgmt zonefile [16:26:55] Logged the message, Master [16:30:53] sbernardin: please do drac cfg for mc17/18 rt4211 [16:31:40] cmjohnson1: got it [16:37:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:47:09] !log maxsem synchronized php-1.21wmf6/extensions/PageImages/PageImages.body.php 'https://gerrit.wikimedia.org/r/#/c/40812/' [16:47:17] Logged the message, Master [16:51:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.036 seconds [16:58:21] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 194 seconds [16:58:30] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 195 seconds [16:58:43] cmjohnson1: mc17 & mc18 iDrac setup is complete [17:00:03] New patchset: Jgreen; "add fr-tech individuals for nagios notification" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41040 [17:00:58] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41040 [17:01:12] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [17:01:12] PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours [17:01:48] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [17:01:57] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [17:22:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.042 seconds [18:07:16] !log restart udp2log on emery to clean up 838 defunct processes [18:07:24] Logged the message, Master [18:08:16] New review: Pyoungmeister; "sorry I didn't see this earlier, but I have a comment!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40983 [18:09:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:11:06] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [18:11:18] New review: MaxSem; "My whole idea was that if there's a solr instance, it should be monitored. And I've always thought t..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40983 [18:11:24] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:12:03] MaxSem: hey, let's talk here :) [18:12:09] okay [18:12:12] so, you're correct [18:12:18] we should be using role modules, actually [18:12:23] that incorporate other modules [18:12:46] this can't really be done piecemeal, as we've been doing a lot of our other transitions, though [18:13:08] I agree that all solr instances should be monitored [18:13:12] for damn sure [18:14:07] but, the docs for puppet modules say that each module should be as atomic as possible [18:14:21] and while we ignore lots of things in these docs, I think it's a legit thing to try for [18:15:01] and hopefully soon we'll transition all role classes to role modules that tie together all the smaller modules that they're comprised of [18:15:23] as this is the direction that puppet labs is taking puppet, generally [18:16:04] do you have a good example so that I could base on it? [18:17:31] sure [18:18:03] the ones that I'm the most familiar with are manifests/role/lucene.pp [18:18:11] or applicationserver.pp [18:18:24] applicationserver.pp pulls in a module by the same name [18:18:31] and then adds some other stuff, like mediawiki install [18:18:33] monitoring [18:18:38] ganglia group [18:18:39] etc [18:18:47] and then is assigned to a node [18:19:04] Ryan_Lane: I have a replacement HD for labstore3...when do you think you'll be available for me to replace it [18:19:25] sbernardin: if you're ready to do it now, let's go for it [18:19:42] the system needs to be brought down for this? [18:19:50] the drives aren't hotswap? [18:20:39] Not sure...Chris told me to get with you before I swapped it out [18:21:15] I'd surely hope they are hot swappable [18:23:22] notpeter, before I start, can we check if monitoring works? [18:23:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.042 seconds [18:23:51] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [18:24:47] sbernardin: well, let me get into the office first, I guess [18:24:54] I'll message you when I get there [18:26:25] Ok [18:27:01] It is hot swappable [18:27:12] MaxSem: definitely! [18:27:17] You want me to just replace it? [18:27:22] sbernardin: ah, well, if that's the case, no need to wait for me [18:27:23] yeah [18:28:25] notpeter, so if puppet has already run on the nagios host, I guess what is needed is to stop jetty on any server other than solr1001 [18:29:37] MaxSem: I don't quite understand why. maybe I'm just unclear on exactly how you want to test [18:30:03] oh, to see if the check crits? [18:30:08] yes [18:30:13] ah, sure! [18:30:34] only solr1001 is currently used, others are just idling [18:31:22] ok, I'll stop solr on 1002 and we can make sure it crits [18:31:28] er, 1003 [18:33:05] ok, stopped [18:33:11] and now we play the waiting game [18:34:42] notpeter, btw - I've started https://wikitech.wikimedia.org/view/Solr - anything else you ops want to see there? [18:36:09] MaxSem: that looks pretty good to me! [18:38:36] New patchset: Jgreen; "add udp2log filter to estimate 2012 fundraiser video views" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41044 [18:39:15] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41044 [18:39:16] MaxSem: hhhmmm, it seems to be saying "connection refused" but it's still green in nagios [18:42:22] Ryan_Lane: labstore3 disk has been replaced [18:42:32] great, thanks [18:43:38] notpeter, any ideas how to fix? [18:44:10] New patchset: Pyoungmeister; "fix for solr monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41045 [18:44:17] that shouild do it [18:44:29] in the nagios config it ended up looking like [18:44:31] check_command check_http_url!{}!http://{solr1001.eqiad.wmnet}:8983/solr/select/?q=*%3A*&start=0&rows=1&ind [18:44:34] ent=on [18:45:04] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41045 [18:45:44] ok, I'm running puppet again on spence [18:45:55] this should be done at some point before the heat death of the universe [18:49:51] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [18:51:23] notpeter: I think moving role/ to a module in one go might actually work [18:51:27] and not be completely painful [18:51:31] paravoid: yep [18:51:32] agreed [18:51:44] or be totally terrible [18:51:48] and ruin everything [18:51:50] but whatevs [18:52:35] I think it's more likely that it'd be fine [18:56:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:57:57] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [18:57:57] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [18:57:57] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [18:57:58] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [19:08:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.339 seconds [19:31:16] LeslieCarr, out of interest does anyone else other than Daniel do Mailman stuff in ops? [19:32:26] Thehelpfulone: he usually finishes mailman tasks before any of the rest of us even see the tickets... so... not really [19:32:58] You have to be a special sort of insane to do mailman admin [19:33:07] heh [19:33:15] daniel is a ticket closing machine. he's amazing. [19:43:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:49] anybody have experience setting up wikidiff2? i'm trying to set it up locally - i've followed instrux on http://mediawiki.org/, but am now getting the following php error: /mnt/hgfs/testing/core/bin/ulimit4.sh: line 4: wikidiff2: command not found [19:54:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.163 seconds [19:59:04] do we have anything that that runs at 7:30 UTC every day? [20:16:13] Thehelpfulone: yes, notpeter is right -- though others can do mailman stuffs [20:20:54] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [20:20:54] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [20:28:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:41:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.413 seconds [20:50:25] notpeter, is puppet still running?:P [20:52:08] MaxSem: which host are you worried about ? [20:52:13] if it's spence, it will take about 3 hours ;) [20:52:19] loooooooooooooooooool [20:52:54] is it run in IBM PC Jr emulator written in QBasic? [20:53:39] 'ruby' for short [20:55:02] hehehe [20:55:03] :) [21:10:58] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [21:14:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:27:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.626 seconds [21:42:35] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [21:42:35] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [21:42:35] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [21:42:35] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [21:42:35] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [21:42:36] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [21:48:16] notpeter, something like http://dpaste.org/QZD08/ ? [22:00:38] MaxSem: I was thinking a lot simpler than that [22:01:40] like role::solr, in my mind, would just include a system_role, service monitoring, anything related to ganglia (not sure if you want to make a cluster) and class { "solr": schema => ... [22:02:04] and then assign that to ^solr(100)?[1-3]\.(eqiad|pmtpa)\.wmnet in site.pp [22:03:10] I'm also willing to do this, if you'd like [22:03:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:06:03] notpeter, isn't that what I'm doing? [22:06:57] ah, yeah, upon reading over again, yep [22:07:00] looks great! [22:07:11] (sorry, had a dyslexic moment) [22:07:28] okay, lemme finish this... [22:07:33] cool! [22:15:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.013 seconds [22:15:52] New patchset: DamianZaremba; "Puppetizing the bots setup for labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [22:20:32] New patchset: DamianZaremba; "Puppetizing the bots setup for labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [22:32:53] New patchset: MaxSem; "Refactor Solr stuff into a role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41186 [22:33:00] notpeter, ^^ [22:48:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:48:50] MaxSem: looks great! I'll +2 and deploy now [22:48:57] whee [22:49:30] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41186 [22:49:54] so, the bad news is that I don't think the solr check is working properly [22:50:14] ex: https://nagios.wikimedia.org/nagios/cgi-bin/status.cgi?host=solr1001 [22:50:32] it says connection refused, but still checks out as ok [22:50:38] I'll take a closer look at that [22:51:59] AAAAAAAAAAAAAAARGH [22:52:05] paravoid: ? [22:52:39] * Damianz gives paravoid his pills [23:04:26] ori-l: what are you guys using /mnt/jdump for on the roundtrip instances? [23:04:52] also, are you guys still using the roundtrip instances? [23:06:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.033 seconds [23:07:35] LeslieCarr, if you're still on RT duty, looks like 4215 is a duplicate of 4213 [23:23:24] thanks [23:26:13] Ryan_Lane: jdump is obsolete; i can delete it [23:26:21] ori-l: please do :) [23:26:44] that host is at 87% for disk [23:26:58] and that instance is eating up nearly 100GBN [23:26:59] yeah if anyone fills up / we're kinda screwed [23:27:00] *GB [23:27:04] Damianz: yes [23:27:05] 100gb? meh [23:27:06] very [23:27:12] Damianz: out of 1TB [23:27:18] it's almost 1/10th of the disk ;) [23:27:19] I just provisioned like 6tb at w0rk for new servers [23:27:35] not made logstash work sexyish yet though :( [23:27:38] most of the large disk space users seem to be on virt6 [23:27:44] no problem [23:28:03] I wonder when we'll need nice 10gb connections to the storage servers or nice fc sans :D [23:28:05] it would be helpful to use logstash to eliminate log clutter on the vms [23:28:07] Ryan_Lane: don't mind me seemingly brute-forcing my own password in the logs :| [23:28:08] meh gluster will break before then [23:28:10] i'm just an idiot [23:28:14] ori-l: :D [23:28:27] Damianz: we have 64TB in gluster [23:28:29] Well... we should [23:28:36] and yes, it will likely break well before then [23:28:38] From an auditing point of view anyway [23:28:51] Like if we need to prove the zombies got in and did bad things [23:29:28] bleh I just switched channels for no reason then -.- damn irc client [23:31:45] hm. will kvm actually shrink a disk once it's used a certain amount of space? [23:32:12] ori-l: Did you try all the passwords? [23:32:17] bleh. seems not [23:32:24] because its disk is still huge :( [23:38:34] PROBLEM - Host ms-be1007 is DOWN: PING CRITICAL - Packet loss = 100% [23:38:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:41:38] Reedy: all of (1234, 12345, 123456, abcdefg, 'password') [23:41:41] usually does the trick. [23:54:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.019 seconds