[00:03:55] New patchset: Lcarr; "removing lvs monitors :(" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4598 [00:04:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4598 [00:04:12] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4598 [00:04:15] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4598 [00:04:21] New patchset: Bhartshorne; "moving iron to the new generic mysql class and throwing up a firewall." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4599 [00:04:28] binasher: ^^^ [00:04:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4599 [00:04:59] working. \o/ [00:09:15] New patchset: Bhartshorne; "moving iron to the new generic mysql class and throwing up a firewall." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4599 [00:09:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4599 [00:16:21] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4599 [00:16:23] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4599 [00:20:28] PROBLEM - Puppet freshness on search1004 is CRITICAL: Puppet has not run in the last 10 hours [00:28:34] PROBLEM - Puppet freshness on search1003 is CRITICAL: Puppet has not run in the last 10 hours [00:36:31] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [00:44:28] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [00:59:43] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [00:59:43] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [01:14:43] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [01:14:43] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [01:30:13] New patchset: Tim Starling; "Enable mod_auth on lists.wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4603 [01:30:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4603 [01:31:34] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4603 [01:31:36] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4603 [01:33:16] !log on sodium: enabling mod_auth on lists.wikimedia.org by running puppet [01:33:19] Logged the message, Master [01:36:16] New review: Tim Starling; "Did you test this? I'm just wondering how long it was broken for." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/503 [03:49:11] oooh, it's a paravoid! good flight? [03:50:44] any swift ppl around? no maplebed [03:54:21] the swift issue is at https://commons.wikimedia.org/wiki/User_talk:Zscout370#File:Flag_of_Belarus_2012.svg [03:58:35] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [03:58:59] !b 34755 [03:58:59] https://bugzilla.wikimedia.org/34755 [03:59:17] (discussion going about this now in #wikimedia) [04:05:09] New review: Krinkle; "Do you still want to remove this, now that https://gerrit.wikimedia.org/r/#change,4366 is in?" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/4364 [04:08:39] maplebed is SF? [04:21:45] ok, mailed ben+aaron [04:55:54] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [05:16:54] PROBLEM - MySQL Slave Running on db1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:08] PROBLEM - Puppet freshness on lvs1002 is CRITICAL: Puppet has not run in the last 10 hours [08:26:29] New review: Hashar; "Krinkle wrote:" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4364 [09:06:52] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 281519 seconds [09:13:28] RECOVERY - mysqld processes on es1003 is OK: PROCS OK: 1 process with command name mysqld [09:28:37] PROBLEM - MySQL Slave Delay on es1003 is CRITICAL: CRIT replication delay 390763 seconds [09:30:25] !log restarted slaving on es1003, it will be a bit before it catches up. patience, young nagios [09:30:28] Logged the message, Master [10:02:06] RECOVERY - MySQL Slave Delay on es1003 is OK: OK replication delay 0 seconds [10:02:18] υαυ [10:02:19] er [10:02:21] yay :-P [10:21:45] PROBLEM - Puppet freshness on search1004 is CRITICAL: Puppet has not run in the last 10 hours [10:29:42] PROBLEM - Puppet freshness on search1003 is CRITICAL: Puppet has not run in the last 10 hours [10:37:39] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [10:45:51] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [11:00:51] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [11:00:51] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [11:15:52] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [11:15:52] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [11:16:12] New review: Dzahn; "yeah, appears to work for me. i do get a 401 - Unauthorized on " [operations/puppet] (production) - https://gerrit.wikimedia.org/r/503 [11:20:22] New patchset: Mark Bergsma; "Allow 3.0.2-2wm3 in the repository without upgrading existing installs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4611 [11:20:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4611 [11:21:54] New patchset: Mark Bergsma; "Allow 3.0.2-2wm3 in the repository without upgrading existing installs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4611 [11:22:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4611 [11:23:18] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4611 [11:23:20] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4611 [11:24:58] !log Imported varnish 3.0.2-2wm3 into the Wikimedia APT repository [11:25:00] Logged the message, Master [11:29:05] New review: Dzahn; "how come every single line is diff here?" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/4367 [11:32:04] New review: Reedy; "I actually have no idea..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4367 [11:34:04] New patchset: Mark Bergsma; "Configure cp1029-1036 as eqiad Varnish servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4612 [11:34:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4612 [11:34:42] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4612 [11:34:45] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4612 [11:39:21] New review: Dzahn; "looks good though anyways. tested the new https links. the only thing left is the certificate warnin..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4367 [11:39:24] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4367 [11:39:40] New patchset: Mark Bergsma; "cp1029-1036 were installed with the wrong partitioning template" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4613 [11:39:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4613 [11:40:08] daniel? [11:40:14] hi mark [11:40:29] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4613 [11:40:30] hi [11:40:31] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4613 [11:40:36] do you have time to reinstall 8 servers? [11:40:43] cp1029-1036 were installed with the wrong partitioning [11:41:21] yea, well, at least i can start [11:41:41] thanks [11:41:49] it should be fully automatic [11:42:09] ok, cool, so does not involve partman recipes?;) [11:42:12] nope [11:42:26] you just need to do some handholding to get them to pxe boot [11:42:45] ok [11:42:57] they're not in production, you can do them all at once [11:43:05] we got your and my merge on sockpuppet right now, was at "fetch && diff" , right before merge [11:43:11] ok [11:43:15] i just merged it [11:43:18] cool [11:45:13] New patchset: Mark Bergsma; "Snapshot build" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4614 [11:45:13] New patchset: Mark Bergsma; "Compilation fixes" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4615 [11:45:14] New patchset: Mark Bergsma; "VCL support for the new chash director" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4616 [11:45:15] New patchset: Mark Bergsma; "Ran automake" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4617 [11:45:15] New patchset: Mark Bergsma; "Added chash director parser to vcc_compile.h" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4618 [11:45:16] New patchset: Mark Bergsma; "Add chash director to include/vrt.h" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4619 [11:45:17] New patchset: Mark Bergsma; "Fix wrong struct size being passed to qsort" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4620 [11:45:18] New patchset: Mark Bergsma; "Correct chash director author, styling and additional comments" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4621 [11:45:33] crap [11:45:52] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4614 [11:46:04] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4615 [11:46:15] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4616 [11:46:26] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4617 [11:46:37] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4618 [11:46:48] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4619 [11:46:57] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4620 [11:47:07] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4621 [11:48:06] New patchset: Mark Bergsma; "New version 3.0.2-2wm3: * Correct chash director author, styling and additional comments" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4622 [11:49:13] New review: Mark Bergsma; "(no comment)" [operations/debs/varnish] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4622 [11:50:18] !log pxe boot / reinstall cp1029 - cp1036 [11:50:20] Logged the message, Master [11:51:06] New review: Mark Bergsma; "(no comment)" [operations/debs/varnish] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4622 [11:51:08] Change merged: Mark Bergsma; [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4622 [11:52:02] PROBLEM - Host cp1029 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:35] RECOVERY - Host cp1029 is UP: PING OK - Packet loss = 0%, RTA = 26.39 ms [12:01:29] PROBLEM - SSH on cp1029 is CRITICAL: Connection refused [12:02:23] PROBLEM - Host cp1032 is DOWN: PING CRITICAL - Packet loss = 100% [12:03:08] PROBLEM - Host cp1033 is DOWN: PING CRITICAL - Packet loss = 100% [12:04:56] PROBLEM - Host cp1029 is DOWN: PING CRITICAL - Packet loss = 100% [12:05:50] RECOVERY - SSH on cp1029 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [12:05:59] RECOVERY - Host cp1029 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [12:07:20] PROBLEM - Host cp1034 is DOWN: PING CRITICAL - Packet loss = 100% [12:07:56] RECOVERY - Host cp1032 is UP: PING OK - Packet loss = 0%, RTA = 26.38 ms [12:08:41] RECOVERY - Host cp1033 is UP: PING OK - Packet loss = 0%, RTA = 26.59 ms [12:10:11] PROBLEM - Host cp1035 is DOWN: PING CRITICAL - Packet loss = 100% [12:11:32] PROBLEM - Host cp1036 is DOWN: PING CRITICAL - Packet loss = 100% [12:11:41] PROBLEM - SSH on cp1033 is CRITICAL: Connection refused [12:12:17] PROBLEM - SSH on cp1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:12:53] RECOVERY - Host cp1034 is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [12:15:17] RECOVERY - SSH on cp1032 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [12:15:44] RECOVERY - Host cp1035 is UP: PING OK - Packet loss = 0%, RTA = 26.38 ms [12:16:02] PROBLEM - Host cp1033 is DOWN: PING CRITICAL - Packet loss = 100% [12:16:47] PROBLEM - SSH on cp1034 is CRITICAL: Connection refused [12:17:05] RECOVERY - Host cp1036 is UP: PING OK - Packet loss = 0%, RTA = 26.39 ms [12:17:41] RECOVERY - SSH on cp1033 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [12:17:50] RECOVERY - Host cp1033 is UP: PING OK - Packet loss = 0%, RTA = 26.44 ms [12:19:29] PROBLEM - SSH on cp1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:21:26] PROBLEM - SSH on cp1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:22:11] RECOVERY - SSH on cp1035 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [12:24:08] RECOVERY - SSH on cp1036 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [12:26:14] PROBLEM - NTP on cp1029 is CRITICAL: NTP CRITICAL: Offset unknown [12:29:23] PROBLEM - NTP on cp1032 is CRITICAL: NTP CRITICAL: No response from NTP server [12:34:47] RECOVERY - NTP on cp1029 is OK: NTP OK: Offset 0.04394471645 secs [12:35:39] mark: cp1029 - puppet run: /Role::Cache::Upload/Varnish::Setup_filesystem[sdb3]/Mount[/srv/sdb3]: Could not evaluate: Execution of '/bin/mount -o noatime,nodiratime,nobarrier,logbufs=8 /srv/sdb3' returned 32: mount: /dev/sdb3: can't read superblock [12:35:48] cp1030: Role::Cache::Upload/Varnish::Instance[upload-frontend]/Exec[load-new-vcl-file-frontend]: Failed to call refresh: /usr/share/varnish/reload-vcl -n frontend returned 1 instead of one of [0] at /var/lib/git/operations/puppet/manifests/varnish.pp:128 [12:39:42] puppet runs finish though [12:40:00] PROBLEM - NTP on cp1033 is CRITICAL: NTP CRITICAL: No response from NTP server [12:41:48] PROBLEM - Varnish HTTP upload-frontend on cp1034 is CRITICAL: Connection refused [12:41:48] PROBLEM - Varnish HTCP daemon on cp1035 is CRITICAL: Connection refused by host [12:41:48] PROBLEM - Varnish HTCP daemon on cp1033 is CRITICAL: Connection refused by host [12:41:48] PROBLEM - Varnish HTTP upload-frontend on cp1036 is CRITICAL: Connection refused [12:42:15] PROBLEM - Varnish HTTP upload-backend on cp1029 is CRITICAL: Connection refused [12:42:15] PROBLEM - Varnish traffic logger on cp1036 is CRITICAL: Connection refused by host [12:42:15] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: Connection refused by host [12:42:15] PROBLEM - Varnish HTTP upload-backend on cp1035 is CRITICAL: Connection refused [12:42:24] PROBLEM - Varnish HTTP upload-backend on cp1033 is CRITICAL: Connection refused [12:42:42] PROBLEM - Varnish HTTP upload-frontend on cp1035 is CRITICAL: Connection refused [12:42:42] PROBLEM - Varnish HTTP upload-frontend on cp1029 is CRITICAL: Connection refused [12:42:51] PROBLEM - Varnish HTTP upload-frontend on cp1033 is CRITICAL: Connection refused [12:42:51] PROBLEM - Varnish HTCP daemon on cp1036 is CRITICAL: Connection refused by host [12:42:51] PROBLEM - Varnish HTCP daemon on cp1034 is CRITICAL: Connection refused by host [12:43:00] PROBLEM - Varnish HTTP upload-backend on cp1036 is CRITICAL: Connection refused [12:43:00] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: Connection refused by host [12:43:00] PROBLEM - Varnish HTTP upload-backend on cp1034 is CRITICAL: Connection refused [12:44:12] PROBLEM - NTP on cp1035 is CRITICAL: NTP CRITICAL: Offset unknown [12:44:57] PROBLEM - NTP on cp1036 is CRITICAL: NTP CRITICAL: No response from NTP server [12:46:00] RECOVERY - Varnish traffic logger on cp1035 is OK: PROCS OK: 2 processes with command name varnishncsa [12:47:57] RECOVERY - Varnish traffic logger on cp1036 is OK: PROCS OK: 2 processes with command name varnishncsa [12:49:00] RECOVERY - NTP on cp1032 is OK: NTP OK: Offset 0.02929723263 secs [12:49:54] RECOVERY - NTP on cp1035 is OK: NTP OK: Offset -0.04992544651 secs [12:51:51] RECOVERY - NTP on cp1033 is OK: NTP OK: Offset -0.01880443096 secs [12:52:09] PROBLEM - Host cp1034 is DOWN: PING CRITICAL - Packet loss = 100% [12:55:33] mutante: oh darn [12:55:38] that means they're installed wrong again [12:55:46] probably because puppet on brewster didn't update yet [12:57:03] ah, so maybe it updated while i was doing this. because on 1030 i did not see the "superblock" error [12:57:42] RECOVERY - Host cp1034 is UP: PING OK - Packet loss = 0%, RTA = 26.40 ms [12:57:50] ok [12:57:51] RECOVERY - NTP on cp1036 is OK: NTP OK: Offset -0.007479310036 secs [12:59:04] redoing 1031 and 1034. one showed superblock error during install, other i got disconnected from console and then didnt get any output. on second attempt they appear to install fine [13:05:30] RECOVERY - SSH on cp1034 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:06:49] arr, can't login with the new_install key now.. [13:07:55] nevermind:) [13:12:33] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 2 processes with command name varnishncsa [13:19:54] PROBLEM - Host cp1029 is DOWN: PING CRITICAL - Packet loss = 100% [13:25:27] RECOVERY - Host cp1029 is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [13:25:36] RECOVERY - SSH on search1001 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:25:54] RECOVERY - Varnish HTCP daemon on cp1033 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [13:26:12] RECOVERY - SSH on search1002 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:28:09] RECOVERY - Host search1005 is UP: PING OK - Packet loss = 0%, RTA = 26.62 ms [13:29:03] PROBLEM - SSH on cp1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:29:03] PROBLEM - Varnish HTCP daemon on cp1029 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:29:48] PROBLEM - Varnish traffic logger on cp1029 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:31:36] PROBLEM - SSH on search1005 is CRITICAL: Connection refused [13:31:54] RECOVERY - SSH on cp1029 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:32:48] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: Connection refused by host [13:33:15] RECOVERY - SSH on search1004 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:33:51] RECOVERY - Host search1007 is UP: PING OK - Packet loss = 0%, RTA = 26.61 ms [13:33:51] RECOVERY - Host search1006 is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [13:35:30] PROBLEM - Host search1005 is DOWN: PING CRITICAL - Packet loss = 100% [13:35:39] RECOVERY - Varnish traffic logger on cp1029 is OK: PROCS OK: 2 processes with command name varnishncsa [13:36:47] woo lookit all those search boxen come alive [13:37:37] <^demon> Did you throw a switch and lightning rushed down from the roof to fill them with surging energy? [13:37:45] RECOVERY - SSH on search1005 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:37:45] PROBLEM - SSH on search1007 is CRITICAL: Connection refused [13:37:54] RECOVERY - Host search1005 is UP: PING OK - Packet loss = 0%, RTA = 26.44 ms [13:38:19] PROBLEM - SSH on search1006 is CRITICAL: Connection refused [13:38:46] PROBLEM - NTP on search1005 is CRITICAL: NTP CRITICAL: No response from NTP server [13:39:49] PROBLEM - Lucene on search1005 is CRITICAL: Connection refused [13:40:02] ^demon: more like alchemy [13:40:12] i turned my blood and sweat into working servers for notpeter to install [13:40:34] <^demon> Ah, well as long as it was cooler than just pushing a button :) [13:41:19] PROBLEM - Varnish HTTP upload-frontend on cp1030 is CRITICAL: Connection refused [13:41:19] PROBLEM - Varnish HTTP upload-frontend on cp1032 is CRITICAL: Connection refused [13:41:19] PROBLEM - Varnish HTCP daemon on cp1031 is CRITICAL: NRPE: Command check_varnishhtcpd not defined [13:41:19] RECOVERY - Varnish HTCP daemon on cp1029 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [13:41:46] PROBLEM - Varnish HTTP upload-backend on cp1031 is CRITICAL: Connection refused [13:41:55] PROBLEM - Lucene on search1007 is CRITICAL: Connection refused [13:42:04] PROBLEM - Varnish HTTP upload-frontend on cp1031 is CRITICAL: Connection refused [13:42:04] PROBLEM - Varnish HTCP daemon on cp1030 is CRITICAL: NRPE: Command check_varnishhtcpd not defined [13:42:04] RECOVERY - Host search1008 is UP: PING OK - Packet loss = 0%, RTA = 26.37 ms [13:42:04] RECOVERY - Host search1009 is UP: PING OK - Packet loss = 0%, RTA = 26.79 ms [13:42:13] RECOVERY - Host search1011 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [13:42:13] RECOVERY - Host search1010 is UP: PING OK - Packet loss = 0%, RTA = 26.39 ms [13:42:13] RECOVERY - Host search1012 is UP: PING OK - Packet loss = 0%, RTA = 26.42 ms [13:42:31] PROBLEM - Varnish HTTP upload-backend on cp1032 is CRITICAL: Connection refused [13:42:31] PROBLEM - Varnish HTTP upload-backend on cp1030 is CRITICAL: Connection refused [13:43:52] PROBLEM - Host search1006 is DOWN: PING CRITICAL - Packet loss = 100% [13:44:32] New patchset: Mark Bergsma; "Add cp1029 to the pool" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4627 [13:44:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4627 [13:44:55] RECOVERY - SSH on search1006 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:45:48] RECOVERY - Host search1006 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [13:45:48] PROBLEM - SSH on search1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:48] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4627 [13:45:48] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4627 [13:45:48] PROBLEM - SSH on search1011 is CRITICAL: Connection refused [13:45:48] RECOVERY - SSH on search1007 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:45:49] PROBLEM - SSH on search1010 is CRITICAL: Connection refused [13:46:16] PROBLEM - SSH on search1012 is CRITICAL: Connection refused [13:46:25] PROBLEM - SSH on search1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:55] RECOVERY - Host search1015 is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [13:47:55] RECOVERY - Host search1014 is UP: PING OK - Packet loss = 0%, RTA = 26.59 ms [13:47:55] RECOVERY - Host search1013 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [13:47:55] RECOVERY - Host search1016 is UP: PING OK - Packet loss = 0%, RTA = 26.41 ms [13:48:13] RECOVERY - SSH on search1008 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:50:19] PROBLEM - Lucene on search1008 is CRITICAL: Connection refused [13:50:37] PROBLEM - Lucene on search1011 is CRITICAL: Connection timed out [13:50:46] RECOVERY - SSH on search1009 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:50:55] PROBLEM - Lucene on search1010 is CRITICAL: Connection timed out [13:51:22] PROBLEM - Lucene on search1012 is CRITICAL: Connection timed out [13:51:31] PROBLEM - SSH on search1016 is CRITICAL: Connection refused [13:51:31] RECOVERY - SSH on search1011 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:51:40] RECOVERY - SSH on search1010 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:51:40] PROBLEM - SSH on search1015 is CRITICAL: Connection refused [13:51:40] PROBLEM - SSH on search1014 is CRITICAL: Connection refused [13:52:25] PROBLEM - SSH on search1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:52] RECOVERY - Varnish HTCP daemon on cp1031 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [13:53:19] PROBLEM - Lucene on search1006 is CRITICAL: Connection refused [13:53:37] RECOVERY - Host search1017 is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [13:53:37] RECOVERY - Host search1018 is UP: PING OK - Packet loss = 0%, RTA = 26.77 ms [13:54:49] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 2 processes with command name varnishncsa [13:54:58] RECOVERY - Varnish HTCP daemon on cp1034 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [13:54:58] RECOVERY - Varnish HTTP upload-backend on cp1029 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.053 seconds [13:54:58] RECOVERY - SSH on search1012 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:55:16] PROBLEM - Host search1013 is DOWN: PING CRITICAL - Packet loss = 100% [13:55:43] RECOVERY - Varnish HTCP daemon on cp1035 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [13:56:01] PROBLEM - NTP on search1007 is CRITICAL: NTP CRITICAL: No response from NTP server [13:56:19] RECOVERY - SSH on search1014 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:56:28] PROBLEM - Lucene on search1016 is CRITICAL: Connection refused [13:56:28] RECOVERY - Varnish HTCP daemon on cp1036 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [13:56:37] PROBLEM - Lucene on search1015 is CRITICAL: Connection refused [13:56:46] RECOVERY - SSH on search1013 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:56:46] PROBLEM - SSH on search1018 is CRITICAL: Connection refused [13:56:55] PROBLEM - SSH on search1017 is CRITICAL: Connection refused [13:56:55] RECOVERY - Host search1013 is UP: PING OK - Packet loss = 0%, RTA = 26.42 ms [13:57:13] PROBLEM - Lucene on search1009 is CRITICAL: Connection refused [13:59:46] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [14:00:58] RECOVERY - Varnish HTCP daemon on cp1030 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [14:01:34] PROBLEM - Lucene on search1017 is CRITICAL: Connection timed out [14:02:10] PROBLEM - Lucene on search1018 is CRITICAL: Connection timed out [14:02:37] PROBLEM - Lucene on search1014 is CRITICAL: Connection refused [14:02:37] PROBLEM - NTP on search1008 is CRITICAL: NTP CRITICAL: No response from NTP server [14:03:22] PROBLEM - NTP on search1011 is CRITICAL: NTP CRITICAL: No response from NTP server [14:03:22] PROBLEM - NTP on search1010 is CRITICAL: NTP CRITICAL: No response from NTP server [14:04:07] RECOVERY - SSH on search1018 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:04:16] RECOVERY - SSH on search1016 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:05:28] PROBLEM - Lucene on search1013 is CRITICAL: Connection refused [14:05:37] RECOVERY - SSH on search1017 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:06:40] PROBLEM - NTP on search1006 is CRITICAL: NTP CRITICAL: No response from NTP server [14:08:55] RECOVERY - SSH on search1015 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:09:22] PROBLEM - NTP on search1016 is CRITICAL: NTP CRITICAL: No response from NTP server [14:09:49] PROBLEM - NTP on search1014 is CRITICAL: NTP CRITICAL: No response from NTP server [14:09:58] PROBLEM - NTP on search1015 is CRITICAL: NTP CRITICAL: No response from NTP server [14:10:07] PROBLEM - NTP on search1009 is CRITICAL: NTP CRITICAL: No response from NTP server [14:15:13] PROBLEM - NTP on search1017 is CRITICAL: NTP CRITICAL: No response from NTP server [14:15:13] PROBLEM - NTP on search1018 is CRITICAL: NTP CRITICAL: No response from NTP server [14:16:34] PROBLEM - NTP on search1012 is CRITICAL: NTP CRITICAL: No response from NTP server [14:17:22] !log search in eqiad is being reinstalled, no need to be alarmed (thats a pun!) [14:17:24] Logged the message, Master [14:18:04] PROBLEM - NTP on search1013 is CRITICAL: NTP CRITICAL: No response from NTP server [14:24:40] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 11.6225215 (gt 8.0) [14:33:31] RECOVERY - Host search1019 is UP: PING OK - Packet loss = 0%, RTA = 26.40 ms [14:36:31] PROBLEM - SSH on search1019 is CRITICAL: Connection refused [14:38:46] RECOVERY - Varnish HTTP upload-frontend on cp1030 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [14:39:13] RECOVERY - Host search1020 is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [14:39:13] RECOVERY - Host search1023 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [14:39:14] RECOVERY - Host search1021 is UP: PING OK - Packet loss = 0%, RTA = 26.67 ms [14:39:14] RECOVERY - Host search1024 is UP: PING OK - Packet loss = 0%, RTA = 26.89 ms [14:39:14] RECOVERY - Host search1022 is UP: PING OK - Packet loss = 0%, RTA = 26.47 ms [14:39:40] RECOVERY - Varnish HTTP upload-backend on cp1030 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.053 seconds [14:40:36] PROBLEM - Disk space on search1021 is CRITICAL: Connection refused by host [14:40:36] PROBLEM - SSH on search1021 is CRITICAL: Connection refused [14:40:36] PROBLEM - SSH on search1024 is CRITICAL: Connection refused [14:40:36] PROBLEM - NTP on search1022 is CRITICAL: NTP CRITICAL: No response from NTP server [14:40:36] PROBLEM - NTP on search1023 is CRITICAL: NTP CRITICAL: No response from NTP server [14:40:54] PROBLEM - SSH on search1023 is CRITICAL: Connection refused [14:40:54] PROBLEM - NTP on search1020 is CRITICAL: NTP CRITICAL: No response from NTP server [14:41:21] PROBLEM - Disk space on search1022 is CRITICAL: Connection refused by host [14:41:21] PROBLEM - SSH on search1022 is CRITICAL: Connection refused [14:41:39] PROBLEM - DPKG on search1021 is CRITICAL: Connection refused by host [14:41:48] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.79167176471 [14:42:06] RECOVERY - Varnish HTTP upload-backend on cp1031 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.053 seconds [14:42:24] PROBLEM - RAID on search1022 is CRITICAL: Connection refused by host [14:42:24] PROBLEM - DPKG on search1022 is CRITICAL: Connection refused by host [14:42:33] RECOVERY - Varnish HTTP upload-frontend on cp1031 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [14:42:42] PROBLEM - SSH on search1020 is CRITICAL: Connection refused [14:42:42] PROBLEM - Lucene on search1019 is CRITICAL: Connection timed out [14:42:51] RECOVERY - Varnish HTTP upload-backend on cp1032 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.053 seconds [14:43:00] PROBLEM - RAID on search1021 is CRITICAL: Connection refused by host [14:43:09] RECOVERY - Varnish HTTP upload-frontend on cp1032 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [14:43:27] RECOVERY - Varnish HTTP upload-backend on cp1033 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.053 seconds [14:43:54] RECOVERY - Varnish HTTP upload-frontend on cp1033 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.054 seconds [14:44:30] RECOVERY - Varnish HTTP upload-frontend on cp1034 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [14:44:57] RECOVERY - Varnish HTTP upload-backend on cp1035 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.053 seconds [14:45:24] RECOVERY - Varnish HTTP upload-frontend on cp1035 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.054 seconds [14:45:51] RECOVERY - SSH on search1019 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:45:51] PROBLEM - Host search1020 is DOWN: PING CRITICAL - Packet loss = 100% [14:46:00] RECOVERY - Varnish HTTP upload-frontend on cp1036 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [14:46:54] PROBLEM - Lucene on search1021 is CRITICAL: Connection timed out [14:47:03] PROBLEM - Lucene on search1024 is CRITICAL: Connection refused [14:47:03] RECOVERY - Varnish HTTP upload-backend on cp1036 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.054 seconds [14:47:21] PROBLEM - Lucene on search1023 is CRITICAL: Connection refused [14:47:35] New patchset: Mark Bergsma; "Add cp1030-1032 to the pool" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4635 [14:47:39] PROBLEM - Lucene on search1022 is CRITICAL: Connection timed out [14:47:48] RECOVERY - SSH on search1021 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:47:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4635 [14:47:59] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4635 [14:48:02] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4635 [14:48:24] RECOVERY - SSH on search1020 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:48:33] RECOVERY - Host search1020 is UP: PING OK - Packet loss = 0%, RTA = 26.38 ms [14:49:54] RECOVERY - SSH on search1022 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:51:24] PROBLEM - Host search1023 is DOWN: PING CRITICAL - Packet loss = 100% [14:53:21] RECOVERY - SSH on search1024 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:54:06] RECOVERY - SSH on search1023 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:54:15] RECOVERY - Host search1023 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [14:56:21] PROBLEM - Lucene on search1020 is CRITICAL: Connection refused [14:56:30] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [14:59:49] PROBLEM - NTP on search1021 is CRITICAL: NTP CRITICAL: No response from NTP server [15:00:06] PROBLEM - NTP on search1024 is CRITICAL: NTP CRITICAL: No response from NTP server [15:07:00] PROBLEM - NTP on search1019 is CRITICAL: NTP CRITICAL: No response from NTP server [15:12:32] New patchset: Mark Bergsma; "Add the remaining upload servers to the pool" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4639 [15:12:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4639 [15:12:54] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4639 [15:12:56] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4639 [15:13:27] PROBLEM - Host search1003 is DOWN: PING CRITICAL - Packet loss = 100% [15:15:33] RECOVERY - SSH on search1003 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [15:15:42] RECOVERY - Host search1003 is UP: PING OK - Packet loss = 0%, RTA = 26.44 ms [15:19:18] RECOVERY - Puppet freshness on search1001 is OK: puppet ran at Tue Apr 10 15:19:08 UTC 2012 [15:20:32] hi hexmode, do you have access to the bugzilla apache access or error log? [15:21:12] drdee: I don't have access to that, but maybe I can help you figure out what the problem is [15:21:33] ^demon: do you have access to the bz machine? [15:21:52] <^demon> I can get to it, but I don't have root, what's up? [15:22:15] ^demon: drdee wants to see what is in the error logs [15:22:26] he is trying to set up a bz client [15:22:30] and it isn't working [15:22:54] so maybe the xmlrpc endpoint is spewing errors or something [15:23:05] <^demon> Apache logs? [15:23:07] exactly [15:23:17] assuming that bugzilla runs on apache [15:23:31] drdee: what do you have in your set up for the endpoint? https://bugzilla.wikimedia.org/ ? [15:23:36] yes [15:24:06] RECOVERY - Puppet freshness on search1002 is OK: puppet ran at Tue Apr 10 15:23:37 UTC 2012 [15:24:11] drdee: Did you try https://bugzilla.wikimedia.org/xmlrpc.cgi as well? [15:24:32] <^demon> Hmm, well apache logs seem to be owned by root:adm, can't touch them :\ [15:24:33] yes but pivotal tracker says that you have to exclude xmlrpci.cgi [15:24:52] it could be that pivotal tracker does not support https [15:25:19] i have looked at the web console to look at the ajax requests but i only see the json error response [15:25:25] so that doesn't really help [15:26:15] drdee: is pivotal running on a machine you have access to? [15:26:35] no, it's a 3rd party service [15:27:28] <^demon> hexmode, drdee: I can't access /var/log/apache/ or /srv/org/wikimedia/bugzilla/, can't help I'm afraid :\ [15:27:51] drdee: so, maybe you can ask pivotal to tell you what is wrong? [15:28:00] ^demon: ty anyway :) [15:28:44] <^demon> I used pivotal before. Was not a fan ;-) [15:29:12] RECOVERY - NTP on search1001 is OK: NTP OK: Offset 0.0014564991 secs [15:29:21] RECOVERY - Puppet freshness on search1003 is OK: puppet ran at Tue Apr 10 15:29:11 UTC 2012 [15:29:51] demon: no? i am just looking for a decent scrum pm tool and it seemed okay [15:30:21] ^demon:thanks for trying, who does have access to those files? [15:30:35] <^demon> I think you'll need to track down a root. [15:30:35] FYI, puppet camp now scheduled for may 19 in LAX [15:30:43] <^demon> In the airport? Yuck. [15:30:52] ^demon: not afaik [15:31:17] Geneva and Dublin prolly conflict with wikimania [15:32:12] all of the ones i just mentioned are free. and as previously mentioned a few times, NYC is 17 days from now (27th) [15:32:30] RECOVERY - Puppet freshness on search1004 is OK: puppet ran at Tue Apr 10 15:32:09 UTC 2012 [15:34:09] RECOVERY - NTP on search1002 is OK: NTP OK: Offset 0.004679083824 secs [15:38:12] RECOVERY - NTP on search1003 is OK: NTP OK: Offset -0.005516648293 secs [15:42:06] RECOVERY - NTP on search1004 is OK: NTP OK: Offset -0.04663825035 secs [15:51:23] RECOVERY - NTP on search1005 is OK: NTP OK: Offset 0.08301877975 secs [15:51:59] RECOVERY - NTP on search1007 is OK: NTP OK: Offset -0.01503574848 secs [15:52:35] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 11.8609471654 (gt 8.0) [15:53:38] RECOVERY - NTP on search1006 is OK: NTP OK: Offset -0.04437673092 secs [15:54:41] RECOVERY - NTP on search1008 is OK: NTP OK: Offset 0.03617584705 secs [15:54:50] RECOVERY - NTP on search1019 is OK: NTP OK: Offset 0.004051208496 secs [15:55:53] RECOVERY - NTP on search1017 is OK: NTP OK: Offset -0.04206681252 secs [15:56:38] RECOVERY - Puppet freshness on search1021 is OK: puppet ran at Tue Apr 10 15:56:26 UTC 2012 [15:58:26] RECOVERY - Disk space on search1021 is OK: DISK OK [15:59:20] RECOVERY - DPKG on search1021 is OK: All packages OK [16:00:59] RECOVERY - NTP on search1018 is OK: NTP OK: Offset 0.04142403603 secs [16:01:08] RECOVERY - Puppet freshness on search1022 is OK: puppet ran at Tue Apr 10 16:00:57 UTC 2012 [16:02:38] RECOVERY - NTP on search1009 is OK: NTP OK: Offset 0.07021224499 secs [16:03:05] RECOVERY - DPKG on search1022 is OK: All packages OK [16:03:14] RECOVERY - Disk space on search1022 is OK: DISK OK [16:03:14] RECOVERY - NTP on search1020 is OK: NTP OK: Offset 0.04404604435 secs [16:03:41] RECOVERY - NTP on search1010 is OK: NTP OK: Offset -0.04792940617 secs [16:07:35] RECOVERY - NTP on search1021 is OK: NTP OK: Offset 0.01049733162 secs [16:09:14] RECOVERY - NTP on search1011 is OK: NTP OK: Offset 0.003487229347 secs [16:14:02] RECOVERY - NTP on search1022 is OK: NTP OK: Offset -0.01966917515 secs [16:14:38] RECOVERY - NTP on search1012 is OK: NTP OK: Offset -0.01556062698 secs [16:15:14] RECOVERY - NTP on search1024 is OK: NTP OK: Offset 0.02426540852 secs [16:18:23] RECOVERY - NTP on search1023 is OK: NTP OK: Offset -0.0172342062 secs [16:19:35] RECOVERY - NTP on search1014 is OK: NTP OK: Offset 0.02981960773 secs [16:19:44] RECOVERY - NTP on search1013 is OK: NTP OK: Offset 0.03611135483 secs [16:20:02] RECOVERY - NTP on search1016 is OK: NTP OK: Offset 0.0552418232 secs [16:22:17] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.9943365354 (gt 8.0) [16:26:29] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 0.81705503937 [16:27:14] RECOVERY - RAID on search1022 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [16:27:32] RECOVERY - RAID on search1021 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [16:28:53] RECOVERY - NTP on search1015 is OK: NTP OK: Offset 0.06716740131 secs [16:34:45] New patchset: Mark Bergsma; "Add some eqiad LVS services to Nagios monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4645 [16:35:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4645 [16:35:24] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4645 [16:35:27] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4645 [16:58:25] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , frwiktionary (16039) [16:59:28] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , frwiktionary (16028) [17:00:39] this is a monthly issue with frwiktionary? ;) i have some vague memory [17:00:45] Is it? [17:03:55] ok, i'll look it up [17:05:27] 19 21:46:45 <+nagios-wm> PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , frwiktionary (25984) [17:05:33] 19 21:52:24 < binasher> re: frwikitionary jobqueue alert, they are all of htmlCacheUpdate and refreshLinks2 variety [17:05:36] 19 22:01:40 < AaronSchulz> binasher: why do those silly users have to change templates? :p [17:05:54] not necessarily the same but seems likely [17:06:01] Reedy: [17:06:15] lol, I bet so too [17:06:19] I'll check when I come back [17:11:55] RECOVERY - Lucene on search1004 is OK: TCP OK - 0.032 second response time on port 8123 [17:12:04] RECOVERY - Lucene on search1003 is OK: TCP OK - 0.026 second response time on port 8123 [17:12:31] RECOVERY - Lucene on search1006 is OK: TCP OK - 0.026 second response time on port 8123 [17:12:31] RECOVERY - Lucene on search1002 is OK: TCP OK - 0.027 second response time on port 8123 [17:12:31] RECOVERY - Lucene on search1001 is OK: TCP OK - 0.027 second response time on port 8123 [17:12:49] RECOVERY - Lucene on search1005 is OK: TCP OK - 0.028 second response time on port 8123 [17:13:16] RECOVERY - Lucene on search1007 is OK: TCP OK - 0.028 second response time on port 8123 [17:13:25] RECOVERY - Lucene on search1008 is OK: TCP OK - 0.026 second response time on port 8123 [17:14:01] RECOVERY - Lucene on search1009 is OK: TCP OK - 0.027 second response time on port 8123 [17:14:01] RECOVERY - Lucene on search1014 is OK: TCP OK - 0.026 second response time on port 8123 [17:14:10] RECOVERY - Lucene on search1010 is OK: TCP OK - 0.026 second response time on port 8123 [17:14:10] RECOVERY - Lucene on search1013 is OK: TCP OK - 0.027 second response time on port 8123 [17:14:37] RECOVERY - Lucene on search1012 is OK: TCP OK - 0.026 second response time on port 8123 [17:14:37] RECOVERY - Lucene on search1016 is OK: TCP OK - 0.026 second response time on port 8123 [17:14:37] RECOVERY - Lucene on search1011 is OK: TCP OK - 0.026 second response time on port 8123 [17:14:55] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.026 second response time on port 8123 [17:15:13] RECOVERY - Lucene on search1021 is OK: TCP OK - 0.026 second response time on port 8123 [17:15:31] RECOVERY - Lucene on search1018 is OK: TCP OK - 0.027 second response time on port 8123 [17:15:40] RECOVERY - Lucene on search1017 is OK: TCP OK - 0.026 second response time on port 8123 [17:15:49] RECOVERY - Lucene on search1023 is OK: TCP OK - 0.027 second response time on port 8123 [17:15:58] RECOVERY - Lucene on search1020 is OK: TCP OK - 0.027 second response time on port 8123 [17:15:58] RECOVERY - Lucene on search1019 is OK: TCP OK - 0.026 second response time on port 8123 [17:16:43] RECOVERY - Lucene on search1024 is OK: TCP OK - 0.027 second response time on port 8123 [17:16:43] RECOVERY - Lucene on search1022 is OK: TCP OK - 0.026 second response time on port 8123 [17:20:41] wheeeee [17:21:40] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.7864485039 (gt 8.0) [17:26:37] PROBLEM - Puppet freshness on lvs1002 is CRITICAL: Puppet has not run in the last 10 hours [17:34:12] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4430 [17:34:14] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4430 [18:02:44] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay seconds [18:03:02] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay seconds [18:06:29] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 2.50685779528 [18:12:11] PROBLEM - mysqld processes on db42 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [18:19:14] RECOVERY - mysqld processes on db42 is OK: PROCS OK: 1 process with command name mysqld [18:21:20] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 80611 seconds [18:21:29] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 80609 seconds [18:26:23] New patchset: Pyoungmeister; "adding bellin and blondel to site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4650 [18:26:28] binasher: ^ [18:26:32] is that what you want? [18:26:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4650 [18:28:40] "want" [18:29:05] ok, I'll merge [18:29:16] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4650 [18:29:18] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4650 [18:29:21] oop just did [18:29:36] cool [18:30:23] did you merge on sockpuppet? [18:30:25] binasher: [18:30:44] no [18:31:00] just with mr. gerrit [18:31:10] kk, mergin on teh puppetz [18:33:27] New review: saper; "As pointed out in https://bugzilla.wikimedia.org/show_bug.cgi?id=35709 a different option should be ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4334 [18:34:07] help requested - none of the nrpe commands look like they're reporting back and/or happening on neon … neon's ip address *is* in the allowed hosts on nrpe_local.cfg on all the machines … any ideas what it coudl be/what to check ? [18:37:56] maplebed: also, nrpe is not in the swift role .. which is needed for the nrpe swift checks ? [18:38:37] I'm not sure; mutante set them up. [18:39:23] mutante: any specific reason (other than forgetting?) ;) [18:44:50] New patchset: Lcarr; "fixing up icinga's nrpe_local.cfg file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4651 [18:45:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4651 [18:45:34] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4651 [18:45:37] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4651 [18:47:55] notpeter: merged your changes [18:53:42] New patchset: Catrope; "Fix /usr/local/refreshWikiVersionsCDB" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4652 [18:53:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4652 [18:56:00] mark: robh: or anybody else; do you know how oxygen is set up and whether it is currently getting log data? [18:57:40] well, it has 4 2TB disks [18:57:47] and the base OS [18:57:50] beyond that i have no idea [18:58:02] it's my understanding that logs are unicast to locke and emery. [18:58:06] is that correct? [18:58:19] LeslieCarr: thanks! [18:58:25] got halfway there.... [18:58:31] https://rt.wikimedia.org/Ticket/Display.html?id=2430 [18:58:55] yeah, that was the ticket I was reading. [18:59:09] It doesn't actually say there whether it is getting logs and can actually behave as a log parser. [18:59:14] just that it has puppet stuff, etc. [18:59:31] i can see in the site.pp it has the udp2 log puppet stuff and multicast [18:59:34] but beyond that no idea [18:59:34] (and I haven't read through enough of the puppet stuff to figure it out) [18:59:42] so the udp2log is the recieving process, [18:59:48] but it can only recieve if stuff is sending it data. [18:59:55] hrmph. [19:00:18] I can't find where (on, say, sq86) hte config exists telling it where and how to send its logs. [19:00:28] New review: Aaron Schulz; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/4652 [19:01:20] ah. I think I did. [19:02:26] who handles creating/fiddling with projects in gerrit? [19:02:33] Ryan_Lane [19:02:35] Ryan_Lane, is it you? [19:02:36] kk [19:02:40] so sadly, while multicast works quite well in eqiad, not so much in pmtpa (old old foundries) [19:02:52] dschoon: ask ^demon I think [19:03:09] robh: thanks for talking through it with me; I think I've found my answer. [19:03:54] glad to claim credit for doing little to nothign at all ;] [19:03:55] <^demon> hashar: Thanks for volunteering me. [19:03:59] <^demon> dschoon: What's up? [19:04:18] ^demon: sorry. Do you anyone able to do that ? or can you grant me access ?:-D [19:04:19] hiya ^demon -- we're looking to rename a project [19:04:22] and create a new one [19:04:22] ^demon: will be glad to help [19:04:33] <^demon> dschoon: Can't rename. [19:04:39] ... [19:04:41] <^demon> Can create a new one and copy it over. [19:04:50] hence why we usually dont do that ;] [19:04:54] i find new reasons to love gerrit all the time. [19:04:55] New patchset: Pyoungmeister; "adding bellin and blondel and shard m1 to mysql.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4653 [19:04:56] <^demon> Man, you're the 4th person to ask me about that today :) [19:05:02] * robh is glad we have so many devs who can do these things these days [19:05:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4653 [19:05:26] can someone put an emergency patch on bugzilla for me? https://bug731219.bugzilla.mozilla.org/attachment.cgi?id=601276 [19:05:33] this is for drdree [19:05:39] <^demon> In 2.3, we'll be able to work around the no-renaming thing nicer. For now I'm asking any people wanting renames to please hang on for a bit unless it's actively breaking something. [19:05:51] binasher: ^ [19:05:53] sure. [19:06:16] do subprojects in gerrit actually have special meaning? [19:06:24] thanks hexmode! [19:06:25] <^demon> Nope, just naming conventions. [19:06:26] or is it just a naming convention? [19:06:28] aiight. [19:06:32] maplebed/mutante know why there's both role/swift.pp and swift.pp ? [19:06:50] drdee: going to file an RT ticket [19:06:53] and i'm under the impression that creating a new project has to go through ops, yes? [19:06:53] in theory, swift.pp is generic and role.pp applies to our specific installation. [19:07:00] ah [19:07:06] but Im' not sure it really matches that closely. [19:07:07] hexmode: cool, can you add me as CC? [19:07:10] <^demon> dschoon: What matters is what's actually set as the parent project [19:07:28] New patchset: Lcarr; "adding in nrpe to our role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4654 [19:07:29] maplebed ^^ [19:07:37] <^demon> By convention, mediawiki/core inherits from mediawiki which inherits from All-Projects. [19:07:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4654 [19:07:49] LeslieCarr: i.e. we should be able to give swift.pp away and have it useful, other people would then need their own swift role to set the variables right fro their installation. [19:07:54] <^demon> But you could make mediawiki/foo inherit from bar/baz if you wanted. [19:08:32] ^demon hm. well, um. here's the issue. "analytics/reportcard" currently exists. but it should really be "analytics/reportcard/scripts". because there needs to be a "analytics/reportcard/site", and they are totally separate repos. [19:08:53] hexmode: devs with deploy access can create new projects. [19:09:16] so we don't so much need to rename "analytics/reportcard" as copy the data to a new project, "analytics/reportcard/scripts", and then create "analytics/reportcard/site" [19:09:24] LeslieCarr: I think I want to -1 that change. [19:09:32] robh: I don't know what you mean... I'm looking for a patch to a file in bugzilla [19:09:35] nrpe is supposed to already be incrluded if network_zone=internal. [19:09:44] <^demon> dschoon: You could make analytics/reportcard/site without affecting analytics/reportcard at all. [19:09:45] oh, i thought you meant cluster wiki projects [19:09:47] sorry, [19:09:59] (see base.pp:base::standard-packages) [19:10:00] ^demon yeah, but it would be confusing. [19:10:13] ^demon as analytics/reportcard is not actually what it says it is. [19:10:23] drdee: email for rt? [19:10:26] LeslieCarr: I think it'd be better to find out why swift hosts don't have network_zone set to internal, and fix that instead. [19:10:31] ah… [19:10:32] yeah [19:11:06] ^demon but yes. to start with, if we could get "analytics/reportcard/site" to exist, that'd be great. [19:11:19] ahha … well one thing is it doesn't recognize bonded interfaces ... [19:11:38] ^demon there's also a sizable history to import from another git repo at git.less.ly/?p=kraken-ui.git [19:11:45] Change abandoned: Lcarr; "figuring out why network_zone is not set as internal instead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4654 [19:12:03] <^demon> dschoon: Can you send me an e-mail about it? I'm very busy today. [19:12:08] sure. [19:12:12] ty [19:12:44] New review: Bhartshorne; "(no comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4654 [19:13:15] woosters: https://rt.wikimedia.org/Ticket/Display.html?id=2804 [19:13:23] woosters: for drdee [19:13:33] also zinc and copper are trying to do swift checks but they're external hosts .... [19:13:47] grumblegrumble. [19:14:32] we can have nrpe running on them since they have a firewall as well, but I would put in in the definition only for that clruster (or remove monitoring). [19:14:52] zinc/copper/magnesium are a test cluster, and not really worth monitoring. [19:15:36] New patchset: Lcarr; "Fixing realm.pp to check the main ip address, not necessarily eth0" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4655 [19:15:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4655 [19:17:19] maplebed can you check this out (won't fix zinc, etc, i'm thinking of making it set a variable and then checking the variable before setting the checks) [19:18:31] LeslieCarr: is main_ipaddress set anywhere? [19:18:59] I don't see it int he output of facter. [19:19:19] it shoudl be set int he lines before in realm.pp [19:19:29] ah, so it is. [19:20:37] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/4655 [19:20:44] +1 [19:21:37] !log restarting gmond on db1004 after removing it's 5gig log [19:21:39] Logged the message, Mistress of the network gear. [19:22:24] doh, need to rebase to remove the dependancy … sigh [19:22:27] PROBLEM - BGP status on cr2-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.197, [19:22:36] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [19:22:45] RECOVERY - MySQL disk space on db1004 is OK: DISK OK [19:22:47] whew, not a real problem [19:22:54] RECOVERY - Disk space on db1004 is OK: DISK OK [19:23:48] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 19, down: 0, shutdown: 1 [19:23:57] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [19:26:57] New patchset: Lcarr; "fixing realm.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4655 [19:27:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4655 [19:27:23] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4655 [19:27:26] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4655 [19:28:35] hexmode - noted [19:32:03] RECOVERY - MySQL disk space on neon is OK: DISK OK [19:37:35] woosters: how long do you think it will take? Sounds like drdee needs this asap [19:43:09] New patchset: Bhartshorne; "e3 filters for faulkner RT-2805" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4657 [19:43:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4657 [19:45:55] drdee: does ^^^ look ok to you? [19:46:38] drdee: one question - do we need to surround the path in single quotes to prevent shell expansion of the &? [19:47:03] PROBLEM - BGP status on cr2-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.197, [19:47:06] maplebed: give me a sec [19:47:21] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Can not connect to 10.0.8.20:11000 (Connection timed out) [19:47:33] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/4653 [19:47:34] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4653 [19:47:37] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4653 [19:48:15] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 19, down: 0, shutdown: 1 [19:48:22] maplebed: the 2nd filter is missing a '/' before the 'w' not sure if that is super important [19:48:32] for the rest should be okay [19:49:22] maplebed: can I merge your realm.pp change? [19:49:24] drdee: also, they're /w/index.php, not /wiki/index.php. [19:49:31] notpeter: it's LeslieCarr's. [19:49:32] one sec. [19:49:38] drdee: does that matter? [19:49:43] LeslieCarr: can I merge your realm.pp change? [19:50:02] maplebed: that's ryan's call :) [19:50:03] notpeter: yes please [19:50:18] kk [19:50:33] grumblegrumbleryan'snotinircgrumblegrumble. [19:50:51] drdee: thanks for the /w catch. [19:51:31] New patchset: Bhartshorne; "e3 filters for faulkner RT-2805" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4657 [19:51:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4657 [19:52:09] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [19:52:54] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [19:53:03] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [19:53:21] RECOVERY - Router interfaces on cr2-pmtpa is OK: OK: host 208.80.152.197, interfaces up: 89, down: 0, dormant: 0, excluded: 0, unused: 0 [19:54:24] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 16%, RTA = 0.46 ms [19:55:27] PROBLEM - Router interfaces on mr1-pmtpa is CRITICAL: CRITICAL: host 10.1.2.3, interfaces up: 32, down: 1, dormant: 0, excluded: 0, unused: 0BRfe-0/0/1: down - csw5-pmtpa:8/23:BR [19:56:06] srv197 is full [19:56:53] <^demon> binasher: Would you mind dropping a note one way or the other on https://gerrit.wikimedia.org/r/#change,4602 ? Tim wants to drop request_with_session|request_without_session and hashar says you're logging it on graphite. [19:57:15] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Can not connect to 10.0.11.39:11000 (Connection timed out) [19:57:31] binasher: ^demon: I suspect graphite to just log everything which uses wfIncrStats() [19:57:51] <^demon> Well then is dropping a stat gonna harm anything? [19:58:45] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [20:00:21] ^demon: it will just harm any potential graphite report using that metric [20:00:26] I have not find any though [20:02:30] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197 for 1.3.6.1.2.1.2.2.1.2 with snmp version 2 [20:03:15] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Can not connect to 10.0.8.17:11000 (Connection timed out) [20:03:42] RECOVERY - Router interfaces on cr2-pmtpa is OK: OK: host 208.80.152.197, interfaces up: 89, down: 0, dormant: 0, excluded: 0, unused: 0 [20:05:11] notpeter: do you know anything about the ubuntu mirror check on watchmouse? [20:06:29] maplebed: nein [20:06:33] false positive for the ubuntu mirror watchmouse alert; the mirror is working fine. [20:06:33] it was there when I got here [20:07:11] also, fenari is toast. [20:07:22] yeah, wtf is going on with it? [20:07:41] New patchset: Pyoungmeister; "adding a log files size for m1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4659 [20:07:55] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/4659 [20:08:43] New patchset: Pyoungmeister; "adding a log files size for m1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4659 [20:08:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4659 [20:09:11] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4659 [20:09:13] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4659 [20:30:12] New review: Krinkle; "Yes, we no longer need them here. They're not used anymore, and we'll re-use them in the future but ..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/4364 [20:35:59] New review: Demon; "(no comment)" [operations/software] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4415 [20:36:01] Change merged: Demon; [operations/software] (master) - https://gerrit.wikimedia.org/r/4415 [20:38:29] any roots around to help Reedy with a permissions issue? [20:41:57] PROBLEM - Lighttpd HTTP on dataset2 is CRITICAL: Connection refused [20:53:56] New patchset: RobH; "removed myself from contacts for my vacation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4662 [20:54:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4662 [20:54:48] New review: RobH; "small change, self reviewed" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4662 [20:55:05] New patchset: Jgreen; "fixed typo in api_sweep_test, replaced socket-based IP validation test with simple regex" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4663 [20:55:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4663 [20:55:48] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4663 [20:55:50] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4663 [20:57:07] notpeter: bunch of search file changes unmerged on sockpuppet [20:57:09] these you? [20:57:27] wha? [20:57:57] cvhanges to diff --git a/files/searchqa/lib/searchqa.pm b/files/searchqa/lib/searchqa.pm [20:57:59] oh, that's Jeff_Green [20:58:08] ahh, Jeff_Green can I merge these? [20:58:15] my nagios change isnt showing up, wtf... [20:58:23] did we just collide? [20:58:27] i just got this: [20:58:27] ahh, damn it, i need rebase [20:58:33] error: Ref refs/remotes/origin/production is at 5175f6d6fac5704e08633bcefa5ae7fad8724809 but expected 9e665a3f4fa092f5c9aa0f1c00e52f474b2b1958 [20:58:54] Change abandoned: RobH; "meh" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4662 [20:58:57] im abandoning my change [20:59:03] i think i fubared you up, so try again now [20:59:10] i'm not sure what to do? [20:59:19] i'll retry the fetch/diff [20:59:31] would try to recommit in gerrit if that failed [20:59:51] btw "dammit, I need to rebase" sounds like something out of some multi-player first person shooter [20:59:59] i have no idea how to do so [21:00:01] =P [21:00:08] seems to have worked for me anyway [21:00:10] my local copy is ahead of the nonlocal, even though i stashed [21:00:15] RECOVERY - Lighttpd HTTP on dataset2 is OK: HTTP OK HTTP/1.0 200 OK - 4903 bytes in 0.005 seconds [21:00:18] so how do i rebase this crap? [21:00:38] I'm not advanced enough to be truly helpful but I did see mention of that in the gerrit documentation on wikitech earlier [21:01:03] in the past I've just punted and started from a fresh clone or checkout or whatever the proper term is [21:01:14] yea i just wanna abadon my crap [21:01:18] i can just cherrypick it later. [21:01:27] but i amnot gonna redo my checkout all over [21:01:33] with new branch and the like, i wanna fix it proper [21:01:36] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [21:01:36] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [21:01:37] Ryan_Lane: how i rebase? [21:01:53] git rebase origin [21:01:53] cannot rebase: you have unstaged changes [21:02:00] http://codeutopia.net/blog/2009/12/10/git-interactive-rebase-tips/ [21:02:35] thing is, when I get to this point I'm always confused about the state my local copy is in, and have fear that Things Will Get Worse [21:03:42] sigh [21:04:06] someone without a fubar'd local repo wanna remove me from nagios contact list please? [21:04:15] sure [21:04:18] thanks! [21:04:35] in files/nagios/contactgroups.cfg [21:04:39] thx [21:04:47] just pull my name out and I will review your commit and merge it =] [21:05:18] this is why i need to get in the habit of making changes in an entirely different local branch [21:05:30] then merge them back into my local production, leaving it in a normal state for quick things like this. [21:05:49] my local production is a few changes ahead due to my working on it =P [21:06:07] New patchset: Jgreen; "removing robh from nagios pager list so he can enjoy the silence" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4664 [21:06:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4664 [21:06:37] robh: yes--I dislike working in branches for production config [21:07:06] New review: RobH; "bwahhahaa, no paging for me!" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4664 [21:07:08] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4664 [21:07:17] I prefer to do a git pull, make a change, and push right away [21:07:46] HUZZAH [21:07:53] change merged. [21:07:54] you are free! [21:08:05] its everything i hoped it would be. [21:08:26] don't go into the light [21:08:53] but i can see nan and poppop [21:09:03] TURN BACK TURN BACK! [21:09:11] (whereever they are they would think thats hilarious ;) [21:09:21] :-) [21:09:36] now the only person annoying me on the phone will be my mom [21:09:44] she fears i will be eaten by bears [21:09:47] or snakebit [21:10:06] when i turn my cell back on i expect to have a lot of texts [21:10:23] hmm [21:10:24] she goes weeks without hearing from me [21:10:34] but when i travel or hike and she knows it, she gets all worried [21:11:03] 'call me when you land' 'Why, you wouldn't have known I was flying if you would just stop reading my facebook status' [21:11:23] thx for making the change for me Jeff_Green [21:11:27] perhaps you can set your phone to post fake statuses while you're hiking [21:11:43] rob has just arrived at $coffeeshop [21:11:45] I explained she can have me have my phone on, and it dies two days into hike or less [21:11:57] or it can be off, in pack, so if a bear mauls one of us or something i have battery to call for help [21:12:31] there is no cell service in the valleys, and constantly searching for a tower runs the battery down faster than anything except constant talking [21:12:34] you need an old-school nokia with the 8 days of battery life [21:12:45] I miss those phones, sorta [21:12:50] but my iphone can give me a gps coordinate readout for said emergency calls [21:13:07] but yea, if i ever did multiweek hiking i would swap over to a ruggedized swappable battery phone [21:13:13] oh DHS or the NSA can find you [21:13:28] if it has no gps they have to do it via triangulation [21:13:34] if i only hit a single tower its harder [21:13:35] just use your old-school phone to make a threatening call and you'll have black choppers with tasers overhead in minutes [21:13:52] but i wouldnt hear them coming, the choppers are silent! [21:13:52] or a drone! :-) [21:13:57] exactly [21:14:33] hrmm, i wish i could whitelist autoresponder for ops list. [21:14:42] i dont want to pull myself off it, its got stuff i need to read when i get back. [21:15:47] are you using gmail or our own imap foo [21:16:12] gmail, if it was imapfoo i could do this. [21:16:17] yeah [21:16:19] hack at the rob.filter file and be done. [21:16:44] you'd think you could filter before the responder even with gmail [21:17:01] well, i can setup a filter and use the canned reply lab feature [21:17:11] but its not quite as smart. [21:17:19] right [21:17:39] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [21:17:39] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [21:18:12] plus i see no way for the filters in gmail to say 'do this to all messages EXCEPT these' [21:18:17] so even the canned reply wont work [21:18:23] oh well, ops list is gonna get my autoresponder [21:18:30] much like we all had to read binasher's ;] [21:18:36] yeah that sucked [21:19:14] then again, i told the one vendor about it [21:19:19] i guess i could just have no autoresponder [21:19:45] PROBLEM - mysqld processes on bellin is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:19:54] PROBLEM - mysqld processes on blondel is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:19:57] that seems less annoying.... [21:20:09] i dunno, i think about it [21:21:43] the office network continues to suck it seems. [21:21:51] is 1.19wmf1 (r114429) version 1.19 in bugzilla ? [21:22:34] robh: hi blondel page! [21:22:39] /msg RoanKattouw_away hey ... [21:22:45] what page? [21:22:51] oops [21:22:54] LeslieCarr: msg fail [21:23:22] I just got paged for mysqld on blondel by nagios [21:24:03] yeah [21:24:05] binasher ... [21:24:14] oh wait [21:24:16] it's notpeter [21:24:17] i think [21:24:37] I didnt get paged! [21:24:43] bwahahahahahahaha [21:24:49] so awesome. [21:25:00] wow paging works! [21:27:41] New review: Rfaulk; "Looks good." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/4657 [21:28:26] \maplebed the new e3 filters llok good Ben. many thanks! [21:28:43] rfaulk: I noticed the URLs there are /w/index.php, not /wiki/index.php. [21:28:45] is that correct? [21:30:07] my irc sucks ..haha... yeah that should be fine [21:30:20] it simply takes one to a login page [21:30:24] ok. [21:30:38] are any queries supposed to match that right now (so that we can test it after putting it in place)? [21:32:17] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4657 [21:32:19] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4657 [21:34:59] drdee: my previous question about whether the path must be quoted has been answered empirically. It does. (aka my changeset didn't work.) [21:35:01] ::sigh:: [21:36:34] New patchset: Bhartshorne; "quoting necessary for path arguments to udp-filter to avoid shell interpretation of apersands" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4665 [21:36:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4665 [21:36:58] maplebed: thanks, i was not aware of that [21:37:08] maplebed: i'll add it to the docucs [21:37:14] thanks. [21:37:33] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4665 [21:37:36] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4665 [21:38:14] robh: you could have a service poll your mailbox and send auto responses instead of having gmail send them automatically [21:39:02] sounds like somethign im not gonna solve in two hours or less [21:39:07] i wonder if it also doesn't send auto responses for spam [21:39:15] not saying it's worth it! [21:39:17] yea, it will [21:39:28] its why i pulled myself off all the nonlist aliases [21:39:34] like dns, root, postmaster, etc... [21:39:38] cuz those get spammed a ton [21:40:29] google's autoresponder has an "only to wikimedia.org" option [21:40:48] orly [21:41:54] grumblegrumble. [21:43:58] New patchset: Bhartshorne; "correcting dir typo in e3 filter for emery" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4666 [21:44:14] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4666 [21:44:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4666 [21:44:16] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4666 [21:46:59] rfaulk: your filters are now running. [21:47:10] rfaulk: did you see my question about whether there is currently traffic flowing to them? [21:47:16] I would like to test and verify they work. [22:00:15] New patchset: Lcarr; "Allowing self access in nrpe.cfg, as with spence" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4667 [22:00:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4667 [22:00:56] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4667 [22:00:59] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4667 [22:11:09] New patchset: Lcarr; "adding in check_nrpe command (turns out, not default)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4669 [22:11:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4669 [22:16:26] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4669 [22:16:29] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4669 [22:32:13] New patchset: Lcarr; "moving check from icinga.pp to nagios.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4670 [22:32:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4670 [22:33:56] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4670 [22:33:59] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4670