[00:02:18] New patchset: Ryan Lane; "Applying LDAP fix to all instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2697 [00:04:04] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [00:10:04] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [00:10:04] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [00:13:16] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2697 [00:13:18] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2697 [00:14:32] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2611 [00:14:33] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2611 [00:18:14] I'd like to join the ops team [00:19:48] !log test [00:19:50] Logged the message, Master [00:20:03] New patchset: Diederik; "IP range filtering and regular expression now work." [analytics/udp-filters] (refactoring) - https://gerrit.wikimedia.org/r/2698 [00:20:37] !log
buttsecks
[00:20:39] Logged the message, Master [00:21:34] New patchset: Ryan Lane; "Adding in nslcd.conf.erb, to avoid awkward cherry-pick" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2699 [00:21:44] Ryan_Lane ^^ [00:22:04] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2699 [00:22:05] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2699 [00:22:49] Joan I will rape you [00:22:59] Well, [00:23:05] no. [00:23:08] * Ryan_Lane groans [00:23:11] what a lame troll [00:23:18] I think the cat's out of the bag on !log. ;-) [00:23:20] I guess I forgot to ban him in here [00:23:25] nah. it's the same troll [00:23:37] I forgot to ban him in this channel [00:23:46] I blame Reedy. [00:24:01] no. it's likely my fauly [00:24:03] *fault [00:24:05] Ryan_Lane time for some surprise buttsecks [00:24:16] It's not really a suprise [00:24:20] You just said it was going to happen [00:24:26] Dammit [00:24:33] It's always a bit of a surprise. [00:24:41] Joan ;) [00:24:55] * Ryan_Lane waves [00:24:59] dick [00:25:53] Looks like "buttsecks" was truncated. :-( [00:25:55] https://twitter.com/#!/wikimediatech [00:26:18] New patchset: Ryan Lane; "We don't want to give people a shell, except in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2700 [00:27:00] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2700 [00:27:11] RAWR lint check [00:27:18] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2700 [00:27:19] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2700 [00:29:51] New patchset: Ottomata; "Removing launcher.py, moved multiprocessing support to pipeline/__main__.py" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2701 [00:30:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:35:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.660 seconds [00:42:48] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2701 [00:42:49] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2701 [00:45:34] cleaned up identica [00:49:29] I like how people discuss this on wikitech-l [00:49:32] keeps trolling lower [00:51:23] I talk about this in talks [00:51:28] * Ryan_Lane shrugs [00:51:36] if it gets bad, I'll lock it down [00:51:43] I'd prefer not to [00:57:52] New patchset: Ottomata; "Adding __main__.py - meant for this to go with the last commit." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2702 [00:58:22] LeslieCarr: warning: Could not load fact file /var/lib/puppet/lib/facter/default_interface.rb: ./default_interface.rb:43: syntax error, unexpected kELSE, expecting kEND [00:58:28] I'm seeing that on some instances [00:59:09] New patchset: Lcarr; "commenting out aggregator Attempt to make puppet compile the directory before timeout" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2703 [01:00:54] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2702 [01:00:55] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2702 [01:01:30] New patchset: Lcarr; "commenting out aggregator Attempt to make puppet compile the directory before timeout" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2703 [01:02:59] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2703 [01:03:00] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2703 [01:08:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:12:35] PROBLEM - MySQL Idle Transactions on db22 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:14:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.023 seconds [01:16:47] PROBLEM - RAID on db22 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [01:19:29] RECOVERY - MySQL Idle Transactions on db22 is OK: OK longest blocking idle transaction sleeps for 0 seconds [01:20:13] New patchset: Lcarr; "Only pushing standard package as stafford is overloaded" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2704 [01:20:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2704 [01:21:24] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2704 [01:21:25] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2704 [01:26:05] RECOVERY - Disk space on neon is OK: DISK OK [01:26:15] New patchset: Lcarr; "Revert "Only pushing standard package as stafford is overloaded"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2705 [01:26:23] RECOVERY - DPKG on neon is OK: All packages OK [01:26:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2705 [01:27:26] RECOVERY - RAID on neon is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [01:28:59] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2705 [01:29:00] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2705 [01:30:44] RECOVERY - NTP on neon is OK: NTP OK: Offset 0.009791016579 secs [01:47:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:53:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.581 seconds [01:55:47] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 601s [01:56:50] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 663s [01:58:11] New patchset: Lcarr; "Fixing nagios service to nagios3 in newmonitor class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2706 [01:58:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2706 [01:59:24] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2706 [01:59:25] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2706 [02:16:47] PROBLEM - Puppet freshness on bast1001 is CRITICAL: Puppet has not run in the last 10 hours [02:25:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:29:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.977 seconds [02:37:47] PROBLEM - RAID on srv194 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:08] PROBLEM - BGP status on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197, [02:39:27] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:39:35] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Can not connect to 10.0.2.227:11000 (Connection timed out) [02:40:03] PROBLEM - BGP status on csw1-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.247, [02:40:21] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [02:40:21] PROBLEM - Router interfaces on mr1-pmtpa is CRITICAL: CRITICAL: host 10.1.2.3, interfaces up: 32, down: 1, dormant: 0, excluded: 0, unused: 0BRfe-0/0/1: down - csw5-pmtpa:8/23:BR [02:40:29] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [02:40:29] PROBLEM - BGP status on cr2-eqiad is CRITICAL: (Service Check Timed Out) [02:40:57] PROBLEM - BGP status on cr1-eqiad is CRITICAL: (Service Check Timed Out) [02:40:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:42:53] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:43:29] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [02:43:47] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.197 for 1.3.6.1.2.1.2.2.1.2 with snmp version 2 [02:44:05] PROBLEM - DPKG on nfs1 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:44:24] PROBLEM - Router interfaces on br1-knams is CRITICAL: CRITICAL: No response from remote host 91.198.174.245 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [02:44:41] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:44:50] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 9, down: 0, shutdown: 0 [02:45:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.433 seconds [02:45:26] PROBLEM - LVS HTTP on ms-fe.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:45:27] RECOVERY - DPKG on nfs1 is OK: All packages OK [02:45:35] PROBLEM - RAID on mw40 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:45:44] RECOVERY - Router interfaces on br1-knams is OK: OK: host 91.198.174.245, interfaces up: 10, down: 0, dormant: 0, excluded: 0, unused: 0 [02:45:44] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [02:46:20] PROBLEM - Puppetmaster HTTPS on sockpuppet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:46:26] !log reset the drac console for spence [02:46:28] Logged the message, Mistress of the network gear. [02:47:14] PROBLEM - BGP status on csw2-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.244, [02:47:33] RECOVERY - Puppetmaster HTTPS on sockpuppet is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.433 seconds [02:47:41] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [02:47:42] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [02:48:18] !log rebooted fenari, nonresponsive [02:48:20] Logged the message, Master [02:48:53] RECOVERY - BGP status on csw1-esams is OK: OK: host 91.198.174.247, sessions up: 5, down: 0, shutdown: 0 [02:48:53] RECOVERY - BGP status on cr1-eqiad is OK: OK: host 208.80.154.196, sessions up: 9, down: 0, shutdown: 0 [02:49:02] PROBLEM - Router interfaces on mr1-pmtpa is CRITICAL: CRITICAL: host 10.1.2.3, interfaces up: 32, down: 1, dormant: 0, excluded: 0, unused: 0BRfe-0/0/1: down - csw5-pmtpa:8/23:BR [02:49:02] RECOVERY - Router interfaces on cr2-pmtpa is OK: OK: host 208.80.152.197, interfaces up: 99, down: 0, dormant: 0, excluded: 0, unused: 0 [02:49:11] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [02:49:11] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 9, down: 0, shutdown: 0 [02:49:20] RECOVERY - BGP status on cr2-pmtpa is OK: OK: host 208.80.152.197, sessions up: 9, down: 0, shutdown: 0 [02:49:29] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 84, down: 2, dormant: 0, excluded: 0, unused: 0BRae3: down - BRae4: down - BR [02:49:56] RECOVERY - BGP status on csw2-esams is OK: OK: host 91.198.174.244, sessions up: 4, down: 0, shutdown: 0 [02:49:56] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 87, down: 2, dormant: 0, excluded: 0, unused: 0BRae3: down - BRae4: down - BR [02:49:56] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [02:50:14] New patchset: Lcarr; "decreasing number of simultaneous checks for nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2707 [02:50:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2707 [02:50:38] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2707 [02:50:38] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2707 [02:50:52] RECOVERY - HTTP on fenari is OK: HTTP OK HTTP/1.1 200 OK - 4252 bytes in 0.005 seconds [02:51:17] New patchset: Ottomata; "Created DygraphLoader for generic transformation of observation aggregations into dygraphs csv format." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2708 [02:51:17] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 2359 bytes in 0.015 seconds [02:51:44] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 2359 bytes in 0.015 seconds [02:52:20] RECOVERY - LVS HTTP on ms-fe.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 2359 bytes in 0.009 seconds [02:53:38] !log manually lowering nagios max checks to 300 [02:53:41] Logged the message, Mistress of the network gear. [02:54:23] New review: Ottomata; "(no comment)" [analytics/reportcard] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2708 [02:54:32] New review: Ottomata; "(no comment)" [analytics/reportcard] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2708 [02:54:40] New review: Ottomata; "(no comment)" [analytics/reportcard] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2708 [02:54:51] New review: Ottomata; "(no comment)" [analytics/reportcard] (master); V: 0 C: 1; - https://gerrit.wikimedia.org/r/2708 [02:55:01] New review: Ottomata; "(no comment)" [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2708 [02:55:01] Change merged: Ottomata; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2708 [03:04:51] PROBLEM - BGP status on csw1-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.247, [03:05:01] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [03:05:27] PROBLEM - BGP status on cr2-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.197, [03:05:36] PROBLEM - BGP status on csw2-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.244, [03:06:04] PROBLEM - BGP status on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197, [03:06:04] PROBLEM - check_all_memcacheds on spence is CRITICAL: (Service Check Timed Out) [03:06:30] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:06:39] PROBLEM - Router interfaces on mr1-pmtpa is CRITICAL: CRITICAL: No response from remote host 10.1.2.3 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [03:07:33] PROBLEM - Certificate expiration on nfs1 is CRITICAL: (Service Check Timed Out) [03:10:48] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.197 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [03:10:57] RECOVERY - BGP status on cr2-pmtpa is OK: OK: host 208.80.152.197, sessions up: 9, down: 0, shutdown: 0 [03:11:06] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [03:11:15] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 87, down: 2, dormant: 0, excluded: 0, unused: 0BRae3: down - BRae4: down - BR [03:11:24] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [03:11:51] RECOVERY - BGP status on csw1-esams is OK: OK: host 91.198.174.247, sessions up: 5, down: 0, shutdown: 0 [03:12:00] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 9, down: 0, shutdown: 0 [03:12:09] RECOVERY - BGP status on csw2-esams is OK: OK: host 91.198.174.244, sessions up: 4, down: 0, shutdown: 0 [03:12:36] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [03:12:36] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 9, down: 0, shutdown: 0 [03:15:27] New patchset: Catrope; "Don't let l10nupdate write to /home directly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2709 [03:15:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2709 [03:42:36] PROBLEM - Puppet freshness on fenari is CRITICAL: Puppet has not run in the last 10 hours [04:25:39] PROBLEM - Puppet freshness on searchidx1001 is CRITICAL: Puppet has not run in the last 10 hours [05:20:15] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 10 seconds [05:21:45] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [06:12:11] New patchset: Tim Starling; "Support l10n manual recache in scap" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2710 [06:12:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2710 [06:12:49] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2710 [06:12:50] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2710 [06:15:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:17:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.538 seconds [06:39:36] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [06:51:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:55:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.619 seconds [07:03:09] PROBLEM - Lucene on search3 is CRITICAL: Connection timed out [07:03:36] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [07:09:36] PROBLEM - Puppet freshness on mw1002 is CRITICAL: Puppet has not run in the last 10 hours [07:09:36] PROBLEM - Puppet freshness on db46 is CRITICAL: Puppet has not run in the last 10 hours [07:11:33] RECOVERY - Lucene on search9 is OK: TCP OK - 8.993 second response time on port 8123 [07:23:42] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [07:31:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:33:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.452 seconds [08:09:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:15:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.026 seconds [08:46:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:51:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.513 seconds [09:24:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:25:39] RECOVERY - HTTP on fenari is OK: HTTP OK HTTP/1.1 200 OK - 4252 bytes in 0.014 seconds [09:28:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.718 seconds [09:39:36] PROBLEM - HTTP on fenari is CRITICAL: Connection refused [09:45:27] RECOVERY - HTTP on fenari is OK: HTTP OK HTTP/1.1 200 OK - 4252 bytes in 0.020 seconds [09:45:36] RECOVERY - Lucene on search9 is OK: TCP OK - 2.997 second response time on port 8123 [09:57:54] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [10:02:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:05:42] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [10:06:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.382 seconds [10:11:42] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [10:11:42] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [10:37:30] RECOVERY - Lucene on search9 is OK: TCP OK - 2.995 second response time on port 8123 [10:42:25] New patchset: ArielGlenn; "initial commit: tool for managing dump uploads to archive.org" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/2711 [10:42:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:42:27] New review: gerrit2; "Lint check passed." [operations/dumps] (ariel); V: 1 - https://gerrit.wikimedia.org/r/2711 [10:43:19] now there (lint message) is a waste of cpu cycles [10:46:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.386 seconds [10:49:39] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [11:00:18] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [11:02:42] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [11:20:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:24:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.023 seconds [11:26:33] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [11:58:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:02:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.055 seconds [12:18:36] PROBLEM - Puppet freshness on bast1001 is CRITICAL: Puppet has not run in the last 10 hours [12:20:33] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:20:33] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:20:33] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:20:33] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:25:30] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:25:30] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:25:30] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:25:31] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:30:36] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:30:36] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:30:36] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:30:36] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:35:24] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:35:25] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:35:25] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:35:25] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:35:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:39:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.093 seconds [12:40:30] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:40:30] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:40:30] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:40:30] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:45:27] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:45:27] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:45:27] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:45:27] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:33] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:34] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:34] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50:34] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:33] RECOVERY - Puppet freshness on searchidx1001 is OK: puppet ran at Wed Feb 22 12:53:11 UTC 2012 [12:54:00] PROBLEM - RAID on db40 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:55:30] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:55:31] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:55:31] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:55:31] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:57:45] RECOVERY - RAID on db40 is OK: OK: 1 logical device(s) checked [13:00:27] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:00:27] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:00:27] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:00:27] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:05:24] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:05:24] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:05:24] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:05:24] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:07:12] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [13:10:30] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:30] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:30] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:30] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:13:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:15:36] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:15:36] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:15:36] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:15:36] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.162 seconds [13:20:33] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:33] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:33] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:33] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:09] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [13:21:38] New patchset: Demon; "Adding .gitreview" [test/mediawiki/extensions/examples] (master) - https://gerrit.wikimedia.org/r/2712 [13:21:40] New review: gerrit2; "Lint check passed." [test/mediawiki/extensions/examples] (master); V: 1 - https://gerrit.wikimedia.org/r/2712 [13:21:52] New review: Demon; "(no comment)" [test/mediawiki/extensions/examples] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2712 [13:21:52] Change merged: Demon; [test/mediawiki/extensions/examples] (master) - https://gerrit.wikimedia.org/r/2712 [13:24:36] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [13:25:30] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:25:30] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:25:30] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:25:30] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:30:36] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:30:36] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:30:36] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:30:36] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:24] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:24] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:24] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:24] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:30] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:30] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:30] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:30] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:43:39] PROBLEM - Puppet freshness on fenari is CRITICAL: Puppet has not run in the last 10 hours [13:45:36] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:36] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:36] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:36] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:49:54] New review: Diederik; "Ok." [analytics/udp-filters] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2559 [13:49:55] Change merged: Diederik; [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2559 [13:50:13] New review: Diederik; "Ok." [analytics/udp-filters] (refactoring); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2560 [13:50:13] Change merged: Diederik; [analytics/udp-filters] (refactoring) - https://gerrit.wikimedia.org/r/2560 [13:50:33] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:33] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:33] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:33] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:30] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:31] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:31] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:31] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.451 seconds [14:00:36] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:00:36] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:00:36] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:00:36] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:05:33] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:05:34] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:05:34] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:05:34] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:30] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:30] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:31] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:31] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:36] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:36] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:36] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:36] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:33] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:33] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:33] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:33] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:25:30] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:25:31] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:25:31] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:25:31] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:25:39] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.80473678261 (gt 8.0) [14:29:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:30:36] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:30:36] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:30:36] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:30:36] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.907 seconds [14:35:33] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:35:33] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:35:33] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:35:34] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:40:30] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:40:31] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:40:31] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:40:31] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:45:36] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:45:36] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:45:36] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:45:36] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:50:34] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:50:34] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:50:34] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:50:34] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:55:30] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:55:31] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:55:31] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:55:31] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:57:36] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 1.96843736842 [15:00:27] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:00:27] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:00:27] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:00:27] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:05:32] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:05:32] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:05:32] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:05:32] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:10:29] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:10:29] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:10:30] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:10:30] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.062 seconds [15:15:26] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:26] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:26] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:26] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:23] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:23] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:23] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:23] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:29] PROBLEM - check_minfraud1 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:29] PROBLEM - check_minfraud1 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:29] PROBLEM - check_minfraud1 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:29] PROBLEM - check_minfraud1 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:38] RobH: are you going to be able to look at dimms on search1008 and search 1014 today? [15:29:17] I plan to yep [15:29:28] I want to clear out eqiad queue today [15:30:25] sweet [15:32:05] RECOVERY - check_minfraud1 on payments3 is OK: OK [15:32:05] RECOVERY - check_minfraud1 on payments2 is OK: OK [15:32:06] RECOVERY - check_minfraud1 on payments1 is OK: OK [15:32:06] RECOVERY - check_minfraud1 on payments4 is OK: OK [15:34:17] !log extending database user grants to eqiad private subnets [15:34:19] Logged the message, and now dispaching a T1000 to your position to terminate you. [15:35:59] PROBLEM - Disk space on srv285 is CRITICAL: DISK CRITICAL - free space: / 277 MB (3% inode=56%): /var/lib/ureadahead/debugfs 277 MB (3% inode=56%): [15:47:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:49:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.365 seconds [15:50:50] PROBLEM - Host db1026 is DOWN: PING CRITICAL - Packet loss = 100% [15:56:41] RECOVERY - Lucene on search15 is OK: TCP OK - 0.008 second response time on port 8123 [15:59:14] RECOVERY - Lucene on search3 is OK: TCP OK - 0.012 second response time on port 8123 [16:02:59] RECOVERY - Lucene on search9 is OK: TCP OK - 0.001 second response time on port 8123 [16:04:02] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:04:11] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:05:59] RECOVERY - Apache HTTP on mw21 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.863 second response time [16:06:08] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [16:06:26] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:08:23] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.499 second response time [16:08:32] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:08:50] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:08:50] PROBLEM - Apache HTTP on mw12 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:09:17] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:09:24] !log restarted lsearchd on search15, was not running [16:09:26] Logged the message, Master [16:09:35] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:09:38] !log restarted lsearchd on search3 and search9, was running but nonresponsive [16:09:40] Logged the message, Master [16:10:02] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:02] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:02] PROBLEM - Apache HTTP on mw11 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:38] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [16:10:38] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.838 second response time [16:10:47] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:11:14] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.036 second response time [16:11:32] RECOVERY - Apache HTTP on mw55 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.051 second response time [16:11:50] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:11:59] RECOVERY - Apache HTTP on mw20 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.980 second response time [16:12:08] RECOVERY - Apache HTTP on mw11 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [16:12:17] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.455 second response time [16:12:26] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:44] RECOVERY - Apache HTTP on mw12 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.573 second response time [16:12:45] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.691 second response time [16:13:56] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [16:14:23] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 251 MB (3% inode=62%): /var/lib/ureadahead/debugfs 251 MB (3% inode=62%): [16:14:23] PROBLEM - Apache HTTP on mw9 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:14:23] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.539 second response time [16:15:53] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [16:16:11] RECOVERY - Apache HTTP on mw9 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.891 second response time [16:17:41] PROBLEM - Lucene on search3 is CRITICAL: Connection timed out [16:18:17] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:19:29] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [16:19:38] RECOVERY - Disk space on srv285 is OK: DISK OK [16:20:23] RECOVERY - Disk space on srv219 is OK: DISK OK [16:22:02] RECOVERY - Lucene on search15 is OK: TCP OK - 8.997 second response time on port 8123 [16:22:02] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:22:02] PROBLEM - Apache HTTP on mw7 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:22:02] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:59] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [16:24:00] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.074 second response time [16:24:08] PROBLEM - Apache HTTP on mw6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:08] PROBLEM - Apache HTTP on mw13 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:08] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:08] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:35] RECOVERY - Apache HTTP on mw26 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.241 second response time [16:24:44] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:56] PROBLEM - LVS HTTP on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:56] PROBLEM - Apache HTTP on mw5 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:05] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.173 second response time [16:26:32] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:08] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.314 seconds [16:27:44] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:53] RECOVERY - Apache HTTP on mw5 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [16:27:53] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:53] RECOVERY - LVS HTTP on appservers.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 57727 bytes in 4.041 seconds [16:28:02] RECOVERY - Apache HTTP on mw6 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.705 second response time [16:28:11] RECOVERY - Apache HTTP on mw25 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.602 second response time [16:28:56] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.592 second response time [16:29:23] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:41] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [16:31:11] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.852 second response time [16:31:20] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.928 second response time [16:32:14] RECOVERY - Apache HTTP on mw13 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.747 second response time [16:32:14] RECOVERY - Apache HTTP on mw7 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.625 second response time [16:32:23] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.220 second response time [16:32:41] PROBLEM - Apache HTTP on mw45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:33:53] PROBLEM - Apache HTTP on mw2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:33:53] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:02] RECOVERY - Apache HTTP on mw19 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.043 second response time [16:34:29] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [16:34:38] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:38] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:41] RECOVERY - Apache HTTP on mw2 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.815 second response time [16:36:35] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.209 second response time [16:36:35] RECOVERY - Apache HTTP on mw45 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.048 second response time [16:36:44] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.403 second response time [16:37:47] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.041 second response time [16:38:14] RECOVERY - Lucene on search3 is OK: TCP OK - 0.001 second response time on port 8123 [16:38:23] RECOVERY - Lucene on search9 is OK: TCP OK - 0.000 second response time on port 8123 [16:38:23] RECOVERY - Lucene on search15 is OK: TCP OK - 0.002 second response time on port 8123 [16:38:57] rainman-sr: you there? [16:39:01] yes? [16:39:09] java.io.IOException: Error constructing searcher for [enwiki.nspart1.sub2, enwiki.nspart1.sub1] [16:39:17] I get that froevermany times on search3 [16:39:33] this is causing the search nodes to have lots of problems, and tying up apaches [16:41:05] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [16:43:27] notpeter, hmm, why are they multiple rsyncs going on? including normal rsync and rsync-no-pagecache ? [16:44:14] the second is a wrapper for rsync that makes it less resource intensive. asher put it there because boxes kept dying when they were rsyncing over new indexes [16:45:06] so that is one and the same rsync process, right [16:45:13] not multiple at once [16:45:37] yeah, it was a shell script to call the rsync version patched with posix_fadvise [16:46:14] mark: should be [16:48:04] notpeter, well, not sure, other errors in log seem to indicate that the process is running out of memory [16:48:24] that's a very reasonable explanation [16:48:26] New patchset: Sumanah; "Additional author for test commit" [test/mediawiki/extensions/examples] (master) - https://gerrit.wikimedia.org/r/2713 [16:48:31] currently the it is run with -Xmx3000m, which I guess is suitable for 32bit java [16:48:41] not sure how much more we can increase it without java complaining [16:49:04] switching to 64bit java would be a bad idea, since java is a bit stupid, and in 64bit java the amount of memory needed is essentially 2x [16:49:08] rainman-sr: the process on the search* host or on the searchidx host? [16:49:30] no, on search3 and others that are frequently dieing [16:49:35] k [16:50:59] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [16:52:35] New review: Sumanah; "I love this change! So rockin'!" [test/mediawiki/extensions/examples] (master) C: 1; - https://gerrit.wikimedia.org/r/2713 [16:54:04] notpeter, I think you can try increasing it to -Xmx3300, but probably not much beyond that [16:54:41] 3 GB? aren't 32 bit processes limited to 2 GB? [16:55:24] rainman-sr: ok [16:55:29] PROBLEM - check_gcsip on payments2 is CRITICAL: Connection timed out [16:55:29] PROBLEM - check_gcsip on payments3 is CRITICAL: Connection timed out [16:56:26] rainman-sr: what would the 64 mem limit be? [16:56:30] *64 bit [16:56:32] PROBLEM - check_gcsip on payments1 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [16:56:48] as much mem as the box has [16:57:13] notpeter, 2x, so at least 6gb, but then we would have more I/O as less is cached by linux [16:57:29] ah, gotcha [16:57:31] still better than not having search due to the process dying [16:57:47] let's get eqiad search cluster up, and with more mem [16:58:13] search3 seems to like the extra 300 megs [16:58:20] going to do the same on search15 [16:58:26] mark, well, then it would die because it's using too much I/O and the search is not fast enough [16:58:52] i'm not sure we can squeeze much more out of these boxes [16:58:55] how much memory do those boxes have? [16:59:08] the old ones have 16 gigs [16:59:39] 1-10 has 16, 11-20 has 32, the eqiad ones have 48 [17:00:35] RECOVERY - check_gcsip on payments1 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 3.725 second response time [17:00:35] RECOVERY - check_gcsip on payments2 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 3.598 second response time [17:00:35] RECOVERY - check_gcsip on payments3 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 3.583 second response time [17:00:47] New review: Sumanah; "Guybrush is so great and a substantive contributor to our community." [test/mediawiki/extensions/examples] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2713 [17:00:47] Change merged: Sumanah; [test/mediawiki/extensions/examples] (master) - https://gerrit.wikimedia.org/r/2713 [17:00:57] and the en.wp indexes are about 10gb in two parts.. during index warmup you might have 3 parts in memory though [17:01:35] what still needs to happen on the eqiad cluster? [17:01:38] RECOVERY - Lucene on search15 is OK: TCP OK - 0.001 second response time on port 8123 [17:01:51] and I vote for only using 64 bit java there :P [17:02:08] if that doesn't suffice, we need to get more memory [17:02:33] mark: eqiad has 64 bit java :) [17:02:39] good [17:02:47] also, those extra 300 megs helped search3 and search15 [17:02:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:03:00] I'm goign to add that to the init scripts on all the pmtpa search boxes and restart them [17:03:21] also also, all search nodes in eqiad are up, I just need to have searchidx1001 start building indexes [17:03:36] which is why I needed those new mysql grants for 10.64 [17:03:43] alright [17:04:00] btw what happened with pmpta search group on ganglia? [17:04:27] not sure, but the search boxes run an old version of ubuntu and that may have something to do with it [17:04:33] our automatic puppet ganglia stuff may have broken on it [17:04:38] !log increasing mem for java to 3300 on pmtpa search hosts [17:04:40] Logged the message, and now dispaching a T1000 to your position to terminate you. [17:06:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.950 seconds [17:07:31] fantastic... http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20pmtpa&h=stafford.pmtpa.wmnet&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [17:08:02] mark: jesus. what is that from? [17:08:11] New patchset: Sumanah; "thinking seriously about our future" [test/mediawiki/extensions/examples] (master) - https://gerrit.wikimedia.org/r/2714 [17:08:19] that seems to happen when puppet has died on all hosts for some reason [17:08:22] and cron restarts it [17:11:05] PROBLEM - Puppet freshness on db46 is CRITICAL: Puppet has not run in the last 10 hours [17:11:05] PROBLEM - Puppet freshness on mw1002 is CRITICAL: Puppet has not run in the last 10 hours [17:11:41] New patchset: Demon; "Evil plans!" [test/mediawiki/extensions/examples] (master) - https://gerrit.wikimedia.org/r/2715 [17:12:21] Change abandoned: Sumanah; "I do not like your plans, Evil Chad!" [test/mediawiki/extensions/examples] (master) - https://gerrit.wikimedia.org/r/2715 [17:16:01] chrismcmahon: do you think you'llbe able to address http://rt.wikimedia.org/Ticket/Display.html?id=2476 today? [17:16:13] maplebed: looking [17:16:20] you mean chris johnson? [17:16:25] moops. [17:16:26] yes. [17:16:28] sorry chrismcmahon [17:16:34] that was autocomplete. [17:16:38] :) [17:16:39] cmjohnson1_: ^^^^ [17:16:44] thank you, mark. [17:16:45] :P [17:16:50] :) [17:17:03] rainman-sr: where were you able to see indication that we were hitting the memory limit? [17:17:24] Jeff_Green, in logs it said at some point GC limit exceeded [17:17:48] ah, ok. thanks [17:19:24] maplebed: regarding 2476...which one are you having an issue with? [17:19:39] LeslieCarr's final comment - ms-be2 is missing. [17:19:58] ah yeah, i coudln't find where it was plugged in :( can you tell me the one 1 below it ? [17:20:06] slight problem w/ that rack...the mrjp-a2 is full [17:20:09] the port where i think it is has something else, so i didn't want to reassign it :) [17:20:33] up to port 24 is connected [17:20:50] i sent an email to Lesliecarr, mark and robh to see where they want it moved [17:21:00] oh you did ? and did i miss it ? [17:21:17] it was about an hour ago [17:24:13] cmjohnson1_: how about sdtpa C3? [17:25:02] there is plenty of space there [17:25:09] let's move it there. [17:25:58] k...lesliecarr, i will ping you with the network changes once I am finished. [17:29:29] thanks cmjohnson1_! [17:35:32] PROBLEM - check_gcsip on payments3 is CRITICAL: Connection timed out [17:35:32] PROBLEM - check_gcsip on payments4 is CRITICAL: Connection timed out [17:36:35] PROBLEM - check_gcsip on payments2 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [17:40:16] thank you cmjohnson1_ (sorry, keep going afk) [17:40:29] RECOVERY - check_gcsip on payments4 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 1.328 second response time [17:40:29] RECOVERY - check_gcsip on payments2 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 3.603 second response time [17:40:29] RECOVERY - check_gcsip on payments3 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 3.586 second response time [17:40:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:43] yay flap [17:44:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.009 seconds [17:46:15] New patchset: Lcarr; "removing defunct ganglia1001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2716 [17:46:44] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2716 [17:46:45] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2716 [17:47:14] yay [17:50:32] PROBLEM - check_gcsip on payments2 is CRITICAL: Connection timed out [17:50:32] PROBLEM - check_gcsip on payments1 is CRITICAL: Connection timed out [17:51:04] lesliecarr: port 5 mrjp-c3-sdtpa [17:51:46] cool :) [17:52:49] oohh so I can ask guy guys this [17:53:10] how to you map from ge-x/y/z to physical port that chris might plug something into? [17:55:29] RECOVERY - check_gcsip on payments2 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 4.329 second response time [17:55:29] PROBLEM - check_gcsip on payments1 is CRITICAL: Connection timed out [18:00:35] RECOVERY - check_gcsip on payments1 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.164 second response time [18:01:02] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [18:02:21] apergos: honestly, it's a semi black art in tampa [18:02:36] usually i find the machine above/below it :) [18:02:40] how about eqiad? [18:03:05] cause if I allocate a port... I have no idea how to tell anyone which one it is :-D [18:03:14] eqiad is super easy, the numbering is asw-$ROW-eqiad.mgmt ge-$RACKNUMBER/0/$PORTNUMBER [18:03:32] oh coooool [18:03:34] lesliecarr: do you need the machine below ms-be2? [18:03:42] cmjohnson1_: that would be great [18:03:54] labstore2 [18:04:05] yeah, the eqiad architecture kicks ass [18:04:06] thanks mark :) [18:04:47] RECOVERY - Lucene on search15 is OK: TCP OK - 2.996 second response time on port 8123 [18:04:58] oh apergos if it's a new machine i'll also do a "show log messages | last 20 " and see if a port is going up/down to double check the port [18:05:06] if it was recently plugged in [18:05:31] smar [18:05:32] t [18:05:39] :) [18:05:43] ok I'm saving that in my useful notes pile [18:06:23] or... look at observium [18:06:28] which will tell you this for all network devices [18:06:33] in its activity log [18:06:49] I need to spend more quality time with observium [18:07:08] if you need to spend a lot of quality time with observium, then something is wrong [18:07:17] but I like knowing how obserium gets its info [18:07:22] it should be fairly straightforward and quick ;-) [18:07:25] observium [18:07:34] well so far I haven't spent any quality time with it [18:07:45] hence spending any time will be more time :-P [18:17:14] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [18:18:44] ...on the eqiad architecture [18:19:05] the one exception for that is of course the EX4500 for the memcached machines, which isn't even hooked up yet [18:19:21] notpeter: so search1008 dimm is actually bad, so i will put in a replacement case with dell [18:19:25] LeslieCarr: so juniper sells VC expansion modules for the EX4500 now, and we could hook it up into the stacks [18:19:41] i'm not sure I like it, since it's different from the EX4200s [18:19:57] hrm, i think i'd rather not :( [18:19:58] it's supposed to work, not sure how well it would work in practice [18:20:01] yeah [18:20:17] i've had experience being a juniper guinea pig beforeā€¦ there was much yelling [18:20:23] I believe ya ;) [18:20:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:20:32] mark: LeslieCarr speaking of that, where should the memcached access switch be plugged into? [18:20:59] we can either hook it into the EX4200 stack with a few 10G modules... or hook it up to the core directly [18:21:14] with not so high traffic I would prefer the former, since it's a lot easier also with subnets and such [18:21:25] I assumed I would be running two fibers, one to each cr [18:21:47] but within the rack is a lot easier ;] [18:22:23] if it's not going to be pushing a lot, plug it into two different switches i'd say in the stack :) [18:22:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.239 seconds [18:22:33] looking if 1 & 8 have spare ports [18:22:40] or else we can just order some more 10g cards [18:22:46] New patchset: Bhartshorne; "adding in partman configuration for ms-be hosts. also whitespace retabbing." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2717 [18:23:56] yeah [18:23:59] we can change it later [18:24:03] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2717 [18:24:05] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2717 [18:24:10] although we may have to change the subnets on the servers then too, but whatever [18:24:42] 1 & 8 should not have spare ports, as there are 2 connections to each cr [18:25:00] we may have a spare 10G module or two [18:25:04] and if we don't, we should get some spares [18:26:15] LeslieCarr: I'm catching your ganglai1001 removal in my puppet diff. cool if i check it in? [18:26:56] oh yes please [18:27:33] done. thanks1 [18:28:48] mark/RobH yep no spares - rob can you order some more ? [18:29:07] you talking about the 4200 fiber module ? [18:29:10] i have a single spare on site. [18:29:17] i mean no extras in the switches [18:29:18] hey RobH I couldn't find a wikitech page about ipmi_mgmt. did you write one? if so, coudl you link to it from the IPMI page? [18:29:54] RobH: order 2 more so we can still have a spare? [18:30:14] maplebed: has no docs, and the in script docs need cleaning, but run it without any arguements [18:30:24] but yea, i need to finish polishing it and wikitech it [18:30:55] I've figured out how to use it, this time, but it took some work (things like discovering it's only installed on sockpuppet, etc.) [18:31:18] and I know I'll forget before the next time I need to use it. [18:31:25] :P [18:31:47] is it checked into the software git repo ? (loaded question) [18:31:55] lol [18:32:32] LeslieCarr: the script is in puppet so its in that git. [18:32:57] ok :) [18:34:04] ryan_lane: did you get labstore1 installed?..dell requires a DSET test and i see there is an OS login now [18:34:27] cmjohnson1_: I did, but feel free to bring it down [18:34:33] I can do a shutdown if you'd like [18:35:21] yes plz... [18:35:54] RobH: do you know if it's possible to get the main NIC's MAC address info from IPMI? I couldn't find it yesterday (Leslie got me MACs from the switch instead). [18:36:39] !log shutting down labstore1 [18:36:41] Logged the message, Master [18:36:51] maplebed: if sysinfo doesnt give it, then you have to drop to serial console in bios [18:37:06] that i know of. [18:37:16] aka reboot the box and go into setup? [18:37:56] that's too bad. (though I suppose if the box is up, you can get it from the OS...) [18:38:00] ok, tnx. [18:38:12] it may have a very specicif ipmi command, but i have not found it [18:39:26] PROBLEM - Host labstore1 is DOWN: PING CRITICAL - Packet loss = 100% [18:40:13] cmjohnson1_: it should be down now, or very soon [18:41:20] it is down...thx [18:47:51] hey RobH IIRC you had trouble with grub installing on ms-be1. [18:47:58] do you have notes or recall the solution? [18:48:02] I'm hitting the same thing on ms-be3. [18:52:17] OMG, slowest puppet run ever goes to spence -- 42234.76 seconds (aka 11 hours, 45ish minutes) [18:52:36] daaaamnnn... [18:52:55] maybe it's time to delete all of the puppet_checksd/* again. [18:53:13] owwwww [18:53:21] puppet_checksd ? [18:54:19] considering i've had major issues trying to get neon to get up, anything we can do to possibly speed up would be good IMO [18:54:36] no, nevermind. [18:54:47] they don't seem to have a bazillion copies of each check in there today. [18:54:50] maplebed: you are installing on 2tb disks [18:54:59] RobH: yes. [18:55:04] maplebed: so whatever disks you have the /boot data on you need to create a 1mb bios partitoin first [18:55:16] its a partitoin type, make it and thats all ya gotta do [18:55:22] can partman make it? [18:55:27] in the installer, yep [18:55:40] * maplebed goes to look for an example. [18:55:57] (and here I thought I'd be able to get away with the same partman recipe for both the fe and be hosts. ::sigh::) [18:56:15] nope, and the thing isnt in a partman recipe of course [18:56:18] that would be too easy ;] [18:56:41] oh, you mean I have to make it by hand; partman won't? [18:56:52] s/partman/partman-controlled-by-recipies/ [18:58:17] New patchset: Pyoungmeister; "eqiad != pmtpa" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2718 [18:58:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:59:34] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2718 [18:59:35] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2718 [19:00:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.156 seconds [19:11:23] maplebed: correct, it was by hand since it was the initial build [19:11:38] sorry for delay, was on call with vendor [19:11:39] but you think I will be able to build a recipe to do it? [19:11:41] np. [19:12:00] yea its identical to other partman stuff, just add the initial 1mb biod part [19:12:03] bios partition [19:15:35] RECOVERY - Puppet freshness on bast1001 is OK: puppet ran at Wed Feb 22 19:15:12 UTC 2012 [19:16:29] RECOVERY - Puppet freshness on spence is OK: puppet ran at Wed Feb 22 19:16:27 UTC 2012 [19:20:56] hee ben, can you help the analytics team out with installing some software on stat1? (see mediawiki.org/wiki/Analytics/Infrastructure/Stat1 [19:21:35] drdee: it's on my list, but I've got 3 annoying servers that need to get out asap. [19:22:08] drdee: did you get signoff from everyone involved that the doc we made descirbing stat1 (http://www.mediawiki.org/wiki/Analytics/Infrastructure/Stat1) is correct? [19:22:30] (I'm specifically thinking of any consumers of bayes) [19:22:55] or at least if not 100% correct, that there aren't any issues with teh private IP assignment? [19:23:05] (that's the only thing that's really hard to change after the fact) [19:24:03] oh wait. it is aimed at a public IP for mediawiki, right? [19:24:06] I had forgotten. [19:24:12] (yay having it written down) [19:24:20] !log removing old wap (mobile) site from ekrem as it hasn't been accessed in a day [19:24:22] Logged the message, Mistress of the network gear. [19:27:52] New patchset: Ryan Lane; "Changing smtp host, on Reedy's request" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2719 [19:28:02] RECOVERY - Lucene on search1001 is OK: TCP OK - 0.027 second response time on port 8123 [19:28:45] New review: Reedy; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2719 [19:28:50] Ryan_Lane: looks good [19:29:00] ty [19:29:08] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2719 [19:29:09] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2719 [19:29:29] maplebed: yes stat1 remains public [19:29:56] new requests do not interfere with bayes users [19:31:47] is anyone planning on starting an important puppet job any time in the next hour ? [19:32:03] dunno [19:32:05] why? [19:32:08] i'm tired of neon not being able to build, thinking of doing an iptables rule [19:32:50] blocking pretty much everything else on port 8140 (leaving established? ) [19:32:59] * Ryan_Lane nods [19:33:02] that's fine [19:33:07] I'm running it on formey right now, though [19:33:10] New patchset: Bhartshorne; "adding a new partman config for ms-be hosts to create a tiny bios partition for grub" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2720 [19:33:28] i'll wait until after you're done before i do the drop on 8140 rule :) [19:33:33] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2720 [19:33:33] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2720 [19:35:59] LeslieCarr: you're going to block puppet from running on any host except neon? [19:36:10] yep [19:36:17] ok. thanks for the headsup. [19:36:31] gonna allow existing connections [19:36:35] that ok with you ? [19:36:40] since you're building the ms's ? [19:36:41] any estimate on how long it'll take? [19:36:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:54] hopefully after that it'll be a 15 minute thing :) [19:36:59] hopefully.... [19:37:06] the part of the build I'm working on right now is all pxe and formatting, so I don't think they hit puppet. [19:37:28] (except that in order to change the partitioning stuff on brewster I need to use puppet, but I'll just do it on the host for now and backport my changes when I'm done.) [19:39:25] !log blocking all new puppet connections on all hosts except neon [19:39:27] Logged the message, Mistress of the network gear. [19:39:35] RECOVERY - Puppet freshness on fenari is OK: puppet ran at Wed Feb 22 19:39:07 UTC 2012 [19:42:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.048 seconds [19:45:26] New patchset: Pyoungmeister; "fqdns: not so much. oh well, doesn't really matter" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2721 [19:45:44] PROBLEM - Lucene on search1001 is CRITICAL: Connection refused [19:48:32] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2721 [19:48:34] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2721 [19:52:57] New patchset: Lcarr; "Making tweaks for nagios3 installation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2722 [19:53:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2722 [19:54:05] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2722 [19:54:05] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2722 [19:57:30] robh: removing row c in pmtpa...scs-c1...can that be disconnected? [20:00:35] PROBLEM - check_minfraud2 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:00:35] PROBLEM - check_minfraud2 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:00:35] PROBLEM - check_minfraud2 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:00:35] PROBLEM - check_minfraud2 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:11] RECOVERY - Lucene on search15 is OK: TCP OK - 2.992 second response time on port 8123 [20:01:32] oh now minfraud2 eh. meh. [20:01:38] RECOVERY - Lucene on search1002 is OK: TCP OK - 0.027 second response time on port 8123 [20:02:59] RECOVERY - Host labstore1 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [20:05:23] PROBLEM - check_minfraud2 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:23] PROBLEM - check_minfraud2 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:23] PROBLEM - check_minfraud2 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:23] PROBLEM - check_minfraud2 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:11] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [20:10:29] PROBLEM - check_minfraud2 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:29] PROBLEM - check_minfraud2 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:29] PROBLEM - check_minfraud2 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:29] PROBLEM - check_minfraud2 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:13:11] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [20:13:11] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [20:13:37] cmjohnson1_: it can, but you need to relocate it [20:13:54] cmjohnson1_: put it in d1 pmtpa, drop a ticket to rename and update it [20:13:54] RobH: did you say search1008 or search1014 was good to go? [20:14:05] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [20:14:08] notpeter: both have bad parts, 1008 bad dimm, 1014 bad mainboard [20:14:16] ah, ok [20:14:18] thanks! [20:14:23] so i will place the RMA today, will swap out parts on Friday [20:15:17] sweeeet [20:15:26] PROBLEM - check_minfraud2 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:26] PROBLEM - check_minfraud2 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:27] PROBLEM - check_minfraud2 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:27] PROBLEM - check_minfraud2 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:44] RECOVERY - Lucene on search1004 is OK: TCP OK - 0.034 second response time on port 8123 [20:17:14] RECOVERY - Lucene on search1009 is OK: TCP OK - 0.026 second response time on port 8123 [20:17:23] RECOVERY - Lucene on search1006 is OK: TCP OK - 0.027 second response time on port 8123 [20:17:50] RECOVERY - Lucene on search1005 is OK: TCP OK - 0.031 second response time on port 8123 [20:17:50] RECOVERY - Lucene on search1012 is OK: TCP OK - 0.027 second response time on port 8123 [20:17:59] RECOVERY - Lucene on search1011 is OK: TCP OK - 0.031 second response time on port 8123 [20:18:08] RECOVERY - Lucene on search1010 is OK: TCP OK - 0.027 second response time on port 8123 [20:18:08] RECOVERY - Lucene on search1013 is OK: TCP OK - 0.029 second response time on port 8123 [20:18:35] RECOVERY - Lucene on search1017 is OK: TCP OK - 0.026 second response time on port 8123 [20:19:11] RECOVERY - Lucene on search1018 is OK: TCP OK - 0.032 second response time on port 8123 [20:19:20] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.027 second response time on port 8123 [20:19:29] RECOVERY - Lucene on search1020 is OK: TCP OK - 0.027 second response time on port 8123 [20:19:38] RECOVERY - Lucene on search1016 is OK: TCP OK - 0.026 second response time on port 8123 [20:19:47] RECOVERY - Lucene on search1019 is OK: TCP OK - 0.031 second response time on port 8123 [20:20:32] PROBLEM - check_minfraud2 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:32] PROBLEM - check_minfraud2 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:32] PROBLEM - check_minfraud2 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:32] PROBLEM - check_minfraud2 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:06] hrm, so i am getting err: /Stage[main]/Nagios::Monitor/Nagios_service[ms-fe2 ntp]: Could not evaluate: Puppet::Util::FileType::FileTypeFlat could not write /etc/nagios3/puppet_checks.d/neon.cfg: No such file or directory - /etc/nagios3/puppet_checks.d/neon.cfg -- on spence when running puppet, i'm not sure why it's trying to write to nagios3/neon.cfg .... [20:23:32] RECOVERY - Lucene on search1001 is OK: TCP OK - 0.027 second response time on port 8123 [20:25:29] PROBLEM - check_minfraud2 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:25:30] PROBLEM - check_minfraud2 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:25:30] PROBLEM - check_minfraud2 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:25:30] PROBLEM - check_minfraud2 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:25:43] joy [20:30:26] PROBLEM - check_minfraud2 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:30:26] PROBLEM - check_minfraud2 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:30:27] PROBLEM - check_minfraud2 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:30:27] PROBLEM - check_minfraud2 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:33:43] New patchset: Lcarr; "Changing puppet agent timeout to 960 since 480 is sometimes not enough for puppet server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2723 [20:34:13] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2723 [20:34:14] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2723 [20:35:23] PROBLEM - check_minfraud2 on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:35:23] PROBLEM - check_minfraud2 on payments2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:35:23] PROBLEM - check_minfraud2 on payments1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:35:23] PROBLEM - check_minfraud2 on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:35:56] !log flushed iptables on stafford - all puppet runs shoudl now work [20:35:58] Logged the message, Mistress of the network gear. [20:36:02] maplebed: fyi :) [20:36:17] maplebed: the fyi is for the flushed iptables on stafford [20:40:29] RECOVERY - check_minfraud2 on payments2 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 118 bytes in 0.159 second response time [20:40:30] RECOVERY - check_minfraud2 on payments3 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 118 bytes in 0.229 second response time [20:40:30] RECOVERY - check_minfraud2 on payments4 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 118 bytes in 0.158 second response time [20:40:30] RECOVERY - check_minfraud2 on payments1 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 118 bytes in 0.159 second response time [20:46:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:48:05] thanks LeslieCarr [20:50:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.454 seconds [20:50:48] i also doubled the timeout on the puppet clients to they'll wait 960 seconds before deciding to cancel their request (since some catalogs take 500s+ to compile) [21:02:09] LeslieCarr: do you remember enough partman stuff to tell me why https://gerrit.wikimedia.org/r/#patch,sidebyside,2720,1,files/autoinstall/raid1-2TB-1partition.cfg won't work? [21:02:44] oh, nm. I'm missing a period. [21:23:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:30:07] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.028 seconds [21:32:40] RECOVERY - Lucene on search15 is OK: TCP OK - 8.998 second response time on port 8123 [21:35:23] have any of you seen the error "Debootstrap Error: Couldn't retrieve dists/lucid/main/binary-amd64/Packages" when trying to build a new server? [21:36:33] RobH: maybe? [21:43:49] fyi....reattempting commons deploy momentarily [21:44:57] RECOVERY - DPKG on erzurumi is OK: All packages OK [21:46:23] could someone take a quick skim of the logs on db22 for anything that should stop our deploy? [21:46:30] woosters: ^ [21:46:32] maplebed: have not seen that nope [21:46:54] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [21:47:06] I don't know if it's an artifact of a failed partitioning scheme or some other error. [21:48:32] ya, robla [21:48:59] will let u know shortly [21:52:40] !log dataset1001 eth1 connected [21:52:42] apergos: ^ [21:52:42] Logged the message, RobH [21:53:30] yay thanks [21:53:54] tomorrow I'll try to do the rest :-) [21:54:10] the info you gave me was good enough [21:54:40] for finding the port =] [21:54:50] well I had to have lesliie tell me the secret mapping algorithm from inteface number to port :-) [22:03:15] RECOVERY - Lucene on search15 is OK: TCP OK - 8.995 second response time on port 8123 [22:03:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:06:29] do we not normally include storman or some other RAID CLI tool on the aacraid boxes? [22:09:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.187 seconds [22:15:24] PROBLEM - Lucene on search15 is CRITICAL: Connection timed out [22:19:09] RECOVERY - Lucene on search15 is OK: TCP OK - 0.006 second response time on port 8123 [22:23:26] zomg does puppet normally crap 200M of cache crap on a host? [22:24:06] rainman-sr: you got a minute? [22:28:55] RobH: I found out why it couldn't retrieve the packages. no space left on device. because it was trying to install to the 10M partition I had made for grub. ::sigh:: [22:29:11] maplebed: :-( [22:32:44] oh man [22:35:15] well, I think it was my mistake. I failed to say the RAID partition should be /dev/sd{a,b}2 (instead of 1) after adding the 10M partition. [22:35:19] trying again now. [22:35:24] (after much head pounding) [22:35:32] there has got to be a better way. [22:35:40] install netbsd! [22:35:57] sorry, I think I must have typoed. I meant *better* way. [22:36:03] :-P [22:36:10] maplebed: 10m? [22:36:11] 1m [22:36:40] bios part only has to be 1, did i make it 10? )wont matter) [22:37:11] I think I just didn't follow your instructions correctly. [22:37:33] yea the bios partitoin only needs to be 1mb [22:37:56] I also called it 'grub' instead of 'bios'. [22:38:10] * maplebed is rebellious. [22:44:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:48:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.742 seconds [22:49:08] New review: Hashar; "Please note the test/mediawiki repo will be destroyed and that commit will be lost :-D" [test/mediawiki/extensions/examples] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2714 [22:49:08] Change merged: Hashar; [test/mediawiki/extensions/examples] (master) - https://gerrit.wikimedia.org/r/2714 [22:51:50] we've been running 1.19 on commons for 45 min now. any weird spikes we should be investigating? [22:52:54] hmm, https://graphite.wikimedia.org/dashboard/ is down [22:57:26] nvm [23:01:01] ok, tired, going home. [23:13:11] New patchset: Lcarr; "Trying to move exported resources in new nagios host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2724 [23:13:49] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2724 [23:13:50] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2724 [23:16:13] New patchset: Pyoungmeister; "need the quotes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2725 [23:16:38] let's hope that my attempt at overriding resources works :) [23:16:51] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2725 [23:16:52] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2725 [23:24:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:28:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.539 seconds [23:32:58] New patchset: Lcarr; "Revert "Trying to move exported resources in new nagios host"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2726 [23:34:25] New patchset: Lcarr; "Revert "Trying to move exported resources in new nagios host"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2727 [23:35:04] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2726 [23:35:12] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2726 [23:35:13] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2726 [23:35:25] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2727 [23:35:26] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2727 [23:39:14] New patchset: Lcarr; "fixing reference to nagios3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2728 [23:39:47] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2728 [23:39:47] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2728 [23:42:33] !log restarting opendj on virt0 [23:42:35] Logged the message, Master [23:43:01] this makes absolutely no sense [23:45:04] www-data can run php... [23:46:55] why would php not run? [23:47:48] php sucks ;) [23:48:50] php sucks less every release [23:49:21] -_- [23:49:27] apache stop, then start worked [23:49:49] I guess something screwed apache up, and it wasn't actually restarting properly [23:50:20] the crappy part of only having one apache node ;) [23:50:28] and no load balancing/health checks [23:56:08] doh, i just realized, neon = internal host right now :( [23:56:15] need to change it/reinstall yet again [23:56:19] heh [23:58:15] New patchset: Lcarr; "Putting neon in decomissioned (reinstalling as public)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2729 [23:58:37] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2729 [23:58:54] LeslieCarr: did apergos tell you about the ssh key issue with neon? [23:59:09] TimStarling: yeah, cleared out the puppet config [23:59:23] using puppetstoredconfigclean.rb [23:59:57] ok