[00:05:37] <nagios-wm_>	 PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours
[00:07:52] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:14:13] <gerrit-wm>	 New patchset: Jdlrobson; "refine contact us emails to include referring page and whether from app (bug 36388)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24238
[00:16:47] <gerrit-wm>	 Change merged: awjrichards; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24238
[00:23:19] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.047 seconds
[00:55:16] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:07:25] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.549 seconds
[01:31:33] <nagios-wm_>	 PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours
[01:40:24] <nagios-wm_>	 PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 259 seconds
[01:42:12] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:42:57] <nagios-wm_>	 PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 217 seconds
[01:44:54] <nagios-wm_>	 RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 26 seconds
[01:50:09] <nagios-wm_>	 PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL -  Seconds_Behind_Master : 649s
[01:51:39] <nagios-wm_>	 RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK -  Seconds_Behind_Master : 19s
[01:51:57] <nagios-wm_>	 RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 13 seconds
[01:56:18] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.069 seconds
[02:20:27] <nagios-wm_>	 PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours
[02:27:57] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:38:36] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.046 seconds
[02:39:30] <nagios-wm_>	 PROBLEM - Puppet freshness on cp1030 is CRITICAL: Puppet has not run in the last 10 hours
[02:58:33] <nagios-wm_>	 PROBLEM - Puppet freshness on cp1029 is CRITICAL: Puppet has not run in the last 10 hours
[03:11:08] <nagios-wm_>	 PROBLEM - Puppet freshness on cp1032 is CRITICAL: Puppet has not run in the last 10 hours
[03:23:08] <nagios-wm_>	 PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours
[03:27:11] <nagios-wm_>	 PROBLEM - Puppet freshness on cp1034 is CRITICAL: Puppet has not run in the last 10 hours
[03:27:11] <nagios-wm_>	 PROBLEM - Puppet freshness on cp1035 is CRITICAL: Puppet has not run in the last 10 hours
[03:43:14] <nagios-wm_>	 PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours
[03:46:14] <nagios-wm_>	 PROBLEM - Puppet freshness on cp1033 is CRITICAL: Puppet has not run in the last 10 hours
[03:56:17] <nagios-wm_>	 PROBLEM - Puppet freshness on cp1036 is CRITICAL: Puppet has not run in the last 10 hours
[05:05:08] <nagios-wm_>	 PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours
[05:05:08] <nagios-wm_>	 PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours
[05:26:01] <nagios-wm_>	 RECOVERY - Puppet freshness on locke is OK: puppet ran at Fri Sep 28 05:25:49 UTC 2012
[05:33:58] <nagios-wm_>	 PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out
[05:35:10] <nagios-wm_>	 RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123
[05:36:04] <nagios-wm_>	 PROBLEM - Squid on brewster is CRITICAL: Connection refused
[05:38:28] <nagios-wm_>	 PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out
[05:40:07] <nagios-wm_>	 PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out
[05:41:19] <nagios-wm_>	 RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123
[05:41:47] <nagios-wm_>	 RECOVERY - Lucene on search1015 is OK: TCP OK - 0.027 second response time on port 8123
[05:56:37] <nagios-wm_>	 PROBLEM - Puppet freshness on mw1 is CRITICAL: Puppet has not run in the last 10 hours
[05:56:37] <nagios-wm_>	 PROBLEM - Puppet freshness on mw10 is CRITICAL: Puppet has not run in the last 10 hours
[05:56:37] <nagios-wm_>	 PROBLEM - Puppet freshness on mw12 is CRITICAL: Puppet has not run in the last 10 hours
[05:56:37] <nagios-wm_>	 PROBLEM - Puppet freshness on mw13 is CRITICAL: Puppet has not run in the last 10 hours
[05:56:37] <nagios-wm_>	 PROBLEM - Puppet freshness on mw11 is CRITICAL: Puppet has not run in the last 10 hours
[05:56:38] <nagios-wm_>	 PROBLEM - Puppet freshness on mw14 is CRITICAL: Puppet has not run in the last 10 hours
[05:56:38] <nagios-wm_>	 PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours
[05:56:39] <nagios-wm_>	 PROBLEM - Puppet freshness on mw2 is CRITICAL: Puppet has not run in the last 10 hours
[05:56:39] <nagios-wm_>	 PROBLEM - Puppet freshness on mw3 is CRITICAL: Puppet has not run in the last 10 hours
[05:56:40] <nagios-wm_>	 PROBLEM - Puppet freshness on mw4 is CRITICAL: Puppet has not run in the last 10 hours
[05:56:40] <nagios-wm_>	 PROBLEM - Puppet freshness on mw6 is CRITICAL: Puppet has not run in the last 10 hours
[05:56:41] <nagios-wm_>	 PROBLEM - Puppet freshness on mw16 is CRITICAL: Puppet has not run in the last 10 hours
[05:56:41] <nagios-wm_>	 PROBLEM - Puppet freshness on mw9 is CRITICAL: Puppet has not run in the last 10 hours
[05:56:42] <nagios-wm_>	 PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours
[05:56:42] <nagios-wm_>	 PROBLEM - Puppet freshness on mw5 is CRITICAL: Puppet has not run in the last 10 hours
[05:56:43] <nagios-wm_>	 PROBLEM - Puppet freshness on mw7 is CRITICAL: Puppet has not run in the last 10 hours
[06:36:49] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:42:58] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.389 seconds
[06:49:43] <nagios-wm_>	 PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours
[06:57:13] <nagios-wm_>	 RECOVERY - Squid on brewster is OK: TCP OK - 0.001 second response time on port 8080
[07:02:46] <nagios-wm_>	 PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours
[07:11:46] <nagios-wm_>	 PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[07:11:55] <nagios-wm_>	 PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[07:11:55] <nagios-wm_>	 PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[07:12:22] <nagios-wm_>	 PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[07:12:31] <nagios-wm_>	 PROBLEM - swift-object-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[07:12:31] <nagios-wm_>	 PROBLEM - swift-container-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[07:12:31] <nagios-wm_>	 PROBLEM - swift-account-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[07:12:40] <nagios-wm_>	 PROBLEM - swift-account-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[07:13:07] <nagios-wm_>	 PROBLEM - swift-object-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[07:13:07] <nagios-wm_>	 PROBLEM - swift-container-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[07:13:07] <nagios-wm_>	 PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[07:13:16] <nagios-wm_>	 PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[07:17:37] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[07:25:16] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.779 seconds
[07:51:11] <gerrit-wm>	 New review: Tim Starling; "One issue, but that was there before, this is no worse." [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/8438
[07:56:58] <nagios-wm_>	 PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours
[08:00:34] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:11:13] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.018 seconds
[08:46:35] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:00:14] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.047 seconds
[09:32:11] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:32:56] <nagios-wm_>	 PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours
[09:32:56] <nagios-wm_>	 PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours
[09:32:56] <nagios-wm_>	 PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours
[09:32:56] <nagios-wm_>	 PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours
[09:32:56] <nagios-wm_>	 PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours
[09:47:29] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.041 seconds
[10:06:19] <nagios-wm_>	 PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours
[10:20:07] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:32:15] <gerrit-wm>	 New patchset: Hashar; "import zuul module from OpenStack" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25235
[10:32:25] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.885 seconds
[10:33:31] <gerrit-wm>	 New patchset: Hashar; "zuul role for labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25236
[10:34:33] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25235
[10:34:33] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25236
[11:06:48] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:19:06] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.507 seconds
[11:32:18] <nagios-wm_>	 PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours
[11:53:45] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:55:15] <nagios-wm_>	 PROBLEM - Puppet freshness on mw22 is CRITICAL: Puppet has not run in the last 10 hours
[12:09:03] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.049 seconds
[12:21:21] <nagios-wm_>	 PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours
[12:40:24] <nagios-wm_>	 PROBLEM - Puppet freshness on cp1030 is CRITICAL: Puppet has not run in the last 10 hours
[12:40:33] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:52:42] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.814 seconds
[12:59:18] <nagios-wm_>	 PROBLEM - Puppet freshness on cp1029 is CRITICAL: Puppet has not run in the last 10 hours
[13:11:49] <mark>	 !log Moved BGP transit sessions to AS1257 from cr2-eqiad (over equinix exchange) to cr1-eqiad (dedicated link)
[13:12:01] <morebots>	 Logged the message, Master
[13:12:21] <nagios-wm_>	 PROBLEM - Puppet freshness on cp1032 is CRITICAL: Puppet has not run in the last 10 hours
[13:24:21] <nagios-wm_>	 PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours
[13:28:15] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:28:24] <nagios-wm_>	 PROBLEM - Puppet freshness on cp1034 is CRITICAL: Puppet has not run in the last 10 hours
[13:28:24] <nagios-wm_>	 PROBLEM - Puppet freshness on cp1035 is CRITICAL: Puppet has not run in the last 10 hours
[13:40:24] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.303 seconds
[13:44:14] <nagios-wm_>	 PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours
[13:47:14] <nagios-wm_>	 PROBLEM - Puppet freshness on cp1033 is CRITICAL: Puppet has not run in the last 10 hours
[13:57:17] <nagios-wm_>	 PROBLEM - Puppet freshness on cp1036 is CRITICAL: Puppet has not run in the last 10 hours
[14:13:56] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:21:38] <ottomata>	 hi all!
[14:21:49] <ottomata>	 mark, if you have a sec, could you help me with that netapp mount for analytics?
[14:25:31] <mark>	 what netapp mount for analytics?
[14:29:16] <Jeff_Green>	 mark: #3619: Allow analytics cluster to mount fundraising archive...
[14:29:41] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.042 seconds
[14:30:43] <ottomata>	 thanks Jeff, ja, that :)
[14:30:57] <mark>	 ok
[14:31:39] <Jeff_Green>	 mark: while you're in there, I wonder if we should allow oxygen and emery to mount r/w as well as locke?
[14:33:47] <mark>	 ottomata: you can now NFS mount (ro) nas1001-a:/vol/fr_archive from analytics1003
[14:33:56] <mark>	 Jeff_Green: for failover you mean?
[14:34:09] <ottomata>	 danke!
[14:34:18] <Jeff_Green>	 yeah, one less thing to do if locke blows up or overloads mid-fundraiser
[14:34:54] <mark>	 ok
[14:35:27] <Jeff_Green>	 oh ho RobH *pounce*: I'm gonna reboot Aluminium, are you within an hour or two of the DC in off chance it doesn't come back?
[14:35:46] <RobH>	 Yep
[14:35:49] <Jeff_Green>	 ok cool
[14:35:53] <RobH>	 onsite already
[14:35:59] <Jeff_Green>	 oh nice
[14:36:09] <RobH>	 I was onsite yesterday and forgot to remove the topic update when I left =P
[14:36:26] <mark>	 emery already had access...
[14:36:43] <Jeff_Green>	 robh we should just put a gps tracker on you :-P
[14:36:51] <Jeff_Green>	 mark: ah good
[14:37:00] <RobH>	 Jeff_Green: no thanks, I already have to fight to take vacation as it is.
[14:37:22] <Jeff_Green>	 RobH: it would have to have an on-off switch for on the clock hours
[14:39:17] <mark>	 oxygen has been granted ro access to nas1001-a
[14:39:21] <mark>	 you can't write to that volume anyway
[14:39:24] <Jeff_Green>	 !log rebooting aluminium
[14:39:33] <Jeff_Green>	 oh is oxygen in eqiad?
[14:39:34] <morebots>	 Logged the message, Master
[14:39:37] <mark>	 yes
[14:39:41] <Jeff_Green>	 oic
[14:43:14] <RobH>	 Jeff_Green: any element is in eqiad
[14:43:36] <RobH>	 tampa misc names are encyclopedians
[14:43:46] <RobH>	 esams just doesnt get miscs servers ;]
[14:43:50] <Jeff_Green>	 ah!
[14:44:04] <RobH>	 i have a rough outline to thorw on wikitech for naming conventions
[14:44:09] <Jeff_Green>	 what happens when we run out of elements? will we add isotopes?
[14:44:18] <RobH>	 plan to do later today so the misc servers can be named properly when im not about next week
[14:44:30] <RobH>	 I dont see us having that many misc servers at eqiad
[14:44:39] <Jeff_Green>	 i see
[14:44:39] <RobH>	 but if we do, that sounds like a good plan to me ;]
[14:45:43] <gerrit-wm>	 New patchset: Mark Bergsma; "Apply Puppet Varnish config to cp1029-1036" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25633
[14:46:40] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25633
[14:47:35] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25633
[14:49:51] <ottomata>	 jargh, Jeff_Green, ha, loading this data into hadoop is gonna be annoying
[14:49:58] <ottomata>	 because you have so many small files! :p
[14:50:09] <RobH>	 cmjohnson1: thats so annoying then
[14:50:21] <RobH>	 so 720 has the drac7 and the settings are the same as on the 320
[14:50:22] <Jeff_Green>	 ottomata: ha.
[14:50:25] <RobH>	 and not same results
[14:50:25] <RobH>	 ?
[14:50:35] <RobH>	 (bringing conversation in here since I was also chatting with mark)
[14:50:43] <cmjohnson1>	 no, i was comparing the 2 last night b4 i left
[14:50:51] <cmjohnson1>	 i didn't see any differences
[14:51:03] <mark>	 esams has ~ 8 misc servers
[14:51:09] <RobH>	 what is one of the 720s?
[14:51:17] <RobH>	 cmjohnson1: these are the two where one was puppet run and one wasnt right?
[14:51:27] <RobH>	 im going to take over the non puppet run system if its still not used to compare settings.
[14:51:46] <cmjohnson1>	 yes
[14:51:48] <RobH>	 mark: what is our naming standard for misc in esams?  (just curious)
[14:51:56] <mark>	 whatever crap names I come up with
[14:51:58] <RobH>	 cmjohnson1: cool, what were they, db61 and 62?
[14:52:10] <cmjohnson1>	 yes, 62 is no puppet run
[14:52:11] <mark>	 for the most recent batch I used some famous dutch people
[14:52:11] <nagios-wm_>	 RECOVERY - Puppet freshness on cp1029 is OK: puppet ran at Fri Sep 28 14:51:56 UTC 2012
[14:52:54] <mark>	 just check rack oe12 in racktables
[14:53:18] <RobH>	 !log db62 being pulled for drac7 work
[14:53:28] <morebots>	 Logged the message, RobH
[14:54:22] <RobH>	 cmjohnson1: im going to make a wikitech page later today, but fyi
[14:54:22] <RobH>	 http://support.dell.com/support/edocs/software/smdrac3/idrac7/index.htm
[14:54:27] <RobH>	 drac7 manual
[14:54:37] <cmjohnson1>	 cool ^ thx
[14:56:40] <apergos>	 so after some reads and writes on ms-be6 with the new controller we have a disk failure apparently
[14:56:49] <apergos>	 (reporting for anyone following aloong)
[14:57:09] <apergos>	 megacli reports it as firmware failed, ls on the partition gives an i/o error
[15:01:38] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:03:01] <apergos>	 er on the filesystem rather :-P
[15:04:04] <RobH>	 cmjohnson1: On all the 320s, you need to disable virtualization tech on the cpu in bios
[15:04:19] <RobH>	 basically you have to confirm thats off on all hosts all the time ;]
[15:04:22] <RobH>	 (except labs)
[15:04:41] <cmjohnson1>	 okay
[15:04:46] <RobH>	 (i dunno if you already did, just mentioning it)
[15:05:19] <cmjohnson1>	 i usually do...but haven't done anything with the 320's
[15:06:17] <nagios-wm_>	 PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours
[15:06:17] <nagios-wm_>	 PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours
[15:06:20] <mark>	 why does that need to be disabled?
[15:07:18] <mark>	 also PLEASE set up all new boxes WITHOUT serial console redirection after boot, thanks :)
[15:08:06] <RobH>	 hahahaha
[15:08:23] <RobH>	 mark: ryan pointed out awhile ago a few exploits that require that, and that we should just disable
[15:08:26] <RobH>	 it was in an ops meeting
[15:08:37] <RobH>	 and everyone thought it was a good idea (who actually replied in meeting)
[15:08:51] <RobH>	 should we leave it on for some reason?
[15:08:58] <RobH>	 cmjohnson1: so its a default change in bios on the 320s
[15:09:45] <RobH>	 Serial Port Address: on the 320s defaults to Serial Device1=COM2,Serial Device2=COM1
[15:09:59] <RobH>	 should be Serial Device1=COM1,Serial Device2=COM2
[15:10:06] <RobH>	 which is what it defaulted to on everything else =P
[15:10:13] <RobH>	 easy to overlook but when i had both up i noticed
[15:10:18] <RobH>	 so they werent quite identical ;]
[15:10:27] <RobH>	 now the 320 redirects fine.
[15:10:36] <mark>	 RobH: ok
[15:10:42] <mark>	 no it's fine, I was wondering
[15:10:44] <RobH>	 I'll be updating a platform specific page later today once i have all these remotely accessible
[15:10:53] <mark>	 indeed, security is the only reason I could think of, was wondering if there were others
[15:11:03] <RobH>	 mark: yea I honestly dont recall the full explination, it was months ago though
[15:11:12] <mark>	 ok
[15:12:07] <RobH>	 so all the eqiad based misc servers should be mgmt accessible later today.  I will drop a network ticket with all the port info for labeling
[15:12:25] <RobH>	 and then i want to spend my afternoon with the allocations that are pending so nothing is waiting on me when im gone next week
[15:14:05] <mark>	 are all memcached servers connected?
[15:14:09] <mark>	 I saw only 8 up a few days ago
[15:14:54] <RobH>	 mark: up to 1014
[15:15:03] <RobH>	 1015/16 are not as i want 14's cable to confirm working
[15:15:11] <RobH>	 i dont wanna open more cables so we can return them (the dell ones)
[15:15:24] <RobH>	 notpeter tried to use the 1009+ yesterday and couldnt
[15:15:34] <RobH>	 seems networking isnt up on them, link is on them once i reseated.
[15:15:35] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.049 seconds
[15:17:53] <mark>	 well we need that up before you're gone
[15:18:13] <mark>	 lemme check now
[15:18:47] <mark>	     Xcvr 8                NON-JNPR     174410000014      UNKNOWN
[15:18:47] <mark>	     Xcvr 9                NON-JNPR     174410000005      UNKNOWN
[15:18:47] <mark>	     Xcvr 10               NON-JNPR     174410000027      SFP+-10G-CU1M
[15:18:47] <mark>	     Xcvr 11               NON-JNPR     174410000030      SFP+-10G-CU1M
[15:18:47] <mark>	     Xcvr 12               NON-JNPR     174410000026      SFP+-10G-CU1M
[15:18:47] <mark>	     Xcvr 13               NON-JNPR     MOC16262718       SFP-CX
[15:18:48] <mark>	     Xcvr 39               NON-JNPR     174540000036      SFP+-10G-CU1M
[15:18:56] <mark>	 so some seem to work, others not?
[15:19:36] <mark>	 1014 doesn't work it looks like
[15:25:11] <nagios-wm_>	 RECOVERY - Puppet freshness on cp1032 is OK: puppet ran at Fri Sep 28 15:24:51 UTC 2012
[15:28:20] <RobH>	 mark: damn, thats the new dell cable
[15:28:29] <RobH>	 =/
[15:28:36] <RobH>	 so i will do a return for those later.
[15:28:44] <mark>	 what brand are they?
[15:29:29] <RobH>	 i dont recall, will go look in a momemnt
[15:29:38] <RobH>	 these were the dell item you linked me to, but we didnt know if it would work
[15:30:03] <mark>	 yes
[15:30:06] <mark>	 apparently they don't :(
[15:30:26] <mark>	 annoying
[15:30:32] <mark>	 probably juniper did this on purpose
[15:30:32] <RobH>	 cisco.
[15:30:41] <mark>	 cisco cables? hm.
[15:46:02] <nagios-wm_>	 RECOVERY - NTP on cp1032 is OK: NTP OK: Offset -0.04797828197 secs
[15:46:11] <RobH>	 !log done with db62.  it was in the ubuntu installer when I took it over, so its just sitting now
[15:46:21] <morebots>	 Logged the message, RobH
[15:48:53] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:57:17] <nagios-wm_>	 PROBLEM - Puppet freshness on mw1 is CRITICAL: Puppet has not run in the last 10 hours
[15:57:17] <nagios-wm_>	 PROBLEM - Puppet freshness on mw11 is CRITICAL: Puppet has not run in the last 10 hours
[15:57:17] <nagios-wm_>	 PROBLEM - Puppet freshness on mw10 is CRITICAL: Puppet has not run in the last 10 hours
[15:57:17] <nagios-wm_>	 PROBLEM - Puppet freshness on mw12 is CRITICAL: Puppet has not run in the last 10 hours
[15:57:17] <nagios-wm_>	 PROBLEM - Puppet freshness on mw13 is CRITICAL: Puppet has not run in the last 10 hours
[15:57:18] <nagios-wm_>	 PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours
[15:57:18] <nagios-wm_>	 PROBLEM - Puppet freshness on mw14 is CRITICAL: Puppet has not run in the last 10 hours
[15:57:19] <nagios-wm_>	 PROBLEM - Puppet freshness on mw3 is CRITICAL: Puppet has not run in the last 10 hours
[15:57:19] <nagios-wm_>	 PROBLEM - Puppet freshness on mw6 is CRITICAL: Puppet has not run in the last 10 hours
[15:57:20] <nagios-wm_>	 PROBLEM - Puppet freshness on mw5 is CRITICAL: Puppet has not run in the last 10 hours
[15:57:20] <nagios-wm_>	 PROBLEM - Puppet freshness on mw4 is CRITICAL: Puppet has not run in the last 10 hours
[15:57:21] <nagios-wm_>	 PROBLEM - Puppet freshness on mw2 is CRITICAL: Puppet has not run in the last 10 hours
[15:57:21] <nagios-wm_>	 PROBLEM - Puppet freshness on mw7 is CRITICAL: Puppet has not run in the last 10 hours
[15:57:22] <nagios-wm_>	 PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours
[15:57:22] <nagios-wm_>	 PROBLEM - Puppet freshness on mw16 is CRITICAL: Puppet has not run in the last 10 hours
[15:57:23] <nagios-wm_>	 PROBLEM - Puppet freshness on mw9 is CRITICAL: Puppet has not run in the last 10 hours
[15:59:59] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.416 seconds
[16:07:13] <nagios-wm_>	 PROBLEM - Varnish HTTP upload-frontend on cp1032 is CRITICAL: Connection refused
[16:25:22] <nagios-wm_>	 PROBLEM - Host cp1032 is DOWN: PING CRITICAL - Packet loss = 100%
[16:25:49] <nagios-wm_>	 RECOVERY - Host cp1032 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms
[16:35:34] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:48:01] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.866 seconds
[16:50:34] <nagios-wm_>	 PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours
[16:57:28] <nagios-wm_>	 PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours
[16:57:47] <apergos>	 and it'snot going to either
[17:03:28] <nagios-wm_>	 PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours
[17:03:43] <cmjohnson1>	 apergos:  i am going to replace disk 7 and start the copy...is that still okay?
[17:03:50] <apergos>	 uh
[17:03:55] <apergos>	 can you wait about 5 mins
[17:04:01] <cmjohnson1>	 sure
[17:04:03] <apergos>	 (for some copies to complete
[17:04:10] <apergos>	 I'll holler as soon as these ones are ready
[17:04:12] <cmjohnson1>	 also updated ticket for dell w/your comments from this morning
[17:04:14] <cmjohnson1>	 cool
[17:04:16] <apergos>	 yeah I saw
[17:04:29] <cmjohnson1>	 good news w/ the setup we have now...the failure led shows
[17:04:37] <apergos>	 yay
[17:04:44] <cmjohnson1>	 they've emailed me 2x and called once
[17:04:52] <apergos>	 really? asking what?
[17:05:10] <cmjohnson1>	 the status on whether or not we were seeing better results w/ the new card
[17:05:13] <apergos>	 ah
[17:05:39] <apergos>	 yeah I don't know.  on the one hand a disk failure didn't cause catadtrophic collapse but otoh we haven't ried to replace it yet either
[17:11:22] <apergos>	 hmm maybe about 5 more mins, sorry
[17:16:02] <cmjohnson1>	 apergos: no problem..let me know I am updating the misc servers here
[17:16:07] <apergos>	 ok
[17:24:01] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:26:36] <apergos>	 cmjohnson1: all yours
[17:26:47] <apergos>	 I'll be back later, I'll check in and see what happened
[17:29:19] <cmjohnson1>	 ok
[17:36:50] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds
[17:38:29] <nagios-wm_>	 PROBLEM - Varnish HTTP upload-frontend on cp1030 is CRITICAL: Connection refused
[17:38:56] <nagios-wm_>	 PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: Connection refused by host
[17:39:14] <nagios-wm_>	 PROBLEM - Varnish HTCP daemon on cp1030 is CRITICAL: Connection refused by host
[17:39:41] <nagios-wm_>	 PROBLEM - Varnish HTTP upload-backend on cp1030 is CRITICAL: Connection refused
[17:41:11] <nagios-wm_>	 RECOVERY - swift-container-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[17:41:11] <nagios-wm_>	 RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[17:41:11] <nagios-wm_>	 RECOVERY - swift-account-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[17:41:20] <nagios-wm_>	 RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[17:41:20] <nagios-wm_>	 RECOVERY - swift-object-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[17:41:20] <nagios-wm_>	 RECOVERY - swift-account-reaper on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[17:41:47] <nagios-wm_>	 RECOVERY - swift-object-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[17:41:47] <nagios-wm_>	 RECOVERY - swift-container-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[17:41:47] <nagios-wm_>	 RECOVERY - swift-account-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[17:42:05] <nagios-wm_>	 RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[17:42:05] <nagios-wm_>	 RECOVERY - swift-container-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[17:42:05] <nagios-wm_>	 RECOVERY - swift-object-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[17:49:35] <nagios-wm_>	 PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100%
[17:57:50] <nagios-wm_>	 PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours
[18:04:01] <binasher>	 !log stopping mysql on db1047 for upgrades
[18:04:11] <morebots>	 Logged the message, Master
[18:09:14] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:11:10] <^demon>	 binasher: Are you doing upgrades on other dbs in that range?
[18:11:17] * ^demon  wonders because db1048 is important to him
[18:11:22] <binasher>	 nope
[18:11:31] <^demon>	 Okie dokie, carry on.
[18:11:31] <binasher>	 db1047 is for enwiki analytics
[18:12:36] <binasher>	 they're getting a shiny new storage array! and are going to be our mariadb guinea pig
[18:20:36] <gerrit-wm>	 New review: Nemo bis; "Bug closed, change to be abandoned." [operations/mediawiki-config] (master) C: -1;  - https://gerrit.wikimedia.org/r/25599
[18:21:41] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.908 seconds
[18:30:32] <nagios-wm_>	 RECOVERY - Host ms-be6 is UP: PING WARNING - Packet loss = 58%, RTA = 43.78 ms
[18:30:37] <gerrit-wm>	 Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/20876
[18:31:08] <gerrit-wm>	 New review: Reedy; "Nice." [operations/mediawiki-config] (master); V: 0 C: 0;  - https://gerrit.wikimedia.org/r/23935
[18:31:24] <gerrit-wm>	 Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24326
[18:32:00] <gerrit-wm>	 Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24561
[18:32:45] <gerrit-wm>	 Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25493
[18:33:07] <gerrit-wm>	 Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23076
[18:33:47] <gerrit-wm>	 Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25196
[18:33:55] <gerrit-wm>	 Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25231
[18:33:59] <nagios-wm_>	 PROBLEM - swift-container-updater on ms-be6 is CRITICAL: Connection refused by host
[18:34:08] <nagios-wm_>	 PROBLEM - swift-account-server on ms-be6 is CRITICAL: Connection refused by host
[18:34:35] <nagios-wm_>	 PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: Connection refused by host
[18:34:37] <gerrit-wm>	 Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24671
[18:34:52] <gerrit-wm>	 Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24672
[18:34:53] <nagios-wm_>	 PROBLEM - swift-object-server on ms-be6 is CRITICAL: Connection refused by host
[18:34:53] <nagios-wm_>	 PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: Connection refused by host
[18:34:53] <nagios-wm_>	 PROBLEM - swift-account-replicator on ms-be6 is CRITICAL: Connection refused by host
[18:34:53] <nagios-wm_>	 PROBLEM - swift-container-server on ms-be6 is CRITICAL: Connection refused by host
[18:35:02] <nagios-wm_>	 PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: Connection refused by host
[18:35:02] <nagios-wm_>	 PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: Connection refused by host
[18:35:02] <nagios-wm_>	 PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: Connection refused by host
[18:35:02] <nagios-wm_>	 PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: Connection refused by host
[18:35:07] <gerrit-wm>	 Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16935
[18:35:11] <nagios-wm_>	 PROBLEM - SSH on ms-be6 is CRITICAL: Connection refused
[18:35:11] <nagios-wm_>	 PROBLEM - swift-object-updater on ms-be6 is CRITICAL: Connection refused by host
[18:35:25] <gerrit-wm>	 Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12556
[18:39:05] <nagios-wm_>	 PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100%
[18:39:09] <gerrit-wm>	 New review: Reedy; "Needs rebasing! :(" [operations/mediawiki-config] (master); V: 0 C: -1;  - https://gerrit.wikimedia.org/r/23059
[18:39:25] <gerrit-wm>	 New review: Reedy; "Needs rebasing! :(" [operations/mediawiki-config] (master); V: 0 C: -1;  - https://gerrit.wikimedia.org/r/23935
[18:45:03] <gerrit-wm>	 Change abandoned: Dereckson; "The logo has been changed on the wiki by a CSS modification. This configuration change is so unneces..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25599
[18:49:37] <Nemo_bis>	 Reedy: I suppose such a big diff is bound to be a nightmare to rebase?
[18:50:23] <Nemo_bis>	 aka perhaps one change should wait for the other to be merged
[18:51:50] <nagios-wm_>	 RECOVERY - swift-container-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[18:51:59] <nagios-wm_>	 RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[18:52:08] <nagios-wm_>	 RECOVERY - swift-object-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[18:52:08] <nagios-wm_>	 RECOVERY - swift-container-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[18:52:11] <gerrit-wm>	 New patchset: Reedy; "(bug 29902) Cleaning InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23059
[18:52:17] <Reedy>	 ^ that was pretty easy to rebase
[18:52:26] <nagios-wm_>	 RECOVERY - swift-account-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[18:52:26] <nagios-wm_>	 RECOVERY - SSH on ms-be6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0)
[18:52:26] <nagios-wm_>	 RECOVERY - swift-account-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[18:52:26] <nagios-wm_>	 RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[18:52:26] <nagios-wm_>	 RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[18:52:35] <nagios-wm_>	 RECOVERY - swift-object-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[18:52:35] <nagios-wm_>	 RECOVERY - swift-account-reaper on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[18:52:35] <nagios-wm_>	 RECOVERY - swift-object-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[18:52:35] <nagios-wm_>	 RECOVERY - swift-container-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[18:52:39] <Nemo_bis>	 Reedy: does this mean that you'll do the other one as well? :)
[18:52:50] <gerrit-wm>	 Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23059
[18:52:55] <Reedy>	 I'm gonna have a looook
[18:52:59] <Nemo_bis>	 oki
[18:53:02] <nagios-wm_>	 RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[18:54:06] <gerrit-wm>	 Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/18125
[18:55:45] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:56:05] <cmjohnson1>	 so if anyone is following along w/ms-be6 issues...replaced disk 7 had to re-do the raid config and have several disk not mounting http://p.defau.lt/?KtPq9kPsWBz7uwIRCM61HA
[18:59:46] <gerrit-wm>	 New patchset: Reedy; "(bug 29692) Per-wiki namespace aliases shouldn't override (remove) global ones" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23935
[19:00:01] <gerrit-wm>	 New patchset: Catrope; "Fix typo in variable name" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25694
[19:00:16] <Reedy>	 Nemo_bis: I think there's a few new extra additions that need tidying up..
[19:00:31] <Nemo_bis>	 Reedy: ok, can you merge this in the meanwhile?
[19:00:34] <Nemo_bis>	 I'll do another commit
[19:00:57] <gerrit-wm>	 Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25694
[19:01:04] <Nemo_bis>	 Reedy: also, does it require interwiki update or not?
[19:01:26] <Reedy>	 What do you mean?
[19:02:26] <Nemo_bis>	 Reedy: I mean that on some of those wikis now e.g. Wikipedia: links to a local page while previously it was an interwiki
[19:02:41] <Nemo_bis>	 does this work automatically or does the interwiki cache need to be updated?
[19:03:02] <Reedy>	 No, it doesn't need updating
[19:03:05] <Reedy>	 they all use the same cache
[19:03:10] <gerrit-wm>	 Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23935
[19:03:35] <Nemo_bis>	 good
[19:06:41] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.745 seconds
[19:07:44] <nagios-wm_>	 RECOVERY - Puppet freshness on ms-be6 is OK: puppet ran at Fri Sep 28 19:07:15 UTC 2012
[19:21:48] <gerrit-wm>	 New patchset: Dereckson; "Removing proteins@msu.edu rate limiter exemption rule." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25696
[19:34:22] <nagios-wm_>	 PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours
[19:34:22] <nagios-wm_>	 PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours
[19:34:22] <nagios-wm_>	 PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours
[19:34:22] <nagios-wm_>	 PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours
[19:34:22] <nagios-wm_>	 PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours
[19:35:52] <nagios-wm_>	 PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100%
[19:36:01] <nagios-wm_>	 PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100%
[19:41:34] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:51:12] <mark>	 !log Destroyed snapmirror relationship between nas1001-a:images -> nas1-a:images and deleted related snapshots
[19:51:26] <morebots>	 Logged the message, Master
[19:54:01] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.031 seconds
[19:55:54] <mark>	 !log Destroyed volume nas1-a:images
[19:56:04] <morebots>	 Logged the message, Master
[20:02:15] <mark>	 !log Destroyed test0 aggregate on nas1-a, zeroing disks
[20:02:25] <morebots>	 Logged the message, Master
[20:04:23] <mark>	 !log Destroyed nas1001-a:images volume, containing aggregate test0, and started zeroing drives
[20:04:33] <morebots>	 Logged the message, Master
[20:05:43] <nagios-wm_>	 RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms
[20:07:22] <nagios-wm_>	 PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours
[20:27:55] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:43:19] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.019 seconds
[20:44:27] <gerrit-wm>	 New patchset: Ottomata; "Installing udp-filter on analytics machines" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25703
[20:45:23] <gerrit-wm>	 New patchset: Reedy; "Revert "(bug 29692) Per-wiki namespace aliases shouldn't override (remove) global ones"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25704
[20:45:23] <gerrit-wm>	 Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25704
[20:45:24] <gerrit-wm>	 Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25703
[20:47:40] <gerrit-wm>	 New review: Catrope; "This was reverted because it broke the sidebar on ptwiki. Specifically, Wikipedia: was no longer an ..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23935
[20:57:55] <gerrit-wm>	 New patchset: Jgreen; "add root@indium to fundraising archive user backupmover's auth keys" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25705
[20:58:49] <gerrit-wm>	 Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25705
[21:06:30] <gerrit-wm>	 New patchset: preilly; "add Dialog Sri Lanka configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25520
[21:07:34] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25520
[21:10:54] <preilly>	 notpeter: you around?
[21:11:13] <preilly>	 paravoid: you around?
[21:13:46] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:15:34] <apergos>	 no cmjohnson but here's the news from ms-be6:  I now see (I guess after his disk replacement)
[21:15:42] <apergos>	 Firmware state: Unconfigured(good), Spun Up
[21:15:49] <apergos>	 for four drives over there (MegaCli output)
[21:28:03] <apergos>	 !log disbaled puppet and stopped swift processes on ms-be6 again.  noticed four drives in "Unconfigured" state in megacli output after disk replacement,don'tknow more details about how that went.
[21:28:09] <apergos>	 ah woops there heis
[21:28:14] <morebots>	 Logged the message, Master
[21:29:49] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.052 seconds
[21:30:07] <nagios-wm_>	 PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[21:30:16] <nagios-wm_>	 PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[21:30:16] <nagios-wm_>	 PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[21:30:34] <nagios-wm_>	 PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[21:30:43] <nagios-wm_>	 PROBLEM - swift-object-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[21:30:43] <nagios-wm_>	 PROBLEM - swift-account-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[21:30:43] <nagios-wm_>	 PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[21:31:01] <nagios-wm_>	 PROBLEM - swift-container-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[21:31:01] <nagios-wm_>	 PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[21:31:01] <nagios-wm_>	 PROBLEM - swift-account-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[21:31:19] <nagios-wm_>	 PROBLEM - swift-object-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[21:31:28] <nagios-wm_>	 PROBLEM - swift-container-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[21:31:30] <notpeter>	 preilly: still need something?
[21:31:55] <preilly>	 notpeter yeah can you merge https://gerrit.wikimedia.org/r/#/c/25708/
[21:33:25] <nagios-wm_>	 PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours
[21:53:58] <notpeter>	 preilly: can you rebase, plox
[21:54:48] <preilly>	 notpeter: no need
[21:54:55] <preilly>	 notpeter: just merge this too https://gerrit.wikimedia.org/r/#/c/25520/2
[21:55:06] <notpeter>	 kk
[21:55:59] <gerrit-wm>	 Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25708
[21:55:59] <gerrit-wm>	 Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25520
[21:56:22] <nagios-wm_>	 PROBLEM - Puppet freshness on mw22 is CRITICAL: Puppet has not run in the last 10 hours
[21:56:25] <preilly>	 notpeter: thanks!
[21:56:57] <notpeter>	 yep. sorry for being slow. at conference
[22:02:04] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:14:13] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.082 seconds
[22:22:49] <nagios-wm_>	 PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours
[22:41:52] <nagios-wm_>	 PROBLEM - Puppet freshness on cp1030 is CRITICAL: Puppet has not run in the last 10 hours
[22:48:55] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:55:40] <nagios-wm_>	 RECOVERY - Puppet freshness on cp1030 is OK: puppet ran at Fri Sep 28 22:55:21 UTC 2012
[23:01:13] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.567 seconds
[23:02:52] <nagios-wm_>	 RECOVERY - Varnish HTTP upload-backend on cp1030 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.053 seconds
[23:03:01] <nagios-wm_>	 RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 3 processes with command name varnishncsa
[23:03:55] <nagios-wm_>	 RECOVERY - Varnish HTCP daemon on cp1030 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker
[23:21:28] <nagios-wm_>	 RECOVERY - NTP on cp1030 is OK: NTP OK: Offset -0.04479074478 secs
[23:25:49] <nagios-wm_>	 PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours
[23:29:52] <nagios-wm_>	 PROBLEM - Puppet freshness on cp1034 is CRITICAL: Puppet has not run in the last 10 hours
[23:29:52] <nagios-wm_>	 PROBLEM - Puppet freshness on cp1035 is CRITICAL: Puppet has not run in the last 10 hours
[23:35:07] <nagios-wm_>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:35:51] <gerrit-wm>	 New patchset: Asher; "remove db1047 from mysql::packages for mariadb testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25717
[23:36:50] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25717
[23:39:15] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25717
[23:44:52] <nagios-wm_>	 PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours
[23:46:22] <nagios-wm_>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.772 seconds
[23:47:52] <nagios-wm_>	 PROBLEM - Puppet freshness on cp1033 is CRITICAL: Puppet has not run in the last 10 hours
[23:49:43] <gerrit-wm>	 New patchset: Asher; "also exempt db1047 from mysql::conf while testing mariadb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25718
[23:50:37] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25718
[23:51:13] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25718
[23:54:16] <gerrit-wm>	 Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25696
[23:54:50] <binasher>	 !log rebooting db1047 to new kernel
[23:55:01] <morebots>	 Logged the message, Master
[23:58:43] <nagios-wm_>	 PROBLEM - Puppet freshness on cp1036 is CRITICAL: Puppet has not run in the last 10 hours