[00:05:37] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [00:07:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:14:13] New patchset: Jdlrobson; "refine contact us emails to include referring page and whether from app (bug 36388)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24238 [00:16:47] Change merged: awjrichards; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24238 [00:23:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.047 seconds [00:55:16] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:07:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.549 seconds [01:31:33] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [01:40:24] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 259 seconds [01:42:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:42:57] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 217 seconds [01:44:54] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 26 seconds [01:50:09] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 649s [01:51:39] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 19s [01:51:57] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 13 seconds [01:56:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.069 seconds [02:20:27] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [02:27:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:38:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.046 seconds [02:39:30] PROBLEM - Puppet freshness on cp1030 is CRITICAL: Puppet has not run in the last 10 hours [02:58:33] PROBLEM - Puppet freshness on cp1029 is CRITICAL: Puppet has not run in the last 10 hours [03:11:08] PROBLEM - Puppet freshness on cp1032 is CRITICAL: Puppet has not run in the last 10 hours [03:23:08] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [03:27:11] PROBLEM - Puppet freshness on cp1034 is CRITICAL: Puppet has not run in the last 10 hours [03:27:11] PROBLEM - Puppet freshness on cp1035 is CRITICAL: Puppet has not run in the last 10 hours [03:43:14] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [03:46:14] PROBLEM - Puppet freshness on cp1033 is CRITICAL: Puppet has not run in the last 10 hours [03:56:17] PROBLEM - Puppet freshness on cp1036 is CRITICAL: Puppet has not run in the last 10 hours [05:05:08] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [05:05:08] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [05:26:01] RECOVERY - Puppet freshness on locke is OK: puppet ran at Fri Sep 28 05:25:49 UTC 2012 [05:33:58] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:35:10] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [05:36:04] PROBLEM - Squid on brewster is CRITICAL: Connection refused [05:38:28] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [05:40:07] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:41:19] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [05:41:47] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.027 second response time on port 8123 [05:56:37] PROBLEM - Puppet freshness on mw1 is CRITICAL: Puppet has not run in the last 10 hours [05:56:37] PROBLEM - Puppet freshness on mw10 is CRITICAL: Puppet has not run in the last 10 hours [05:56:37] PROBLEM - Puppet freshness on mw12 is CRITICAL: Puppet has not run in the last 10 hours [05:56:37] PROBLEM - Puppet freshness on mw13 is CRITICAL: Puppet has not run in the last 10 hours [05:56:37] PROBLEM - Puppet freshness on mw11 is CRITICAL: Puppet has not run in the last 10 hours [05:56:38] PROBLEM - Puppet freshness on mw14 is CRITICAL: Puppet has not run in the last 10 hours [05:56:38] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [05:56:39] PROBLEM - Puppet freshness on mw2 is CRITICAL: Puppet has not run in the last 10 hours [05:56:39] PROBLEM - Puppet freshness on mw3 is CRITICAL: Puppet has not run in the last 10 hours [05:56:40] PROBLEM - Puppet freshness on mw4 is CRITICAL: Puppet has not run in the last 10 hours [05:56:40] PROBLEM - Puppet freshness on mw6 is CRITICAL: Puppet has not run in the last 10 hours [05:56:41] PROBLEM - Puppet freshness on mw16 is CRITICAL: Puppet has not run in the last 10 hours [05:56:41] PROBLEM - Puppet freshness on mw9 is CRITICAL: Puppet has not run in the last 10 hours [05:56:42] PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours [05:56:42] PROBLEM - Puppet freshness on mw5 is CRITICAL: Puppet has not run in the last 10 hours [05:56:43] PROBLEM - Puppet freshness on mw7 is CRITICAL: Puppet has not run in the last 10 hours [06:36:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:42:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.389 seconds [06:49:43] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [06:57:13] RECOVERY - Squid on brewster is OK: TCP OK - 0.001 second response time on port 8080 [07:02:46] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [07:11:46] PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [07:11:55] PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [07:11:55] PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:12:22] PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [07:12:31] PROBLEM - swift-object-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [07:12:31] PROBLEM - swift-container-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [07:12:31] PROBLEM - swift-account-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [07:12:40] PROBLEM - swift-account-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [07:13:07] PROBLEM - swift-object-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [07:13:07] PROBLEM - swift-container-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [07:13:07] PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [07:13:16] PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [07:17:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:25:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.779 seconds [07:51:11] New review: Tim Starling; "One issue, but that was there before, this is no worse." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8438 [07:56:58] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [08:00:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:11:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.018 seconds [08:46:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:00:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.047 seconds [09:32:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:32:56] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [09:32:56] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [09:32:56] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [09:32:56] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [09:32:56] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [09:47:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.041 seconds [10:06:19] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [10:20:07] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:32:15] New patchset: Hashar; "import zuul module from OpenStack" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25235 [10:32:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.885 seconds [10:33:31] New patchset: Hashar; "zuul role for labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25236 [10:34:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25235 [10:34:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25236 [11:06:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:19:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.507 seconds [11:32:18] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [11:53:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:55:15] PROBLEM - Puppet freshness on mw22 is CRITICAL: Puppet has not run in the last 10 hours [12:09:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.049 seconds [12:21:21] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [12:40:24] PROBLEM - Puppet freshness on cp1030 is CRITICAL: Puppet has not run in the last 10 hours [12:40:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:52:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.814 seconds [12:59:18] PROBLEM - Puppet freshness on cp1029 is CRITICAL: Puppet has not run in the last 10 hours [13:11:49] !log Moved BGP transit sessions to AS1257 from cr2-eqiad (over equinix exchange) to cr1-eqiad (dedicated link) [13:12:01] Logged the message, Master [13:12:21] PROBLEM - Puppet freshness on cp1032 is CRITICAL: Puppet has not run in the last 10 hours [13:24:21] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [13:28:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:28:24] PROBLEM - Puppet freshness on cp1034 is CRITICAL: Puppet has not run in the last 10 hours [13:28:24] PROBLEM - Puppet freshness on cp1035 is CRITICAL: Puppet has not run in the last 10 hours [13:40:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.303 seconds [13:44:14] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [13:47:14] PROBLEM - Puppet freshness on cp1033 is CRITICAL: Puppet has not run in the last 10 hours [13:57:17] PROBLEM - Puppet freshness on cp1036 is CRITICAL: Puppet has not run in the last 10 hours [14:13:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:21:38] hi all! [14:21:49] mark, if you have a sec, could you help me with that netapp mount for analytics? [14:25:31] what netapp mount for analytics? [14:29:16] mark: #3619: Allow analytics cluster to mount fundraising archive... [14:29:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.042 seconds [14:30:43] thanks Jeff, ja, that :) [14:30:57] ok [14:31:39] mark: while you're in there, I wonder if we should allow oxygen and emery to mount r/w as well as locke? [14:33:47] ottomata: you can now NFS mount (ro) nas1001-a:/vol/fr_archive from analytics1003 [14:33:56] Jeff_Green: for failover you mean? [14:34:09] danke! [14:34:18] yeah, one less thing to do if locke blows up or overloads mid-fundraiser [14:34:54] ok [14:35:27] oh ho RobH *pounce*: I'm gonna reboot Aluminium, are you within an hour or two of the DC in off chance it doesn't come back? [14:35:46] Yep [14:35:49] ok cool [14:35:53] onsite already [14:35:59] oh nice [14:36:09] I was onsite yesterday and forgot to remove the topic update when I left =P [14:36:26] emery already had access... [14:36:43] robh we should just put a gps tracker on you :-P [14:36:51] mark: ah good [14:37:00] Jeff_Green: no thanks, I already have to fight to take vacation as it is. [14:37:22] RobH: it would have to have an on-off switch for on the clock hours [14:39:17] oxygen has been granted ro access to nas1001-a [14:39:21] you can't write to that volume anyway [14:39:24] !log rebooting aluminium [14:39:33] oh is oxygen in eqiad? [14:39:34] Logged the message, Master [14:39:37] yes [14:39:41] oic [14:43:14] Jeff_Green: any element is in eqiad [14:43:36] tampa misc names are encyclopedians [14:43:46] esams just doesnt get miscs servers ;] [14:43:50] ah! [14:44:04] i have a rough outline to thorw on wikitech for naming conventions [14:44:09] what happens when we run out of elements? will we add isotopes? [14:44:18] plan to do later today so the misc servers can be named properly when im not about next week [14:44:30] I dont see us having that many misc servers at eqiad [14:44:39] i see [14:44:39] but if we do, that sounds like a good plan to me ;] [14:45:43] New patchset: Mark Bergsma; "Apply Puppet Varnish config to cp1029-1036" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25633 [14:46:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25633 [14:47:35] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25633 [14:49:51] jargh, Jeff_Green, ha, loading this data into hadoop is gonna be annoying [14:49:58] because you have so many small files! :p [14:50:09] cmjohnson1: thats so annoying then [14:50:21] so 720 has the drac7 and the settings are the same as on the 320 [14:50:22] ottomata: ha. [14:50:25] and not same results [14:50:25] ? [14:50:35] (bringing conversation in here since I was also chatting with mark) [14:50:43] no, i was comparing the 2 last night b4 i left [14:50:51] i didn't see any differences [14:51:03] esams has ~ 8 misc servers [14:51:09] what is one of the 720s? [14:51:17] cmjohnson1: these are the two where one was puppet run and one wasnt right? [14:51:27] im going to take over the non puppet run system if its still not used to compare settings. [14:51:46] yes [14:51:48] mark: what is our naming standard for misc in esams? (just curious) [14:51:56] whatever crap names I come up with [14:51:58] cmjohnson1: cool, what were they, db61 and 62? [14:52:10] yes, 62 is no puppet run [14:52:11] for the most recent batch I used some famous dutch people [14:52:11] RECOVERY - Puppet freshness on cp1029 is OK: puppet ran at Fri Sep 28 14:51:56 UTC 2012 [14:52:54] just check rack oe12 in racktables [14:53:18] !log db62 being pulled for drac7 work [14:53:28] Logged the message, RobH [14:54:22] cmjohnson1: im going to make a wikitech page later today, but fyi [14:54:22] http://support.dell.com/support/edocs/software/smdrac3/idrac7/index.htm [14:54:27] drac7 manual [14:54:37] cool ^ thx [14:56:40] so after some reads and writes on ms-be6 with the new controller we have a disk failure apparently [14:56:49] (reporting for anyone following aloong) [14:57:09] megacli reports it as firmware failed, ls on the partition gives an i/o error [15:01:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:01] er on the filesystem rather :-P [15:04:04] cmjohnson1: On all the 320s, you need to disable virtualization tech on the cpu in bios [15:04:19] basically you have to confirm thats off on all hosts all the time ;] [15:04:22] (except labs) [15:04:41] okay [15:04:46] (i dunno if you already did, just mentioning it) [15:05:19] i usually do...but haven't done anything with the 320's [15:06:17] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [15:06:17] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [15:06:20] why does that need to be disabled? [15:07:18] also PLEASE set up all new boxes WITHOUT serial console redirection after boot, thanks :) [15:08:06] hahahaha [15:08:23] mark: ryan pointed out awhile ago a few exploits that require that, and that we should just disable [15:08:26] it was in an ops meeting [15:08:37] and everyone thought it was a good idea (who actually replied in meeting) [15:08:51] should we leave it on for some reason? [15:08:58] cmjohnson1: so its a default change in bios on the 320s [15:09:45] Serial Port Address: on the 320s defaults to Serial Device1=COM2,Serial Device2=COM1 [15:09:59] should be Serial Device1=COM1,Serial Device2=COM2 [15:10:06] which is what it defaulted to on everything else =P [15:10:13] easy to overlook but when i had both up i noticed [15:10:18] so they werent quite identical ;] [15:10:27] now the 320 redirects fine. [15:10:36] RobH: ok [15:10:42] no it's fine, I was wondering [15:10:44] I'll be updating a platform specific page later today once i have all these remotely accessible [15:10:53] indeed, security is the only reason I could think of, was wondering if there were others [15:11:03] mark: yea I honestly dont recall the full explination, it was months ago though [15:11:12] ok [15:12:07] so all the eqiad based misc servers should be mgmt accessible later today. I will drop a network ticket with all the port info for labeling [15:12:25] and then i want to spend my afternoon with the allocations that are pending so nothing is waiting on me when im gone next week [15:14:05] are all memcached servers connected? [15:14:09] I saw only 8 up a few days ago [15:14:54] mark: up to 1014 [15:15:03] 1015/16 are not as i want 14's cable to confirm working [15:15:11] i dont wanna open more cables so we can return them (the dell ones) [15:15:24] notpeter tried to use the 1009+ yesterday and couldnt [15:15:34] seems networking isnt up on them, link is on them once i reseated. [15:15:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.049 seconds [15:17:53] well we need that up before you're gone [15:18:13] lemme check now [15:18:47] Xcvr 8 NON-JNPR 174410000014 UNKNOWN [15:18:47] Xcvr 9 NON-JNPR 174410000005 UNKNOWN [15:18:47] Xcvr 10 NON-JNPR 174410000027 SFP+-10G-CU1M [15:18:47] Xcvr 11 NON-JNPR 174410000030 SFP+-10G-CU1M [15:18:47] Xcvr 12 NON-JNPR 174410000026 SFP+-10G-CU1M [15:18:47] Xcvr 13 NON-JNPR MOC16262718 SFP-CX [15:18:48] Xcvr 39 NON-JNPR 174540000036 SFP+-10G-CU1M [15:18:56] so some seem to work, others not? [15:19:36] 1014 doesn't work it looks like [15:25:11] RECOVERY - Puppet freshness on cp1032 is OK: puppet ran at Fri Sep 28 15:24:51 UTC 2012 [15:28:20] mark: damn, thats the new dell cable [15:28:29] =/ [15:28:36] so i will do a return for those later. [15:28:44] what brand are they? [15:29:29] i dont recall, will go look in a momemnt [15:29:38] these were the dell item you linked me to, but we didnt know if it would work [15:30:03] yes [15:30:06] apparently they don't :( [15:30:26] annoying [15:30:32] probably juniper did this on purpose [15:30:32] cisco. [15:30:41] cisco cables? hm. [15:46:02] RECOVERY - NTP on cp1032 is OK: NTP OK: Offset -0.04797828197 secs [15:46:11] !log done with db62. it was in the ubuntu installer when I took it over, so its just sitting now [15:46:21] Logged the message, RobH [15:48:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:57:17] PROBLEM - Puppet freshness on mw1 is CRITICAL: Puppet has not run in the last 10 hours [15:57:17] PROBLEM - Puppet freshness on mw11 is CRITICAL: Puppet has not run in the last 10 hours [15:57:17] PROBLEM - Puppet freshness on mw10 is CRITICAL: Puppet has not run in the last 10 hours [15:57:17] PROBLEM - Puppet freshness on mw12 is CRITICAL: Puppet has not run in the last 10 hours [15:57:17] PROBLEM - Puppet freshness on mw13 is CRITICAL: Puppet has not run in the last 10 hours [15:57:18] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [15:57:18] PROBLEM - Puppet freshness on mw14 is CRITICAL: Puppet has not run in the last 10 hours [15:57:19] PROBLEM - Puppet freshness on mw3 is CRITICAL: Puppet has not run in the last 10 hours [15:57:19] PROBLEM - Puppet freshness on mw6 is CRITICAL: Puppet has not run in the last 10 hours [15:57:20] PROBLEM - Puppet freshness on mw5 is CRITICAL: Puppet has not run in the last 10 hours [15:57:20] PROBLEM - Puppet freshness on mw4 is CRITICAL: Puppet has not run in the last 10 hours [15:57:21] PROBLEM - Puppet freshness on mw2 is CRITICAL: Puppet has not run in the last 10 hours [15:57:21] PROBLEM - Puppet freshness on mw7 is CRITICAL: Puppet has not run in the last 10 hours [15:57:22] PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours [15:57:22] PROBLEM - Puppet freshness on mw16 is CRITICAL: Puppet has not run in the last 10 hours [15:57:23] PROBLEM - Puppet freshness on mw9 is CRITICAL: Puppet has not run in the last 10 hours [15:59:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.416 seconds [16:07:13] PROBLEM - Varnish HTTP upload-frontend on cp1032 is CRITICAL: Connection refused [16:25:22] PROBLEM - Host cp1032 is DOWN: PING CRITICAL - Packet loss = 100% [16:25:49] RECOVERY - Host cp1032 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [16:35:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.866 seconds [16:50:34] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [16:57:28] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [16:57:47] and it'snot going to either [17:03:28] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [17:03:43] apergos: i am going to replace disk 7 and start the copy...is that still okay? [17:03:50] uh [17:03:55] can you wait about 5 mins [17:04:01] sure [17:04:03] (for some copies to complete [17:04:10] I'll holler as soon as these ones are ready [17:04:12] also updated ticket for dell w/your comments from this morning [17:04:14] cool [17:04:16] yeah I saw [17:04:29] good news w/ the setup we have now...the failure led shows [17:04:37] yay [17:04:44] they've emailed me 2x and called once [17:04:52] really? asking what? [17:05:10] the status on whether or not we were seeing better results w/ the new card [17:05:13] ah [17:05:39] yeah I don't know. on the one hand a disk failure didn't cause catadtrophic collapse but otoh we haven't ried to replace it yet either [17:11:22] hmm maybe about 5 more mins, sorry [17:16:02] apergos: no problem..let me know I am updating the misc servers here [17:16:07] ok [17:24:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:26:36] cmjohnson1: all yours [17:26:47] I'll be back later, I'll check in and see what happened [17:29:19] ok [17:36:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [17:38:29] PROBLEM - Varnish HTTP upload-frontend on cp1030 is CRITICAL: Connection refused [17:38:56] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: Connection refused by host [17:39:14] PROBLEM - Varnish HTCP daemon on cp1030 is CRITICAL: Connection refused by host [17:39:41] PROBLEM - Varnish HTTP upload-backend on cp1030 is CRITICAL: Connection refused [17:41:11] RECOVERY - swift-container-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:41:11] RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [17:41:11] RECOVERY - swift-account-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:41:20] RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [17:41:20] RECOVERY - swift-object-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:41:20] RECOVERY - swift-account-reaper on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:41:47] RECOVERY - swift-object-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:41:47] RECOVERY - swift-container-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [17:41:47] RECOVERY - swift-account-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:42:05] RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [17:42:05] RECOVERY - swift-container-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [17:42:05] RECOVERY - swift-object-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [17:49:35] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [17:57:50] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [18:04:01] !log stopping mysql on db1047 for upgrades [18:04:11] Logged the message, Master [18:09:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:11:10] <^demon> binasher: Are you doing upgrades on other dbs in that range? [18:11:17] * ^demon wonders because db1048 is important to him [18:11:22] nope [18:11:31] <^demon> Okie dokie, carry on. [18:11:31] db1047 is for enwiki analytics [18:12:36] they're getting a shiny new storage array! and are going to be our mariadb guinea pig [18:20:36] New review: Nemo bis; "Bug closed, change to be abandoned." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/25599 [18:21:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.908 seconds [18:30:32] RECOVERY - Host ms-be6 is UP: PING WARNING - Packet loss = 58%, RTA = 43.78 ms [18:30:37] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/20876 [18:31:08] New review: Reedy; "Nice." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/23935 [18:31:24] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24326 [18:32:00] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24561 [18:32:45] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25493 [18:33:07] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23076 [18:33:47] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25196 [18:33:55] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25231 [18:33:59] PROBLEM - swift-container-updater on ms-be6 is CRITICAL: Connection refused by host [18:34:08] PROBLEM - swift-account-server on ms-be6 is CRITICAL: Connection refused by host [18:34:35] PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: Connection refused by host [18:34:37] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24671 [18:34:52] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24672 [18:34:53] PROBLEM - swift-object-server on ms-be6 is CRITICAL: Connection refused by host [18:34:53] PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: Connection refused by host [18:34:53] PROBLEM - swift-account-replicator on ms-be6 is CRITICAL: Connection refused by host [18:34:53] PROBLEM - swift-container-server on ms-be6 is CRITICAL: Connection refused by host [18:35:02] PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: Connection refused by host [18:35:02] PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: Connection refused by host [18:35:02] PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: Connection refused by host [18:35:02] PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: Connection refused by host [18:35:07] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16935 [18:35:11] PROBLEM - SSH on ms-be6 is CRITICAL: Connection refused [18:35:11] PROBLEM - swift-object-updater on ms-be6 is CRITICAL: Connection refused by host [18:35:25] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/12556 [18:39:05] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [18:39:09] New review: Reedy; "Needs rebasing! :(" [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/23059 [18:39:25] New review: Reedy; "Needs rebasing! :(" [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/23935 [18:45:03] Change abandoned: Dereckson; "The logo has been changed on the wiki by a CSS modification. This configuration change is so unneces..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25599 [18:49:37] Reedy: I suppose such a big diff is bound to be a nightmare to rebase? [18:50:23] aka perhaps one change should wait for the other to be merged [18:51:50] RECOVERY - swift-container-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:51:59] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [18:52:08] RECOVERY - swift-object-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [18:52:08] RECOVERY - swift-container-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [18:52:11] New patchset: Reedy; "(bug 29902) Cleaning InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23059 [18:52:17] ^ that was pretty easy to rebase [18:52:26] RECOVERY - swift-account-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [18:52:26] RECOVERY - SSH on ms-be6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [18:52:26] RECOVERY - swift-account-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [18:52:26] RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [18:52:26] RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [18:52:35] RECOVERY - swift-object-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [18:52:35] RECOVERY - swift-account-reaper on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [18:52:35] RECOVERY - swift-object-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [18:52:35] RECOVERY - swift-container-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [18:52:39] Reedy: does this mean that you'll do the other one as well? :) [18:52:50] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23059 [18:52:55] I'm gonna have a looook [18:52:59] oki [18:53:02] RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [18:54:06] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/18125 [18:55:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:56:05] so if anyone is following along w/ms-be6 issues...replaced disk 7 had to re-do the raid config and have several disk not mounting http://p.defau.lt/?KtPq9kPsWBz7uwIRCM61HA [18:59:46] New patchset: Reedy; "(bug 29692) Per-wiki namespace aliases shouldn't override (remove) global ones" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23935 [19:00:01] New patchset: Catrope; "Fix typo in variable name" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25694 [19:00:16] Nemo_bis: I think there's a few new extra additions that need tidying up.. [19:00:31] Reedy: ok, can you merge this in the meanwhile? [19:00:34] I'll do another commit [19:00:57] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25694 [19:01:04] Reedy: also, does it require interwiki update or not? [19:01:26] What do you mean? [19:02:26] Reedy: I mean that on some of those wikis now e.g. Wikipedia: links to a local page while previously it was an interwiki [19:02:41] does this work automatically or does the interwiki cache need to be updated? [19:03:02] No, it doesn't need updating [19:03:05] they all use the same cache [19:03:10] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23935 [19:03:35] good [19:06:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.745 seconds [19:07:44] RECOVERY - Puppet freshness on ms-be6 is OK: puppet ran at Fri Sep 28 19:07:15 UTC 2012 [19:21:48] New patchset: Dereckson; "Removing proteins@msu.edu rate limiter exemption rule." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25696 [19:34:22] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [19:34:22] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [19:34:22] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [19:34:22] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [19:34:22] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [19:35:52] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [19:36:01] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [19:41:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:51:12] !log Destroyed snapmirror relationship between nas1001-a:images -> nas1-a:images and deleted related snapshots [19:51:26] Logged the message, Master [19:54:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.031 seconds [19:55:54] !log Destroyed volume nas1-a:images [19:56:04] Logged the message, Master [20:02:15] !log Destroyed test0 aggregate on nas1-a, zeroing disks [20:02:25] Logged the message, Master [20:04:23] !log Destroyed nas1001-a:images volume, containing aggregate test0, and started zeroing drives [20:04:33] Logged the message, Master [20:05:43] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [20:07:22] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [20:27:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:43:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.019 seconds [20:44:27] New patchset: Ottomata; "Installing udp-filter on analytics machines" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25703 [20:45:23] New patchset: Reedy; "Revert "(bug 29692) Per-wiki namespace aliases shouldn't override (remove) global ones"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25704 [20:45:23] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25704 [20:45:24] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25703 [20:47:40] New review: Catrope; "This was reverted because it broke the sidebar on ptwiki. Specifically, Wikipedia: was no longer an ..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23935 [20:57:55] New patchset: Jgreen; "add root@indium to fundraising archive user backupmover's auth keys" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25705 [20:58:49] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25705 [21:06:30] New patchset: preilly; "add Dialog Sri Lanka configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25520 [21:07:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25520 [21:10:54] notpeter: you around? [21:11:13] paravoid: you around? [21:13:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:15:34] no cmjohnson but here's the news from ms-be6: I now see (I guess after his disk replacement) [21:15:42] Firmware state: Unconfigured(good), Spun Up [21:15:49] for four drives over there (MegaCli output) [21:28:03] !log disbaled puppet and stopped swift processes on ms-be6 again. noticed four drives in "Unconfigured" state in megacli output after disk replacement,don'tknow more details about how that went. [21:28:09] ah woops there heis [21:28:14] Logged the message, Master [21:29:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.052 seconds [21:30:07] PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [21:30:16] PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [21:30:16] PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [21:30:34] PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [21:30:43] PROBLEM - swift-object-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [21:30:43] PROBLEM - swift-account-replicator on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [21:30:43] PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [21:31:01] PROBLEM - swift-container-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [21:31:01] PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:31:01] PROBLEM - swift-account-server on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [21:31:19] PROBLEM - swift-object-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [21:31:28] PROBLEM - swift-container-updater on ms-be6 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [21:31:30] preilly: still need something? [21:31:55] notpeter yeah can you merge https://gerrit.wikimedia.org/r/#/c/25708/ [21:33:25] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [21:53:58] preilly: can you rebase, plox [21:54:48] notpeter: no need [21:54:55] notpeter: just merge this too https://gerrit.wikimedia.org/r/#/c/25520/2 [21:55:06] kk [21:55:59] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25708 [21:55:59] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25520 [21:56:22] PROBLEM - Puppet freshness on mw22 is CRITICAL: Puppet has not run in the last 10 hours [21:56:25] notpeter: thanks! [21:56:57] yep. sorry for being slow. at conference [22:02:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:14:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.082 seconds [22:22:49] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [22:41:52] PROBLEM - Puppet freshness on cp1030 is CRITICAL: Puppet has not run in the last 10 hours [22:48:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:55:40] RECOVERY - Puppet freshness on cp1030 is OK: puppet ran at Fri Sep 28 22:55:21 UTC 2012 [23:01:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.567 seconds [23:02:52] RECOVERY - Varnish HTTP upload-backend on cp1030 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.053 seconds [23:03:01] RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 3 processes with command name varnishncsa [23:03:55] RECOVERY - Varnish HTCP daemon on cp1030 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [23:21:28] RECOVERY - NTP on cp1030 is OK: NTP OK: Offset -0.04479074478 secs [23:25:49] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [23:29:52] PROBLEM - Puppet freshness on cp1034 is CRITICAL: Puppet has not run in the last 10 hours [23:29:52] PROBLEM - Puppet freshness on cp1035 is CRITICAL: Puppet has not run in the last 10 hours [23:35:07] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:51] New patchset: Asher; "remove db1047 from mysql::packages for mariadb testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25717 [23:36:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25717 [23:39:15] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25717 [23:44:52] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [23:46:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.772 seconds [23:47:52] PROBLEM - Puppet freshness on cp1033 is CRITICAL: Puppet has not run in the last 10 hours [23:49:43] New patchset: Asher; "also exempt db1047 from mysql::conf while testing mariadb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25718 [23:50:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25718 [23:51:13] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25718 [23:54:16] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25696 [23:54:50] !log rebooting db1047 to new kernel [23:55:01] Logged the message, Master [23:58:43] PROBLEM - Puppet freshness on cp1036 is CRITICAL: Puppet has not run in the last 10 hours