[00:09:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:10:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.962 seconds [00:40:44] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [00:45:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:54:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.023 seconds [01:27:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:31:38] PROBLEM - Puppet freshness on oxygen is CRITICAL: Puppet has not run in the last 10 hours [01:42:17] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 295 seconds [01:42:35] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 288 seconds [01:42:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [01:43:47] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 26 seconds [01:45:53] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 13 seconds [02:14:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.018 seconds [02:44:32] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [02:44:32] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [02:44:32] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [02:50:32] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [03:28:36] RECOVERY - Puppet freshness on virt1000 is OK: puppet ran at Sat Sep 8 03:28:20 UTC 2012 [03:29:39] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Sat Sep 8 03:29:15 UTC 2012 [04:16:12] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [04:16:12] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [04:56:58] New patchset: Catrope; "(bug 39432) Enable Narayam on ka.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23085 [04:56:58] Change merged: Catrope; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23085 [04:59:52] lolwut [05:00:01] I didn't do that, Gerrit [05:00:04] I just pushed that change in [05:00:29] Oh, I suppose it's got some polling clevernesst here [05:07:54] :o [05:08:26] Catrope: Change has been successfully pushed. < seems like the magic polling that sometimes causes a spam of updates [05:11:37] Yeah I guess so [05:12:01] That commit was dangling on manganese, I managed to pull it down to my local machine and pushed it from there [05:20:15] PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours [07:21:24] PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours [07:25:54] PROBLEM - MySQL Slave Delay on db1042 is CRITICAL: CRIT replication delay 198 seconds [07:26:12] PROBLEM - MySQL Replication Heartbeat on db1042 is CRITICAL: CRIT replication delay 207 seconds [07:29:03] RECOVERY - MySQL Slave Delay on db1042 is OK: OK replication delay 0 seconds [07:29:21] RECOVERY - MySQL Replication Heartbeat on db1042 is OK: OK replication delay 0 seconds [07:43:27] PROBLEM - MySQL Replication Heartbeat on db1042 is CRITICAL: CRIT replication delay 219 seconds [07:44:03] PROBLEM - MySQL Slave Delay on db1042 is CRITICAL: CRIT replication delay 243 seconds [07:54:02] !log restarted Jenkins to fix up a plugin [07:54:04] poor bot [07:54:13] Logged the message, Master [08:05:30] RECOVERY - MySQL Slave Delay on db1042 is OK: OK replication delay 0 seconds [08:06:42] RECOVERY - MySQL Replication Heartbeat on db1042 is OK: OK replication delay 0 seconds [08:13:54] New patchset: Hashar; "jenkins test" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23163 [08:22:33] New patchset: Hashar; "test: fix 'do not remove lines in Hostsbyname'" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23163 [08:23:21] New patchset: Hashar; "test: fix 'do not remove lines in Hostsbyname'" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23163 [08:24:14] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23163 [08:46:21] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [08:46:21] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [08:46:21] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [08:46:21] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [08:46:22] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [08:46:22] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [08:46:23] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [08:46:23] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [08:46:24] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [10:26:39] PROBLEM - Apache HTTP on srv219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:28:09] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [10:41:22] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [11:32:26] PROBLEM - Puppet freshness on oxygen is CRITICAL: Puppet has not run in the last 10 hours [12:45:27] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [12:45:27] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [12:45:27] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [12:51:27] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [13:10:30] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.14:11000 (Connection timed out) [13:11:42] PROBLEM - SSH on srv264 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:12:00] PROBLEM - Apache HTTP on srv264 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:16:21] RECOVERY - SSH on srv264 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:17:24] PROBLEM - Memcached on srv264 is CRITICAL: Connection refused [13:18:18] RECOVERY - Apache HTTP on srv264 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.136 second response time [13:20:06] PROBLEM - Backend Squid HTTP on knsq24 is CRITICAL: Connection refused [13:24:54] RECOVERY - Memcached on srv264 is OK: TCP OK - 0.008 second response time on port 11000 [13:25:57] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [13:39:57] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.19:11000 (timeout) [13:41:26] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [14:17:35] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [14:17:35] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [14:32:17] PROBLEM - Apache HTTP on srv258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:02] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.8:11000 (Connection timed out) [14:33:38] RECOVERY - Apache HTTP on srv258 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.029 second response time [14:34:32] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [15:20:54] PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours [15:36:34] apergos: going to powedown ms-be6 now [15:36:40] helo [15:36:43] ok [15:36:52] thx for getting online on a saturday! [15:36:55] sure [15:37:09] this should not need a reinstall [15:37:19] so we'll just see if it comes up well afterwards [15:37:19] let's hope [15:37:24] ok [15:40:06] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [15:51:03] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:51:12] PROBLEM - Apache HTTP on srv220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:52:24] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [15:52:42] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [15:52:43] apergos: controller is in [15:52:51] ok great [15:53:00] i am going to power on now�if you wanna console in [15:53:04] er backplane right? [15:53:06] ok just a sec [15:53:09] yep [15:53:15] sorry�controller on the brain [15:54:22] I'm on it [15:55:00] ok�it's on [15:55:28] yep, I;m sitting here watching it [15:55:40] yeah well [15:55:54] i see �some red flashes [15:56:21] well I am skipping the things it can't mount again [15:56:50] I'll log on as soon as it comes up and see if we have the usual driver errors but I expect we do [15:56:54] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [15:57:19] i have extra drives here�do you wanna swap them out and see if the problems persist [15:57:25] no [15:57:28] it's 4 disks [15:58:12] we know that we don't have 4 disks on this, 5 or 6 on the nex,t and another 5 or 6 on the third box all go bad [16:00:46] same old same old, looknig at the mpt2sas errors [16:01:05] sd 7:0:9:0: [sdj] Unhandled error code [16:01:12] mpt2sas0: log_info(0x3112011a): originator(PL), code(0x12), sub_code(0x011a) [16:01:21] sd 7:0:9:0: [sdj] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK [16:01:27] sd 7:0:9:0: [sdj] CDB: Write(10): 2a 00 3a 47 4e 48 00 00 08 00 [16:01:33] end_request: I/O error, dev sdj, sector 977751624 [16:01:36] so that's about it [16:01:46] well due diligence [16:01:54] yep [16:02:00] needed to be done for sure [16:02:13] absolutely [16:02:38] well meh wanna log it on the ticket? I guess there's not much else for you to do at this point [16:02:58] I sent dell a pile of logs (as you'll see from the ticket) and they probably won't get back to us again til monday [16:03:10] can you update the ticket plz [16:03:14] ah [16:03:15] sure [16:04:04] thx for getting on�enjoy the rest of your day [16:04:29] see you tomorrow or Monday [16:04:30] you too, thanks for going in to the dc [16:04:32] yep [16:05:46] crap forgot to cc everybody and his brother on the comment [16:05:48] grr [16:06:59] there are quite a few ppl to cc [16:07:23] does mediawiki-config work for you? [16:08:38] forwarded it separately [16:09:31] cool [16:10:17] apergos, can you pull mediawiki-config repo? [16:10:30] New patchset: Platonides; "Point glwiki logo back to Galipedia. They want to use a commemorative logo for passing 90k articles." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23175 [16:12:26] yes [16:12:31] I just did so and it seemed fine [16:12:46] fatal: git fetch_pack: expected ACK/NAK, got 'ERR want 52e7dd5c76cd4c6e673a9aa6cc7cadee9ad77fb7 not valid' [16:12:52] fatal: internal server error [16:14:02] 81298b380762bd286081a9cb3d5a73903a33a83b that's the latest commit I see via git log on my nw local copy [16:14:11] Sat Sep 8 08:24:14 is the timestamp [16:14:20] so seems to be getting everything [16:14:40] I'm at 4d12ee39dcbc32ab32dc0bf514af1ccdc2a53e3b 2012-09-07 17:57:04 [16:15:07] I do seem to have a 52e7dd5c76cd4c6e673a9aa6cc7cadee9ad77fb7 revision [16:15:24] but doesn't seem merged [16:17:31] a local commit maybe? [16:18:09] no, it's one from hashar [16:18:18] hmm... I seem to have a corrupted local copy [16:18:19] well it's not in git log for me so I don't have it [16:19:13] maybe not, I had been doing some tests... [16:20:05] it's patchset 7 from here: https://gerrit.wikimedia.org/r/#/c/12185/ [16:20:49] how do you have an intermediate patchset in there? [16:21:19] actually that seems to be a change I sent [16:21:32] (author Hashar, committer Platonides) [16:21:36] I see [16:21:41] but I have fetch = +refs/changes/*:refs/changes/* [16:22:22] yep, it's git fetch origin +refs/changes/*:refs/changes/* what fails [16:22:41] yeah I didn't add that, it's just the usual fetch for me [16:23:34] I'm guessing gerrit was upgraded [16:23:41] I think it was demon looking at the repo issues earlier, you might want to bring it up [16:23:44] and it's now rejecting that commit [16:24:01] despite that it was accepted before [16:24:02] he specifically mentioned there might be a few lingering problems with this repo [16:24:29] oh huh he didn't mail this to the world so I'm gonna quote a little of it in here [16:24:59] verything's back up except for mysqlatfacebook, which is very very broken. mediawiki-config may end up having a few remaining errors (it's got some detached objects), but should mostly work [16:25:21] that's likely the source of your errors [16:30:09] I opened him a bug: https://bugzilla.wikimedia.org/show_bug.cgi?id=40107 [16:31:21] great [16:53:20] * jeremyb has the same problem as Platonides [17:22:48] PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours [17:25:48] PROBLEM - SSH on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:57] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.23:11000 (Connection timed out) [17:27:18] RECOVERY - SSH on srv273 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [17:30:45] PROBLEM - Memcached on srv279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:31:03] PROBLEM - Apache HTTP on srv279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:15] RECOVERY - Memcached on srv279 is OK: TCP OK - 0.002 second response time on port 11000 [17:32:24] RECOVERY - Apache HTTP on srv279 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.327 second response time [17:35:25] PROBLEM - Apache HTTP on srv259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:45] RECOVERY - Apache HTTP on srv259 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.334 second response time [17:38:24] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [18:47:24] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [18:47:24] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [18:47:24] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [18:47:24] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [18:47:24] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [18:47:25] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [18:47:25] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [18:47:26] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [18:47:26] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [19:07:41] New review: Dereckson; "Shellpolicy -> Shell" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/22999 [19:31:26] New patchset: Dereckson; "(bug 39206) Namespaces configuration for se.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23222 [19:37:38] New patchset: Dereckson; "(bug 39206) Namespaces configuration for se.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23222 [19:51:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:54:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.685 seconds [20:27:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [20:42:48] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [21:12:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:27:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [21:33:45] PROBLEM - Puppet freshness on oxygen is CRITICAL: Puppet has not run in the last 10 hours [22:00:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:12:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.587 seconds [22:46:02] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [22:46:02] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [22:46:02] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [22:46:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:52:02] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [22:59:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.897 seconds [23:33:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:38] mutante: mailman 3.0b2 was just released [23:47:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.949 seconds