[00:03:10] AaronSchulz: the db connection problems look like they started yesterday: Wed Aug 29 15:49:09 UTC 2012, and really started in earnest Wed Aug 29 19:51:30 UTC 2012 [00:05:20] are you doing cat dberror.log | cut -b -16 | uniq --count or something? [00:06:26] I forgot about the --count switch, but that's pretty close to it [00:07:06] zgrep 'Error connecting to 10.0.6.73' dberror.log-20120830.gz | cut -b1-28 | uniq --count [00:12:01] New review: RobLa; "Verified web page looks kosher and imported Chris's key from the page." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/22158 [00:12:37] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22158 [00:13:33] https://secure.wikimedia.org/keys.html [00:14:21] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [00:15:40] sure lots of job runner OOMs [00:21:01] win 2 [00:24:42] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [00:24:42] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [00:24:42] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [00:24:42] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [00:24:42] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [00:24:43] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [00:24:43] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [00:24:44] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [00:24:44] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [00:24:45] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [00:24:45] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [00:24:46] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [00:24:46] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [00:25:32] TimStarling: I wonder how much memory getBacklinkCache() on a bunch of titles takes up when running jobs [00:26:30] though the title object (and hence the cache member) should be freed between jobs, meh [00:26:39] there's a cache of title objects [00:26:48] const CACHE_MAX = 1000; [00:27:17] maybe we could clear it after each job [00:27:20] yeah, other refs [00:27:22] Title::clearCache(); [00:27:38] doesn't exist, just saying we could make it [00:28:49] makes sense [00:41:50] RECOVERY - Puppet freshness on ms-be1001 is OK: puppet ran at Fri Aug 31 00:41:18 UTC 2012 [00:42:17] RECOVERY - Puppet freshness on ms-be1003 is OK: puppet ran at Fri Aug 31 00:42:04 UTC 2012 [00:45:17] RECOVERY - Puppet freshness on ms-be1002 is OK: puppet ran at Fri Aug 31 00:45:15 UTC 2012 [00:59:50] RECOVERY - Puppet freshness on ms-fe1001 is OK: puppet ran at Fri Aug 31 00:59:27 UTC 2012 [01:30:26] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [01:41:59] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 263 seconds [01:42:08] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 271 seconds [01:48:44] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 668s [01:57:18] RECOVERY - swift-account-reaper on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [01:58:57] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 1s [01:59:06] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [01:59:33] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 11 seconds [02:01:22] what the hell [02:05:42] mutante: around? [02:10:12] RECOVERY - Host ms-be8 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [02:20:24] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [02:26:06] RECOVERY - Host ms-be8 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [02:32:48] paravoid: re [02:33:34] I have bigger problems but would you perhaps know why puppet_freshness/SNMP wouldn't work? [02:33:47] you're one of the people that might know :) [02:33:54] on prod nagios? [02:34:12] PROBLEM - swift-container-auditor on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:21] PROBLEM - swift-object-replicator on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:21] PROBLEM - swift-container-replicator on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:29] yes [02:34:30] PROBLEM - swift-account-replicator on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:30] PROBLEM - swift-account-reaper on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:33] paravoid: i guess because spence has been rebooted and the deamon isnt running, hold on [02:34:39] PROBLEM - swift-object-server on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:39] PROBLEM - swift-container-server on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:06] PROBLEM - SSH on ms-be8 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:35:06] PROBLEM - swift-object-updater on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:06] PROBLEM - swift-account-server on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:06] PROBLEM - swift-container-updater on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:24] PROBLEM - swift-object-auditor on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:29] 4 swift boxes down in tampa [02:35:30] out of 9 [02:35:33] PROBLEM - swift-account-auditor on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:39] ugh :/ [02:38:33] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [02:44:49] paravoid: so, freshness monitoring is via snmp, the daemons are running and i see snmp traps incoming in tcpdump [02:45:08] paravoid: just not working for the swift boxes? [02:45:50] seems so. not important right now... [02:46:01] sigh [02:49:39] RECOVERY - swift-account-auditor on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [02:49:39] RECOVERY - swift-object-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [02:49:39] RECOVERY - swift-container-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [02:49:48] RECOVERY - swift-account-reaper on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [02:49:48] RECOVERY - Host ms-be8 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [02:49:58] RECOVERY - swift-object-server on ms-be8 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [02:49:58] RECOVERY - swift-account-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [02:50:06] RECOVERY - swift-container-server on ms-be8 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [02:50:15] RECOVERY - SSH on ms-be8 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [02:50:33] RECOVERY - swift-account-server on ms-be8 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [02:50:33] RECOVERY - swift-object-auditor on ms-be8 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [02:50:33] RECOVERY - swift-object-updater on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [02:50:51] RECOVERY - swift-container-auditor on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [02:50:51] RECOVERY - swift-container-updater on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [02:59:36] !log swift: removing ms-be8 sdc1, sdg1, sdk1, sdl1, sdn1 from ring; broken hardware (again) [02:59:48] Logged the message, Master [03:04:49] * Jeff_Green is sorta not totally failing with pbuilder . . . finally [03:05:22] 'bug' turned out to be bad command syntax in the manifest I cherrypicked from production puppet [03:06:09] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [03:10:06] grumble grumble grumble [03:15:49] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , plwiktionary (26323) [03:15:58] RECOVERY - Puppet freshness on palladium is OK: puppet ran at Fri Aug 31 03:15:47 UTC 2012 [03:17:11] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , plwiktionary (26483) [03:20:01] paravoid: i looked through older mail when i talked to Ben about ms-be5 going down in March. I just forwarded one to you, it contains a Bash script i once wrote to add devices to swift rings.. fwiw [03:21:43] eh, pasted here: http://wikitech.wikimedia.org/view/User_talk:Dzahn [03:30:31] PROBLEM - swift-container-server on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [03:30:31] PROBLEM - swift-account-replicator on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [03:31:43] PROBLEM - swift-object-replicator on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [03:32:01] RECOVERY - swift-container-server on ms-be8 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [03:32:01] RECOVERY - swift-account-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [03:35:46] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [03:43:48] another disk broken [03:43:50] seriously, WTH [03:45:22] mutante: thanks [03:47:10] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [03:48:31] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [03:49:25] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [03:49:34] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [03:53:46] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [03:56:08] paravoid: http://lists.us.dell.com/pipermail/linux-poweredge/2012-May/046375.html [04:02:32] added to the new ticket.. [04:02:41] ms-be8 is going down for power off NOW [04:03:37] I know [04:03:38] that's me [04:03:53] even without the i/o errored disk umounted, it's broken [04:04:03] multiple processes in D [04:04:15] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [04:05:09] oh man.. just quoting "starting to wonder whether [04:05:11] this model of machine is a bit of a lemon" [04:06:30] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [04:07:33] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [04:08:00] ACKNOWLEDGEMENT - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn Dell C2100 [04:11:18] RECOVERY - Host ms-be8 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [04:12:45] hahahaha fucking crazy machine [04:13:02] I disabled those disks in fstab, rebooted it and now it produces i/o errors for other disks [04:13:21] yeah, it's randomized on every reboot [04:13:51] thats why we kept thinking it must be the controller [04:14:07] after several disks had been replaced/checked by Chris [04:14:14] no it wasn't randomized, it kept being the same disks [04:14:18] until I disabled these [04:14:22] ooh [04:14:27] PROBLEM - swift-container-auditor on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:14:27] PROBLEM - swift-container-updater on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [04:14:45] PROBLEM - swift-account-auditor on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [04:14:45] PROBLEM - swift-account-reaper on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [04:15:03] PROBLEM - swift-container-replicator on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [04:15:30] PROBLEM - swift-object-server on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [04:15:39] PROBLEM - swift-object-auditor on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [04:15:57] PROBLEM - swift-account-server on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [04:15:57] PROBLEM - swift-object-updater on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [04:16:06] PROBLEM - swift-container-server on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [04:16:51] PROBLEM - swift-account-replicator on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [04:18:48] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.048 second response time [04:21:57] RECOVERY - swift-account-server on ms-be8 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [04:22:15] RECOVERY - swift-object-updater on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [04:22:24] RECOVERY - swift-container-server on ms-be8 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [04:22:24] RECOVERY - swift-account-reaper on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [04:22:24] RECOVERY - swift-container-auditor on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:22:33] RECOVERY - swift-object-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [04:22:33] RECOVERY - swift-object-server on ms-be8 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [04:22:33] RECOVERY - swift-account-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [04:22:42] RECOVERY - swift-container-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [04:23:09] RECOVERY - swift-account-auditor on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [04:23:09] RECOVERY - swift-object-auditor on ms-be8 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [04:23:09] RECOVERY - swift-container-updater on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [04:24:17] on ms-be1012 it was .."Spinning up disk .. not responding" . and "Add. Sense: Logical unit not ready, cause not reportable" [04:24:23] bbl [04:24:30] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [04:24:30] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [04:24:30] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [04:26:18] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [04:27:39] PROBLEM - swift-object-auditor on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [04:27:39] PROBLEM - swift-account-auditor on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [04:27:57] PROBLEM - swift-account-server on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [04:28:15] PROBLEM - swift-object-updater on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [04:28:24] PROBLEM - swift-container-server on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [04:28:24] PROBLEM - swift-container-auditor on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:28:24] PROBLEM - swift-account-reaper on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [04:28:33] PROBLEM - swift-object-replicator on ms-be8 is CRITICAL: Connection refused by host [04:28:33] PROBLEM - swift-object-server on ms-be8 is CRITICAL: Connection refused by host [04:28:33] PROBLEM - swift-account-replicator on ms-be8 is CRITICAL: Connection refused by host [04:28:33] PROBLEM - swift-container-updater on ms-be8 is CRITICAL: Connection refused by host [04:28:51] PROBLEM - swift-container-replicator on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:30:30] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [04:31:33] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [04:35:36] RECOVERY - swift-account-reaper on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [04:35:45] RECOVERY - Host ms-be8 is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms [04:40:07] PROBLEM - swift-account-reaper on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [04:47:18] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time [05:05:11] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [05:43:26] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (38499) [05:44:11] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (37921) [05:57:14] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [05:57:14] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [06:08:28] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [06:09:22] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [06:12:40] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [06:46:06] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time [06:52:33] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [06:54:30] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [06:56:18] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [06:58:19] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [06:59:49] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [07:03:07] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [07:22:37] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [07:22:55] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [07:30:16] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [07:30:52] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [07:35:13] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [07:39:48] New patchset: Petrb; "Deploying OSB to beta" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22172 [08:09:07] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [08:22:46] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [08:33:03] PROBLEM - NTP on mw8 is CRITICAL: NTP CRITICAL: Offset unknown [08:36:57] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.048 second response time [08:37:33] RECOVERY - NTP on mw8 is OK: NTP OK: Offset 0.03597307205 secs [09:21:49] New patchset: Petrb; "Deploying OSB to beta" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22172 [09:43:02] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [09:45:44] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (25016) [09:46:29] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (23878) [10:01:20] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [10:02:05] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [10:11:41] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [10:19:20] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [10:25:47] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [10:25:47] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [10:25:47] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [10:25:47] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [10:25:47] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [10:25:48] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [10:25:48] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [10:25:49] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [10:25:49] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [10:42:31] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (46492) [10:43:16] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (46678) [11:16:07] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [11:31:07] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [12:05:25] New patchset: Hashar; "disallow robots on beta" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21602 [12:05:42] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21602 [12:08:15] New patchset: Hashar; "$wgDBerrorLogInUTC -> $wgDBerrorLogTZ" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/15634 [12:08:27] New review: Hashar; "rebased" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/15634 [12:08:27] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/15634 [12:22:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:25:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.956 seconds [12:27:09] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [12:31:48] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.022 second response time [12:34:12] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [12:34:39] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [12:37:57] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [13:02:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:06:58] New patchset: Hashar; "(bug 39701) beta: automatic MediaWiki update" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22116 [13:07:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22116 [13:08:49] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [13:09:34] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [13:11:51] New patchset: Hashar; "(bug 39701) beta: automatic MediaWiki update" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22116 [13:12:41] New review: Hashar; "PS6: run git pull in extensions directory" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/22116 [13:12:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22116 [13:15:16] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [13:15:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.021 seconds [13:22:46] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [13:43:55] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [13:47:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:11] hey paravoid, would you maybe have some time to look at https://rt.wikimedia.org/Ticket/Display.html?id=2970 (Redirect all .mobile requests to .m)? or are you busy with swift? [14:01:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [14:08:44] New patchset: Ottomata; "Access to stat1 for Ori and S Page - RT 3451" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22196 [14:09:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22196 [14:20:14] drdee__: I'm generally busy, but I'll have a look [14:20:56] ty paravoid [14:23:06] New patchset: Ottomata; "Access on stat1 for giovanni and halfak - RT 3460" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22199 [14:23:23] apergos: all ok with the rebalance? [14:23:36] apergos: the puppet taking forever is a sign of broken controller... [14:23:40] except for thw two hosts that will be running forever [14:23:44] yes, it's why I ran em last [14:23:50] I figured they might never complete [14:23:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22199 [14:23:58] (also why I ran em in screen :-P) [14:23:59] we should copy the object.* manually [14:24:43] and we should remove them completely from the rings [14:24:47] are they still in there? [14:24:49] * paravoid checks [14:25:22] well neither of those was even listed in the (live) ring by the time I looked at it [14:25:33] ms-be7 is [14:25:35] so I expect that they will be dropped off it again if they were back in [14:25:42] 210 10 10.0.6.206 6000 sdc1 31.00 844 0.04 [14:25:43] oh. sorry, be6 and be10 aren't. my bad [14:25:45] 211 10 10.0.6.206 6000 sdd1 31.00 844 0.04 [14:25:46] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [14:25:46] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [14:25:46] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [14:25:49] 213 10 10.0.6.206 6000 sdf1 31.00 844 0.04 [14:25:52] 214 10 10.0.6.206 6000 sdg1 0.00 0 0.00 [14:25:54] 7 is the one that's in there [14:25:55] 219 10 10.0.6.206 6000 sdl1 31.00 844 0.04 [14:25:58] yeah, those two are not [14:26:00] for its four measly disks [14:26:08] yeah [14:26:31] it seems to me that when we have boxes diying that way, with half of their disks with I/O errors [14:26:34] we should remove them completely [14:26:43] since the rest don't have I/O errors but don't work properly either [14:26:53] anyways, except for ms-be11 which I didn't try to adjust (sdi which now shows up as something else but it's still listed in the ring) [14:27:07] there's nothing of note, and there were (when I looked) no *new* broken items [14:27:20] oh, ms-be11 sdi, argh [14:27:22] we should remove that too [14:27:30] well it can get removed next round [14:28:14] I certainly have no objections with pulling msbe7 out [14:30:02] ok, I removed ms-be7 and ms-be11's sdi [14:30:07] haven't rebalanced yet [14:30:22] when did you rebalance/push? [14:30:32] around 13:30 [14:30:42] for everything but the two broken ones [14:31:10] okay [14:31:46] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [14:31:55] I hunted the linux-scsi list for stuff on the mpt2sas driver in the meantime [14:32:01] came up negaative [14:32:24] I mean there were a lot of things (though bear in mind we are running a really recent version on one host and a very old version oon another) [14:32:40] but none of them, after I looked at em, seemed to be the guilty party [14:32:58] in the meantime there's also this: [14:32:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:59] http://linux.slashdot.org/story/11/02/03/0136252/dell-releases-ubuntu-powered-cloud-servers [14:33:12] which, ok it's slashdot, who knows, but a few of the comments are quite interesting [14:33:23] stop pinging me! >:( [14:34:13] you should set your client to only ping if the line starts with the name [14:34:16] that's what mark does [14:34:31] apergos and paranoid: so Dell thinks it could be the controller firmware�not sure I am buying it but I will updating ms-be6 today to see if it works [14:34:34] especially for a name that's a very common word... [14:34:47] We purchased 16 C2100s in August. If you like being a Dell beta tester, have at it. The LSI RAID controllers they have in these things are, for a lack of a better word, complete crap. Technically, it's probably the drivers ... but until they have a working driver for linux that doesn't lose its mind and reset the card randomly (thus making your volumes disappear for a minute or two), I suggest staying away. Far away. [14:34:50] going to the latest firmware is a great idea [14:34:54] hahaha [14:34:59] (yep, that's one) [14:34:59] cmjohnson1: we run the latest firmware in ms-be6 iirc [14:35:16] they are going to want us to be on it anyways [14:35:29] I think the only sane way of using these machines is getting new controllers [14:35:53] I read a few reports where people ditched the h200s and put in something else [14:35:57] and their problems went away [14:36:02] it would be worth trying on one of these [14:36:36] because if it's a combo of driver/firmware/controller, well, swapping the controller for something else eliminates all those [14:36:40] in one fell swoop [14:37:15] once the fw update fails..i will push for the controllers but they are going to send the same LSI controller that we already have [14:38:00] that's gonna suck [14:38:38] oh well, gotta jump through the hoops [14:39:49] * apergos will soon be snacking on breaded zucchini slices [14:47:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [15:04:40] cmjohnson1: with the proprietary tools I've kinda installed we can check for the firmware version from within linux [15:05:53] cool�that would help�we should check the fw versions for the borked servers and the c2100's that are working [15:06:04] curious if there is a difference [15:06:19] doing that now [15:20:35] paravoid: is the firmware the same? [15:21:03] looking [15:21:07] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:05] cmjohnson1: BIOS 7.07.00.00, firmware 6.00.00.00 [15:22:08] everywhere on pmtpa [15:22:22] except ms-be8 which I didn't test since it's powered off, but I'm pretty sure that's the case too [15:22:28] PROBLEM - Host msfe1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:34] can you check that against eqiad [15:33:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.371 seconds [15:37:11] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (42594) [15:42:10] New patchset: Ottomata; "Adding parameter $server_admin to webserver::apache::site define." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22205 [15:43:05] New patchset: Ottomata; "Now hosting community-analytics.wikimedia.org from stat1001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22206 [15:43:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22205 [15:43:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22206 [15:47:32] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (41817) [15:58:29] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [15:58:29] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [16:07:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:11] PROBLEM - ps1-d1-sdtpa-infeed-load-tower-A-phase-Y on ps1-d1-sdtpa is CRITICAL: ps1-d1-sdtpa-infeed-load-tower-A-phase-Y CRITICAL - *2588* [16:13:11] RECOVERY - ps1-d1-sdtpa-infeed-load-tower-A-phase-Y on ps1-d1-sdtpa is OK: ps1-d1-sdtpa-infeed-load-tower-A-phase-Y OK - 2400 [16:21:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.060 seconds [16:39:43] cmjohnson1: re: srv281. do yuou know if it's under warranty? should I make a ticket for you to have a look at it? [16:41:01] notpeter: haven't check but go ahead and put a ticket in so we can track the issue�there may be an old ticket out there [16:41:29] ok, cool [16:43:03] cmjohnson1: ok, reopened old ticket [16:53:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:54:28] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22205 [17:09:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.057 seconds [17:10:47] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [17:12:14] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [17:38:56] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (44842) [17:39:41] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (47280) [17:42:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [18:03:45] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19397 [18:06:38] New patchset: Demon; "Trying to puppetize replication config -- WORK IN PROGRESS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22215 [18:07:27] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21675 [18:07:28] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/22215 [18:08:56] New patchset: Demon; "Trying to puppetize replication config -- WORK IN PROGRESS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22215 [18:09:29] I'm not sure if this is the right place to ask - how many of wikipedia servers are CPU bound? [18:09:36] and how many of those are CPU bound due to PHP? [18:09:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22215 [18:13:56] fijal: look at ganglia - http://ganglia.wikimedia.org/latest/ and then check out app servers [18:14:00] New patchset: Demon; "Trying to puppetize replication config -- WORK IN PROGRESS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22215 [18:14:43] * cmjohnson1 gonna get some food [18:14:46] LeslieCarr: ok, but is it all PHP? [18:14:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22215 [18:15:21] fijal: well those run mediawiki so they are also running apache [18:15:40] right [18:15:44] fijal: you can check out the cpu_user stat to get a slightly more detailed idea - http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_user&s=by+name&c=Application+servers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [18:15:57] thanks [18:15:58] there's lots of metrics to filter by, sort of in the upper lefthand side [18:17:31] yeah, playing, thanks [18:23:20] PROBLEM - Apache HTTP on srv237 is CRITICAL: Connection refused [18:30:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:34:51] hiya [18:34:55] notpeter, if you have a sec: [18:34:56] https://gerrit.wikimedia.org/r/#/c/22196/ [18:34:59] https://gerrit.wikimedia.org/r/#/c/22199/ [18:35:00] New patchset: Demon; "Only set up SMTP for gerrit production host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22217 [18:35:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22217 [18:37:42] ottomata: lookin' [18:38:55] New patchset: Ottomata; "Puppetizing /etc/init.d/rsync and /etc/default/rsync" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22219 [18:39:27]