[00:03:10] AaronSchulz: the db connection problems look like they started yesterday: Wed Aug 29 15:49:09 UTC 2012, and really started in earnest Wed Aug 29 19:51:30 UTC 2012 [00:05:20] are you doing cat dberror.log | cut -b -16 | uniq --count or something? [00:06:26] I forgot about the --count switch, but that's pretty close to it [00:07:06] zgrep 'Error connecting to 10.0.6.73' dberror.log-20120830.gz | cut -b1-28 | uniq --count [00:12:01] New review: RobLa; "Verified web page looks kosher and imported Chris's key from the page." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/22158 [00:12:37] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22158 [00:13:33] https://secure.wikimedia.org/keys.html [00:14:21] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [00:15:40] sure lots of job runner OOMs [00:21:01] win 2 [00:24:42] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [00:24:42] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [00:24:42] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [00:24:42] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [00:24:42] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [00:24:43] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [00:24:43] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [00:24:44] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [00:24:44] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [00:24:45] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [00:24:45] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [00:24:46] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [00:24:46] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [00:25:32] TimStarling: I wonder how much memory getBacklinkCache() on a bunch of titles takes up when running jobs [00:26:30] though the title object (and hence the cache member) should be freed between jobs, meh [00:26:39] there's a cache of title objects [00:26:48] const CACHE_MAX = 1000; [00:27:17] maybe we could clear it after each job [00:27:20] yeah, other refs [00:27:22] Title::clearCache(); [00:27:38] doesn't exist, just saying we could make it [00:28:49] makes sense [00:41:50] RECOVERY - Puppet freshness on ms-be1001 is OK: puppet ran at Fri Aug 31 00:41:18 UTC 2012 [00:42:17] RECOVERY - Puppet freshness on ms-be1003 is OK: puppet ran at Fri Aug 31 00:42:04 UTC 2012 [00:45:17] RECOVERY - Puppet freshness on ms-be1002 is OK: puppet ran at Fri Aug 31 00:45:15 UTC 2012 [00:59:50] RECOVERY - Puppet freshness on ms-fe1001 is OK: puppet ran at Fri Aug 31 00:59:27 UTC 2012 [01:30:26] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [01:41:59] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 263 seconds [01:42:08] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 271 seconds [01:48:44] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 668s [01:57:18] RECOVERY - swift-account-reaper on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [01:58:57] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 1s [01:59:06] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [01:59:33] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 11 seconds [02:01:22] what the hell [02:05:42] mutante: around? [02:10:12] RECOVERY - Host ms-be8 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [02:20:24] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [02:26:06] RECOVERY - Host ms-be8 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [02:32:48] paravoid: re [02:33:34] I have bigger problems but would you perhaps know why puppet_freshness/SNMP wouldn't work? [02:33:47] you're one of the people that might know :) [02:33:54] on prod nagios? [02:34:12] PROBLEM - swift-container-auditor on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:21] PROBLEM - swift-object-replicator on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:21] PROBLEM - swift-container-replicator on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:29] yes [02:34:30] PROBLEM - swift-account-replicator on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:30] PROBLEM - swift-account-reaper on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:33] paravoid: i guess because spence has been rebooted and the deamon isnt running, hold on [02:34:39] PROBLEM - swift-object-server on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:39] PROBLEM - swift-container-server on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:06] PROBLEM - SSH on ms-be8 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:35:06] PROBLEM - swift-object-updater on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:06] PROBLEM - swift-account-server on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:06] PROBLEM - swift-container-updater on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:24] PROBLEM - swift-object-auditor on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:29] 4 swift boxes down in tampa [02:35:30] out of 9 [02:35:33] PROBLEM - swift-account-auditor on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:39] ugh :/ [02:38:33] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [02:44:49] paravoid: so, freshness monitoring is via snmp, the daemons are running and i see snmp traps incoming in tcpdump [02:45:08] paravoid: just not working for the swift boxes? [02:45:50] seems so. not important right now... [02:46:01] sigh [02:49:39] RECOVERY - swift-account-auditor on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [02:49:39] RECOVERY - swift-object-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [02:49:39] RECOVERY - swift-container-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [02:49:48] RECOVERY - swift-account-reaper on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [02:49:48] RECOVERY - Host ms-be8 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [02:49:58] RECOVERY - swift-object-server on ms-be8 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [02:49:58] RECOVERY - swift-account-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [02:50:06] RECOVERY - swift-container-server on ms-be8 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [02:50:15] RECOVERY - SSH on ms-be8 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [02:50:33] RECOVERY - swift-account-server on ms-be8 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [02:50:33] RECOVERY - swift-object-auditor on ms-be8 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [02:50:33] RECOVERY - swift-object-updater on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [02:50:51] RECOVERY - swift-container-auditor on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [02:50:51] RECOVERY - swift-container-updater on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [02:59:36] !log swift: removing ms-be8 sdc1, sdg1, sdk1, sdl1, sdn1 from ring; broken hardware (again) [02:59:48] Logged the message, Master [03:04:49] * Jeff_Green is sorta not totally failing with pbuilder . . . finally [03:05:22] 'bug' turned out to be bad command syntax in the manifest I cherrypicked from production puppet [03:06:09] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [03:10:06] grumble grumble grumble [03:15:49] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , plwiktionary (26323) [03:15:58] RECOVERY - Puppet freshness on palladium is OK: puppet ran at Fri Aug 31 03:15:47 UTC 2012 [03:17:11] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , plwiktionary (26483) [03:20:01] paravoid: i looked through older mail when i talked to Ben about ms-be5 going down in March. I just forwarded one to you, it contains a Bash script i once wrote to add devices to swift rings.. fwiw [03:21:43] eh, pasted here: http://wikitech.wikimedia.org/view/User_talk:Dzahn [03:30:31] PROBLEM - swift-container-server on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [03:30:31] PROBLEM - swift-account-replicator on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [03:31:43] PROBLEM - swift-object-replicator on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [03:32:01] RECOVERY - swift-container-server on ms-be8 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [03:32:01] RECOVERY - swift-account-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [03:35:46] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [03:43:48] another disk broken [03:43:50] seriously, WTH [03:45:22] mutante: thanks [03:47:10] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [03:48:31] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [03:49:25] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [03:49:34] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [03:53:46] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [03:56:08] paravoid: http://lists.us.dell.com/pipermail/linux-poweredge/2012-May/046375.html [04:02:32] added to the new ticket.. [04:02:41] ms-be8 is going down for power off NOW [04:03:37] I know [04:03:38] that's me [04:03:53] even without the i/o errored disk umounted, it's broken [04:04:03] multiple processes in D [04:04:15] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [04:05:09] oh man.. just quoting "starting to wonder whether [04:05:11] this model of machine is a bit of a lemon" [04:06:30] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [04:07:33] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [04:08:00] ACKNOWLEDGEMENT - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn Dell C2100 [04:11:18] RECOVERY - Host ms-be8 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [04:12:45] hahahaha fucking crazy machine [04:13:02] I disabled those disks in fstab, rebooted it and now it produces i/o errors for other disks [04:13:21] yeah, it's randomized on every reboot [04:13:51] thats why we kept thinking it must be the controller [04:14:07] after several disks had been replaced/checked by Chris [04:14:14] no it wasn't randomized, it kept being the same disks [04:14:18] until I disabled these [04:14:22] ooh [04:14:27] PROBLEM - swift-container-auditor on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:14:27] PROBLEM - swift-container-updater on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [04:14:45] PROBLEM - swift-account-auditor on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [04:14:45] PROBLEM - swift-account-reaper on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [04:15:03] PROBLEM - swift-container-replicator on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [04:15:30] PROBLEM - swift-object-server on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [04:15:39] PROBLEM - swift-object-auditor on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [04:15:57] PROBLEM - swift-account-server on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [04:15:57] PROBLEM - swift-object-updater on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [04:16:06] PROBLEM - swift-container-server on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [04:16:51] PROBLEM - swift-account-replicator on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [04:18:48] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.048 second response time [04:21:57] RECOVERY - swift-account-server on ms-be8 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [04:22:15] RECOVERY - swift-object-updater on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [04:22:24] RECOVERY - swift-container-server on ms-be8 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [04:22:24] RECOVERY - swift-account-reaper on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [04:22:24] RECOVERY - swift-container-auditor on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:22:33] RECOVERY - swift-object-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [04:22:33] RECOVERY - swift-object-server on ms-be8 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [04:22:33] RECOVERY - swift-account-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [04:22:42] RECOVERY - swift-container-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [04:23:09] RECOVERY - swift-account-auditor on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [04:23:09] RECOVERY - swift-object-auditor on ms-be8 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [04:23:09] RECOVERY - swift-container-updater on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [04:24:17] on ms-be1012 it was .."Spinning up disk .. not responding" . and "Add. Sense: Logical unit not ready, cause not reportable" [04:24:23] bbl [04:24:30] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [04:24:30] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [04:24:30] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [04:26:18] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [04:27:39] PROBLEM - swift-object-auditor on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [04:27:39] PROBLEM - swift-account-auditor on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [04:27:57] PROBLEM - swift-account-server on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [04:28:15] PROBLEM - swift-object-updater on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [04:28:24] PROBLEM - swift-container-server on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [04:28:24] PROBLEM - swift-container-auditor on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:28:24] PROBLEM - swift-account-reaper on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [04:28:33] PROBLEM - swift-object-replicator on ms-be8 is CRITICAL: Connection refused by host [04:28:33] PROBLEM - swift-object-server on ms-be8 is CRITICAL: Connection refused by host [04:28:33] PROBLEM - swift-account-replicator on ms-be8 is CRITICAL: Connection refused by host [04:28:33] PROBLEM - swift-container-updater on ms-be8 is CRITICAL: Connection refused by host [04:28:51] PROBLEM - swift-container-replicator on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:30:30] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [04:31:33] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [04:35:36] RECOVERY - swift-account-reaper on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [04:35:45] RECOVERY - Host ms-be8 is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms [04:40:07] PROBLEM - swift-account-reaper on ms-be8 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [04:47:18] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time [05:05:11] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [05:43:26] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (38499) [05:44:11] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (37921) [05:57:14] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [05:57:14] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [06:08:28] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [06:09:22] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [06:12:40] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [06:46:06] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time [06:52:33] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [06:54:30] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [06:56:18] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [06:58:19] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [06:59:49] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [07:03:07] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [07:22:37] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [07:22:55] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [07:30:16] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [07:30:52] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [07:35:13] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [07:39:48] New patchset: Petrb; "Deploying OSB to beta" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22172 [08:09:07] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [08:22:46] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [08:33:03] PROBLEM - NTP on mw8 is CRITICAL: NTP CRITICAL: Offset unknown [08:36:57] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.048 second response time [08:37:33] RECOVERY - NTP on mw8 is OK: NTP OK: Offset 0.03597307205 secs [09:21:49] New patchset: Petrb; "Deploying OSB to beta" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22172 [09:43:02] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [09:45:44] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (25016) [09:46:29] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (23878) [10:01:20] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [10:02:05] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [10:11:41] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [10:19:20] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [10:25:47] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [10:25:47] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [10:25:47] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [10:25:47] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [10:25:47] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [10:25:48] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [10:25:48] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [10:25:49] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [10:25:49] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [10:42:31] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (46492) [10:43:16] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (46678) [11:16:07] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [11:31:07] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [12:05:25] New patchset: Hashar; "disallow robots on beta" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21602 [12:05:42] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21602 [12:08:15] New patchset: Hashar; "$wgDBerrorLogInUTC -> $wgDBerrorLogTZ" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/15634 [12:08:27] New review: Hashar; "rebased" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/15634 [12:08:27] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/15634 [12:22:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:25:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.956 seconds [12:27:09] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [12:31:48] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.022 second response time [12:34:12] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [12:34:39] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [12:37:57] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [13:02:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:06:58] New patchset: Hashar; "(bug 39701) beta: automatic MediaWiki update" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22116 [13:07:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22116 [13:08:49] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [13:09:34] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [13:11:51] New patchset: Hashar; "(bug 39701) beta: automatic MediaWiki update" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22116 [13:12:41] New review: Hashar; "PS6: run git pull in extensions directory" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/22116 [13:12:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22116 [13:15:16] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [13:15:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.021 seconds [13:22:46] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [13:43:55] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [13:47:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:55:11] hey paravoid, would you maybe have some time to look at https://rt.wikimedia.org/Ticket/Display.html?id=2970 (Redirect all .mobile requests to .m)? or are you busy with swift? [14:01:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [14:08:44] New patchset: Ottomata; "Access to stat1 for Ori and S Page - RT 3451" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22196 [14:09:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22196 [14:20:14] drdee__: I'm generally busy, but I'll have a look [14:20:56] ty paravoid [14:23:06] New patchset: Ottomata; "Access on stat1 for giovanni and halfak - RT 3460" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22199 [14:23:23] apergos: all ok with the rebalance? [14:23:36] apergos: the puppet taking forever is a sign of broken controller... [14:23:40] except for thw two hosts that will be running forever [14:23:44] yes, it's why I ran em last [14:23:50] I figured they might never complete [14:23:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22199 [14:23:58] (also why I ran em in screen :-P) [14:23:59] we should copy the object.* manually [14:24:43] and we should remove them completely from the rings [14:24:47] are they still in there? [14:24:49] * paravoid checks [14:25:22] well neither of those was even listed in the (live) ring by the time I looked at it [14:25:33] ms-be7 is [14:25:35] so I expect that they will be dropped off it again if they were back in [14:25:42] 210 10 10.0.6.206 6000 sdc1 31.00 844 0.04 [14:25:43] oh. sorry, be6 and be10 aren't. my bad [14:25:45] 211 10 10.0.6.206 6000 sdd1 31.00 844 0.04 [14:25:46] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [14:25:46] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [14:25:46] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [14:25:49] 213 10 10.0.6.206 6000 sdf1 31.00 844 0.04 [14:25:52] 214 10 10.0.6.206 6000 sdg1 0.00 0 0.00 [14:25:54] 7 is the one that's in there [14:25:55] 219 10 10.0.6.206 6000 sdl1 31.00 844 0.04 [14:25:58] yeah, those two are not [14:26:00] for its four measly disks [14:26:08] yeah [14:26:31] it seems to me that when we have boxes diying that way, with half of their disks with I/O errors [14:26:34] we should remove them completely [14:26:43] since the rest don't have I/O errors but don't work properly either [14:26:53] anyways, except for ms-be11 which I didn't try to adjust (sdi which now shows up as something else but it's still listed in the ring) [14:27:07] there's nothing of note, and there were (when I looked) no *new* broken items [14:27:20] oh, ms-be11 sdi, argh [14:27:22] we should remove that too [14:27:30] well it can get removed next round [14:28:14] I certainly have no objections with pulling msbe7 out [14:30:02] ok, I removed ms-be7 and ms-be11's sdi [14:30:07] haven't rebalanced yet [14:30:22] when did you rebalance/push? [14:30:32] around 13:30 [14:30:42] for everything but the two broken ones [14:31:10] okay [14:31:46] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [14:31:55] I hunted the linux-scsi list for stuff on the mpt2sas driver in the meantime [14:32:01] came up negaative [14:32:24] I mean there were a lot of things (though bear in mind we are running a really recent version on one host and a very old version oon another) [14:32:40] but none of them, after I looked at em, seemed to be the guilty party [14:32:58] in the meantime there's also this: [14:32:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:59] http://linux.slashdot.org/story/11/02/03/0136252/dell-releases-ubuntu-powered-cloud-servers [14:33:12] which, ok it's slashdot, who knows, but a few of the comments are quite interesting [14:33:23] stop pinging me! >:( [14:34:13] you should set your client to only ping if the line starts with the name [14:34:16] that's what mark does [14:34:31] apergos and paranoid: so Dell thinks it could be the controller firmware�not sure I am buying it but I will updating ms-be6 today to see if it works [14:34:34] especially for a name that's a very common word... [14:34:47] We purchased 16 C2100s in August. If you like being a Dell beta tester, have at it. The LSI RAID controllers they have in these things are, for a lack of a better word, complete crap. Technically, it's probably the drivers ... but until they have a working driver for linux that doesn't lose its mind and reset the card randomly (thus making your volumes disappear for a minute or two), I suggest staying away. Far away. [14:34:50] going to the latest firmware is a great idea [14:34:54] hahaha [14:34:59] (yep, that's one) [14:34:59] cmjohnson1: we run the latest firmware in ms-be6 iirc [14:35:16] they are going to want us to be on it anyways [14:35:29] I think the only sane way of using these machines is getting new controllers [14:35:53] I read a few reports where people ditched the h200s and put in something else [14:35:57] and their problems went away [14:36:02] it would be worth trying on one of these [14:36:36] because if it's a combo of driver/firmware/controller, well, swapping the controller for something else eliminates all those [14:36:40] in one fell swoop [14:37:15] once the fw update fails..i will push for the controllers but they are going to send the same LSI controller that we already have [14:38:00] that's gonna suck [14:38:38] oh well, gotta jump through the hoops [14:39:49] * apergos will soon be snacking on breaded zucchini slices [14:47:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [15:04:40] cmjohnson1: with the proprietary tools I've kinda installed we can check for the firmware version from within linux [15:05:53] cool�that would help�we should check the fw versions for the borked servers and the c2100's that are working [15:06:04] curious if there is a difference [15:06:19] doing that now [15:20:35] paravoid: is the firmware the same? [15:21:03] looking [15:21:07] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:05] cmjohnson1: BIOS 7.07.00.00, firmware 6.00.00.00 [15:22:08] everywhere on pmtpa [15:22:22] except ms-be8 which I didn't test since it's powered off, but I'm pretty sure that's the case too [15:22:28] PROBLEM - Host msfe1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:34] can you check that against eqiad [15:33:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.371 seconds [15:37:11] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (42594) [15:42:10] New patchset: Ottomata; "Adding parameter $server_admin to webserver::apache::site define." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22205 [15:43:05] New patchset: Ottomata; "Now hosting community-analytics.wikimedia.org from stat1001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22206 [15:43:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22205 [15:43:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22206 [15:47:32] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (41817) [15:58:29] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [15:58:29] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [16:07:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:11] PROBLEM - ps1-d1-sdtpa-infeed-load-tower-A-phase-Y on ps1-d1-sdtpa is CRITICAL: ps1-d1-sdtpa-infeed-load-tower-A-phase-Y CRITICAL - *2588* [16:13:11] RECOVERY - ps1-d1-sdtpa-infeed-load-tower-A-phase-Y on ps1-d1-sdtpa is OK: ps1-d1-sdtpa-infeed-load-tower-A-phase-Y OK - 2400 [16:21:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.060 seconds [16:39:43] cmjohnson1: re: srv281. do yuou know if it's under warranty? should I make a ticket for you to have a look at it? [16:41:01] notpeter: haven't check but go ahead and put a ticket in so we can track the issue�there may be an old ticket out there [16:41:29] ok, cool [16:43:03] cmjohnson1: ok, reopened old ticket [16:53:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:54:28] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22205 [17:09:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.057 seconds [17:10:47] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [17:12:14] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [17:38:56] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (44842) [17:39:41] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (47280) [17:42:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [18:03:45] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19397 [18:06:38] New patchset: Demon; "Trying to puppetize replication config -- WORK IN PROGRESS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22215 [18:07:27] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21675 [18:07:28] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/22215 [18:08:56] New patchset: Demon; "Trying to puppetize replication config -- WORK IN PROGRESS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22215 [18:09:29] I'm not sure if this is the right place to ask - how many of wikipedia servers are CPU bound? [18:09:36] and how many of those are CPU bound due to PHP? [18:09:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22215 [18:13:56] fijal: look at ganglia - http://ganglia.wikimedia.org/latest/ and then check out app servers [18:14:00] New patchset: Demon; "Trying to puppetize replication config -- WORK IN PROGRESS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22215 [18:14:43] * cmjohnson1 gonna get some food [18:14:46] LeslieCarr: ok, but is it all PHP? [18:14:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22215 [18:15:21] fijal: well those run mediawiki so they are also running apache [18:15:40] right [18:15:44] fijal: you can check out the cpu_user stat to get a slightly more detailed idea - http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_user&s=by+name&c=Application+servers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [18:15:57] thanks [18:15:58] there's lots of metrics to filter by, sort of in the upper lefthand side [18:17:31] yeah, playing, thanks [18:23:20] PROBLEM - Apache HTTP on srv237 is CRITICAL: Connection refused [18:30:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:34:51] hiya [18:34:55] notpeter, if you have a sec: [18:34:56] https://gerrit.wikimedia.org/r/#/c/22196/ [18:34:59] https://gerrit.wikimedia.org/r/#/c/22199/ [18:35:00] New patchset: Demon; "Only set up SMTP for gerrit production host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22217 [18:35:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22217 [18:37:42] ottomata: lookin' [18:38:55] New patchset: Ottomata; "Puppetizing /etc/init.d/rsync and /etc/default/rsync" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22219 [18:39:27] +2ing [18:39:31] danke [18:39:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22219 [18:39:53] all of your paperwork looks to be in order. very good, comrade. [18:39:59] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22196 [18:40:06] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22199 [18:40:36] danke! [18:40:42] thx notpeter [18:40:45] this is a little bit more to read, but could you check this one too? [18:40:45] https://gerrit.wikimedia.org/r/22219 [18:41:20] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [18:42:05] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [18:42:23] notpeter: you are my most consistent merger :) [18:42:24] https://gerrit.wikimedia.org/r/22219 [18:42:42] most available, perhaps [18:43:14] easy to +2 things when it's exactly what it says it is an all approval is present :) [18:44:43] ottomata: I feel like to give any kind of legitimate review, I will need to actually read all yr codes for that module [18:44:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.113 seconds [18:45:44] aye, mark and paravoid read those yesterday [18:45:47] and they are not my code [18:45:50] it is a puppetlabs module [18:45:54] that the had me commit to our repo [18:46:20] the variables it provided by default for things like rsyncd.conf were not consistent with the init.d scripts that the ubuntu rsync package comes with [18:47:22] so this commit just brings in the two init.d relevant files for rsync [18:47:32] the init.d script and the defaults file [18:47:37] and sets the variables to make them consistent [18:48:29] these are the same configs that rsync daemons already use in the poorly puppetized generic::rsyncd class [18:48:50] gotcha. sweet! I shall take a looksy [18:48:57] danke! [18:49:21] RECOVERY - Apache HTTP on srv237 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [18:50:49] oo, found a comment boo boo, fixing that [18:51:25] New patchset: Ottomata; "Puppetizing /etc/init.d/rsync and /etc/default/rsync" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22219 [18:52:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22219 [18:58:57] New patchset: Ottomata; "Doh, accounts::, not admins::" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22222 [18:59:21] notpeter, made a booboo with the stat1 access tickets ^ [18:59:52] ottomata: ah, kk [19:00:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22222 [19:00:16] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22222 [19:00:27] thank you [19:01:48] PROBLEM - Apache HTTP on srv239 is CRITICAL: Connection refused [19:12:34] notpeter, let me know if you are looking at and will or will not feel comfortable approving the rsync stuff [19:12:38] if not i'm going to work on other things [19:12:56] erik z needs this to deploy new stuff to stats.wikimedia.org [19:13:05] ottomata: am looking [19:13:10] what all is it being used on right now? [19:13:14] all rsync for cluster? [19:13:38] New patchset: Aaron Schulz; "Make sure private wikis create private backend containers." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22223 [19:15:24] nothign right now at all [19:15:37] this is a better rsync puppetization than the generic::rsyncd stuff that we wrote [19:15:59] this allows you to puppetized separate rsyncd modules without having to create a different rsyncd.conf file for every server [19:16:15] this all looks reasonable to me! [19:16:18] once this is merged, I will be doing this: [19:16:18] https://gerrit.wikimedia.org/r/#/c/21745/1/manifests/misc/statistics.pp [19:16:31] which will allow rsyncing between /a on statistic serverds [19:16:36] sure. I'll merge this up [19:16:49] and then if there are any fixes you need once you're deploying it, just let me know [19:16:54] danke danke [19:17:06] i will commit the stat server changes directly after [19:17:10] and try it out on stat1 and stat1001 [19:17:19] if I have problems i will get you to approve my fixes [19:17:20] thank youuuu [19:17:26] perhaps get mar_k/faidon to look over it before replacing existing functionality on the cluster, though [19:17:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:35] yeah totally [19:17:39] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22219 [19:17:45] i'm not going to mess with things unless I need too [19:18:01] sure [19:18:02] I might add a note in the generic::rsyncd class that says this module should be used instead for new things [19:18:22] ottomata: "parameter $server_admin to webserver::apache::site" is also in there, fyi [19:18:34] sure, I'd just like more eyes before things start transitioing overall [19:18:57] for sure [19:19:02] Bwa [19:19:07] in that last commit? [19:19:19] ? [19:19:46] oh merged? [19:19:47] ottomata: in 22205 [19:19:48] mutante? [19:19:50] yes [19:19:55] i merged it earlier [19:19:57] thank you! [19:20:02] so this depended on that [19:20:03] https://gerrit.wikimedia.org/r/#/c/22206/ [19:20:04] it will affect stat1001 [19:20:06] can you do that for me too? [19:20:44] New patchset: Ottomata; "Setting up rsync daemons for /a on stat1 and stat1001." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22224 [19:21:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22224 [19:22:23] k notpeter ^, that is the rsyncd for erik z [19:22:25] ottomata: you do have trailing whitespace, which is frowned up. do you want to fix that? or do we decidedly not care enough? [19:22:27] ottomata: sorry for being picky about it, but do you wanna remove the red stuff (whitespace)? [19:22:29] oooop [19:22:32] i'll fix [19:22:35] lulz [19:22:38] heh [19:22:41] picky is good, no pologies [19:23:36] Change merged: Catrope; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22223 [19:24:23] New patchset: Ottomata; "Now hosting community-analytics.wikimedia.org from stat1001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22206 [19:25:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22206 [19:26:14] it turns out my planet stuff does not have localization for the date and few words in the menu, Subscribe/Subscriptions/Last Updated. But it's basically just these 3 words, but still i would a separate template for each language, including Russian etc.. or users would have to live with English .. hrmm.. hrmm [19:26:39] New patchset: Ottomata; "Setting up rsync daemons for /a on stat1 and stat1001." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22224 [19:27:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22224 [19:27:39] New patchset: Ottomata; "Setting up rsync daemons for /a on stat1 and stat1001." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22224 [19:28:05] ottomata: woo! looks good [19:28:28] jaa danke [19:28:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22224 [19:29:21] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22224 [19:30:36] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:31:02] Jamesofur: what are your thoughts on l10n/i18n in the new planet? I still have those 3 words Subscripe/Subscriptions/Last updated and the date format all in English and it looks like i would have to write separate templates for each language, unless users would be ok with that [19:33:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [19:34:12] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 26.68 ms [19:35:15] RECOVERY - Apache HTTP on srv239 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [19:36:45] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21961 [19:39:21] mutante: yeah, if at all possible we probably want to try and do that, they are unlikely to like it in English if we can avoid it. If we don't have the translations then putting it as english until we do is one thing but otherwise we probably want to keep it [19:39:54] mutante: I've had a couple questions as well because some of the languages aren't being fully displayed correctly. [19:40:25] Jamesofur: there is an issue with some French feeds? [19:40:41] they get 500 when planet tries to fetch them but work fine in a browser [19:41:26] http://hyperboree-apollon.blogspot.com/feeds/posts/default/-/wikipédia [19:41:51] this is one example. there are a few others [19:42:24] did we try with / ? [19:42:25] they all have in common that there are special chars in the URL. I already wrote to the planet mailing list about it .. but no reply [19:42:31] ahh [19:42:47] yeah, there are some special character issues with Polish that someone poked me about as eell [19:42:50] *well [19:42:51] other example: http://wikiźródła.pl/blog/?feed=rss2 from Polish [19:42:57] exactly [19:43:53] can these be written in another format? that alternative IDN format [19:44:47] Punycode, that was it [19:45:00] mutante: Example of the other issue that may be related http://zirconium.wikimedia.org/planet/pl/ look at August 11 [19:45:13] that title is perfectly fine on the actual site but we're not coding it right [19:46:03] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (62590) [19:46:16] Jamesofur: sigh.. and this https://bugzilla.wikimedia.org/show_bug.cgi?id=39725 [19:46:48] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , zhwiki (63285) [19:48:41] mutante: :-/ Yeah, and the old planet is starting to get more complaints as well .. apparently many of the other languages haven't been updating for ages [19:49:03] I've had 3 emails and a talk page post about it in the past week (god knows why so many people noticed it all of a sudden but I'm glad they did) [19:49:09] Jamesofur: is that really still current? i fixed quite a few languages [19:49:27] Jamesofur: it is usually due to missing locales.. these are easy fixes since locales are puppetized [19:49:28] hmm, it was when they pointed me to it, you may have done it since then? let me look [19:49:36] !log dropping testswarm database on gallium. It is no more used [19:49:48] they were always up to date on zirconium [19:49:59] the last one i fixed was "gmq" [19:50:06] on the old planet that is [19:50:26] mutante: have you seen my two bugs? [19:50:28] http://pl.planet.wikimedia.org/ was another one [19:50:36] eh yeah, the old planet just stopped working if the configured locales was gone [19:50:42] i was not sure about the best place to give feeback about the new planet [19:50:43] the new one does not rely on that in the same way [19:50:59] Nemo_bis: i think i just mentioned at least one of them, the iframe one [19:51:14] lemme check pl.planet, brb [19:51:20] mutante: ok, was still reading the scrollback [19:52:31] !log running pl.planet update manually [19:52:54] Jamesofur: heh, pl.planet actually runs, it just does not update because every single feed in it fails :p [19:53:13] ouch…. wonder what's wrong… the feeds work on zirconium [19:53:19] 500 or 404 or AttributeError: object has no attribute 'type' [19:53:19] I'll compare the two setups [19:53:31] could you shoot me the errors in a PM? [19:53:36] sure [19:59:53] so the conclusion is: both planets have their own issues, what works on one does not on the other and vice versa :p [20:00:08] venus is still better though:) [20:04:43] Jamesofur: eh.. i changed the log level to DEBUG instead of just ERROR and ran it again and it worked?:P [20:05:06] sierpień 31, 2012 [20:05:21] weird..... [20:05:30] !log fixed pl.planet updates without knowing how:) [20:05:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:54] New patchset: Tpt; "(bug 37483) Add a list of Page and Index namespaces ids in order to use the new namespace configuration system included into Proofread Page (change: https://gerrit.wikimedia.org/r/#/c/17643/ )" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/20876 [20:07:59] well, the parser has some issues with "raise AttributeError, "object has no attribute '%s'" % key" (feedparser.py) [20:08:24] but it did before and after, for _some_ feeds, and it did not stop it from updating other feeds [20:19:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.040 seconds [20:21:56] New patchset: Ottomata; "Moving hosts_allow to parameter for misc::statistics::rsyncd." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22259 [20:22:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22259 [20:22:53] New patchset: Ottomata; "Moving hosts_allow to parameter for misc::statistics::rsyncd." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22259 [20:23:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22259 [20:23:48] phew, ok, notpeter [20:23:53] there was a prob, fixed! [20:23:54] ^ [20:26:29] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [20:26:29] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [20:26:29] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [20:26:29] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [20:26:29] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [20:26:30] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [20:26:30] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [20:26:31] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [20:26:31] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [20:27:59] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [20:28:25] !log deleting /var/lib/testswarm on gallium (testswarm is gone) [20:28:33] Logged the message, Master [20:30:32] PROBLEM - SSH on virt1002 is CRITICAL: Connection refused [20:31:53] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 27.00 ms [20:32:23] it doesn't matter but... [20:32:40] !log ms-be6 down for firmware upgrade [20:32:47] Logged the message, Master [20:36:59] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [20:38:13] hey mutante, coudl you merge this one? [20:38:14] https://gerrit.wikimedia.org/r/#/c/22206/ [20:41:11] PROBLEM - Host virt1002 is DOWN: PING CRITICAL - Packet loss = 100% [20:41:47] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [20:44:29] RECOVERY - SSH on virt1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:44:38] RECOVERY - Host virt1002 is UP: PING OK - Packet loss = 0%, RTA = 26.60 ms [20:44:56] PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: Connection refused by host [20:45:05] PROBLEM - swift-account-server on ms-be6 is CRITICAL: Connection refused by host [20:45:23] PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: Connection refused by host [20:45:23] PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: Connection refused by host [20:45:31] ottomata: yeah, you are gonna test it on stat1001 though right? i haven't used webserver::apache::site , especially with the "custom" part [20:45:32] PROBLEM - swift-object-server on ms-be6 is CRITICAL: Connection refused by host [20:45:32] PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: Connection refused by host [20:45:32] PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: Connection refused by host [20:45:41] PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: Connection refused by host [20:45:41] PROBLEM - swift-container-server on ms-be6 is CRITICAL: Connection refused by host [20:45:41] PROBLEM - swift-account-replicator on ms-be6 is CRITICAL: Connection refused by host [20:46:17] PROBLEM - swift-container-updater on ms-be6 is CRITICAL: Connection refused by host [20:46:17] PROBLEM - swift-object-updater on ms-be6 is CRITICAL: Connection refused by host [20:46:29] ottomata: looks like just torrus uses it so far [20:46:40] the others just put templates in /apache/templates/sites [20:46:45] stat1001 uses it yeah [20:46:47] already [20:46:57] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22206 [20:46:58] stat1001 uses it for stats.wikimedia.org [20:47:00] done [20:47:15] danke! [20:47:21] de nada [20:48:08] eh, don't see it on sockpuppet [20:49:02] ok, somebody already merged [20:49:28] i did it! [20:49:37] running puppet on stat1001 now [20:52:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:53:35] mr mutante, would you mind doing this one too? [20:53:35] https://gerrit.wikimedia.org/r/#/c/22259/ [20:53:41] notpeter would be i think he is away [20:57:37] dont want to use hostnames? [20:57:41] or can't [20:59:11] sup? [20:59:21] i could [20:59:22] should I ? [20:59:40] i guess it's better to not have IPs in puppet manifests [20:59:42] ottomata: merge things on sockpuppet? [20:59:46] what? [20:59:49] I'm confused [20:59:56] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [21:00:17] naw, need this approved [21:00:23] but mutante says I should use names rather than IPs [21:00:25] so gonna patch that [21:00:36] notpeter: there were 2 changes, one already merged, then another one [21:00:53] do you agree on the IPs? dunno [21:01:19] I think that if IPs are documented it's fine either way [21:01:22] but that's just me [21:01:51] i put config options in the role class and the rest in the actual class [21:01:59] and you just moved it from role class into the other [21:02:06] but that is also just me [21:02:16] yeah, they weren't in scope? [21:02:33] i wanted to put them in the same place that included the rsyncd class [21:02:34] so I could pass them [21:02:52] not sure why, maybe because the role is inherited [21:02:55] by those subclasses? [21:03:08] ::web and ::cruncher [21:03:17] ::www and ::cruncher [21:04:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.731 seconds [21:05:02] New patchset: Ottomata; "Moving hosts_allow to parameter for misc::statistics::rsyncd." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22259 [21:05:22] ok notpeter, mutante ^ [21:06:16] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22259 [21:07:44] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 1.42 ms [21:09:05] yaya, danke [21:12:10] apergos1: r u still around? [21:14:39] yes [21:14:48] barely (it's after midnight on a friday night :-P) [21:14:50] what's up? [21:15:27] yeah..i know it's late�so I just updated the firmware on the controller for ms-be6 [21:15:31] um for anyone keeping track, I listed three commons containers and then listed the same dirs on ms7, ms7 has significantly *fewer* files than swift. I haven't done the comparison yet [21:15:35] ok great [21:15:38] it is in post now [21:17:47] ah, now that I remove the 'archive/x/y' files from the lists on swift, the numbers are very very close [21:18:14] apergos1: just wondering, is there a maximum for thumbnail sizes? [21:18:57] well we have ulimits [21:19:06] so at some point we will cut you off anyways [21:19:25] people can upload very large files (and that size will likely expand over theyears) so... [21:19:26] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [21:19:44] PROBLEM - SSH on ms-be6 is CRITICAL: Connection refused [21:21:01] gotcha. just thought if there were any thumbnails larger than the actual image [21:21:13] and if yes, get rid of them? [21:21:41] apergos1: can you tell if there is significant difference. [21:22:31] a simple (sort uniq) comparison of the first of the ms7 dirs and the swift containers (s 1/256 of commons images) look great [21:23:05] cmjohnson1: give me a minute to finish this, then I'll look [21:23:09] did it boot up all the way? [21:23:38] no..ssh is denying me and the OS is not loaded [21:23:52] but it went through post [21:23:56] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [21:24:47] where did it stop, what's the last message you see? [21:25:01] so the container lists are on the money, all three of them [21:25:31] next q is whether any of the objects in them result in 507s (because there are no partitions with a viable copy). that's going to be harder [21:25:44] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [21:25:53] cmjohnson1: last messages on the console? [21:26:29] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [21:26:44] i didn't see�i am at the DC and have it connected to the KVM. [21:26:58] we could reboot again [21:27:00] cmjohnson1: let me try mgmt on ms-be6 .. [21:27:22] ah well if the mgmt console is free I can just look at it [21:27:34] that should be fine [21:27:35] I bet it's hung trying to mount something [21:27:35] [SOL Session operational. but then ehm... frozen? [21:27:41] try S [21:27:52] good one! [21:27:56] An error occurred while mounting /srv/swift-storage/sdn1 [21:27:56] Press S to skip mounting or M for manual recovery [21:27:57] see if that bit scrolled off the screen [21:28:10] so if it stops again repeat with another S [21:28:18] see which ones they are [21:28:24] sdn1, sdm1, sdl1 all had to be skipped [21:28:26] RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [21:28:31] now it is at login again [21:28:33] three of em [21:28:35] RECOVERY - swift-container-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [21:28:35] RECOVERY - swift-object-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [21:28:35] RECOVERY - swift-container-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:28:35] RECOVERY - swift-account-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [21:28:35] RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [21:28:36] RECOVERY - swift-account-reaper on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [21:28:36] RECOVERY - swift-account-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [21:28:36] meeehh [21:28:44] RECOVERY - SSH on ms-be6 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [21:28:44] wait, even more [21:28:52] oh [21:28:55] sdn1, sdl1, sdi1, sdf1 [21:28:59] 4 [21:29:01] ok [21:29:04] well it was um [21:29:08] * apergos1 goes to look at their notes [21:29:20] RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [21:29:38] RECOVERY - swift-object-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [21:29:38] RECOVERY - swift-container-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [21:29:47] RECOVERY - swift-object-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [21:29:51] sdf1, sdi1:, sdl1, sdm1, sn1 [21:30:10] if you type mount, [21:30:21] do you .. ah yer not logged in and anyways I can ssh on there now [21:30:30] i am [21:30:38] through mgmt [21:30:56] yeah sdf1 isn't in there either [21:31:05] sda3,sdb3,sde1,sdh1,sdc1,sdd1,sdj1,sdg1,sdk1 [21:31:16] same as before, no difference [21:31:21] I have a record form earlier today [21:31:45] so, firmware did not help, but at least they can now rule that out [21:31:51] okay..i am going to copy the notes from this and update ticket�apergos1 �thx for looking and sticking around so late [21:31:58] sure [21:32:07] well I happened to be passing through [21:32:10] woosters^^ [21:32:20] from last night: [21:32:21] good timing! [21:32:21] 21:13 < paravoid> hahahaha fucking crazy machine [21:32:22] 21:14 < paravoid> I disabled those disks in fstab, rebooted it and now it produces i/o errors for other disks [21:32:25] ya [21:32:29] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [21:32:37] cmjohnson1 - wassup? [21:32:42] so firmware update was not successful in fxing any of the issues [21:32:46] ms-be6 but not really ;-) [21:33:02] f%$# [21:33:43] cmjohnson1 - pls escalate [21:33:55] k [21:34:08] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [21:35:09] !log ms-be6 - just for the record, still could not mount same disks (sdn1, sdl1, sdi1, sdf1) after firmware upgrade [21:35:18] Logged the message, Master [21:38:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:40:09] guys, thanks for your help, mutante, notpeter [21:40:16] i got some fixing to do for that community analytics site [21:40:19] buuut, i'll have to do that monday [21:40:23] (needs python modules or sumpin) [21:40:27] so one thing about this is that we really need to check that there are still errors below the filesystme level [21:40:29] the apache stuff looks good [21:40:31] on these drives after reboot [21:40:41] i'm out,, seriously, I stomped on some RT tickets today [21:40:43] thanks for your help! [21:44:23] for sdl and sdn it looks like it [21:45:50] ottomata: it doesn't break your existing site though, right? [21:45:58] not clear about the other two drives [21:46:00] ottomata: have a great weekend.. Monday is a holiday !! [21:46:49] * apergos tries a mount of sdf1 to see how it fails [21:47:11] nono, dns hasn't changed yet [21:47:12] so its cool [21:47:16] oh, notpeter... [21:47:27] sup? [21:48:18] https://www.dropbox.com/s/pvcgvwlswonnz84/notpeter.jpg [21:49:02] ottomata: awesome!!!! thank you!!!! [21:49:22] I owe you many beers [21:50:23] yep it fails the same old way cmjohnson1 [21:50:31] I could check the last drive but I won't [21:50:33] I/O error in filesystem ("sdf1") [21:50:44] I will save these errors on bast1001 so we have em [21:51:04] how do i quit the ipmi console again? [21:51:04] apergos: thx�most likely but we'll see [21:51:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.189 seconds [21:51:38] !log running rm -rf /var/lib/testswarm on gallium in a screen ..takes forever [21:51:46] Logged the message, Master [21:52:00] tilde period [21:52:22] thanks [21:52:53] it eventually sends a diag reset [21:53:05] apergos: enjoy the weekend, you should stop working now :) [21:53:06] mpt2sas0: LSISAS2008: FWVersion(11.00.00.00), ChipRevision(0x03), BiosVersion(07.21.00.00) [21:53:26] sends port enable [21:53:36] then there are a bunch of [21:53:41] handle changed from blah blah [21:54:35] yea.. mpt2sas0: attempting task abort [21:54:42] and eventually [21:54:44] Aug 31 21:47:41 ms-be6 kernel: [ 1947.175473] sd 0:0:15:0: Device offlined - not ready after error recovery [21:55:07] especially nice is "Unhandled error code" [21:55:09] so the mount starts at Aug 31 21:46 [21:55:18] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.046 second response time [21:55:54] there's a copy of that kern log in my home dir on bast1001 [21:56:10] with a couple of notes about where to find the stuff related to the mount [21:56:13] I'm off now [21:56:15] toooooo tired [21:56:24] thx!!! have a nice weekend [21:56:43] good luck! (half my weekend will be spent travelling) [21:57:39] have a good flight [21:58:54] thanks [22:00:52] cu soon apergos. by chris. also out [22:12:41] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [22:14:38] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [22:16:53] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [22:26:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:40:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.679 seconds [22:54:32] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [23:12:14] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [23:14:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:19:45] New review: Siebrand; "Still got plans with this Ariel, or to be abandoned?" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/9118 [23:26:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.480 seconds [23:27:39] New review: Siebrand; "Still have any plans with this, Jon? I0bf7981d was merged months ago." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/11963 [23:29:05] hello again. [23:29:27] retrying question from yesterday: Who can tell me what an e-mail alias on the lists server (*-owner@lists) points to? [23:30:05] The wikimediake list is receiving browser-hijacking spam, and the one list-owner I know has disengaged. [23:39:11] well i was going to say can't you just ask the listinfo footer... [23:39:18] but obviously that's not helpful [23:39:39] > If possible, try this install on a separate machine for testing first. If possible, try this install on a separate machine for testing first [23:39:43] gah [23:39:53] > WikimediaKE list run by WikimediaKE-owner (at) lists (dot) wikimedia (dot) org [23:41:01] abartov: do you have a new owner to propose if the one you're thinking of is really the only one? [23:41:30] also you might want to talk to Thehelpfulone [23:43:28] ah I know that list, one sec [23:43:51] yes there's only one owner [23:43:56] see https://lists.wikimedia.org/mailman/roster/wikimediake [23:45:17] Thehelpfulone: why would we be able to see that? [23:45:19] ;) [23:47:01] New review: Siebrand; "You should probably add one or two reviewers if you still would like this change to be merged." [operations/debs/lucene-search-2] (master) C: 0; - https://gerrit.wikimedia.org/r/12944 [23:47:09] jeremyb, look at the bottom :P [23:47:38] ohhhh, haha [23:47:50] that's a bug surely [23:48:02] but very useful ;) [23:48:46] New review: Siebrand; "Roan, is this change still current?" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/15561 [23:49:47] New review: Catrope; "I believe so, it's just not getting reviewed." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/15561 [23:50:05] Ryan_Lane: ----^^ That change has been sitting in the review queue since Wikimania [23:50:20] heh [23:50:20] ok [23:50:23] I'll review it [23:50:34] Change abandoned: Diederik; "(no reason)" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/12944 [23:50:44] it will definitely need to be rebased [23:50:54] New patchset: Catrope; "Set $wgVisualEditorParsoidPrefix correctly" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/18125 [23:51:01] * RoanKattouw rebases [23:51:05] Ryan_Lane: ircecho? ;-) [23:53:47] jeremyb: thanks. I can find a new owner, yes, and in the meantime, I'm willing to admin it myself for a few days. [23:53:58] who has the power to do this? [23:55:49] abartov: only roots [23:56:51] abartov: you can file a ticket or i can for you