[00:04:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.022 seconds [00:10:30] New patchset: Asher; "pulling db11" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33498 [00:10:45] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33498 [00:11:25] !log asher synchronized wmf-config/db.php 'pulling db11' [00:11:33] Logged the message, Master [00:12:06] RECOVERY - MySQL Slave Delay on db11 is OK: OK replication delay seconds [00:13:09] RECOVERY - MySQL Replication Heartbeat on db11 is OK: OK replication delay seconds [00:19:45] PROBLEM - MySQL Replication Heartbeat on db11 is CRITICAL: CRIT replication delay 24217 seconds [00:20:04] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [00:20:04] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [00:20:04] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [00:35:14] !log catrope synchronized php-1.21wmf3/extensions/VisualEditor 'Updating VE to master' [00:35:20] Logged the message, Master [00:35:36] !log catrope synchronized php-1.21wmf4/extensions/VisualEditor 'Updating VE to master' [00:35:42] Logged the message, Master [00:37:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:45:06] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error [00:47:44] Hey binasher: For the next step of my cleanup on dewikivoyage, I've got 14k UPDATE statements for the revision table.. [00:47:54] They all look like UPDATE `revision` SET `rev_user_text`='MediaWiki default', `rev_user`=0 WHERE rev_id IN ( 1, 4, 5, 6, 8, 61, 136, 160, 214, 237, 564, 806, 900, 1095, 1096, 1097, 1110, 1115, 7832, 8901, 8924, 9016, 9067 ); [00:48:14] Is that something I can run on db34? Or would I need to update each slave individually? [00:49:03] go ahead and run that on db34, one statement at a time (one txn per statement) [00:49:43] Cool. If I just do `sql dewikivoyage < updates.sql` and it's just the update statments, it defaults to txn per statment, right? [00:50:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.089 seconds [00:50:23] csteipp: yup, it defaults to autocommit mode [00:53:03] New patchset: Dzahn; "move stuff (racktables & RT) out of misc-servers.pp (RT-720)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33509 [00:54:23] New patchset: Dzahn; "move stuff (racktables & RT) out of misc-servers.pp (RT-720)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33509 [00:56:01] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33509 [00:57:06] PROBLEM - MySQL Slave Delay on db11 is CRITICAL: CRIT replication delay 26442 seconds [00:58:41] New patchset: Faidon; "swift: use WSGIContext properly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33510 [01:00:13] AaronSchulz: I added you as a reviewer in all of the above, have a look when you have some time [01:00:31] AaronSchulz: they're now live on the depooled ms-fe1 [01:03:49] New patchset: Dzahn; "move stuff (misc::download::*) out of misc-servers.pp (RT-720)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33511 [01:07:44] New patchset: Asher; "adding db66 to s3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33513 [01:08:24] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33513 [01:17:33] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33511 [01:18:01] New patchset: Faidon; "secure.wikimedia.org: add no-escape to redirects" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33516 [01:18:18] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33516 [01:21:16] New patchset: Dzahn; "move stuff (IRC-related) out of misc-servers.pp (RT-720)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33517 [01:22:06] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33517 [01:24:01] New patchset: Dzahn; "move stuff (Etherpad) out of misc-servers.pp (RT-720)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33518 [01:24:30] New patchset: Dzahn; "move stuff (Etherpad) out of misc-servers.pp (RT-720)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33518 [01:24:57] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33518 [01:25:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:29:21] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 181 seconds [01:31:00] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [01:36:42] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [01:39:15] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 245 seconds [01:39:33] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 251 seconds [01:40:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.439 seconds [01:42:51] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [01:44:52] !log catrope synchronized php-1.21wmf3/extensions/VisualEditor 'Updating VE to master' [01:44:58] Logged the message, Master [01:45:12] !log catrope synchronized php-1.21wmf4/extensions/VisualEditor 'Updating VE to master' [01:45:14] New patchset: Cmjohnson; "removing storage3 from netboot.cfg and will manually set up partition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33524 [01:45:18] Logged the message, Master [01:46:00] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [01:49:22] !log catrope synchronized php-1.21wmf3/extensions/VisualEditor 'Updating VE to master' [01:49:28] Logged the message, Master [01:49:40] !log catrope synchronized php-1.21wmf4/extensions/VisualEditor 'Updating VE to master' [01:49:46] Logged the message, Master [01:50:54] New review: MZMcBride; "Sweet! Thanks!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33516 [01:57:29] New patchset: Reedy; "Move keys to under mediawiki docroot, rather than secure" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33525 [02:01:04] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 307 seconds [02:01:04] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 307 seconds [02:04:19] New patchset: Reedy; "Kill off old secure stuff now un-needed" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33526 [02:05:31] New patchset: Reedy; "Kill creation of secure related symlinks" [operations/mediawiki-multiversion] (master) - https://gerrit.wikimedia.org/r/33527 [02:11:26] New review: Faidon; "May we should keep robots.txt, just in case we're missing a redirect." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/33526 [02:12:19] New review: Reedy; "Probably a good idea" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/33526 [02:13:07] Reedy, where are the keys going to be hosted now? [02:15:04] New patchset: Reedy; "Kill off old secure stuff now un-needed" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33526 [02:15:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:18:59] Krenair: https://www.mediawiki.org/keys/keys.html probably [02:19:07] hence adding them to the mediawiki docroot ;) [02:19:36] Ok. Didn't even know that existed until now [02:20:09] lol [02:20:18] it's linked on every tarball release email [02:22:07] Reedy, 1.20 had https://secure.wikimedia.org/keys.html [02:22:15] Yup [02:22:19] Which is where it is currently [02:22:33] Probably best to redirect that as well [02:22:49] Indeed [02:23:26] I moved it to there after paravoid asked where to redirect it to ;) [02:23:32] hehe [02:28:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.489 seconds [02:30:26] jeremyb: Hey, remember that Etherpad Lite packaging project a while back? You said we needed to package it for debian and puppetize it, right? Do you happen to know the reasons for doing those? [02:32:37] I mean, puppetizing is pretty straightforward, but is the debian packaging necessary? [02:34:15] !log LocalisationUpdate completed (1.21wmf4) at Thu Nov 15 02:34:15 UTC 2012 [02:34:22] Logged the message, Master [02:36:19] paravoid: https://bugzilla.wikimedia.org/show_bug.cgi?id=42133 [02:36:47] just saw that [02:36:51] marktraceur: I think packaging is for ease of deployment somewhat [02:37:12] AaronSchulz: around? [02:38:30] Reedy: So if it would be mostly easier to pull from git, then checkout a tag, would that be OK? [02:39:09] You'd have to ask ops [02:39:28] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: Connection refused [02:40:03] * marktraceur is here for that reason [02:40:16] I guess it might be unlikely that I get an answer tonight :) [02:40:25] probably not hte best time [02:40:25] minnie_: We may have to wait until tomorrow, Pacific time [02:40:38] The europeans are still about and the yanks are AWOL ;) [02:40:52] * marktraceur is here! [02:41:04] spagewmf is here! I think. [02:41:32] hello, what up? [02:41:48] marktraceur that's alright, I'll show up tomorrow on time then, thanks :) [02:41:57] marktraceur: Though, with other ops people adding functionality to pull things from git in puppet [02:41:58] * Reedy shrugs [02:42:00] spagewmf: We're representing the US here :) [02:42:09] New patchset: Faidon; "swift: use WSGIContext properly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33510 [02:42:09] Reedy: Well, that bodes well. [02:42:19] Yeah [02:42:37] The main thing is it not having to be installed manually... [02:42:38] Blame Canada! Blame Canada! With all that hockey hullaballoo And that b***h Ann Murray too! It's not even a real country anyway. [02:42:46] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.008 seconds [02:42:55] minnie_: You don't need to show up, I can ask for you, but it's an unfortunate side effect of having many, many US employees :) [02:42:55] hey, I have a double citizenship w/ them [02:43:04] I'm Canadianian, home of the best national anthem evar [02:43:14] spagewmf: sgardner doesn't hang out here, does she? :P [02:43:19] Jasper_Deng_busy: dual* [02:43:34] marktraceur: there's other canadians, not just her [02:43:49] I'm sure Sue knows the lyrics of the only Academy Award-nominated national anthem [02:44:08] also, sgardner's in #wikimedia right now [02:44:38] Heh, she's also sitting ~10 feet from me [02:44:49] ohh [02:45:08] marktraceur: i see :) [02:45:15] marktraceur: not my call on what's needed to get it in prod. i personally would prefer it be packaged but it's not entirely b+w [02:45:30] Nice, we have secure.wm.o redirects now! [02:54:48] !log LocalisationUpdate completed (1.21wmf3) at Thu Nov 15 02:54:48 UTC 2012 [02:54:54] Logged the message, Master [03:14:07] PROBLEM - MySQL Replication Heartbeat on db1003 is CRITICAL: CRIT replication delay 243 seconds [03:14:16] PROBLEM - MySQL Replication Heartbeat on db1019 is CRITICAL: CRIT replication delay 254 seconds [03:14:34] PROBLEM - MySQL Slave Delay on db1003 is CRITICAL: CRIT replication delay 271 seconds [03:14:44] PROBLEM - MySQL Slave Delay on db1019 is CRITICAL: CRIT replication delay 281 seconds [03:14:52] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 291 seconds [03:15:19] PROBLEM - MySQL Replication Heartbeat on db1010 is CRITICAL: CRIT replication delay 315 seconds [03:15:19] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 316 seconds [03:17:34] RECOVERY - MySQL Replication Heartbeat on db1019 is OK: OK replication delay 0 seconds [03:17:52] RECOVERY - MySQL Slave Delay on db1003 is OK: OK replication delay 0 seconds [03:18:01] RECOVERY - MySQL Slave Delay on db1019 is OK: OK replication delay 0 seconds [03:18:37] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [03:18:37] RECOVERY - MySQL Replication Heartbeat on db1010 is OK: OK replication delay 0 seconds [03:19:13] RECOVERY - MySQL Replication Heartbeat on db1003 is OK: OK replication delay 0 seconds [03:19:58] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [03:48:10] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [03:50:16] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 15 seconds [04:33:01] PROBLEM - mysqld processes on db66 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [05:23:40] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.331 second response time [05:28:37] PROBLEM - Apache HTTP on mw1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [05:46:10] New patchset: Tim Starling; "Disable ULS toolbar" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33534 [06:19:55] RECOVERY - MySQL Replication Heartbeat on db11 is OK: OK replication delay 0 seconds [06:20:42] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.331 second response time [06:21:00] RECOVERY - MySQL Slave Delay on db11 is OK: OK replication delay 0 seconds [06:25:39] PROBLEM - Apache HTTP on mw1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [06:40:04] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [06:40:04] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [06:50:22] PROBLEM - Squid on brewster is CRITICAL: Connection refused [07:08:31] RECOVERY - Squid on brewster is OK: TCP OK - 0.002 second response time on port 8080 [08:06:01] New patchset: Hashar; "zuul: setup.py requires python-setuptools package" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33475 [08:06:21] New review: Hashar; "Minor typo in first line summary" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/33475 [08:18:51] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [08:51:24] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [09:00:24] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [09:07:27] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [09:14:06] !log aaron synchronized php-1.21wmf4/maintenance/nextJobDB.php 'deployed 2e0c24df43588bd4ceba5522bde1b3f06fbd05b0' [09:14:12] Logged the message, Master [09:24:58] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [09:26:37] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [10:05:09] New patchset: J; "Enable transcoding on all wikis that allow uploads" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33542 [10:21:04] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [10:21:04] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [10:21:04] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [10:47:38] !log kaldari synchronized php-1.21wmf4/extensions/UploadWizard/resources/mw.UploadWizardDetails.js 'syncing UploadWizard fix for Safari, part 1' [10:47:45] Logged the message, Master [10:48:22] !log kaldari synchronized php-1.21wmf4/extensions/UploadWizard/resources/mw.UploadWizardUploadInterface.js 'syncing UploadWizard fix for Safari, part 2' [10:48:29] Logged the message, Master [11:03:56] !log kaldari synchronized php-1.21wmf4/extensions/UploadWizard/resources/mw.UploadWizardUploadInterface.js 'rolling back my fix for Safari, part 2' [11:04:03] Logged the message, Master [11:04:35] !log kaldari synchronized php-1.21wmf4/extensions/UploadWizard/resources/mw.UploadWizardDetails.js 'rolling back my fix for Safari, part 1' [11:04:41] Logged the message, Master [11:14:37] !log kaldari synchronized php-1.21wmf4/extensions/UploadWizard/resources/mw.UploadWizardDetails.js 'refixing for Safari, part 1' [11:14:43] Logged the message, Master [11:15:08] !log kaldari synchronized php-1.21wmf4/extensions/UploadWizard/resources/mw.UploadWizardUploadInterface.js 'refixing for Safari, part 2' [11:15:14] Logged the message, Master [11:17:50] New patchset: J; "increase number of concurrent transcoding jobs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33554 [11:26:27] !log kaldari synchronized php-1.21wmf4/extensions/UploadWizard/resources/mw.UploadWizardDetails.js 'refixing for Safari, part 1' [11:26:33] Logged the message, Master [11:26:50] !log kaldari synchronized php-1.21wmf4/extensions/UploadWizard/resources/mw.UploadWizardUploadInterface.js 'refixing for Safari, part 2' [11:26:56] Logged the message, Master [11:28:43] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.332 second response time [11:33:40] PROBLEM - Apache HTTP on mw1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [11:38:01] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [11:39:40] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [11:42:58] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.002 second response time on port 11000 [12:13:22] New patchset: Mark Bergsma; "Disable the session leak Varnish restart" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33555 [12:14:54] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33555 [13:45:25] New patchset: Pyoungmeister; "removing es2 from db.php for conversion to innodb" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33562 [13:47:30] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33562 [13:48:41] !log py synchronized wmf-config/db.php 'pulling es2 for conversion to inno' [13:48:47] Logged the message, Master [13:51:51] notpeter: can you merge this when you get a chance please https://gerrit.wikimedia.org/r/33524 [13:52:56] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33524 [13:53:03] cmjohnson1: done :) [13:53:19] thx [13:53:39] but of course [13:54:57] RECOVERY - MySQL Slave Delay on es2 is OK: OK replication delay seconds [13:58:33] PROBLEM - mysqld processes on es2 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [14:02:26] that's ok, btw [14:03:25] New patchset: ArielGlenn; "script we might use with amazon, rsyncs dumps to space-limited partitions" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/33566 [14:32:56] !log removing srv200-srv213 from pybal for upgrades to precise (and this time I mean it) [14:33:02] Logged the message, notpeter [14:43:49] New patchset: Pyoungmeister; "removing mw60 and mw61 from bits backends for reimage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33569 [14:44:21] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33569 [14:54:12] PROBLEM - Host srv201 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:12] PROBLEM - Host srv205 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:21] PROBLEM - Host srv200 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:30] PROBLEM - Host srv207 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:39] PROBLEM - Host srv204 is DOWN: PING CRITICAL - Packet loss = 100% [14:55:06] PROBLEM - Host srv203 is DOWN: PING CRITICAL - Packet loss = 100% [14:59:27] PROBLEM - SSH on srv202 is CRITICAL: Connection refused [14:59:54] PROBLEM - Apache HTTP on srv202 is CRITICAL: Connection refused [14:59:55] PROBLEM - Memcached on srv202 is CRITICAL: Connection refused [14:59:55] RECOVERY - Host srv201 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [14:59:55] RECOVERY - Host srv205 is UP: PING OK - Packet loss = 0%, RTA = 1.77 ms [15:00:03] RECOVERY - Host srv200 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [15:00:12] RECOVERY - Host srv207 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:00:21] RECOVERY - Host srv204 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [15:00:57] RECOVERY - Host srv203 is UP: PING OK - Packet loss = 0%, RTA = 5.94 ms [15:01:01] !log reedy synchronized php-1.21wmf4/extensions/SiteMatrix/ [15:01:07] Logged the message, Master [15:03:30] PROBLEM - Apache HTTP on srv205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:30] PROBLEM - Memcached on srv205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:39] PROBLEM - Memcached on srv201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:48] PROBLEM - Apache HTTP on srv201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:48] PROBLEM - Apache HTTP on srv207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:57] PROBLEM - Memcached on srv200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:04:06] PROBLEM - SSH on srv207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:04:15] PROBLEM - SSH on srv204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:04:24] RECOVERY - SSH on srv202 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:04:24] PROBLEM - Apache HTTP on srv204 is CRITICAL: Connection refused [15:04:33] PROBLEM - Apache HTTP on srv203 is CRITICAL: Connection refused [15:04:33] PROBLEM - SSH on srv203 is CRITICAL: Connection refused [15:04:33] PROBLEM - Apache HTTP on srv200 is CRITICAL: Connection refused [15:04:51] PROBLEM - Memcached on srv204 is CRITICAL: Connection refused [15:05:09] PROBLEM - Memcached on srv207 is CRITICAL: Connection refused [15:05:36] RECOVERY - SSH on srv207 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:05:45] RECOVERY - SSH on srv204 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:05:54] PROBLEM - Memcached on srv203 is CRITICAL: Connection refused [15:06:30] New patchset: Faidon; "Unmount /mnt/thumbs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33202 [15:07:19] New patchset: Faidon; "reprepro: use the Ceph repository as an upstream" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33352 [15:07:45] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33352 [15:08:18] PROBLEM - Host srv209 is DOWN: PING CRITICAL - Packet loss = 100% [15:08:18] PROBLEM - Host srv213 is DOWN: PING CRITICAL - Packet loss = 100% [15:08:18] PROBLEM - Host srv211 is DOWN: PING CRITICAL - Packet loss = 100% [15:09:30] RECOVERY - SSH on srv203 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:10:15] PROBLEM - Memcached on srv210 is CRITICAL: Connection refused [15:10:33] PROBLEM - SSH on srv210 is CRITICAL: Connection refused [15:12:21] PROBLEM - Apache HTTP on srv208 is CRITICAL: Connection refused [15:12:39] PROBLEM - Apache HTTP on srv210 is CRITICAL: Connection refused [15:12:48] PROBLEM - SSH on srv212 is CRITICAL: Connection refused [15:12:57] PROBLEM - Memcached on srv208 is CRITICAL: Connection refused [15:13:06] PROBLEM - Apache HTTP on srv212 is CRITICAL: Connection refused [15:13:07] PROBLEM - Memcached on srv212 is CRITICAL: Connection refused [15:13:15] PROBLEM - SSH on srv208 is CRITICAL: Connection refused [15:14:09] RECOVERY - Host srv209 is UP: PING OK - Packet loss = 0%, RTA = 1.23 ms [15:14:09] RECOVERY - Host srv213 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [15:14:09] RECOVERY - Host srv211 is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms [15:17:45] RECOVERY - SSH on srv212 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:17:45] PROBLEM - SSH on srv213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:17:54] PROBLEM - Memcached on srv211 is CRITICAL: Connection refused [15:18:12] RECOVERY - SSH on srv208 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:18:21] PROBLEM - Memcached on srv209 is CRITICAL: Connection refused [15:18:21] PROBLEM - Memcached on srv213 is CRITICAL: Connection refused [15:18:21] PROBLEM - Apache HTTP on srv213 is CRITICAL: Connection refused [15:18:30] PROBLEM - Apache HTTP on srv211 is CRITICAL: Connection refused [15:18:30] PROBLEM - Apache HTTP on srv209 is CRITICAL: Connection refused [15:19:06] RECOVERY - SSH on srv210 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:19:15] RECOVERY - SSH on srv213 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:19:51] PROBLEM - NTP on srv202 is CRITICAL: NTP CRITICAL: No response from NTP server [15:23:45] PROBLEM - NTP on srv207 is CRITICAL: NTP CRITICAL: No response from NTP server [15:23:54] PROBLEM - NTP on srv201 is CRITICAL: NTP CRITICAL: No response from NTP server [15:23:54] PROBLEM - NTP on srv205 is CRITICAL: NTP CRITICAL: No response from NTP server [15:24:48] PROBLEM - NTP on srv200 is CRITICAL: NTP CRITICAL: No response from NTP server [15:24:48] PROBLEM - NTP on srv204 is CRITICAL: NTP CRITICAL: No response from NTP server [15:25:24] PROBLEM - NTP on srv203 is CRITICAL: NTP CRITICAL: No response from NTP server [15:30:12] PROBLEM - NTP on srv210 is CRITICAL: NTP CRITICAL: No response from NTP server [15:33:07] PROBLEM - NTP on srv208 is CRITICAL: NTP CRITICAL: No response from NTP server [15:33:07] PROBLEM - NTP on srv212 is CRITICAL: NTP CRITICAL: No response from NTP server [15:36:54] apergos: hmm [15:37:13] 1042122 jobs in total across the wiki (from the databases) [15:37:22] New patchset: Pyoungmeister; "setting srv200-213 to use applicationserver modules" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33581 [15:38:03] PROBLEM - NTP on srv209 is CRITICAL: NTP CRITICAL: No response from NTP server [15:38:03] PROBLEM - NTP on srv213 is CRITICAL: NTP CRITICAL: No response from NTP server [15:38:30] PROBLEM - NTP on srv211 is CRITICAL: NTP CRITICAL: No response from NTP server [15:38:31] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33581 [15:42:25] RECOVERY - Apache HTTP on srv200 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [15:47:12] RECOVERY - Apache HTTP on srv207 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.002 seconds [15:48:15] PROBLEM - SSH on mw60 is CRITICAL: Connection refused [15:48:51] PROBLEM - Apache HTTP on mw61 is CRITICAL: Connection refused [15:49:09] PROBLEM - Apache HTTP on mw60 is CRITICAL: Connection refused [15:49:27] PROBLEM - SSH on mw61 is CRITICAL: Connection refused [15:50:39] RECOVERY - Apache HTTP on srv201 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.007 seconds [15:51:24] RECOVERY - NTP on srv200 is OK: NTP OK: Offset 0.03678154945 secs [15:55:36] RECOVERY - Apache HTTP on srv208 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [15:57:42] RECOVERY - SSH on mw61 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:58:00] RECOVERY - SSH on mw60 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:58:54] RECOVERY - Apache HTTP on srv202 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [15:59:03] RECOVERY - NTP on srv201 is OK: NTP OK: Offset -0.09195172787 secs [16:03:52] RECOVERY - Apache HTTP on mw60 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.013 seconds [16:04:09] RECOVERY - Apache HTTP on srv209 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.006 seconds [16:05:30] RECOVERY - Apache HTTP on mw61 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [16:08:57] RECOVERY - Apache HTTP on srv203 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [16:11:03] RECOVERY - NTP on srv207 is OK: NTP OK: Offset -0.03520560265 secs [16:11:57] RECOVERY - Apache HTTP on srv210 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [16:19:45] RECOVERY - NTP on srv208 is OK: NTP OK: Offset -0.04207217693 secs [16:20:57] RECOVERY - Apache HTTP on srv211 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.014 seconds [16:22:09] PROBLEM - NTP on mw61 is CRITICAL: NTP CRITICAL: Offset unknown [16:22:36] PROBLEM - NTP on mw60 is CRITICAL: NTP CRITICAL: Offset unknown [16:23:03] RECOVERY - NTP on srv202 is OK: NTP OK: Offset -0.02346765995 secs [16:27:33] RECOVERY - NTP on srv209 is OK: NTP OK: Offset -0.04837656021 secs [16:30:06] RECOVERY - Apache HTTP on srv212 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [16:30:15] RECOVERY - Apache HTTP on srv204 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [16:31:00] RECOVERY - NTP on mw60 is OK: NTP OK: Offset -0.0104367733 secs [16:31:54] RECOVERY - NTP on mw61 is OK: NTP OK: Offset -0.00554561615 secs [16:35:57] RECOVERY - NTP on srv210 is OK: NTP OK: Offset -0.02769041061 secs [16:36:43] New patchset: Pyoungmeister; "Revert "removing mw60 and mw61 from bits backends for reimage"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33584 [16:37:04] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33584 [16:38:12] RECOVERY - Apache HTTP on srv205 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [16:38:30] RECOVERY - Apache HTTP on srv213 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [16:40:45] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [16:40:45] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [16:42:05] notpeter: srv249/250 may have the drac's crossed and I need to take one down to look at the setup on console. which would you prefer..i know 249 is a bit server? [16:43:00] hhhhmmmm, I'm about to reimage srv249 [16:43:08] can you wait about 30 minutes? [16:43:18] RECOVERY - NTP on srv203 is OK: NTP OK: Offset -0.05600476265 secs [16:43:41] (taking bits servers out of rotation takes a while....) [16:43:50] (or just do whatever you want on srv250 [16:44:03] cmjohnson1: did you have a chance to look at ms3? [16:44:21] RECOVERY - NTP on srv211 is OK: NTP OK: Offset -0.04437398911 secs [16:44:31] oh, yeah, I merged that [16:44:46] paravoid: ms3 has a known bad disk or 2...but I don't know which ones...I opened the cover hoping an amber light would appear but no such luck [16:45:33] notpeter: okay..i will need it for about 5-10 mins...to check cfg [16:46:15] paravoid rt2073 [16:46:32] cmjohnson1: ok, srv250 is our of rotation, so do as you please with it [16:46:40] let me know if you also need to do stuff to srv249 [16:46:41] cool [16:46:46] i will [16:46:47] thx [16:47:08] !log shutting down srv250 to verify drac config [16:47:10] cmjohnson1: ha! [16:47:15] Logged the message, Master [16:47:28] !log running 'reprepro update' to fetch ceph packages [16:47:35] Logged the message, Master [16:50:43] PROBLEM - Host srv250 is DOWN: PING CRITICAL - Packet loss = 100% [16:52:22] RECOVERY - NTP on srv204 is OK: NTP OK: Offset -0.04629826546 secs [16:52:53] cmjohnson1: we can figure out which disk it is [16:53:52] RECOVERY - NTP on srv212 is OK: NTP OK: Offset -0.03157567978 secs [16:54:24] mark: via lom? [16:56:54] no [16:56:59] via the linux device names [16:57:13] they have names like /dev/dsk/c2t1d4 [16:57:37] and when we have that, we can map it to the drive in the chassis [16:58:10] okay..yes the wikitech page has the layout [16:59:10] notpeter: you may want to know this...in the dhcpd files srv249 has srv250's mac....so when you shutdown srv250 it is actually shutting down the physical server known as srv249 [16:59:29] the os for srv249 is on the physical server srv250 [17:01:07] mark: how do i determine the bad drive though? [17:01:19] we need to boot up the linux install that was on it [17:01:23] if that's still possible [17:01:32] paravoid re-installed [17:01:41] wonder if i can use an ubuntu recovery disk [17:01:43] ? [17:01:45] no [17:01:47] that won't work [17:01:53] solaris? [17:02:03] running puppet might work [17:02:09] it sets up udev for it [17:02:25] cmjohnson1: ok, I'll switch the two macs and it should be correct, yes? [17:02:29] to get links in /dev/dsk/by-cntrl/ [17:03:16] notpeter: correct me if I am wrong but wouldn't you need to do re-image both to correct it? [17:03:38] it has been this way for a long time may not be worth it...just worth notign [17:04:29] seq=$(echo $1 | cut -d':' -f 1) [17:04:29] controller=$(($seq / 8)) [17:04:30] disk=$((seq % 8)) [17:04:36] this suggests they're simply in order [17:05:14] cmjohnson1: ah, yes, this will be a part of the reimaging of all boxes that I'm doing :) [17:05:25] s/boxes/apaches [17:05:26] would be helpful to have the serial nr of the failed disk I guess [17:05:34] bbl [17:05:47] mark: ok...will look at it more [17:06:19] notpeter: cool...i have the ticket ..will assign to you to close out once you are don [17:06:19] done [17:06:21] thx [17:06:53] ok [17:09:37] RECOVERY - Host srv250 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [17:12:38] !log removing srv248, srv249, and srv250 from various pools for upgrade to precise [17:12:44] Logged the message, notpeter [17:14:07] PROBLEM - Apache HTTP on srv250 is CRITICAL: Connection refused [17:14:41] New patchset: Pyoungmeister; "removing srv248 and srv249 from bits backend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33586 [17:15:09] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33586 [17:21:54] New patchset: Pyoungmeister; "swapping macs of srv249 and srv250" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33587 [17:37:17] robh: looking at wtp1 and the degraided array and pretty sure that the raid cfg is wrong...take a look http://p.defau.lt/?mpns_A_VS0D0uzBWZ0eIew [17:37:22] RECOVERY - Apache HTTP on srv250 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.108 second response time [17:38:04] that's one drive missing yes [17:38:07] I am pretty sure there is supposed to be an sda1 and sdb1 [17:38:08] sdb [17:38:33]