[00:04:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.022 seconds [00:10:30] New patchset: Asher; "pulling db11" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33498 [00:10:45] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33498 [00:11:25] !log asher synchronized wmf-config/db.php 'pulling db11' [00:11:33] Logged the message, Master [00:12:06] RECOVERY - MySQL Slave Delay on db11 is OK: OK replication delay seconds [00:13:09] RECOVERY - MySQL Replication Heartbeat on db11 is OK: OK replication delay seconds [00:19:45] PROBLEM - MySQL Replication Heartbeat on db11 is CRITICAL: CRIT replication delay 24217 seconds [00:20:04] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [00:20:04] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [00:20:04] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [00:35:14] !log catrope synchronized php-1.21wmf3/extensions/VisualEditor 'Updating VE to master' [00:35:20] Logged the message, Master [00:35:36] !log catrope synchronized php-1.21wmf4/extensions/VisualEditor 'Updating VE to master' [00:35:42] Logged the message, Master [00:37:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:45:06] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error [00:47:44] Hey binasher: For the next step of my cleanup on dewikivoyage, I've got 14k UPDATE statements for the revision table.. [00:47:54] They all look like UPDATE `revision` SET `rev_user_text`='MediaWiki default', `rev_user`=0 WHERE rev_id IN ( 1, 4, 5, 6, 8, 61, 136, 160, 214, 237, 564, 806, 900, 1095, 1096, 1097, 1110, 1115, 7832, 8901, 8924, 9016, 9067 ); [00:48:14] Is that something I can run on db34? Or would I need to update each slave individually? [00:49:03] go ahead and run that on db34, one statement at a time (one txn per statement) [00:49:43] Cool. If I just do `sql dewikivoyage < updates.sql` and it's just the update statments, it defaults to txn per statment, right? [00:50:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.089 seconds [00:50:23] csteipp: yup, it defaults to autocommit mode [00:53:03] New patchset: Dzahn; "move stuff (racktables & RT) out of misc-servers.pp (RT-720)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33509 [00:54:23] New patchset: Dzahn; "move stuff (racktables & RT) out of misc-servers.pp (RT-720)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33509 [00:56:01] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33509 [00:57:06] PROBLEM - MySQL Slave Delay on db11 is CRITICAL: CRIT replication delay 26442 seconds [00:58:41] New patchset: Faidon; "swift: use WSGIContext properly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33510 [01:00:13] AaronSchulz: I added you as a reviewer in all of the above, have a look when you have some time [01:00:31] AaronSchulz: they're now live on the depooled ms-fe1 [01:03:49] New patchset: Dzahn; "move stuff (misc::download::*) out of misc-servers.pp (RT-720)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33511 [01:07:44] New patchset: Asher; "adding db66 to s3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33513 [01:08:24] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33513 [01:17:33] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33511 [01:18:01] New patchset: Faidon; "secure.wikimedia.org: add no-escape to redirects" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33516 [01:18:18] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33516 [01:21:16] New patchset: Dzahn; "move stuff (IRC-related) out of misc-servers.pp (RT-720)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33517 [01:22:06] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33517 [01:24:01] New patchset: Dzahn; "move stuff (Etherpad) out of misc-servers.pp (RT-720)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33518 [01:24:30] New patchset: Dzahn; "move stuff (Etherpad) out of misc-servers.pp (RT-720)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33518 [01:24:57] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33518 [01:25:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:29:21] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 181 seconds [01:31:00] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [01:36:42] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [01:39:15] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 245 seconds [01:39:33] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 251 seconds [01:40:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.439 seconds [01:42:51] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [01:44:52] !log catrope synchronized php-1.21wmf3/extensions/VisualEditor 'Updating VE to master' [01:44:58] Logged the message, Master [01:45:12] !log catrope synchronized php-1.21wmf4/extensions/VisualEditor 'Updating VE to master' [01:45:14] New patchset: Cmjohnson; "removing storage3 from netboot.cfg and will manually set up partition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33524 [01:45:18] Logged the message, Master [01:46:00] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [01:49:22] !log catrope synchronized php-1.21wmf3/extensions/VisualEditor 'Updating VE to master' [01:49:28] Logged the message, Master [01:49:40] !log catrope synchronized php-1.21wmf4/extensions/VisualEditor 'Updating VE to master' [01:49:46] Logged the message, Master [01:50:54] New review: MZMcBride; "Sweet! Thanks!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33516 [01:57:29] New patchset: Reedy; "Move keys to under mediawiki docroot, rather than secure" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33525 [02:01:04] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 307 seconds [02:01:04] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 307 seconds [02:04:19] New patchset: Reedy; "Kill off old secure stuff now un-needed" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33526 [02:05:31] New patchset: Reedy; "Kill creation of secure related symlinks" [operations/mediawiki-multiversion] (master) - https://gerrit.wikimedia.org/r/33527 [02:11:26] New review: Faidon; "May we should keep robots.txt, just in case we're missing a redirect." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/33526 [02:12:19] New review: Reedy; "Probably a good idea" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/33526 [02:13:07] Reedy, where are the keys going to be hosted now? [02:15:04] New patchset: Reedy; "Kill off old secure stuff now un-needed" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33526 [02:15:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:18:59] Krenair: https://www.mediawiki.org/keys/keys.html probably [02:19:07] hence adding them to the mediawiki docroot ;) [02:19:36] Ok. Didn't even know that existed until now [02:20:09] lol [02:20:18] it's linked on every tarball release email [02:22:07] Reedy, 1.20 had https://secure.wikimedia.org/keys.html [02:22:15] Yup [02:22:19] Which is where it is currently [02:22:33] Probably best to redirect that as well [02:22:49] Indeed [02:23:26] I moved it to there after paravoid asked where to redirect it to ;) [02:23:32] hehe [02:28:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.489 seconds [02:30:26] jeremyb: Hey, remember that Etherpad Lite packaging project a while back? You said we needed to package it for debian and puppetize it, right? Do you happen to know the reasons for doing those? [02:32:37] I mean, puppetizing is pretty straightforward, but is the debian packaging necessary? [02:34:15] !log LocalisationUpdate completed (1.21wmf4) at Thu Nov 15 02:34:15 UTC 2012 [02:34:22] Logged the message, Master [02:36:19] paravoid: https://bugzilla.wikimedia.org/show_bug.cgi?id=42133 [02:36:47] just saw that [02:36:51] marktraceur: I think packaging is for ease of deployment somewhat [02:37:12] AaronSchulz: around? [02:38:30] Reedy: So if it would be mostly easier to pull from git, then checkout a tag, would that be OK? [02:39:09] You'd have to ask ops [02:39:28] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: Connection refused [02:40:03] * marktraceur is here for that reason [02:40:16] I guess it might be unlikely that I get an answer tonight :) [02:40:25] probably not hte best time [02:40:25] minnie_: We may have to wait until tomorrow, Pacific time [02:40:38] The europeans are still about and the yanks are AWOL ;) [02:40:52] * marktraceur is here! [02:41:04] spagewmf is here! I think. [02:41:32] hello, what up? [02:41:48] marktraceur that's alright, I'll show up tomorrow on time then, thanks :) [02:41:57] marktraceur: Though, with other ops people adding functionality to pull things from git in puppet [02:41:58] * Reedy shrugs [02:42:00] spagewmf: We're representing the US here :) [02:42:09] New patchset: Faidon; "swift: use WSGIContext properly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33510 [02:42:09] Reedy: Well, that bodes well. [02:42:19] Yeah [02:42:37] The main thing is it not having to be installed manually... [02:42:38] Blame Canada! Blame Canada! With all that hockey hullaballoo And that b***h Ann Murray too! It's not even a real country anyway. [02:42:46] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.008 seconds [02:42:55] minnie_: You don't need to show up, I can ask for you, but it's an unfortunate side effect of having many, many US employees :) [02:42:55] hey, I have a double citizenship w/ them [02:43:04] I'm Canadianian, home of the best national anthem evar [02:43:14] spagewmf: sgardner doesn't hang out here, does she? :P [02:43:19] Jasper_Deng_busy: dual* [02:43:34] marktraceur: there's other canadians, not just her [02:43:49] I'm sure Sue knows the lyrics of the only Academy Award-nominated national anthem [02:44:08] also, sgardner's in #wikimedia right now [02:44:38] Heh, she's also sitting ~10 feet from me [02:44:49] ohh [02:45:08] marktraceur: i see :) [02:45:15] marktraceur: not my call on what's needed to get it in prod. i personally would prefer it be packaged but it's not entirely b+w [02:45:30] Nice, we have secure.wm.o redirects now! [02:54:48] !log LocalisationUpdate completed (1.21wmf3) at Thu Nov 15 02:54:48 UTC 2012 [02:54:54] Logged the message, Master [03:14:07] PROBLEM - MySQL Replication Heartbeat on db1003 is CRITICAL: CRIT replication delay 243 seconds [03:14:16] PROBLEM - MySQL Replication Heartbeat on db1019 is CRITICAL: CRIT replication delay 254 seconds [03:14:34] PROBLEM - MySQL Slave Delay on db1003 is CRITICAL: CRIT replication delay 271 seconds [03:14:44] PROBLEM - MySQL Slave Delay on db1019 is CRITICAL: CRIT replication delay 281 seconds [03:14:52] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 291 seconds [03:15:19] PROBLEM - MySQL Replication Heartbeat on db1010 is CRITICAL: CRIT replication delay 315 seconds [03:15:19] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 316 seconds [03:17:34] RECOVERY - MySQL Replication Heartbeat on db1019 is OK: OK replication delay 0 seconds [03:17:52] RECOVERY - MySQL Slave Delay on db1003 is OK: OK replication delay 0 seconds [03:18:01] RECOVERY - MySQL Slave Delay on db1019 is OK: OK replication delay 0 seconds [03:18:37] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [03:18:37] RECOVERY - MySQL Replication Heartbeat on db1010 is OK: OK replication delay 0 seconds [03:19:13] RECOVERY - MySQL Replication Heartbeat on db1003 is OK: OK replication delay 0 seconds [03:19:58] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [03:48:10] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [03:50:16] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 15 seconds [04:33:01] PROBLEM - mysqld processes on db66 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [05:23:40] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.331 second response time [05:28:37] PROBLEM - Apache HTTP on mw1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [05:46:10] New patchset: Tim Starling; "Disable ULS toolbar" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33534 [06:19:55] RECOVERY - MySQL Replication Heartbeat on db11 is OK: OK replication delay 0 seconds [06:20:42] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.331 second response time [06:21:00] RECOVERY - MySQL Slave Delay on db11 is OK: OK replication delay 0 seconds [06:25:39] PROBLEM - Apache HTTP on mw1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [06:40:04] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [06:40:04] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [06:50:22] PROBLEM - Squid on brewster is CRITICAL: Connection refused [07:08:31] RECOVERY - Squid on brewster is OK: TCP OK - 0.002 second response time on port 8080 [08:06:01] New patchset: Hashar; "zuul: setup.py requires python-setuptools package" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33475 [08:06:21] New review: Hashar; "Minor typo in first line summary" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/33475 [08:18:51] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [08:51:24] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [09:00:24] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [09:07:27] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [09:14:06] !log aaron synchronized php-1.21wmf4/maintenance/nextJobDB.php 'deployed 2e0c24df43588bd4ceba5522bde1b3f06fbd05b0' [09:14:12] Logged the message, Master [09:24:58] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [09:26:37] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [10:05:09] New patchset: J; "Enable transcoding on all wikis that allow uploads" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33542 [10:21:04] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [10:21:04] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [10:21:04] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [10:47:38] !log kaldari synchronized php-1.21wmf4/extensions/UploadWizard/resources/mw.UploadWizardDetails.js 'syncing UploadWizard fix for Safari, part 1' [10:47:45] Logged the message, Master [10:48:22] !log kaldari synchronized php-1.21wmf4/extensions/UploadWizard/resources/mw.UploadWizardUploadInterface.js 'syncing UploadWizard fix for Safari, part 2' [10:48:29] Logged the message, Master [11:03:56] !log kaldari synchronized php-1.21wmf4/extensions/UploadWizard/resources/mw.UploadWizardUploadInterface.js 'rolling back my fix for Safari, part 2' [11:04:03] Logged the message, Master [11:04:35] !log kaldari synchronized php-1.21wmf4/extensions/UploadWizard/resources/mw.UploadWizardDetails.js 'rolling back my fix for Safari, part 1' [11:04:41] Logged the message, Master [11:14:37] !log kaldari synchronized php-1.21wmf4/extensions/UploadWizard/resources/mw.UploadWizardDetails.js 'refixing for Safari, part 1' [11:14:43] Logged the message, Master [11:15:08] !log kaldari synchronized php-1.21wmf4/extensions/UploadWizard/resources/mw.UploadWizardUploadInterface.js 'refixing for Safari, part 2' [11:15:14] Logged the message, Master [11:17:50] New patchset: J; "increase number of concurrent transcoding jobs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33554 [11:26:27] !log kaldari synchronized php-1.21wmf4/extensions/UploadWizard/resources/mw.UploadWizardDetails.js 'refixing for Safari, part 1' [11:26:33] Logged the message, Master [11:26:50] !log kaldari synchronized php-1.21wmf4/extensions/UploadWizard/resources/mw.UploadWizardUploadInterface.js 'refixing for Safari, part 2' [11:26:56] Logged the message, Master [11:28:43] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.332 second response time [11:33:40] PROBLEM - Apache HTTP on mw1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception [11:38:01] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [11:39:40] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [11:42:58] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.002 second response time on port 11000 [12:13:22] New patchset: Mark Bergsma; "Disable the session leak Varnish restart" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33555 [12:14:54] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33555 [13:45:25] New patchset: Pyoungmeister; "removing es2 from db.php for conversion to innodb" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33562 [13:47:30] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33562 [13:48:41] !log py synchronized wmf-config/db.php 'pulling es2 for conversion to inno' [13:48:47] Logged the message, Master [13:51:51] notpeter: can you merge this when you get a chance please https://gerrit.wikimedia.org/r/33524 [13:52:56] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33524 [13:53:03] cmjohnson1: done :) [13:53:19] thx [13:53:39] but of course [13:54:57] RECOVERY - MySQL Slave Delay on es2 is OK: OK replication delay seconds [13:58:33] PROBLEM - mysqld processes on es2 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [14:02:26] that's ok, btw [14:03:25] New patchset: ArielGlenn; "script we might use with amazon, rsyncs dumps to space-limited partitions" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/33566 [14:32:56] !log removing srv200-srv213 from pybal for upgrades to precise (and this time I mean it) [14:33:02] Logged the message, notpeter [14:43:49] New patchset: Pyoungmeister; "removing mw60 and mw61 from bits backends for reimage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33569 [14:44:21] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33569 [14:54:12] PROBLEM - Host srv201 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:12] PROBLEM - Host srv205 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:21] PROBLEM - Host srv200 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:30] PROBLEM - Host srv207 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:39] PROBLEM - Host srv204 is DOWN: PING CRITICAL - Packet loss = 100% [14:55:06] PROBLEM - Host srv203 is DOWN: PING CRITICAL - Packet loss = 100% [14:59:27] PROBLEM - SSH on srv202 is CRITICAL: Connection refused [14:59:54] PROBLEM - Apache HTTP on srv202 is CRITICAL: Connection refused [14:59:55] PROBLEM - Memcached on srv202 is CRITICAL: Connection refused [14:59:55] RECOVERY - Host srv201 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [14:59:55] RECOVERY - Host srv205 is UP: PING OK - Packet loss = 0%, RTA = 1.77 ms [15:00:03] RECOVERY - Host srv200 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [15:00:12] RECOVERY - Host srv207 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:00:21] RECOVERY - Host srv204 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [15:00:57] RECOVERY - Host srv203 is UP: PING OK - Packet loss = 0%, RTA = 5.94 ms [15:01:01] !log reedy synchronized php-1.21wmf4/extensions/SiteMatrix/ [15:01:07] Logged the message, Master [15:03:30] PROBLEM - Apache HTTP on srv205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:30] PROBLEM - Memcached on srv205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:39] PROBLEM - Memcached on srv201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:48] PROBLEM - Apache HTTP on srv201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:48] PROBLEM - Apache HTTP on srv207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:57] PROBLEM - Memcached on srv200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:04:06] PROBLEM - SSH on srv207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:04:15] PROBLEM - SSH on srv204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:04:24] RECOVERY - SSH on srv202 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:04:24] PROBLEM - Apache HTTP on srv204 is CRITICAL: Connection refused [15:04:33] PROBLEM - Apache HTTP on srv203 is CRITICAL: Connection refused [15:04:33] PROBLEM - SSH on srv203 is CRITICAL: Connection refused [15:04:33] PROBLEM - Apache HTTP on srv200 is CRITICAL: Connection refused [15:04:51] PROBLEM - Memcached on srv204 is CRITICAL: Connection refused [15:05:09] PROBLEM - Memcached on srv207 is CRITICAL: Connection refused [15:05:36] RECOVERY - SSH on srv207 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:05:45] RECOVERY - SSH on srv204 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:05:54] PROBLEM - Memcached on srv203 is CRITICAL: Connection refused [15:06:30] New patchset: Faidon; "Unmount /mnt/thumbs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33202 [15:07:19] New patchset: Faidon; "reprepro: use the Ceph repository as an upstream" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33352 [15:07:45] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33352 [15:08:18] PROBLEM - Host srv209 is DOWN: PING CRITICAL - Packet loss = 100% [15:08:18] PROBLEM - Host srv213 is DOWN: PING CRITICAL - Packet loss = 100% [15:08:18] PROBLEM - Host srv211 is DOWN: PING CRITICAL - Packet loss = 100% [15:09:30] RECOVERY - SSH on srv203 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:10:15] PROBLEM - Memcached on srv210 is CRITICAL: Connection refused [15:10:33] PROBLEM - SSH on srv210 is CRITICAL: Connection refused [15:12:21] PROBLEM - Apache HTTP on srv208 is CRITICAL: Connection refused [15:12:39] PROBLEM - Apache HTTP on srv210 is CRITICAL: Connection refused [15:12:48] PROBLEM - SSH on srv212 is CRITICAL: Connection refused [15:12:57] PROBLEM - Memcached on srv208 is CRITICAL: Connection refused [15:13:06] PROBLEM - Apache HTTP on srv212 is CRITICAL: Connection refused [15:13:07] PROBLEM - Memcached on srv212 is CRITICAL: Connection refused [15:13:15] PROBLEM - SSH on srv208 is CRITICAL: Connection refused [15:14:09] RECOVERY - Host srv209 is UP: PING OK - Packet loss = 0%, RTA = 1.23 ms [15:14:09] RECOVERY - Host srv213 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [15:14:09] RECOVERY - Host srv211 is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms [15:17:45] RECOVERY - SSH on srv212 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:17:45] PROBLEM - SSH on srv213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:17:54] PROBLEM - Memcached on srv211 is CRITICAL: Connection refused [15:18:12] RECOVERY - SSH on srv208 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:18:21] PROBLEM - Memcached on srv209 is CRITICAL: Connection refused [15:18:21] PROBLEM - Memcached on srv213 is CRITICAL: Connection refused [15:18:21] PROBLEM - Apache HTTP on srv213 is CRITICAL: Connection refused [15:18:30] PROBLEM - Apache HTTP on srv211 is CRITICAL: Connection refused [15:18:30] PROBLEM - Apache HTTP on srv209 is CRITICAL: Connection refused [15:19:06] RECOVERY - SSH on srv210 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:19:15] RECOVERY - SSH on srv213 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:19:51] PROBLEM - NTP on srv202 is CRITICAL: NTP CRITICAL: No response from NTP server [15:23:45] PROBLEM - NTP on srv207 is CRITICAL: NTP CRITICAL: No response from NTP server [15:23:54] PROBLEM - NTP on srv201 is CRITICAL: NTP CRITICAL: No response from NTP server [15:23:54] PROBLEM - NTP on srv205 is CRITICAL: NTP CRITICAL: No response from NTP server [15:24:48] PROBLEM - NTP on srv200 is CRITICAL: NTP CRITICAL: No response from NTP server [15:24:48] PROBLEM - NTP on srv204 is CRITICAL: NTP CRITICAL: No response from NTP server [15:25:24] PROBLEM - NTP on srv203 is CRITICAL: NTP CRITICAL: No response from NTP server [15:30:12] PROBLEM - NTP on srv210 is CRITICAL: NTP CRITICAL: No response from NTP server [15:33:07] PROBLEM - NTP on srv208 is CRITICAL: NTP CRITICAL: No response from NTP server [15:33:07] PROBLEM - NTP on srv212 is CRITICAL: NTP CRITICAL: No response from NTP server [15:36:54] apergos: hmm [15:37:13] 1042122 jobs in total across the wiki (from the databases) [15:37:22] New patchset: Pyoungmeister; "setting srv200-213 to use applicationserver modules" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33581 [15:38:03] PROBLEM - NTP on srv209 is CRITICAL: NTP CRITICAL: No response from NTP server [15:38:03] PROBLEM - NTP on srv213 is CRITICAL: NTP CRITICAL: No response from NTP server [15:38:30] PROBLEM - NTP on srv211 is CRITICAL: NTP CRITICAL: No response from NTP server [15:38:31] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33581 [15:42:25] RECOVERY - Apache HTTP on srv200 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [15:47:12] RECOVERY - Apache HTTP on srv207 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.002 seconds [15:48:15] PROBLEM - SSH on mw60 is CRITICAL: Connection refused [15:48:51] PROBLEM - Apache HTTP on mw61 is CRITICAL: Connection refused [15:49:09] PROBLEM - Apache HTTP on mw60 is CRITICAL: Connection refused [15:49:27] PROBLEM - SSH on mw61 is CRITICAL: Connection refused [15:50:39] RECOVERY - Apache HTTP on srv201 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.007 seconds [15:51:24] RECOVERY - NTP on srv200 is OK: NTP OK: Offset 0.03678154945 secs [15:55:36] RECOVERY - Apache HTTP on srv208 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [15:57:42] RECOVERY - SSH on mw61 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:58:00] RECOVERY - SSH on mw60 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:58:54] RECOVERY - Apache HTTP on srv202 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [15:59:03] RECOVERY - NTP on srv201 is OK: NTP OK: Offset -0.09195172787 secs [16:03:52] RECOVERY - Apache HTTP on mw60 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.013 seconds [16:04:09] RECOVERY - Apache HTTP on srv209 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.006 seconds [16:05:30] RECOVERY - Apache HTTP on mw61 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [16:08:57] RECOVERY - Apache HTTP on srv203 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [16:11:03] RECOVERY - NTP on srv207 is OK: NTP OK: Offset -0.03520560265 secs [16:11:57] RECOVERY - Apache HTTP on srv210 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [16:19:45] RECOVERY - NTP on srv208 is OK: NTP OK: Offset -0.04207217693 secs [16:20:57] RECOVERY - Apache HTTP on srv211 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.014 seconds [16:22:09] PROBLEM - NTP on mw61 is CRITICAL: NTP CRITICAL: Offset unknown [16:22:36] PROBLEM - NTP on mw60 is CRITICAL: NTP CRITICAL: Offset unknown [16:23:03] RECOVERY - NTP on srv202 is OK: NTP OK: Offset -0.02346765995 secs [16:27:33] RECOVERY - NTP on srv209 is OK: NTP OK: Offset -0.04837656021 secs [16:30:06] RECOVERY - Apache HTTP on srv212 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [16:30:15] RECOVERY - Apache HTTP on srv204 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [16:31:00] RECOVERY - NTP on mw60 is OK: NTP OK: Offset -0.0104367733 secs [16:31:54] RECOVERY - NTP on mw61 is OK: NTP OK: Offset -0.00554561615 secs [16:35:57] RECOVERY - NTP on srv210 is OK: NTP OK: Offset -0.02769041061 secs [16:36:43] New patchset: Pyoungmeister; "Revert "removing mw60 and mw61 from bits backends for reimage"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33584 [16:37:04] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33584 [16:38:12] RECOVERY - Apache HTTP on srv205 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [16:38:30] RECOVERY - Apache HTTP on srv213 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [16:40:45] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [16:40:45] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [16:42:05] notpeter: srv249/250 may have the drac's crossed and I need to take one down to look at the setup on console. which would you prefer..i know 249 is a bit server? [16:43:00] hhhhmmmm, I'm about to reimage srv249 [16:43:08] can you wait about 30 minutes? [16:43:18] RECOVERY - NTP on srv203 is OK: NTP OK: Offset -0.05600476265 secs [16:43:41] (taking bits servers out of rotation takes a while....) [16:43:50] (or just do whatever you want on srv250 [16:44:03] cmjohnson1: did you have a chance to look at ms3? [16:44:21] RECOVERY - NTP on srv211 is OK: NTP OK: Offset -0.04437398911 secs [16:44:31] oh, yeah, I merged that [16:44:46] paravoid: ms3 has a known bad disk or 2...but I don't know which ones...I opened the cover hoping an amber light would appear but no such luck [16:45:33] notpeter: okay..i will need it for about 5-10 mins...to check cfg [16:46:15] paravoid rt2073 [16:46:32] cmjohnson1: ok, srv250 is our of rotation, so do as you please with it [16:46:40] let me know if you also need to do stuff to srv249 [16:46:41] cool [16:46:46] i will [16:46:47] thx [16:47:08] !log shutting down srv250 to verify drac config [16:47:10] cmjohnson1: ha! [16:47:15] Logged the message, Master [16:47:28] !log running 'reprepro update' to fetch ceph packages [16:47:35] Logged the message, Master [16:50:43] PROBLEM - Host srv250 is DOWN: PING CRITICAL - Packet loss = 100% [16:52:22] RECOVERY - NTP on srv204 is OK: NTP OK: Offset -0.04629826546 secs [16:52:53] cmjohnson1: we can figure out which disk it is [16:53:52] RECOVERY - NTP on srv212 is OK: NTP OK: Offset -0.03157567978 secs [16:54:24] mark: via lom? [16:56:54] no [16:56:59] via the linux device names [16:57:13] they have names like /dev/dsk/c2t1d4 [16:57:37] and when we have that, we can map it to the drive in the chassis [16:58:10] okay..yes the wikitech page has the layout [16:59:10] notpeter: you may want to know this...in the dhcpd files srv249 has srv250's mac....so when you shutdown srv250 it is actually shutting down the physical server known as srv249 [16:59:29] the os for srv249 is on the physical server srv250 [17:01:07] mark: how do i determine the bad drive though? [17:01:19] we need to boot up the linux install that was on it [17:01:23] if that's still possible [17:01:32] paravoid re-installed [17:01:41] wonder if i can use an ubuntu recovery disk [17:01:43] ? [17:01:45] no [17:01:47] that won't work [17:01:53] solaris? [17:02:03] running puppet might work [17:02:09] it sets up udev for it [17:02:25] cmjohnson1: ok, I'll switch the two macs and it should be correct, yes? [17:02:29] to get links in /dev/dsk/by-cntrl/ [17:03:16] notpeter: correct me if I am wrong but wouldn't you need to do re-image both to correct it? [17:03:38] it has been this way for a long time may not be worth it...just worth notign [17:04:29] seq=$(echo $1 | cut -d':' -f 1) [17:04:29] controller=$(($seq / 8)) [17:04:30] disk=$((seq % 8)) [17:04:36] this suggests they're simply in order [17:05:14] cmjohnson1: ah, yes, this will be a part of the reimaging of all boxes that I'm doing :) [17:05:25] s/boxes/apaches [17:05:26] would be helpful to have the serial nr of the failed disk I guess [17:05:34] bbl [17:05:47] mark: ok...will look at it more [17:06:19] notpeter: cool...i have the ticket ..will assign to you to close out once you are don [17:06:19] done [17:06:21] thx [17:06:53] ok [17:09:37] RECOVERY - Host srv250 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [17:12:38] !log removing srv248, srv249, and srv250 from various pools for upgrade to precise [17:12:44] Logged the message, notpeter [17:14:07] PROBLEM - Apache HTTP on srv250 is CRITICAL: Connection refused [17:14:41] New patchset: Pyoungmeister; "removing srv248 and srv249 from bits backend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33586 [17:15:09] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33586 [17:21:54] New patchset: Pyoungmeister; "swapping macs of srv249 and srv250" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33587 [17:37:17] robh: looking at wtp1 and the degraided array and pretty sure that the raid cfg is wrong...take a look http://p.defau.lt/?mpns_A_VS0D0uzBWZ0eIew [17:37:22] RECOVERY - Apache HTTP on srv250 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.108 second response time [17:38:04] that's one drive missing yes [17:38:07] I am pretty sure there is supposed to be an sda1 and sdb1 [17:38:08] sdb [17:38:33] New review: Aaron Schulz; "Faidon, this was tested on ms-fe1 right?" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/23392 [17:38:36] yep...cool...thx mark [17:38:54] AaronSchulz: yes, the whole patch series. [17:39:16] AaronSchulz: and is still running the patched rewrite.py and is depooled, in case you want to do more tests [17:40:22] !log wtp1 raid1 config is wrong; bringing server down to fix [17:40:29] Logged the message, Master [17:43:40] PROBLEM - Host wtp1 is DOWN: PING CRITICAL - Packet loss = 100% [17:46:06] New review: Reedy; "https://gerrit.wikimedia.org/r/#/c/33525/" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/33526 [17:53:35] New patchset: Ottomata; "Setting up analytics1011 as a Ganglia aggregator." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33590 [17:53:58] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33590 [17:54:58] New patchset: Ottomata; "Setting up analytics1011.eqiad.wmnet as a data source for ganglia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33591 [17:55:09] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33591 [17:58:05] paravoid apergos: ms-be7 is onsite now [17:58:21] what's that? [17:58:23] a new 720? [17:58:28] yep [17:59:26] excellent [17:59:44] are we going to get them one by one now? :) [17:59:51] :-D [18:00:13] apergos: did you increase the weight btw? [18:00:16] ms-be6's [18:00:40] no [18:00:46] I left it alone [18:01:13] are we able to get stats on e.g. object hit rate as a function of age? [18:01:21] no [18:01:31] there's df though. [18:02:08] i meant GETs. idk where df fits in [18:02:08] paravoid: no we are supposed to get them all at once...i expect the next shipment to be the remaining 10 [18:02:14] a weight 100 disk has ~600G [18:02:44] hence a full weight 33 disk has ~200G, which is about what ms-be6 has now [18:03:08] what weights should I put on them (I'll do it tomorrow monring I guess, I'm done for the day now) [18:03:09] ? [18:03:16] I don't think it'd be a problem even if we went with 100 right away [18:03:18] do we want to jut 66% them all? [18:03:20] oh. [18:03:29] I can do that (100) [18:03:55] nah I'll for 66 and we'll go with 100 when we add ms-be7 [18:03:59] I'll just do that now. [18:04:11] ok [18:05:58] it's a cp, a single command and an scp [18:06:01] RECOVERY - Host wtp1 is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [18:06:04] no reason to waste more time [18:06:13] so to clarify, I would redo the weights on ms-be6 to 100 and set ms-be7 weights to 100 when ms-be7 goes in? [18:06:39] well, ms-be6 to 100 and ms-be7 to 33 or 50 I'd say [18:06:54] ok [18:07:58] careful where you prepare the rings [18:08:08] uh huh [18:08:17] not on ms-fe*, not on ms-be6 or 7 [18:08:33] I have ms-be11 staked out but I would have checked the version anyways [18:08:42] no idea what you've touched or not [18:09:32] I put everything in SAL, but yeah doesn't hurt to double check [18:10:13] PROBLEM - SSH on wtp1 is CRITICAL: Connection refused [18:20:07] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [18:20:43] PROBLEM - Host wtp1 is DOWN: PING CRITICAL - Packet loss = 100% [18:36:11] Right, so, there was a question last night about whether pulling software from a git repository would be a good idea on a production server. Thoughts? [18:36:38] Context: Etherpad Lite, which has a "stable" branch and a "develop" branch, we would only be using the former. [18:42:01] marktraceur: well at any rate definitely not pull from github [18:42:24] 15 02:30:26 < marktraceur> jeremyb: Hey, remember that Etherpad Lite packaging project a while back? You said we needed to package it for debian and puppetize it, right? Do you happen to know the reasons for doing those? [18:42:32] 15 02:45:14 < jeremyb> marktraceur: not my call on what's needed to get it in prod. i personally would prefer it be packaged but it's not entirely b+w [18:43:08] jeremyb: What would be preferable? WMF hosting? If we were willing to create the repo that would work for me. [18:44:19] marktraceur: that's one possibility. could have a copy on gerrit that mirrored from github [18:45:06] marktraceur: but i'm assuming that it's not going to get regular deploys right? i.e. should average less than once a month? [18:45:48] marktraceur: in which case debian package is probably the best route [18:47:02] New patchset: Dereckson; "(bug 42155) Site name for ur.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33597 [18:48:37] RECOVERY - Host wtp1 is UP: PING OK - Packet loss = 0%, RTA = 1.34 ms [18:48:57] jeremyb: Except that a debian package could become quickly out of date, with the rate of development on the project. Also, it would be a complicated package to create, and there are too many config decisions (in my opinion) to do it reliably. [18:49:17] marktraceur: i'm confused [18:49:28] marktraceur: 1) the config doesn't have to be part of the package [18:49:41] (or elaborate on the decisions?) [18:50:04] jeremyb: Well, I was thinking about reverse proxies, but yeah, I can see that wouldn't be part of the packag. [18:50:07] ?e [18:50:09] marktraceur: 2) out of date should be no problem. once you have a package it's not so hard to make an updated one [18:50:32] jeremyb: I guess I could test that theory with the existing attempt I made before....OK, good point. [18:50:44] LeslieCarr, I've got other things I need to work on atm, buuuut, I am curious about your phone call yesterday (re ganglia multicast) [18:50:59] marktraceur: check out git build package [18:52:28] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33587 [18:52:49] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [18:52:51] marktraceur: http://packages.debian.org/git-buildpackage [18:52:56] ottomata: it was fail [18:53:02] didn't get to talk to any engineers [18:53:03] sigh [18:53:45] jeremyb: OK, I'm convinced we should do the package. [18:53:59] jeremyb: When I see the person online again, I'll throw them at it :) [18:54:04] doohhhh fail [18:54:23] ok, well, fyi, not sure if this will work, but I set up a ganglia aggregator on analytics1011 today….yet to see any results though [18:54:42] multicast in the rack seems to work fine, so I had hoped that the dells could talk to an11 that way [18:54:49] and that nickel would grab stats from both aggregators [18:54:51] but meh [18:55:15] anyways, i need to fix fundraising firewall [18:55:36] sbernardin: you can view our enviro info as well as power on each powerstrip by logging into the strip. [18:55:42] example https://ps1-b3-sdtpa.mgmt.pmtpa.wmnet [18:55:49] marktraceur: ok... i wouldn't mind having some other ops type people (or actual ops for that matter) chime in [18:55:52] use the mgmt password [18:55:58] *nod* [18:56:12] jeremyb: Feel free to ping some of them, I'm not hip to the ops jive [18:56:25] marktraceur: well they're here! [18:56:39] jeremyb: And yet, no chiming as yet :) [18:56:47] right [18:58:58] !log upgrading srv{190,219,220,221,222,223,224}.pmtpa.wmnet for libtiff vulnerabilities (USN-1631-1) [18:59:06] Logged the message, Master [18:59:18] New review: Amire80; "The new string is correct." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/33597 [19:01:13] PROBLEM - Memcached on srv248 is CRITICAL: Connection refused [19:01:13] PROBLEM - Apache HTTP on srv248 is CRITICAL: Connection refused [19:01:58] PROBLEM - Memcached on srv249 is CRITICAL: Connection refused [19:01:58] PROBLEM - Memcached on srv250 is CRITICAL: Connection refused [19:01:58] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [19:02:26] PROBLEM - SSH on srv248 is CRITICAL: Connection refused [19:02:26] PROBLEM - SSH on srv250 is CRITICAL: Connection refused [19:02:34] PROBLEM - SSH on srv249 is CRITICAL: Connection refused [19:02:52] PROBLEM - Apache HTTP on srv249 is CRITICAL: Connection refused [19:02:52] PROBLEM - Apache HTTP on srv250 is CRITICAL: Connection refused [19:05:58] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31556 [19:07:31] RECOVERY - SSH on srv250 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:07:31] RECOVERY - SSH on srv248 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:07:31] RECOVERY - SSH on srv249 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:08:52] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [19:09:42] New patchset: Reedy; "Fix space" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33598 [19:10:01] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33598 [19:10:36] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33597 [19:13:13] PROBLEM - NTP on wtp1 is CRITICAL: NTP CRITICAL: No response from NTP server [19:15:38] New patchset: Pyoungmeister; "new new role classes for srv248-250" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33600 [19:16:29] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33600 [19:19:17] !log reedy synchronized wmf-config/InitialiseSettings.php [19:19:22] RECOVERY - Apache HTTP on srv248 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.005 seconds [19:19:23] Logged the message, Master [19:21:55] PROBLEM - NTP on srv250 is CRITICAL: NTP CRITICAL: No response from NTP server [19:22:31] PROBLEM - NTP on srv248 is CRITICAL: NTP CRITICAL: Offset unknown [19:22:58] PROBLEM - NTP on srv249 is CRITICAL: NTP CRITICAL: No response from NTP server [19:23:23] !log removing srv251-srv257 for upgrade to precise [19:23:29] Logged the message, notpeter [19:27:28] RECOVERY - Apache HTTP on srv249 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.007 seconds [19:29:34] RECOVERY - mysqld processes on db66 is OK: PROCS OK: 1 process with command name mysqld [19:29:43] PROBLEM - check_apache2 on payments3 is CRITICAL: PROCS CRITICAL: 351 processes with command name apache2 [19:30:10] PROBLEM - check_apache2 on payments3 is CRITICAL: PROCS CRITICAL: 312 processes with command name apache2 [19:30:37] PROBLEM - check_apache2 on payments2 is CRITICAL: PROCS CRITICAL: 351 processes with command name apache2 [19:30:37] PROBLEM - check_apache2 on payments2 is CRITICAL: PROCS CRITICAL: 351 processes with command name apache2 [19:31:40] PROBLEM - check_apache2 on payments1 is CRITICAL: PROCS CRITICAL: 351 processes with command name apache2 [19:34:31] PROBLEM - MySQL Replication Heartbeat on db66 is CRITICAL: CRIT replication delay 59068 seconds [19:35:07] RECOVERY - check_apache2 on payments3 is OK: PROCS OK: 15 processes with command name apache2 [19:35:08] RECOVERY - check_apache2 on payments2 is OK: PROCS OK: 64 processes with command name apache2 [19:35:08] RECOVERY - check_apache2 on payments1 is OK: PROCS OK: 132 processes with command name apache2 [19:37:31] RECOVERY - Apache HTTP on srv250 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [19:37:45] !log repooling srv200-srv213 [19:37:50] Logged the message, notpeter [19:40:13] PROBLEM - MySQL Slave Delay on db66 is CRITICAL: CRIT replication delay 59099 seconds [19:44:16] RECOVERY - NTP on srv248 is OK: NTP OK: Offset -0.03458809853 secs [19:44:35] PROBLEM - Host srv251 is DOWN: PING CRITICAL - Packet loss = 100% [19:44:35] PROBLEM - Host srv257 is DOWN: PING CRITICAL - Packet loss = 100% [19:45:28] RECOVERY - NTP on srv250 is OK: NTP OK: Offset -0.04923892021 secs [19:45:28] PROBLEM - Host srv256 is DOWN: PING CRITICAL - Packet loss = 100% [19:47:01] PROBLEM - SSH on srv252 is CRITICAL: Connection refused [19:47:01] PROBLEM - SSH on srv253 is CRITICAL: Connection refused [19:47:01] PROBLEM - SSH on srv254 is CRITICAL: Connection refused [19:47:01] PROBLEM - SSH on srv255 is CRITICAL: Connection refused [19:47:37] PROBLEM - Apache HTTP on srv254 is CRITICAL: Connection refused [19:47:37] PROBLEM - Apache HTTP on srv255 is CRITICAL: Connection refused [19:47:37] PROBLEM - Apache HTTP on srv253 is CRITICAL: Connection refused [19:47:55] PROBLEM - Memcached on srv253 is CRITICAL: Connection refused [19:47:55] PROBLEM - Memcached on srv255 is CRITICAL: Connection refused [19:47:55] PROBLEM - Memcached on srv254 is CRITICAL: Connection refused [19:47:55] PROBLEM - Memcached on srv252 is CRITICAL: Connection refused [19:48:31] PROBLEM - NTP on srv254 is CRITICAL: NTP CRITICAL: No response from NTP server [19:48:31] PROBLEM - NTP on srv252 is CRITICAL: NTP CRITICAL: No response from NTP server [19:48:31] PROBLEM - NTP on srv253 is CRITICAL: NTP CRITICAL: No response from NTP server [19:49:43] PROBLEM - Apache HTTP on srv252 is CRITICAL: Connection refused [19:50:43] preilly - lesliecarr can help with Vumi diagnostic when u are ready [19:50:55] RECOVERY - Host srv251 is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [19:51:04] RECOVERY - Host srv256 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [19:51:04] RECOVERY - Host srv257 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [19:51:58] RECOVERY - NTP on srv249 is OK: NTP OK: Offset -0.03746378422 secs [19:53:19] preilly: ok, what's up ? [19:53:29] LeslieCarr: I think Asher has it [19:53:35] ok cool [19:54:32] PROBLEM - Memcached on srv256 is CRITICAL: Connection refused [19:54:32] PROBLEM - Memcached on srv257 is CRITICAL: Connection refused [19:54:32] PROBLEM - Apache HTTP on srv249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:54:40] PROBLEM - Memcached on srv251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:54:49] PROBLEM - Apache HTTP on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:13] New patchset: Pyoungmeister; "repooling srv248 and srv249, depooling srv191 and srv192" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33603 [19:55:16] RECOVERY - SSH on srv255 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:55:16] RECOVERY - SSH on srv252 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:55:16] RECOVERY - SSH on srv254 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:55:16] RECOVERY - SSH on srv253 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:55:25] PROBLEM - SSH on srv257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:25] PROBLEM - SSH on srv256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:52] PROBLEM - Apache HTTP on srv251 is CRITICAL: Connection refused [19:55:52] PROBLEM - Apache HTTP on srv257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:56:05] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33603 [19:56:55] RECOVERY - SSH on srv257 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:56:55] RECOVERY - SSH on srv256 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:59:56] cmjohnson1: all done with ms-be7 [20:00:07] great! thx [20:02:16] New patchset: preilly; "fix redis package" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33606 [20:02:27] binasher: ^^ [20:04:10] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33606 [20:05:53] !log reedy synchronized php-1.21wmf4/ [20:05:55] RECOVERY - Apache HTTP on srv254 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [20:05:59] Logged the message, Master [20:07:25] RECOVERY - Apache HTTP on srv251 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.007 seconds [20:08:28] PROBLEM - NTP on srv255 is CRITICAL: NTP CRITICAL: No response from NTP server [20:09:19] !log reedy synchronized php-1.21wmf3/ [20:09:26] Logged the message, Master [20:11:10] PROBLEM - Host wtp1 is DOWN: PING CRITICAL - Packet loss = 100% [20:12:32] sbernardin: can you reboot wtp1 and enter the raid bios for me it is ctrl-r at the prompt during post [20:12:58] cmjohnson1: ok [20:13:11] let me know when you are there [20:14:19] RECOVERY - Apache HTTP on srv255 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.044 seconds [20:15:04] PROBLEM - NTP on srv256 is CRITICAL: NTP CRITICAL: No response from NTP server [20:15:13] PROBLEM - NTP on srv251 is CRITICAL: NTP CRITICAL: Offset unknown [20:15:13] PROBLEM - NTP on srv257 is CRITICAL: NTP CRITICAL: No response from NTP server [20:16:07] RECOVERY - Apache HTTP on srv252 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [20:16:59] New patchset: Pyoungmeister; "making srv248 and 249 the bits appserver ganglia aggregators" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33609 [20:17:33] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33609 [20:17:45] cmjohnson1: I'm there... [20:18:51] go to previous page [20:18:53] ctrl-p [20:19:57] cmjohnson1: are you in there? [20:20:03] yes [20:20:05] that is me [20:22:16] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [20:22:16] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [20:22:16] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [20:22:43] RECOVERY - Apache HTTP on srv256 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.008 seconds [20:22:43] RECOVERY - Apache HTTP on srv253 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.007 seconds [20:24:20] New patchset: Pyoungmeister; "setting srv250-srv257 to use applicationserver role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33610 [20:25:07] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33610 [20:25:51] New patchset: Jgreen; "adding mysql_myisam to fundraisingdb's b/c that gives us 1GB key_buffer" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33611 [20:26:54] Jeff_Green: why does fundraising use myisam? [20:27:00] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33611 [20:27:20] civicrm (shudder) [20:29:46] RECOVERY - NTP on srv254 is OK: NTP OK: Offset -0.023863554 secs [20:30:33] New patchset: Pyoungmeister; "making srv190-srv192 regular apaches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33612 [20:34:33] !log reimaging srv190-srv192 as regular apaches [20:34:34] RECOVERY - Host wtp1 is UP: PING OK - Packet loss = 0%, RTA = 1.95 ms [20:34:39] Logged the message, notpeter [20:35:35] New patchset: Jgreen; "second attempt to set key_buffer to 1G for fundraisingdb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33614 [20:37:36] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33614 [20:38:19] PROBLEM - SSH on srv253 is CRITICAL: Connection refused [20:38:19] PROBLEM - SSH on srv255 is CRITICAL: Connection refused [20:38:19] PROBLEM - Apache HTTP on srv250 is CRITICAL: Connection refused [20:38:19] PROBLEM - SSH on srv251 is CRITICAL: Connection refused [20:38:46] PROBLEM - Apache HTTP on srv254 is CRITICAL: Connection refused [20:38:46] PROBLEM - Apache HTTP on srv251 is CRITICAL: Connection refused [20:39:04] PROBLEM - Apache HTTP on srv256 is CRITICAL: Connection refused [20:39:04] PROBLEM - Apache HTTP on srv252 is CRITICAL: Connection refused [20:39:04] PROBLEM - Apache HTTP on srv255 is CRITICAL: Connection refused [20:39:13] PROBLEM - SSH on srv256 is CRITICAL: Connection refused [20:39:22] PROBLEM - Apache HTTP on srv253 is CRITICAL: Connection refused [20:39:48] AHGHGHGHG. I HATE PUPPET. [20:39:49] PROBLEM - SSH on srv250 is CRITICAL: Connection refused [20:39:58] PROBLEM - SSH on srv257 is CRITICAL: Connection refused [20:39:58] PROBLEM - SSH on srv254 is CRITICAL: Connection refused [20:40:07] PROBLEM - SSH on srv252 is CRITICAL: Connection refused [20:42:13] New patchset: Jgreen; "third trie to set key_buffer = 1GB for fundraisingdb, blarhg" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33630 [20:42:13] PROBLEM - Host srv192 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:13] PROBLEM - Host srv191 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:46] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33630 [20:42:47] !log removing srv225-srv230 for upgrade to precise [20:42:53] Logged the message, notpeter [20:43:16] RECOVERY - SSH on srv251 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:43:16] RECOVERY - SSH on srv255 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:43:16] RECOVERY - SSH on srv257 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:43:16] RECOVERY - SSH on srv254 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:43:16] RECOVERY - SSH on srv253 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:43:34] RECOVERY - SSH on srv252 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:44:10] RECOVERY - SSH on srv256 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:44:46] RECOVERY - SSH on srv250 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:45:00] hooray, 3rd try is the charm [20:45:58] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: Connection refused [20:46:32] !log adjusted key_buffer_size for fundraisingdbs from 1M to 1G [20:46:38] Logged the message, Master [20:47:19] PROBLEM - Apache HTTP on srv190 is CRITICAL: Connection refused [20:47:39] New patchset: Pyoungmeister; "setting srv 225-srv230 to use applicationserver role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33650 [20:47:55] RECOVERY - Host srv192 is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [20:47:55] RECOVERY - Host srv191 is UP: PING OK - Packet loss = 0%, RTA = 1.38 ms [20:47:55] PROBLEM - Host wtp1 is DOWN: PING CRITICAL - Packet loss = 100% [20:48:04] PROBLEM - SSH on srv190 is CRITICAL: Connection refused [20:48:50] New patchset: Faidon; "swift: remove unreferenced code/variables" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33651 [20:48:50] New patchset: Faidon; "swift: add CORS support" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33652 [20:51:40] PROBLEM - Memcached on srv192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:51:49] PROBLEM - SSH on srv191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:51:58] PROBLEM - Memcached on srv191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:52:16] PROBLEM - Apache HTTP on srv192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:52:34] PROBLEM - Apache HTTP on srv191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:52:43] PROBLEM - SSH on srv192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:53:01] exception 'DBConnectionError' with message 'DB connection error: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) (localhost)' in /usr/local/apache/common-local/php-1.21wmf3/includes/db/Database.php:797 < bits for css stuff via the loader causing issues on pl. [20:53:03] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33612 [20:53:11] PROBLEM - Host srv228 is DOWN: PING CRITICAL - Packet loss = 100% [20:53:26] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33650 [20:53:37] RECOVERY - Host wtp1 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [20:54:13] RECOVERY - SSH on srv192 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:54:24] binasher: see Damianz ^^ ? [20:54:49] RECOVERY - Apache HTTP on srv250 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.003 seconds [20:55:25] RECOVERY - Apache HTTP on srv254 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [20:55:50] (you have now! in #-tech) [20:56:28] PROBLEM - SSH on srv227 is CRITICAL: Connection refused [20:56:28] PROBLEM - Memcached on srv227 is CRITICAL: Connection refused [20:56:37] PROBLEM - Apache HTTP on srv229 is CRITICAL: Connection refused [20:56:46] PROBLEM - SSH on srv229 is CRITICAL: Connection refused [20:57:04] PROBLEM - Memcached on srv229 is CRITICAL: Connection refused [20:57:22] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.011 seconds [20:57:22] PROBLEM - Apache HTTP on srv226 is CRITICAL: Connection refused [20:57:22] PROBLEM - Apache HTTP on srv227 is CRITICAL: Connection refused [20:57:31] PROBLEM - Memcached on srv225 is CRITICAL: Connection refused [20:57:40] PROBLEM - SSH on srv226 is CRITICAL: Connection refused [20:57:49] PROBLEM - Apache HTTP on srv230 is CRITICAL: Connection refused [20:58:16] PROBLEM - SSH on srv225 is CRITICAL: Connection refused [20:58:16] PROBLEM - SSH on srv230 is CRITICAL: Connection refused [20:58:16] PROBLEM - NTP on srv250 is CRITICAL: NTP CRITICAL: Offset unknown [20:58:16] PROBLEM - Memcached on srv226 is CRITICAL: Connection refused [20:58:43] PROBLEM - Apache HTTP on srv225 is CRITICAL: Connection refused [20:58:52] PROBLEM - Memcached on srv230 is CRITICAL: Connection refused [20:58:52] RECOVERY - Host srv228 is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [20:59:13] Damianz: what url? [20:59:28] PROBLEM - NTP on srv254 is CRITICAL: NTP CRITICAL: Offset unknown [20:59:37] http://bits.wikimedia.org/pl.wikipedia.org/load.php?debug=false&lang=pl&modules=site&only=styles&skin=vector&* [20:59:40] seems to be working again now [20:59:52] pretty much every css on the pl homepage was throwing a db error back [21:00:19] pl wikipedia? [21:00:40] New review: Faidon; "Looks good, will merge it soon." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/29768 [21:01:16] mhm [21:01:43] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay 0 seconds [21:01:57] New patchset: Faidon; "swift: remove support for container sync" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33653 [21:02:19] RECOVERY - MySQL Slave Delay on db66 is OK: OK replication delay 0 seconds [21:02:19] PROBLEM - Apache HTTP on srv228 is CRITICAL: Connection refused [21:02:19] PROBLEM - Memcached on srv228 is CRITICAL: Connection refused [21:02:37] RECOVERY - SSH on srv226 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:02:46] RECOVERY - NTP on srv254 is OK: NTP OK: Offset 0.06521999836 secs [21:02:55] RECOVERY - SSH on srv227 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:03:04] RECOVERY - SSH on srv225 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:03:04] RECOVERY - SSH on srv230 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:03:14] RECOVERY - SSH on srv191 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:03:14] RECOVERY - SSH on srv229 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:03:31] RECOVERY - Apache HTTP on srv251 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.005 seconds [21:03:58] RECOVERY - Apache HTTP on srv255 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.004 seconds [21:09:07] New patchset: preilly; "add redis ganglia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33656 [21:09:13] binasher: ^^ [21:11:10] RECOVERY - NTP on srv255 is OK: NTP OK: Offset 0.0590723753 secs [21:11:10] PROBLEM - NTP on srv191 is CRITICAL: NTP CRITICAL: No response from NTP server [21:11:10] PROBLEM - NTP on srv192 is CRITICAL: NTP CRITICAL: No response from NTP server [21:11:19] RECOVERY - NTP on srv251 is OK: NTP OK: Offset -0.09725081921 secs [21:11:55] RECOVERY - Apache HTTP on srv256 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [21:12:04] RECOVERY - Apache HTTP on srv252 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.007 seconds [21:12:40] RECOVERY - Apache HTTP on srv191 is OK: HTTP OK HTTP/1.1 200 OK - 451 bytes in 0.005 seconds [21:13:32] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33656 [21:13:34] RECOVERY - Apache HTTP on srv225 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.007 seconds [21:13:52] RECOVERY - Apache HTTP on srv228 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.011 seconds [21:17:19] PROBLEM - NTP on srv225 is CRITICAL: NTP CRITICAL: Offset unknown [21:17:21] no css on it.wiki [21:17:37] PROBLEM - NTP on srv227 is CRITICAL: NTP CRITICAL: No response from NTP server [21:17:37] PROBLEM - NTP on srv230 is CRITICAL: NTP CRITICAL: No response from NTP server [21:17:46] PROBLEM - NTP on srv226 is CRITICAL: NTP CRITICAL: No response from NTP server [21:17:46] PROBLEM - NTP on srv229 is CRITICAL: NTP CRITICAL: No response from NTP server [21:18:06] indeed [21:18:14] exception 'DBConnectionError' with message 'DB connection error: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) (localhost)' in /usr/local/apache/common-local/php-1.21wmf3/includes/db/Database.php:797 [21:18:21] that's what the CSS says [21:18:29] it looks ok to me..... [21:18:31] RECOVERY - NTP on srv250 is OK: NTP OK: Offset -0.03769361973 secs [21:18:31] PROBLEM - Host wtp1 is DOWN: PING CRITICAL - Packet loss = 100% [21:18:34] it.wikipedia.org ? [21:18:45] yeah [21:18:47] cached probably [21:19:10] * Damianz blames bits in general [21:19:27] X-Cache: sq68 miss (0), cp3020 hit (3161) [21:19:52] RECOVERY - SSH on srv190 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:20:19] RECOVERY - Apache HTTP on srv192 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.002 seconds [21:20:28] RECOVERY - Apache HTTP on srv257 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.006 seconds [21:20:34] X-Cache: sq70 miss (0), cp3021 hit (10552) [21:20:35] both URLs [21:20:37] RECOVERY - Apache HTTP on srv253 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.004 seconds [21:20:45] New patchset: Jgreen; "adding awight to restricted for bastion use" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33657 [21:20:47] notpeter:yep [21:20:55] RECOVERY - Apache HTTP on srv226 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.005 seconds [21:21:04] RECOVERY - NTP on srv252 is OK: NTP OK: Offset -0.03627490997 secs [21:21:07] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33657 [21:21:14] RECOVERY - Apache HTTP on srv229 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.002 seconds [21:21:15] paravoid: is there any info on which bits apache served that? [21:21:44] nothing should be attempting to talk to mysql on localhost, obvs [21:21:54] X-Powered-By: PHP/5.3.10-1ubuntu3.4+wmf1 [21:21:58] nothing, but it's a precise box [21:22:05] they're all precise now [21:22:18] maybe one has a bad deploy [21:22:41] I'll just sync-common on them by hand [21:22:43] PROBLEM - NTP on srv228 is CRITICAL: NTP CRITICAL: Offset unknown [21:23:04] what should we purge? [21:23:25] bits.wikimedia.org/it.wikipedia.org/.* ? :) [21:23:28] all the things! [21:24:43] 23:24 < wizardist> getting this in be_x_oldwiki, but I can see the same symptoms at least in enwiki and plwiki, so it doesn't seem a project-related problem to me [21:24:51] today of all days [21:25:03] At least it's not a monday [21:25:15] it's the fundraising trial [21:25:35] notpeter: running sync-common? [21:26:03] yep [21:26:05] Ah - just blame the fundrasing team :D [21:26:24] That explains the horrid blue thing on the page *opens ghostscripts* [21:27:13] RECOVERY - Apache HTTP on srv190 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [21:27:20] ok, all bits apaches just got a sync-common [21:27:45] binasher: any ideas on what to purge? [21:28:25] srv249 is still throwing mad errors [21:28:34] RECOVERY - Apache HTTP on srv227 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.002 seconds [21:28:39] can depool [21:28:52] RECOVERY - Apache HTTP on srv230 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.002 seconds [21:29:19] RECOVERY - NTP on srv226 is OK: NTP OK: Offset -0.03455495834 secs [21:29:26] whytf is MW returning a 200 with a page full of exceptions [21:29:32] going to graceful on 249 [21:29:42] just kill apache [21:30:05] !log ran chmod 755 /usr/local/apache/common-local/wmf-config on srv249 [21:30:12] Logged the message, Master [21:30:19] binasher: ideas on what to purge? [21:30:25] it was 0700 owned by mwdeploy and undreadable by apache [21:30:29] paravoid: no [21:30:40] RECOVERY - Apache HTTP on srv249 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.052 second response time [21:30:53] I can fix the it.wp bits [21:30:57] but who knows what else is broken [21:31:01] paravoid: it seems conceivable that anything can be bad for any wiki [21:31:44] I think the fact that exceptions were output with 200s might be a bits/resourceloader thing though, couldn't it? [21:32:22] doesn't it trap exceptions and add them to the page as comments? [21:32:29] wtf, /usr/local/apache/common-local/php-1.21wmf4/skins is 0700 on there too [21:32:32] AaronSchulz: yes [21:32:34] perhaps it could also send 500 if doesn't [21:32:35] !log stopping apache on srv249 [21:32:41] Logged the message, Master [21:32:46] PROBLEM - NTP on srv190 is CRITICAL: NTP CRITICAL: Offset unknown [21:32:47] *doesn't already [21:32:48] The problem is that will cache a blank style sheet :( [21:32:48] so now we have a CSS full of exceptions cached [21:33:08] at least two to be exact [21:33:28] binasher: could we purge all of bits? would we able to sustain that? [21:33:41] !log starting apache on srv249 [21:33:47] Logged the message, Master [21:34:05] what's the tll of most things in bits [21:34:10] bits caches fill up in like 5 seconds [21:34:17] notpeter: yeah that was my thought [21:34:28] and have a 90something hit rate [21:34:33] if oyu spaced them out by like 30, it would probably be fine [21:34:36] 99.something [21:34:44] yeah, that should be fine [21:35:37] RECOVERY - NTP on srv225 is OK: NTP OK: Offset -0.03458023071 secs [21:35:52] i wonder why the deploy directory perms on srv249 were screwed after being reimaged [21:35:55] RECOVERY - NTP on srv228 is OK: NTP OK: Offset -0.04363107681 secs [21:35:58] hm [21:36:04] so, the URLs that were breaking before are now OK [21:36:04] RECOVERY - NTP on srv191 is OK: NTP OK: Offset -0.02308547497 secs [21:36:04] RECOVERY - NTP on srv190 is OK: NTP OK: Offset 0.07767355442 secs [21:36:07] and I didn't purge anything [21:36:16] so ttl is probably set to something short [21:36:22] RECOVERY - NTP on srv256 is OK: NTP OK: Offset -0.03583824635 secs [21:36:37] (yes, I checked if it was the same cache that served that) [21:36:40] mediawiki generally doesn't spew errors in the return html, but that's based on things it the app config. i guess this is its behavior when it can't load config [21:36:47] binasher: this is my fault. I've found that apaches sometimes need a graceful before being put into prod. and I think I forgot on srv248 and srv249 :( [21:36:47] Cache-Control: public, max-age=300, s-maxage=300 [21:37:06] notpeter: the graceful had nothing to do with it [21:37:13] alright [21:37:31] an apcahectl graceful doesn't set directory permissions [21:37:32] okay, no need to do anything [21:37:37] ah, ok [21:37:42] that's really weird.... [21:38:11] should I open a bugzilla against resourceloader so that CSS with exceptions get a ttl 0 or 500? [21:38:29] btw, the apache syslog file pointed out srv249 right away [21:38:55] okay, I will and we'll see how that goes [21:39:04] paravoid: maybe better that exceptions != 200 [21:39:24] well, in this case a 500 would be sensible [21:39:30] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [21:39:41] New review: Krinkle; "To be removed from secure in I64638e3edf6c33b3cc227d4fbb1ab8cda88db06e" [operations/mediawiki-config] (master); V: 0 C: 1; - https://gerrit.wikimedia.org/r/33525 [21:39:45] paravoid: oh yeah, totally.. i misread what you said [21:39:46] but I guess the whole put exceptions in comments idea was to catch some less severe exceptions? [21:40:36] Imo it would make more sense to throw them to a central log and not output at all to the client [21:40:44] indeed [21:41:13] paravoid: i think it is a mediawiki config option on whether to return exceptions in the request or not [21:41:17] and it's off in our config [21:41:25] but in this case there was no config [21:41:25] but that config wasn't readable [21:41:26] hrm [21:43:33] RECOVERY - NTP on srv257 is OK: NTP OK: Offset -0.05375528336 secs [21:44:45] RECOVERY - NTP on srv192 is OK: NTP OK: Offset -0.02823865414 secs [21:44:45] RECOVERY - NTP on srv229 is OK: NTP OK: Offset -0.02194833755 secs [21:44:54] RECOVERY - NTP on srv253 is OK: NTP OK: Offset -0.03266561031 secs [21:45:40] lunch time [21:51:12] https://bugzilla.wikimedia.org/show_bug.cgi?id=42166 fwiw [21:52:51] RECOVERY - NTP on srv227 is OK: NTP OK: Offset -0.04273641109 secs [21:52:51] RECOVERY - NTP on srv230 is OK: NTP OK: Offset -0.03337621689 secs [21:55:09] paravoid: Have you had chance to add some redirects for the keys files? [21:58:31] did you move it already? didn't see that [21:59:34] i saw them live before gerrit merge [22:00:40] still says review in progress [22:00:48] !g I51da1d588146219e068f632c65a64781895f62fd [22:00:48] https://gerrit.wikimedia.org/r/#q,I51da1d588146219e068f632c65a64781895f62fd,n,z [22:01:04] http://www.mediawiki.org/keys/keys.html [22:01:19] maybe http://www.mediawiki.org/keys/ should even have dir listings enabled [22:01:58] Reedy: ^^^ [22:02:05] it's live but the commit isn't merged? [22:02:23] yeah, I put the files there for easyness [22:06:15] !log repooling srv190-srv192, srv225-srv230 [22:06:21] Logged the message, notpeter [22:07:49] New patchset: Faidon; "secure.wikimedia.org: keys.html to mediawiki.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33658 [22:08:04] I'll tidy up fenari in a little bit when that's deployed! :) [22:08:07] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33658 [22:08:16] done [22:08:21] what's fenari got to do with that? [22:08:38] !log repooling srv25-srv257 [22:08:44] Logged the message, notpeter [22:08:46] mediawiki-config working copy [22:08:48] fenari is where they were deployed first before teh apaches? [22:09:05] they=keys [22:09:22] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33525 [22:09:30] PROBLEM - Apache HTTP on srv252 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [22:09:47] wha? [22:10:02] Not sure if keys-2009-08-02.txt is still actually needed [22:10:45] https://gerrit.wikimedia.org/r/#/c/33526/ is for the rest of the secure tidyup... [22:17:05] class swift::cleaner { [22:17:09] paravoid: more stuff to kill later :) [22:17:21] yes! :) [22:17:25] seen all of my commits? [22:18:20] cmjohnson1: drives have been swapped out of wtp1 [22:22:47] New review: Aaron Schulz; "Are we done with secure yet?" [operations/mediawiki-multiversion] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/33527 [22:25:31] AaronSchulz: almost, yes [22:25:42] The apache rules are in place and working [22:25:46] keys have now been moved.. [22:28:16] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33526 [22:29:52] !log Ran sync-docroot [22:29:58] Logged the message, Master [22:36:23] !log reedy synchronized php-1.21wmf4/extensions/WikimediaIncubator/ [22:36:30] Logged the message, Master [22:41:20] !log reedy synchronized php-1.21wmf4/extensions/ReaderFeedback/ [22:41:26] Logged the message, Master [22:46:40] Reedy: so why is it ok to remove those symlinks? [22:47:08] AaronSchulz because they aren't being used anymore [22:47:25] secure.wm.o redirects to https://... [22:52:24] AaronSchulz: did you see my earlier reply about rewrite.py being deployed on ms-fe1? [22:52:34] I think so [22:52:39] AaronSchulz: all of the changes in gerrit are and it's depooled, feel free to play with it if you'd like [22:53:46] Change merged: Aaron Schulz; [operations/mediawiki-multiversion] (master) - https://gerrit.wikimedia.org/r/33527 [23:01:42] paravoid: where does REMOTE_USER come from exactly? [23:02:01] https://github.com/gholt/swauth/blob/master/swauth/middleware.py [23:02:04] nvm [23:02:10] yeah [23:02:16] that's why it's behind swauth in the pipeline [23:02:29] yeah I noticed it's pipeline position [23:03:06] *its [23:08:05] Simetrica? [23:13:52] * AaronSchulz can now poke tim about https://gerrit.wikimedia.org/r/#/c/33411/ [23:16:02] New patchset: Reedy; "Restore small.dblist from history..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33669 [23:16:11] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33669 [23:18:54] AaronSchulz: thanks for the reviews! [23:19:14] * AaronSchulz has 2 hours left to do stuff now :p [23:22:41] Ok, so I know why the special page update scripts aren't running [23:22:57] the cron log is full of "Could not open input file: MWScript.php" [23:23:20] but if I run them manually they work fine.. [23:24:08] what log file? [23:25:28] /home/wikipedia/log/norotate/updateSpecialPages.php [23:25:49] /home/wikipedia/log/norotate/updateSpecialPages.log [23:25:50] even [23:29:22] are we on hume? [23:29:49] I am [23:31:55] stealing the whole command from cron gives a different error [23:31:55] reedy@hume:~$ sudo -u apache flock -n /var/lock/update-special-pages /usr/local/bin/update-special-pages > /home/wikipedia/logs/norotate/updateSpecialPages.log [23:31:55] -bash: /home/wikipedia/logs/norotate/updateSpecialPages.log: Permission denied [23:35:01] reedy@hume:~$ sudo -u apache /usr/local/bin/update-special-pages [23:35:01] aawiki [23:35:01] Statistics completed in 0.01s [23:39:07] paravoid: does swift ever give errors with a 200 response? [23:39:28] I don't think so [23:39:29] why? [23:39:43] I'm just thinking about https://bugzilla.wikimedia.org/show_bug.cgi?id=42047 [23:40:08] this would be for authenticated requests, which rewrite doesn't touch, so that seems unlikely [23:40:11] Reedy: what you ran and pasted above is writing to the log as reedy not as apache [23:40:18] I tested this, I was telling you that yesterday [23:40:29] that's a 500 [23:40:34] oh [23:40:35] feck [23:40:42] Reedy: you could do | tee /home/wikipedia/logs/norotate/updateSpecialPages.log >/dev/null [23:40:50] err [23:40:56] no, better yet [23:40:59] wikidev doesn't have w on those files, apache does [23:41:09] All that aside [23:41:18] AaronSchulz: also, I presume you're asking about the backtrace, which is irrelevant to the rest of the bug report [23:41:21] Which aren't the errors at fault [23:41:24] Running it as apache works fine [23:41:39] just for some reason it doesn't under cron [23:49:03] New patchset: Tim Starling; "Disable ULS toolbar for anons" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33534