[00:11:58] mutante: So, I see that Apache is upset about /srv/org/wikimedia/controller/wikis/w/index.php/ but I also see that that file is present… I don't know enough to know if that trailing / is expected or not. [00:12:27] andrewbogott: on virt0 ? [00:12:32] yep [00:13:13] The files on virt0 all look as they should be to me. [00:13:31] ok, thats a bit better, document root /var/www looked empty [00:13:46] these are Aliases..ok [00:14:19] yes, index.php is a file, the trailing / seems weird, i see you already removed it ,right [00:14:31] Wait, I'm not aware of having removed it... [00:14:33] ? [00:15:22] Alias /wiki /srv/org/wikimedia/controller/wikis/w/index.php [00:15:41] Sure, that part is right. But apache error log is saying "File does not exist: /srv/org/wikimedia/controller/wikis/w/index.php/" [00:15:51] every time I try to load the site I get a couple of those. [00:16:14] Which, I presume that error message is the same thing as the 404 [00:17:19] * andrewbogott knows very little about how php behaves. [00:22:50] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [00:22:50] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [00:22:50] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [00:23:25] New patchset: Mwalker; "CN Removing Payments Wiki for CN Reflection" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32480 [00:24:05] andrewbogott: "w" is a link to "slot1" and slot1 appears to have changed today [00:24:18] Change merged: Pgehres; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32480 [00:25:19] but not its contents [00:25:22] !log reedy synchronized php-1.21wmf3/extensions/SiteMatrix/ [00:25:28] Logged the message, Master [00:25:38] mutante: I see that… no idea what that signifies. [00:26:29] Is Ryan running git deploy on it? [00:27:09] The git log shows the last patch from 9/4 [00:27:33] which is weird, I'm sure I've changed things since then. [00:27:50] Reedy: you deploying anything? i need to do one final sync on CommonSettings [00:28:26] oh, nm, I'm confusing mediawiki with openstackmanager [00:30:26] andrewbogott: seems like on of the last things done by root before this was related to cd w/extensions/OpenStackManager/ [00:30:44] That's me, just now :) [00:30:50] oh,ok [00:31:50] !log pgehres synchronized wmf-config [00:31:56] Logged the message, Master [00:32:12] people rarely do sync-dir on wmf-config [00:32:17] * AaronSchulz just noticed that [00:32:38] AaronSchulz: reedy told me to sync-dir it instead of sync-file [00:32:59] not saying it;s bad [00:33:08] RobH, how are your apache troubleshooting skills? [00:47:02] mutante - ping [00:47:35] woosters: pong [00:47:59] what's brewing there? [00:48:57] labs ? [00:49:24] woosters: Apache is misbehaving and… I could use the help of someone who knows how to troubleshoot apache. [00:49:35] woosters: i dont know, i could not see anything obvious either to help Andrew and really busy with those mysql imports [00:49:40] Mutante and I have looked at the obvious, and I've restarted a few things… [00:49:53] this is labs, right? [00:49:58] virt0 [00:51:21] woosters: yes, labs [00:51:24] the labs wiki [00:51:46] labsconsole.wikimedia.org == virt0 [00:52:02] and labs = redirect to labsconsole..yep [00:52:56] !log LocalisationUpdate failed: git pull of extensions failed [00:53:02] Logged the message, Master [01:01:05] Errors were encountered while processing: /var/cache/apt/archives/python-iso8601_0.1.4-1ubuntu1_all.deb [01:02:31] New patchset: Pyoungmeister; "removing es4 from db.php for transition to innodb" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32482 [01:09:34] andrewbogott: arg.. look this http://www.mail-archive.com/ubuntu-bugs@lists.ubuntu.com/msg3883169.html [01:09:43] [Bug 1073289] [NEW] nova-common has an incorrect dep on python-nova (= 2012.1-0ubuntu2) [01:10:05] that seems to fit the dpkg error, no idea though how that would be related to Apache though [01:12:22] PROBLEM - HTTP on kaulen is CRITICAL: Connection refused [01:13:16] kaulen: Syntax error on line 5 of /etc/apache2/sites-enabled/codereview-proxy.wikimedia.org: [01:13:21] guys, what is going on [01:13:33] i really cant multi-task any longer :p sigh [01:16:21] Change abandoned: Asher; "the prior values *should* work from looking at the math in pybal.. looking deeper there." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32471 [01:16:50] ok, dpkg upgrades are breaking Apaches [01:17:04] 2012-11-09 01:08:51 status half-configured apache2 2.2.14-5ubuntu8.10 [01:18:07] !log installing package upgrades on kaulen, half-configured apache2 brought down bugzilla ... [01:18:14] Logged the message, Master [01:18:31] mutante: do those happen automatically? [01:19:18] robla: i would not have expected it on kaulen at all, but it looks like it indeed [01:19:52] that's a treat [01:20:02] unless somebody started them via dsh [01:20:13] or puppet ensure "latest" for specific packages [01:20:27] its not like there is a cronjob to install anything it finds... [01:21:12] Invalid command 'php_admin_flag', [01:21:21] binasher: You have some apache know-how don't you? We could use a little help here. [01:21:25] where is the command above? [01:21:41] hmm? [01:22:01] Apache is freaking out on virt0… complaining that it can't find a file that is obviously there. [01:22:10] virt0 or kaulen? [01:22:18] (Which may or may not be a result of a broken upgrade.) [01:22:18] different issues? [01:22:18] binasher: both :o [01:22:24] binasher: Possibly both, possibly the same issue. [01:22:33] not sure yet if its really the same [01:22:40] on virt0 Apache runs [01:22:43] on kaulen it cant [01:23:00] both have in common that there are failed dpkg upgrades though [01:23:01] and the timing [01:23:41] kaulen is up now [01:23:46] RECOVERY - HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.002 seconds [01:23:57] yup, bz is fine [01:23:59] did you do anything? [01:24:09] Apache start failed a few seconds ago [01:24:13] installed libapache2-mod-php5 [01:24:18] missing the php_admin_flag [01:24:25] oh..duh?! [01:24:42] binasher: If you have a moment to look at virt0, I will stand back so you can get a clear picture. [01:24:55] 2012-11-09 01:23:24 status half-configured libapache2-mod-php5 5.3.2-2wm1 [01:25:07] ok, so that was half-configured as well [01:25:14] andrewbogott: what's wrong with virt0? [01:25:41] binasher: virt0 is labsconsole.wikimedia.org [01:25:41] apache is running there, unlike what was up with kaulen [01:25:47] binasher: https://labsconsole.wikimedia.org/ [01:25:52] It's 404ing with an obvious error in the error log... [01:26:00] well, obvious, except I don't know why it's happening. [01:26:21] running labsconsole, that is indeed something wrong [01:26:25] binasher: it wants to downgrade a python package [01:26:34] and that fails.. and it is also python-keystone depends on python-iso8601 [01:26:36] mutante: No, I fixed that part I think. [01:26:39] ok [01:26:43] apt-get is happy now. [01:26:45] Best I can tell. [01:27:37] ok, it's better [01:27:50] php install issue again [01:28:02] yeah,this makes me think we might get more of them soon [01:28:06] whenever puppet runs [01:28:14] so.. what'd you guys do? [01:29:25] Wait, you fixed it just like that? What'd you do? [01:29:42] binasher: We didn't do anything, stuff just started failing. Presumably because of some cron-initiated upgrade. [01:29:52] magic rainbow sauce [01:29:52] binasher: no idea..apparently labs was down since quite a while [01:30:11] and then bugzilla..it just happened a few minutes ago [01:30:13] True story: I just said "Maybe if I go to the bathroom Asher will have it fixed when I get back" and, sure enough. [01:30:21] let me run puppet and see if it breaks again [01:30:34] and then go thru logs on virt0 [01:30:56] andrewbogott: missing libapache2-mod-php5 was also to blame [01:31:38] binasher: How did you know? Was that in the apt-get upgrade scroll and I just missed it? [01:32:27] nope, i just saw that php code wasn't running and looked at how it was installed / configured. apt was happy :/ [01:32:51] Hmph. [01:32:59] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32482 [01:33:12] binasher: You started a puppet run already, or shall I do that? [01:34:24] i did, didn't break anything [01:34:38] Great! Then maybe that means I can finally take my Very Patient Date to dinner. [01:34:56] thanks andrewbogott! [01:35:22] binasher: but but but....any idea how libapache2-mod-php5 disappeared from those machines? [01:35:24] robla: I provided the valuable service of waiting for Asher to log in :) [01:35:45] andrewbogott: you sacrificed appropriate chickens [01:36:01] !log py synchronized wmf-config/db.php 'pulling es4' [01:36:07] Logged the message, Master [01:36:11] robla: There was definitely a busted apt-get upgrade happening, so… probably we shouldn't be doing that when no one's watching. [01:37:36] we dont, it's not like we do automatic apt-get upgrade , besides on labs instances [01:37:47] but we do "ensure => latest;" in puppet manifests in some places [01:37:52] like webserver.pp [01:38:21] which has kind of the same effect , but just for a few packages [01:38:21] this: [01:38:22] Start-Date: 2012-11-08 23:01:23 [01:38:23] Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install apache2 [01:38:41] pulled in libapache2-mod-php5filter which is incompatible with libapache2-mod-php5 [01:38:57] so it was broken before this: [01:38:58] Start-Date: 2012-11-09 00:58:21 [01:38:59] Commandline: apt-get upgrade [01:39:11] if we are going to do anything of that variety automatically, we should probably at least have it also drop an entry into the server admin log [01:40:03] I'm guessing this sort of thing gets done automatically in cases where our track record for doing it manually is spotty [01:40:20] Yeah, it seems simultaneously wise and foolish to automate it. [01:40:43] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 183 seconds [01:41:27] * andrewbogott -> dinner [01:42:53] the serveradmin log isn't an appropriate place for puppet generated messages, it would become unreadable. all of this stuff is logged, or i wouldn't be pasting it here [01:44:02] it's a code problem.. if you run a php app where package apache { ensure => latest } you need to use puppet to manage / ensure the php install as well [01:46:37] New patchset: Alex Monk; "(bug 41907) Enable patrolling on wikidatawiki." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32484 [01:47:19] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 25 seconds [01:48:34] _ _ _ _ [01:48:38] __ _ ___| | __ | |_| |__ ___ | | ___ __ _ [01:48:42] / _` / __| |/ / | __| '_ \ / _ \ | |/ _ \ / _` | [01:48:46] | (_| \__ \ < | |_| | | | __/ | | (_) | (_| | [01:48:51] \__,_|___/_|\_\ \__|_| |_|\___| |_|\___/ \__, | [01:48:55] |___/ [01:49:52] !log olivneh synchronized php-1.21wmf3/extensions/E3Experiments/experiments/openTask.js [01:49:58] Logged the message, Master [01:51:04] RECOVERY - MySQL Slave Delay on es4 is OK: OK replication delay seconds [01:53:28] PROBLEM - mysqld processes on es4 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [01:58:11] !log starting innobackupex from es3 to es4 [01:58:16] Logged the message, notpeter [01:58:23] !log converted all s3 searchindex tables to innodb [01:58:29] Logged the message, Master [02:00:22] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 275 seconds [02:00:49] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 302 seconds [02:23:01] !log LocalisationUpdate completed (1.21wmf3) at Fri Nov 9 02:23:01 UTC 2012 [02:23:08] Logged the message, Master [02:25:15] PROBLEM - Host search-pool3.svc.eqiad.wmnet is DOWN: CRITICAL - Network Unreachable (10.2.2.13) [02:25:16] PROBLEM - Host search-pool1.svc.eqiad.wmnet is DOWN: CRITICAL - Network Unreachable (10.2.2.11) [02:25:16] PROBLEM - Host search-prefix.svc.eqiad.wmnet is DOWN: CRITICAL - Network Unreachable (10.2.2.15) [02:26:09] PROBLEM - Host search-pool4.svc.eqiad.wmnet is DOWN: CRITICAL - Network Unreachable (10.2.2.14) [02:26:10] PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:10] PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:10] PROBLEM - Apache HTTP on srv298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:10] PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:18] PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:18] PROBLEM - Apache HTTP on srv257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:28] PROBLEM - Apache HTTP on srv251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:36] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:36] PROBLEM - Apache HTTP on srv296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:36] PROBLEM - Apache HTTP on srv292 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:36] PROBLEM - Apache HTTP on srv300 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:37] PROBLEM - Apache HTTP on srv250 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:45] PROBLEM - Apache HTTP on srv252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:54] PROBLEM - Apache HTTP on srv216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:03] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:12] PROBLEM - Apache HTTP on srv295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:12] PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:39] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.057 second response time [02:27:39] RECOVERY - Apache HTTP on srv297 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.065 second response time [02:27:39] RECOVERY - Apache HTTP on srv298 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [02:27:39] RECOVERY - Apache HTTP on srv294 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [02:27:48] RECOVERY - Apache HTTP on srv253 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.050 second response time [02:27:48] RECOVERY - Apache HTTP on srv257 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.058 second response time [02:27:57] RECOVERY - Apache HTTP on srv251 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [02:28:06] RECOVERY - Apache HTTP on srv296 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [02:28:06] RECOVERY - Apache HTTP on srv300 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [02:28:07] RECOVERY - Apache HTTP on srv292 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [02:28:07] RECOVERY - Apache HTTP on srv250 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.054 second response time [02:28:07] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.063 second response time [02:28:07] RECOVERY - Host search-pool1.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 26.64 ms [02:28:07] RECOVERY - Host search-prefix.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 26.80 ms [02:28:08] RECOVERY - Host search-pool3.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 27.25 ms [02:28:15] RECOVERY - Apache HTTP on srv252 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.055 second response time [02:28:24] RECOVERY - Apache HTTP on srv216 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.068 second response time [02:28:33] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.068 second response time [02:28:42] RECOVERY - Apache HTTP on srv295 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.049 second response time [02:28:42] RECOVERY - Apache HTTP on srv290 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.054 second response time [02:29:00] RECOVERY - Host search-pool4.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 27.22 ms [02:32:37] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [02:47:02] !log olivneh synchronized php-1.21wmf3/extensions/E3Experiments/Experiments.hooks.php 'Fixing OpenTask event log bug' [02:47:09] Logged the message, Master [03:34:33] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: Puppet has not run in the last 10 hours [03:44:00] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [03:50:00] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 8 seconds [04:04:33] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:04:33] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [04:04:33] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [04:53:00] New patchset: Tim Starling; "(bug 41907) Enable patrolling on wikidatawiki." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32484 [04:53:07] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32484 [05:00:00] !log tstarling synchronized wmf-config/InitialiseSettings.php 'enabling patrol on wikidatawiki' [05:00:08] Logged the message, Master [05:19:35] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [06:13:59] /away zzz [06:17:53] mutante: slap wel! [07:05:32] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [08:46:24] New review: Hashar; "Nice stefan that is good start." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/32192 [08:55:14] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [09:33:13] PROBLEM - Host foundation-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:33:14] PROBLEM - Host wikiversity-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:33:14] PROBLEM - Host bits.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:33:14] PROBLEM - Host bits-lb.pmtpa.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:33:31] RECOVERY - Host bits-lb.pmtpa.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [09:33:31] RECOVERY - Host bits.pmtpa.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [09:34:07] RECOVERY - Host foundation-lb.pmtpa.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [09:34:07] RECOVERY - Host wikiversity-lb.pmtpa.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [09:34:24] hmmm [10:06:30] New patchset: Hashar; "Enable AFTv5 on beta" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32061 [10:07:27] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32061 [10:21:46] morning [10:21:51] lost all the fun it seems [10:22:07] didn't even hear the page [10:22:56] stupid linux route cache [10:23:55] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [10:23:55] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [10:23:56] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [10:32:21] I heard it but it was already cleared up (I got both sets at once) when I looked [10:32:47] hi apergos [10:32:50] yo [10:32:52] so, what's the status of ms-be6? [10:33:18] it has an install on it (I went to bed at midnight after getting the install to finally work) [10:33:29] all the disks are partitioned and xfsed and mounted [10:36:30] as far as I'm concerned it could go into the pool now [10:36:48] what about puppet/partman, did you fix that? [10:36:54] uh huh [10:36:55] * paravoid git pulls [10:37:08] self reviewed too :-/ [10:38:30] looks fine to me [10:38:56] need popst commit review workflow for gerrit >_< [10:39:01] *post [10:48:08] so, wait [10:48:13] where's sdm3? [10:48:27] (and sdn3) [10:50:03] hmm [10:50:08] where are those created? [10:54:26] nowhere apparently [10:54:53] maybe they were a casualty of the recent partman consolidations [10:54:55] and we can't even use swift::create_filesystem without rewriting that, as it's also changing the partition table [10:55:03] don't think so [10:55:13] hadn't touch ms-be-with-ssd yet [10:55:24] well that's irritating [10:55:52] meh, let's just run mkfs and mount now [10:56:33] at least the partition is there [10:58:41] done [10:58:41] I see you [10:58:58] I already makde them, I guess you edited the fstab [10:59:00] maybe you remade my fses [10:59:00] whatever [11:02:41] so, wanna add it to the rings? [11:02:55] sure [11:06:00] we have to find out in which rack it is [11:06:05] and add it to the same or a new zone [11:06:11] that's the only "catch" [11:06:22] oh [11:09:10] so am I translating b4 to zone 4? [11:09:51] don't remember [11:09:56] check the other zones? [11:10:08] they are numbers [11:10:19] duh [11:10:28] check which zones the servers in the same rack are in [11:11:57] the numbers seem to have no relation to the racks [11:12:04] well at least the first one I checked [11:12:30] they don't have to have any arithmetic relation [11:12:38] the point is that servers in the same rack should be in the same zone [11:12:41] ok [11:12:53] we have a list of servers and their zones and a list of servers and their racks [11:13:13] so, we can figure it out :) [11:15:41] the notes have "move ms-be6 to zone 8" [11:15:59] that's from before Ben left [11:16:13] and assume that the 720 is in the same rack as the C2100 I guess [11:16:15] ok well I dunno about "move" it but that's what I have added [11:16:18] and let's double check that anyway [11:16:23] I looked in rackables [11:17:29] i believe the idea was to do drop-in replacements [11:17:33] racking at the exact same locations etc [11:17:49] yeah [11:20:28] ok I have some rings sitting in root@ms-be11:~/swift-rings/swift if you wanna look em over (not rebalanced yet) [11:20:39] sure [11:21:26] so ms-be6 is in the same rack as ms-be5? [11:21:35] yes [11:21:50] according to racktables [11:22:27] looks all good to me [11:22:42] ok [11:22:48] I"m gonna rebalance and shove em out then [11:23:15] yep [11:23:52] NOTE: Balance of 33.33 indicates you should push this [11:23:53] ring, wait at least 3 hours, and rebalance/repush. [11:23:55] we need to do that? [11:25:12] if that's what it says :) [11:25:15] did you push them? [11:25:22] if not, I have a different idea [11:25:33] if you did, that's okay too [11:25:40] I have not pushed them, this is on the rebalance [11:26:22] okay, so I'd say let's start them on reduced weight [11:26:25] e.g. 33 [11:26:34] for the object rings anyway, the others are small [11:26:45] it dosn't whine for the object rings [11:26:45] but to do that you have to start from scratch :) [11:27:20] it only complains about container and account [11:27:35] yeah, but still [11:27:45] so do them all with 33 then? [11:28:14] no account/container are basically nothing [11:28:25] so just object with 33 [11:28:26] ok [11:28:36] 35G per SSD [11:28:38] nothing [11:28:58] less actually, that's with 4 boxes with SSD, now we'll have 5 [11:32:13] ok the rebalanced rings are available if you want to lookk at em again [11:34:46] looks good [11:37:32] are owa1 and ms3 really in the swift cluster? [11:37:41] (ganglia thinks they are) [11:37:43] they're in /a/ swift cluster [11:37:57] heh [11:38:02] * apergos ignores them [11:38:27] yeah [11:38:31] we should clean this up [11:43:37] RECOVERY - Puppet freshness on ms-fe3 is OK: puppet ran at Fri Nov 9 11:43:22 UTC 2012 [11:49:28] RECOVERY - Puppet freshness on ms-be3 is OK: puppet ran at Fri Nov 9 11:48:53 UTC 2012 [11:50:49] eh? [11:50:51] what? [11:51:02] apergos: did you run puppet on ms-fe3? [11:51:10] yes [11:52:28] did it change anything else? [11:52:38] I had disabled until I push the ubuntu cloud archive apt [11:53:10] /usr/lib/ganglia/python_modules/memcached.py added [11:53:22] no, removed [11:53:24] sorry [11:53:36] /tmp/puppet-file20121109-11696-v8eii4-0 this was added [11:53:59] New review: Mark Bergsma; "+20000" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/32368 [11:54:22] hahaha [11:55:33] apergos: you're lucky I was cautious enough to commit 1.7 configs without having 1.7 in the apt repo [11:55:42] :) [11:55:42] nice [11:56:02] but that should really read "*you*" are lucky (not me) :-P [11:56:13] hehe I guess [11:56:37] anyways there were no whines, just refresh of gmond [11:58:28] RECOVERY - Puppet freshness on ms-be7 is OK: puppet ran at Fri Nov 9 11:58:13 UTC 2012 [12:03:22] !log ms-be6 deployed and back in swift rings on shiny new 720xd [12:03:29] Logged the message, Master [12:03:34] :-) [12:05:55] how do you like the 720xd so far? [12:06:23] except for the little annoyance with the ssds, fine [12:07:04] not much to say about it, configuration is otherwise quite straightfoward [12:28:57] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:29:31] PROBLEM - Varnish HTTP mobile-backend on cp1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:29:58] PROBLEM - Varnish HTCP daemon on cp1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:10] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [13:19:24] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [13:26:47] New patchset: Dereckson; "(bug 41912) Enable WebFonts and Narayam on betawikiversity." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32564 [13:49:03] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.002 second response time on port 11000 [13:50:41] one of these days I'll debug that [14:05:24] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [14:05:24] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [14:05:24] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [14:06:44] puppet appy --noop --modulepath=/home/paravoid/wikimedia/puppet/modules foo.pp [14:06:48] yay [14:26:03] New patchset: Faidon; "Remove a few spurious no-op apt pins" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32367 [14:26:03] New patchset: Faidon; "Remove apt::ppa-req and apt::key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32368 [14:26:03] New patchset: Faidon; "Initial attempt for an apt module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32369 [14:34:54] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32368 [14:35:47] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32367 [14:36:16] !log authodns committing mgmt ip for labsdb3 [14:36:23] Logged the message, Master [14:40:41]