[00:11:58] mutante: So, I see that Apache is upset about /srv/org/wikimedia/controller/wikis/w/index.php/ but I also see that that file is present… I don't know enough to know if that trailing / is expected or not. [00:12:27] andrewbogott: on virt0 ? [00:12:32] yep [00:13:13] The files on virt0 all look as they should be to me. [00:13:31] ok, thats a bit better, document root /var/www looked empty [00:13:46] these are Aliases..ok [00:14:19] yes, index.php is a file, the trailing / seems weird, i see you already removed it ,right [00:14:31] Wait, I'm not aware of having removed it... [00:14:33] ? [00:15:22] Alias /wiki /srv/org/wikimedia/controller/wikis/w/index.php [00:15:41] Sure, that part is right. But apache error log is saying "File does not exist: /srv/org/wikimedia/controller/wikis/w/index.php/" [00:15:51] every time I try to load the site I get a couple of those. [00:16:14] Which, I presume that error message is the same thing as the 404 [00:17:19] * andrewbogott knows very little about how php behaves. [00:22:50] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [00:22:50] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [00:22:50] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [00:23:25] New patchset: Mwalker; "CN Removing Payments Wiki for CN Reflection" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32480 [00:24:05] andrewbogott: "w" is a link to "slot1" and slot1 appears to have changed today [00:24:18] Change merged: Pgehres; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32480 [00:25:19] but not its contents [00:25:22] !log reedy synchronized php-1.21wmf3/extensions/SiteMatrix/ [00:25:28] Logged the message, Master [00:25:38] mutante: I see that… no idea what that signifies. [00:26:29] Is Ryan running git deploy on it? [00:27:09] The git log shows the last patch from 9/4 [00:27:33] which is weird, I'm sure I've changed things since then. [00:27:50] Reedy: you deploying anything? i need to do one final sync on CommonSettings [00:28:26] oh, nm, I'm confusing mediawiki with openstackmanager [00:30:26] andrewbogott: seems like on of the last things done by root before this was related to cd w/extensions/OpenStackManager/ [00:30:44] That's me, just now :) [00:30:50] oh,ok [00:31:50] !log pgehres synchronized wmf-config [00:31:56] Logged the message, Master [00:32:12] people rarely do sync-dir on wmf-config [00:32:17] * AaronSchulz just noticed that [00:32:38] AaronSchulz: reedy told me to sync-dir it instead of sync-file [00:32:59] not saying it;s bad [00:33:08] RobH, how are your apache troubleshooting skills? [00:47:02] mutante - ping [00:47:35] woosters: pong [00:47:59] what's brewing there? [00:48:57] labs ? [00:49:24] woosters: Apache is misbehaving and… I could use the help of someone who knows how to troubleshoot apache. [00:49:35] woosters: i dont know, i could not see anything obvious either to help Andrew and really busy with those mysql imports [00:49:40] Mutante and I have looked at the obvious, and I've restarted a few things… [00:49:53] this is labs, right? [00:49:58] virt0 [00:51:21] woosters: yes, labs [00:51:24] the labs wiki [00:51:46] labsconsole.wikimedia.org == virt0 [00:52:02] and labs = redirect to labsconsole..yep [00:52:56] !log LocalisationUpdate failed: git pull of extensions failed [00:53:02] Logged the message, Master [01:01:05] Errors were encountered while processing: /var/cache/apt/archives/python-iso8601_0.1.4-1ubuntu1_all.deb [01:02:31] New patchset: Pyoungmeister; "removing es4 from db.php for transition to innodb" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32482 [01:09:34] andrewbogott: arg.. look this http://www.mail-archive.com/ubuntu-bugs@lists.ubuntu.com/msg3883169.html [01:09:43] [Bug 1073289] [NEW] nova-common has an incorrect dep on python-nova (= 2012.1-0ubuntu2) [01:10:05] that seems to fit the dpkg error, no idea though how that would be related to Apache though [01:12:22] PROBLEM - HTTP on kaulen is CRITICAL: Connection refused [01:13:16] kaulen: Syntax error on line 5 of /etc/apache2/sites-enabled/codereview-proxy.wikimedia.org: [01:13:21] guys, what is going on [01:13:33] i really cant multi-task any longer :p sigh [01:16:21] Change abandoned: Asher; "the prior values *should* work from looking at the math in pybal.. looking deeper there." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32471 [01:16:50] ok, dpkg upgrades are breaking Apaches [01:17:04] 2012-11-09 01:08:51 status half-configured apache2 2.2.14-5ubuntu8.10 [01:18:07] !log installing package upgrades on kaulen, half-configured apache2 brought down bugzilla ... [01:18:14] Logged the message, Master [01:18:31] mutante: do those happen automatically? [01:19:18] robla: i would not have expected it on kaulen at all, but it looks like it indeed [01:19:52] that's a treat [01:20:02] unless somebody started them via dsh [01:20:13] or puppet ensure "latest" for specific packages [01:20:27] its not like there is a cronjob to install anything it finds... [01:21:12] Invalid command 'php_admin_flag', [01:21:21] binasher: You have some apache know-how don't you? We could use a little help here. [01:21:25] where is the command above? [01:21:41] hmm? [01:22:01] Apache is freaking out on virt0… complaining that it can't find a file that is obviously there. [01:22:10] virt0 or kaulen? [01:22:18] (Which may or may not be a result of a broken upgrade.) [01:22:18] different issues? [01:22:18] binasher: both :o [01:22:24] binasher: Possibly both, possibly the same issue. [01:22:33] not sure yet if its really the same [01:22:40] on virt0 Apache runs [01:22:43] on kaulen it cant [01:23:00] both have in common that there are failed dpkg upgrades though [01:23:01] and the timing [01:23:41] kaulen is up now [01:23:46] RECOVERY - HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.002 seconds [01:23:57] yup, bz is fine [01:23:59] did you do anything? [01:24:09] Apache start failed a few seconds ago [01:24:13] installed libapache2-mod-php5 [01:24:18] missing the php_admin_flag [01:24:25] oh..duh?! [01:24:42] binasher: If you have a moment to look at virt0, I will stand back so you can get a clear picture. [01:24:55] 2012-11-09 01:23:24 status half-configured libapache2-mod-php5 5.3.2-2wm1 [01:25:07] ok, so that was half-configured as well [01:25:14] andrewbogott: what's wrong with virt0? [01:25:41] binasher: virt0 is labsconsole.wikimedia.org [01:25:41] apache is running there, unlike what was up with kaulen [01:25:47] binasher: https://labsconsole.wikimedia.org/ [01:25:52] It's 404ing with an obvious error in the error log... [01:26:00] well, obvious, except I don't know why it's happening. [01:26:21] running labsconsole, that is indeed something wrong [01:26:25] binasher: it wants to downgrade a python package [01:26:34] and that fails.. and it is also python-keystone depends on python-iso8601 [01:26:36] mutante: No, I fixed that part I think. [01:26:39] ok [01:26:43] apt-get is happy now. [01:26:45] Best I can tell. [01:27:37] ok, it's better [01:27:50] php install issue again [01:28:02] yeah,this makes me think we might get more of them soon [01:28:06] whenever puppet runs [01:28:14] so.. what'd you guys do? [01:29:25] Wait, you fixed it just like that? What'd you do? [01:29:42] binasher: We didn't do anything, stuff just started failing. Presumably because of some cron-initiated upgrade. [01:29:52] magic rainbow sauce [01:29:52] binasher: no idea..apparently labs was down since quite a while [01:30:11] and then bugzilla..it just happened a few minutes ago [01:30:13] True story: I just said "Maybe if I go to the bathroom Asher will have it fixed when I get back" and, sure enough. [01:30:21] let me run puppet and see if it breaks again [01:30:34] and then go thru logs on virt0 [01:30:56] andrewbogott: missing libapache2-mod-php5 was also to blame [01:31:38] binasher: How did you know? Was that in the apt-get upgrade scroll and I just missed it? [01:32:27] nope, i just saw that php code wasn't running and looked at how it was installed / configured. apt was happy :/ [01:32:51] Hmph. [01:32:59] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32482 [01:33:12] binasher: You started a puppet run already, or shall I do that? [01:34:24] i did, didn't break anything [01:34:38] Great! Then maybe that means I can finally take my Very Patient Date to dinner. [01:34:56] thanks andrewbogott! [01:35:22] binasher: but but but....any idea how libapache2-mod-php5 disappeared from those machines? [01:35:24] robla: I provided the valuable service of waiting for Asher to log in :) [01:35:45] andrewbogott: you sacrificed appropriate chickens [01:36:01] !log py synchronized wmf-config/db.php 'pulling es4' [01:36:07] Logged the message, Master [01:36:11] robla: There was definitely a busted apt-get upgrade happening, so… probably we shouldn't be doing that when no one's watching. [01:37:36] we dont, it's not like we do automatic apt-get upgrade , besides on labs instances [01:37:47] but we do "ensure => latest;" in puppet manifests in some places [01:37:52] like webserver.pp [01:38:21] which has kind of the same effect , but just for a few packages [01:38:21] this: [01:38:22] Start-Date: 2012-11-08 23:01:23 [01:38:23] Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install apache2 [01:38:41] pulled in libapache2-mod-php5filter which is incompatible with libapache2-mod-php5 [01:38:57] so it was broken before this: [01:38:58] Start-Date: 2012-11-09 00:58:21 [01:38:59] Commandline: apt-get upgrade [01:39:11] if we are going to do anything of that variety automatically, we should probably at least have it also drop an entry into the server admin log [01:40:03] I'm guessing this sort of thing gets done automatically in cases where our track record for doing it manually is spotty [01:40:20] Yeah, it seems simultaneously wise and foolish to automate it. [01:40:43] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 183 seconds [01:41:27] * andrewbogott -> dinner [01:42:53] the serveradmin log isn't an appropriate place for puppet generated messages, it would become unreadable. all of this stuff is logged, or i wouldn't be pasting it here [01:44:02] it's a code problem.. if you run a php app where package apache { ensure => latest } you need to use puppet to manage / ensure the php install as well [01:46:37] New patchset: Alex Monk; "(bug 41907) Enable patrolling on wikidatawiki." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32484 [01:47:19] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 25 seconds [01:48:34]           _      _   _            _              [01:48:38]  __ _ ___| | __ | |_| |__   ___  | | ___   __ _  [01:48:42] / _` / __| |/ / | __| '_ \ / _ \ | |/ _ \ / _` | [01:48:46] | (_| \__ \   <  | |_| | | |  __/ | | (_) | (_| | [01:48:51] \__,_|___/_|\_\  \__|_| |_|\___| |_|\___/ \__, | [01:48:55]                                           |___/  [01:49:52] !log olivneh synchronized php-1.21wmf3/extensions/E3Experiments/experiments/openTask.js [01:49:58] Logged the message, Master [01:51:04] RECOVERY - MySQL Slave Delay on es4 is OK: OK replication delay seconds [01:53:28] PROBLEM - mysqld processes on es4 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [01:58:11] !log starting innobackupex from es3 to es4 [01:58:16] Logged the message, notpeter [01:58:23] !log converted all s3 searchindex tables to innodb [01:58:29] Logged the message, Master [02:00:22] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 275 seconds [02:00:49] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 302 seconds [02:23:01] !log LocalisationUpdate completed (1.21wmf3) at Fri Nov 9 02:23:01 UTC 2012 [02:23:08] Logged the message, Master [02:25:15] PROBLEM - Host search-pool3.svc.eqiad.wmnet is DOWN: CRITICAL - Network Unreachable (10.2.2.13) [02:25:16] PROBLEM - Host search-pool1.svc.eqiad.wmnet is DOWN: CRITICAL - Network Unreachable (10.2.2.11) [02:25:16] PROBLEM - Host search-prefix.svc.eqiad.wmnet is DOWN: CRITICAL - Network Unreachable (10.2.2.15) [02:26:09] PROBLEM - Host search-pool4.svc.eqiad.wmnet is DOWN: CRITICAL - Network Unreachable (10.2.2.14) [02:26:10] PROBLEM - Apache HTTP on srv301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:10] PROBLEM - Apache HTTP on srv297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:10] PROBLEM - Apache HTTP on srv298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:10] PROBLEM - Apache HTTP on srv294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:18] PROBLEM - Apache HTTP on srv253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:18] PROBLEM - Apache HTTP on srv257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:28] PROBLEM - Apache HTTP on srv251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:36] PROBLEM - Apache HTTP on mw65 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:36] PROBLEM - Apache HTTP on srv296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:36] PROBLEM - Apache HTTP on srv292 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:36] PROBLEM - Apache HTTP on srv300 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:37] PROBLEM - Apache HTTP on srv250 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:45] PROBLEM - Apache HTTP on srv252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:54] PROBLEM - Apache HTTP on srv216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:03] PROBLEM - Apache HTTP on mw74 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:12] PROBLEM - Apache HTTP on srv295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:12] PROBLEM - Apache HTTP on srv290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:39] RECOVERY - Apache HTTP on srv301 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.057 second response time [02:27:39] RECOVERY - Apache HTTP on srv297 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.065 second response time [02:27:39] RECOVERY - Apache HTTP on srv298 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [02:27:39] RECOVERY - Apache HTTP on srv294 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [02:27:48] RECOVERY - Apache HTTP on srv253 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.050 second response time [02:27:48] RECOVERY - Apache HTTP on srv257 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.058 second response time [02:27:57] RECOVERY - Apache HTTP on srv251 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [02:28:06] RECOVERY - Apache HTTP on srv296 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [02:28:06] RECOVERY - Apache HTTP on srv300 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [02:28:07] RECOVERY - Apache HTTP on srv292 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [02:28:07] RECOVERY - Apache HTTP on srv250 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.054 second response time [02:28:07] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.063 second response time [02:28:07] RECOVERY - Host search-pool1.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 26.64 ms [02:28:07] RECOVERY - Host search-prefix.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 26.80 ms [02:28:08] RECOVERY - Host search-pool3.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 27.25 ms [02:28:15] RECOVERY - Apache HTTP on srv252 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.055 second response time [02:28:24] RECOVERY - Apache HTTP on srv216 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.068 second response time [02:28:33] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.068 second response time [02:28:42] RECOVERY - Apache HTTP on srv295 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.049 second response time [02:28:42] RECOVERY - Apache HTTP on srv290 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.054 second response time [02:29:00] RECOVERY - Host search-pool4.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 27.22 ms [02:32:37] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [02:47:02] !log olivneh synchronized php-1.21wmf3/extensions/E3Experiments/Experiments.hooks.php 'Fixing OpenTask event log bug' [02:47:09] Logged the message, Master [03:34:33] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: Puppet has not run in the last 10 hours [03:44:00] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [03:50:00] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 8 seconds [04:04:33] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:04:33] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [04:04:33] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [04:53:00] New patchset: Tim Starling; "(bug 41907) Enable patrolling on wikidatawiki." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32484 [04:53:07] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32484 [05:00:00] !log tstarling synchronized wmf-config/InitialiseSettings.php 'enabling patrol on wikidatawiki' [05:00:08] Logged the message, Master [05:19:35] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [06:13:59] /away zzz [06:17:53] mutante: slap wel! [07:05:32] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [08:46:24] New review: Hashar; "Nice stefan that is good start." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/32192 [08:55:14] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [09:33:13] PROBLEM - Host foundation-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:33:14] PROBLEM - Host wikiversity-lb.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:33:14] PROBLEM - Host bits.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:33:14] PROBLEM - Host bits-lb.pmtpa.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:33:31] RECOVERY - Host bits-lb.pmtpa.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [09:33:31] RECOVERY - Host bits.pmtpa.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [09:34:07] RECOVERY - Host foundation-lb.pmtpa.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [09:34:07] RECOVERY - Host wikiversity-lb.pmtpa.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [09:34:24] hmmm [10:06:30] New patchset: Hashar; "Enable AFTv5 on beta" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32061 [10:07:27] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32061 [10:21:46] morning [10:21:51] lost all the fun it seems [10:22:07] didn't even hear the page [10:22:56] stupid linux route cache [10:23:55] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [10:23:55] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [10:23:56] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [10:32:21] I heard it but it was already cleared up (I got both sets at once) when I looked [10:32:47] hi apergos [10:32:50] yo [10:32:52] so, what's the status of ms-be6? [10:33:18] it has an install on it (I went to bed at midnight after getting the install to finally work) [10:33:29] all the disks are partitioned and xfsed and mounted [10:36:30] as far as I'm concerned it could go into the pool now [10:36:48] what about puppet/partman, did you fix that? [10:36:54] uh huh [10:36:55] * paravoid git pulls [10:37:08] self reviewed too :-/ [10:38:30] looks fine to me [10:38:56] need popst commit review workflow for gerrit >_< [10:39:01] *post [10:48:08] so, wait [10:48:13] where's sdm3? [10:48:27] (and sdn3) [10:50:03] hmm [10:50:08] where are those created? [10:54:26] nowhere apparently [10:54:53] maybe they were a casualty of the recent partman consolidations [10:54:55] and we can't even use swift::create_filesystem without rewriting that, as it's also changing the partition table [10:55:03] don't think so [10:55:13] hadn't touch ms-be-with-ssd yet [10:55:24] well that's irritating [10:55:52] meh, let's just run mkfs and mount now [10:56:33] at least the partition is there [10:58:41] done [10:58:41] I see you [10:58:58] I already makde them, I guess you edited the fstab [10:59:00] maybe you remade my fses [10:59:00] whatever [11:02:41] so, wanna add it to the rings? [11:02:55] sure [11:06:00] we have to find out in which rack it is [11:06:05] and add it to the same or a new zone [11:06:11] that's the only "catch" [11:06:22] oh [11:09:10] so am I translating b4 to zone 4? [11:09:51] don't remember [11:09:56] check the other zones? [11:10:08] they are numbers [11:10:19] duh [11:10:28] check which zones the servers in the same rack are in [11:11:57] the numbers seem to have no relation to the racks [11:12:04] well at least the first one I checked [11:12:30] they don't have to have any arithmetic relation [11:12:38] the point is that servers in the same rack should be in the same zone [11:12:41] ok [11:12:53] we have a list of servers and their zones and a list of servers and their racks [11:13:13] so, we can figure it out :) [11:15:41] the notes have "move ms-be6 to zone 8" [11:15:59] that's from before Ben left [11:16:13] and assume that the 720 is in the same rack as the C2100 I guess [11:16:15] ok well I dunno about "move" it but that's what I have added [11:16:18] and let's double check that anyway [11:16:23] I looked in rackables [11:17:29] i believe the idea was to do drop-in replacements [11:17:33] racking at the exact same locations etc [11:17:49] yeah [11:20:28] ok I have some rings sitting in root@ms-be11:~/swift-rings/swift if you wanna look em over (not rebalanced yet) [11:20:39] sure [11:21:26] so ms-be6 is in the same rack as ms-be5? [11:21:35] yes [11:21:50] according to racktables [11:22:27] looks all good to me [11:22:42] ok [11:22:48] I"m gonna rebalance and shove em out then [11:23:15] yep [11:23:52] NOTE: Balance of 33.33 indicates you should push this [11:23:53] ring, wait at least 3 hours, and rebalance/repush. [11:23:55] we need to do that? [11:25:12] if that's what it says :) [11:25:15] did you push them? [11:25:22] if not, I have a different idea [11:25:33] if you did, that's okay too [11:25:40] I have not pushed them, this is on the rebalance [11:26:22] okay, so I'd say let's start them on reduced weight [11:26:25] e.g. 33 [11:26:34] for the object rings anyway, the others are small [11:26:45] it dosn't whine for the object rings [11:26:45] but to do that you have to start from scratch :) [11:27:20] it only complains about container and account [11:27:35] yeah, but still [11:27:45] so do them all with 33 then? [11:28:14] no account/container are basically nothing [11:28:25] so just object with 33 [11:28:26] ok [11:28:36] 35G per SSD [11:28:38] nothing [11:28:58] less actually, that's with 4 boxes with SSD, now we'll have 5 [11:32:13] ok the rebalanced rings are available if you want to lookk at em again [11:34:46] looks good [11:37:32] are owa1 and ms3 really in the swift cluster? [11:37:41] (ganglia thinks they are) [11:37:43] they're in /a/ swift cluster [11:37:57] heh [11:38:02] * apergos ignores them [11:38:27] yeah [11:38:31] we should clean this up [11:43:37] RECOVERY - Puppet freshness on ms-fe3 is OK: puppet ran at Fri Nov 9 11:43:22 UTC 2012 [11:49:28] RECOVERY - Puppet freshness on ms-be3 is OK: puppet ran at Fri Nov 9 11:48:53 UTC 2012 [11:50:49] eh? [11:50:51] what? [11:51:02] apergos: did you run puppet on ms-fe3? [11:51:10] yes [11:52:28] did it change anything else? [11:52:38] I had disabled until I push the ubuntu cloud archive apt [11:53:10] /usr/lib/ganglia/python_modules/memcached.py added [11:53:22] no, removed [11:53:24] sorry [11:53:36] /tmp/puppet-file20121109-11696-v8eii4-0 this was added [11:53:59] New review: Mark Bergsma; "+20000" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/32368 [11:54:22] hahaha [11:55:33] apergos: you're lucky I was cautious enough to commit 1.7 configs without having 1.7 in the apt repo [11:55:42] :) [11:55:42] nice [11:56:02] but that should really read "*you*" are lucky (not me) :-P [11:56:13] hehe I guess [11:56:37] anyways there were no whines, just refresh of gmond [11:58:28] RECOVERY - Puppet freshness on ms-be7 is OK: puppet ran at Fri Nov 9 11:58:13 UTC 2012 [12:03:22] !log ms-be6 deployed and back in swift rings on shiny new 720xd [12:03:29] Logged the message, Master [12:03:34] :-) [12:05:55] how do you like the 720xd so far? [12:06:23] except for the little annoyance with the ssds, fine [12:07:04] not much to say about it, configuration is otherwise quite straightfoward [12:28:57] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:29:31] PROBLEM - Varnish HTTP mobile-backend on cp1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:29:58] PROBLEM - Varnish HTCP daemon on cp1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:10] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [13:19:24] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [13:26:47] New patchset: Dereckson; "(bug 41912) Enable WebFonts and Narayam on betawikiversity." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32564 [13:49:03] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.002 second response time on port 11000 [13:50:41] one of these days I'll debug that [14:05:24] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [14:05:24] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [14:05:24] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [14:06:44] puppet appy --noop --modulepath=/home/paravoid/wikimedia/puppet/modules foo.pp [14:06:48] yay [14:26:03] New patchset: Faidon; "Remove a few spurious no-op apt pins" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32367 [14:26:03] New patchset: Faidon; "Remove apt::ppa-req and apt::key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32368 [14:26:03] New patchset: Faidon; "Initial attempt for an apt module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32369 [14:34:54] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32368 [14:35:47] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32367 [14:36:16] !log authodns committing mgmt ip for labsdb3 [14:36:23] Logged the message, Master [14:40:41] paravoid: hello Faidon -:] Would you be available for a few puppet merges related to continuous integration ? [14:45:37] shoot [14:47:42] https://gerrit.wikimedia.org/r/#/c/31235/ and https://gerrit.wikimedia.org/r/#/c/31234/ [14:48:15] paravoid: the changes are made to tweak the apache configuration for http://integration.mediawiki.org/nightly/mediawiki/core/ [14:48:30] one make it so the nightly snapshots are sorted by date [14:48:39] the other simply add a link on the front page : http://integration.mediawiki.org/ [14:48:49] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31234 [14:48:58] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31235 [14:49:17] another change is https://gerrit.wikimedia.org/r/#/c/32462/ which is to install nodejs on gallium (contint server) [14:49:27] we are going to use nodejs to lint our javascript and css files [14:49:30] on any project [14:49:40] (not to serve content to internet user, that will be only be for CLI scripts) [14:49:59] it get some nodejs version which has been build for WMF parsoid [14:50:25] yeah, I wanted to warn you about that [14:50:31] we have I think two users already [14:50:36] of node, in the whole infrastructure [14:50:47] paravoid: have you had a chance to look at ms-be6 yet? [14:50:53] and this has resulted into package updates, and I'm not sure if they're keeping compability [14:51:09] cmjohnson1: yeah, we chatted about it with apergos earlier; apergos put it into the swift so-called ring files [14:51:16] so it means it gets data and traffic now [14:51:30] I think it's safe to say we can give the go-ahead to Dell [14:51:43] okay...cool! thx [14:51:49] I don't think there's any reason to wait through the weekend [14:52:27] ok...wanted to wait for the go ahead from you....i will let them know now [14:52:29] paravoid: talking to Krinkle about the nodejs upgrade [14:52:40] hashar: so, node evolved quickly as you say in the commit, but they might be breaking compatibility as they go [14:52:43] not much you can do [14:53:04] paravoid: not going to break our use cases. Node js mostly keep back compatibility and we will update our script from their latest version [14:53:06] I do worry that you might want a newer version than the parsoid people or the other way around though [14:53:18] ohh [14:53:24] but I think I'll just let you worry about that :P [14:53:40] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32462 [14:53:50] so most of the nodejs scripts would be handled by Krinkle who is also closely following what the parsoid team is doing with node [14:54:00] so I guess that would be fine. If it ever broke, we will update/fix our script [14:54:15] I am not too worried about it, but thanks for the warning! ;-]]]]] [14:54:30] all merged [14:54:37] \O: [14:54:40] \O/ [14:54:52] New patchset: Faidon; "Initial attempt for an apt module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32369 [14:55:06] hashar yo [14:55:14] no progress yet on wiki renaming :/ [14:55:19] nodejs does sometimes break compatibility, but gruntjs will cover for that with its abstraction layer [14:55:24] can I interest you in running those tests :p [14:55:37] ShiroiNeko: that has been an issue for like 7 years. Don't expect anything to happen in the next weeks :-/ [14:55:37] New review: Faidon; "Mark gave it a glance too, so this is not entirely self-reviewed :)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/32369 [14:55:38] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32369 [14:55:53] hashar I dont expect resolution tomorow [14:56:07] I just one someone to have it on their agenda so that they will tackle it at some point [14:56:09] AltKrinkle: good to know [14:56:17] paravoid: and another one is extracting the PHP linter script we have from the main puppet scope to a module https://gerrit.wikimedia.org/r/#/c/29937/ . [14:56:40] mostly copy pasting / renaming from /manifests to /modules/wmfscripts [14:56:44] sec now [14:56:47] though that change might need some discussion [14:56:51] with tests more bugs possibly will pop up [14:56:55] I just merged something big and I want to make sure I didn't break everything [14:59:25] New review: Silke Meyer; "I would be so grateful not to talk about tabs vs. spaces any longer. Please tell me *the one way* [T..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/30593 [14:59:55] where's ^demon [15:00:02] I just found something really interesting :) [15:00:20] i gave it a +1, not just a glance ;) [15:02:30] ah, missed that [15:02:41] stupid gerrit resets +1/+2 on updated patchsets [15:03:14] it'd be nice to have a history "previous versions +1/+2/-1/-2" [15:20:48] Krinkle: https://gerrit.wikimedia.org/r/#/c/32475/ [15:21:06] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [16:02:14] ottomata: I forgot to tell you [16:02:26] but when i was unboxing them i managed to mangle my hand up (scrped skin off knuckles) [16:02:38] so these servers have been blooded, which is important [16:02:41] servers need blood. [16:02:45] daww [16:02:52] good to know [16:03:05] i'll edit the partman recipe to make use of your blood, thanks [16:03:14] blood lubricates the disk [16:03:25] now they are immune from gremlins and indeed, a .05% performance increase. [16:03:32] yeah, we have 12 disks on 12 servers, so we'll need a lot of blood [16:08:04] http://stackoverflow.com/users/319266/krinkle?tab=reputation [16:08:12] thanks Guest44503 [16:14:21] csteipp: woosters: still importing dumps..but we are making progress.. "en" should actually be done soon [16:14:40] progress! [16:14:43] Sweet! Thanks mutante! [16:14:59] New patchset: Ottomata; "The new Analytics Dells are here!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32571 [16:15:09] New patchset: Hashar; "gallium:/var/lib/jenkins now belong to jenkins group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32572 [16:15:12] I missed my normal train, so I'll be in in about an hour.. [16:15:17] i would like to stay here and continue on the "de".. [16:15:27] \O/ [16:15:41] That would be very helpful [16:16:24] yea, they are like 30 min each or so [16:16:43] i will just keep going and keep you updated instead of moving to office right away [16:17:16] RobH, can you check this real quick? [16:17:16] https://gerrit.wikimedia.org/r/#/c/32571/ [16:17:24] the only bit i'm not sure about is the naming of my partman file [16:17:40] i'm not sure if anyone cares, but I didn't see any other partman recipes named after machine model [16:18:16] most tend to apply to more than a single model [16:18:28] right, its more the disk layout than the model that matters [16:18:45] ottomata: what's special about your recipe? [16:18:45] cause I would just say 'analytics-hadoop' or maybe 'analytics-kraken1' or something [16:18:57] I don't understand the comment at the top [16:18:58] ciscos start device lettering at sdc [16:19:04] oops [16:19:09] typo, was supposed to say sda [16:19:09] ahh [16:19:13] i did a find/replace after I wrote the comment [16:19:14] anyone mind reviewing a simple permission fix for the continuous integration server ? https://gerrit.wikimedia.org/r/#/c/32572/ it simply ensure that the jenkins user home directory has sane ownership [16:19:14] patching... [16:19:31] I'd like us to minimize the amount of recipes that we have and I've been working towards that [16:19:42] yeah [16:19:42] deduplicating them as I go, and trying to have some common content between them [16:19:46] heh, i woudl make it all lowercase cuz i just do that [16:19:49] New patchset: Ottomata; "The new Analytics Dells are here!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32571 [16:19:50] the raid1-* family is like that [16:19:52] ok [16:19:56] but otherwise the vendor and model in there isnt a big deal [16:20:01] yeah, i like that, paravoid [16:20:05] this is kinda a simiple recipe [16:20:15] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32572 [16:20:16] raid1 / + swap [16:20:23] md0 / and md1 swap [16:20:24] im torn if making tons of recipies is just easier [16:20:31] or if we would template this somehow [16:20:42] paravoid: you are sooo fast :-] thx [16:20:44] RobH: no it's not [16:20:53] (which would apply to the isntall server files) [16:21:07] we are getting to where we should template it prolly. [16:21:09] also look https://gerrit.wikimedia.org/r/#/c/32366/ which isn't ready for prime-time yet [16:21:14] but that's the idea [16:21:17] but for today, just adding another one like this is fine [16:21:45] paravoid: cool [16:21:47] ottomata: / and swap on raid and then what? [16:21:50] the whole space? [16:22:18] ottomata: what's wrong with raid1-1partition.cfg then? [16:22:28] no, partially [16:22:33] don't want it to take up all the space [16:22:54] what will you do with the rest? [16:22:54] these disks are large (2TB) so only want it to use a few GB for / [16:23:00] what do you want to do with the rest? [16:23:05] jinx :-P [16:23:07] unsure at this time, which is why i didn't want to allocate it [16:23:12] could use the raid1 250GB file [16:23:15] we don't need it on / [16:23:20] oh, that file has a borked swap [16:23:20] hm, which one? [16:23:24] we have raid1-lvm.cfg that partitions the rest as LVM [16:23:34] or raid1-varnish.cfg that partitions the rest as XFS but does not mount it [16:23:35] the raid1-250G-1partition [16:23:36] yeah but RAID 1 [16:23:41] but it needs to be updatred [16:23:44] or raid1-squid.cfg which partitions it and leaves it unformatted [16:23:44] as it doesnt raid the swap [16:23:47] which is a mistake. [16:23:47] i was thinking [16:23:58] (as the swap disk failing when not raided will result in system crash) [16:24:09] raid 1 / - xGB [16:24:09] or raid1.cfg which partitions it, formats it and mounts it on /srv [16:24:09] raid 1 swap - xGB [16:24:09] but then the rest of sda and sdb not on raid one [16:24:21] RobH: I've fixed that on all the raid1-* ones [16:24:23] so we could use them as single disks (for hadoop or similiar) if we wanted too [16:24:33] paravoid: ahh, awesome, its just myu local copy then [16:24:39] ottomata: sounds like raid1-squid.cfg. [16:24:51] so ottomata i think the raid1-250GB-1partition is awesome for you [16:24:51] RobH: yeah, I've done tons of changes there [16:24:52] there's no raid1-250GB anymore [16:24:54] please pull first [16:24:57] oh, ok [16:25:09] ottomata: raid1-squid.cfg? [16:25:23] or raid1-varnish.cfg [16:25:32] or raid1-lvm.cfg [16:25:42] reading.. [16:26:08] heh well now i have to read all the new ones [16:26:12] so i have no input =P [16:26:17] i'd prefer if the extra space and the remaining disks weren't formatted [16:26:27] then it's raid1-squid.cfg. [16:26:39] or -lvm, depends on what you call "formatting" ;-) [16:26:49] RobH: told ya :-) [16:27:09] so no partitions on the rest of the disks is what ottomata wants [16:27:20] right [16:27:26] just the raid1 / and raid1 swap and the / should be smaller than the rest of the disk [16:27:27] on sda and sdb [16:27:31] the rest of the disks untouched [16:27:32] just md0 / and md1 swap [16:27:55] I think raid1-squid is what he wants [16:27:59] that way they can partition and use the remainder of sda and sdb as needed in the future [16:28:06] looking [16:28:13] let's reuse stuff and be DRY [16:28:17] reading raid squid [16:28:26] # - one extended partition per disk (sda5/sdb5) for the rest, [16:28:26] # used as a squid coss disk [16:28:28] squid has [16:28:36] that part is not wanted though =[ [16:28:44] what do you mean? [16:29:07] d-i is not doing the coss formatting [16:29:14] the squid makes an ext3, a swap, then puts an extended partition across the rest [16:29:15] i don't know how to read these recipes very well, i can read the auto-raid recipe fine [16:29:15] ? [16:29:25] i dont' see the extended partition (even though the comment says so) [16:29:40] 1000 1000 -1 linux-swap \ [16:29:40] method{ keep } \ [16:29:40] indeed, im not seeing it either [16:29:42] but expert_recipe i'm not sure how to read correctly [16:29:58] expert_recipe says [16:30:24] ottomata: hashar: re the "libanon" stuff on integration. You guys know better about where it actually makes sense (role / class). I actually told him to do a role class and used that as an example to answer his question how to actually get packages on prod. servers [16:30:28] 10 G raid, then 1 G raid, then swap? [16:30:50] mutante: ohhh [16:30:57] mutante: so let it be in a role case :-] [16:30:58] class [16:31:03] a role for just a package? [16:31:06] first primary partition 5-10GB used as an md device, second primary partition 1GB used as an md device, third (extended) partition don't touch [16:31:06] paravoid: So the squid one seems ok except for it adding a single extended partition across the rest of sda and sdb [16:31:06] ignore the linux-swap part [16:31:06] the 1gb is swap? [16:31:08] yes [16:31:08] the linux swap part was messing me up [16:31:12] it's not so difficult, it says so 10 lines below [16:31:17] I though that role classes where only supposed to call system_role then call a parameterized class with some settings [16:31:18] so it makes sense now [16:31:19] ottomata: hashar: so that discussion should go on, but if that is like urgent and blocking people, i mean it is just installing openssl, i could also just do that and you merge puppet stuff later [16:31:22] the linux-swap part is what the fs type in the partition table will be [16:31:41] ottomata: So, squid would work in that it just puts an unusued partitoin across sda and sdb remainder [16:31:53] mutante: ottomata: anyway they don't need openssl, it is already installed on there. They want libssl-dev or something similiar [16:31:55] which since its not used, you could eaisly just rip out later, and it saves you adding new one off files to the repo [16:31:55] ottomata: pretty sure it will not stay just one package, it is merely starting it [16:32:03] hashar: really? duuuh [16:32:05] :p [16:32:18] mutante: and they are definitely going to install a LOT MORE dev packages :-] [16:32:20] RobH: eh? [16:32:30] how would he use the rest of the space if there's not a partition [16:32:36] the partition is a feature, not a bug [16:32:40] i think the raid1-squid.cfg will work for the analytics1011-1022 servers. [16:32:49] i would create it, i'm not sure what an 'unused partition' is [16:32:50] but ok [16:32:51] ottomata: hashar: i just dont like the "generic::packages:foo" and then lots of them [16:32:56] but shrug..hmmm [16:32:56] it does a bit more than needed (we dont need it to add that extended partition) [16:33:03] yeah, i think a role would be fine, if it was called something appropriate [16:33:08] role::analytics::opensll [16:33:10] or whatever it was [16:33:14] isn't right [16:33:29] its just unmounted, non filesystemed space on disk [16:33:34] so its no big deal [16:33:36] ahhh ok [16:33:40] yeah that's cool [16:33:47] ok cool [16:33:51] 10GB is kinda small for / though [16:33:56] its done in squid since it will spin that space as coss space i think. [16:33:56] ottomata: an unused partition is "there is going to be an sda5 and an sdb5 sized 2000GB-11G that won't be formatted or otherwise touched by d-i" [16:33:57] but for you it wont matter, as you have no squid process [16:33:58] sorry, being picky now, i can live with it if I must! [16:34:06] indeed, 10g is small [16:34:11] that will fill with logs squickly [16:34:15] how is 10G small? [16:34:29] he has a 2tb disk. [16:34:32] so we have two overlapping conversations ;-D [16:34:34] how is it not? [16:34:50] hashar: I am totally able to keep up ;] [16:35:03] me too! [16:35:04] hehe [16:35:05] how does / relate to the size of the disk? [16:35:18] I have people around me speaking in Dutch, German and English :-D does not help [16:35:22] swift has 24T of disks, should we give it a 1T /? :) [16:35:27] well, if / is large enough, i probably won't have to think about it [16:35:31] (per box) [16:35:33] paravoid: if we have 2tb its silly to restrict their OS to only 10G and then deal with that small constriction [16:35:38] if it is small, then I will have to thikn about making a special /var/log partition one day probably [16:35:41] or /home [16:35:44] etc. etc. [16:36:03] indeed [16:36:13] ottomata: mutante: regarding the ssl package, I don't mind. I commented on the ops report that I am 120% for us to install dev package on the CI machine. I will let the analytic team lead that :-] [16:36:31] at worst case, the raid1-squid.cfg is a perfect starting point, with only tweaks needed to / size [16:36:37] hashar, mutante: shoudl there be some role::build::xxxx classes? [16:36:52] as matching to it will be matching to the new standard that paravoid pushed [16:37:04] (and it was due for cleanup, thx for doing that) [16:37:07] yeah, i could copy at renmae to raid1-large or something? [16:37:11] hashar: yea, ok, thanks. i also dont mind, i just wanted to prevent it being installed on "gallium" in site.pp, but somehow be flexible for the future [16:37:13] raid1-200GB [16:37:13] ? [16:37:15] I think using lots of gigabytes per box for / on a cluster composed of a dozen of machines is a waste tbh [16:37:22] use logrotate, or remote logging or whatever [16:37:31] and you shouldn't use /home that much [16:37:34] but whatever works for you [16:38:05] yeah, true, i mean, having these 2TB in sda and sdb is kind of a waste, since we need to use those for / [16:38:11] ottomata: it sounds reasonable, yea, if this for building things .. build::libanon ? [16:38:12] i get what you are saying [16:38:20] i will be convenient (more space is always convenient) but not necessary [16:38:32] i guess, but a role shoudl be more generic i think [16:38:44] i would err to making it easier on configuration of a new cluster and make it a larger / [16:38:51] that doesnt make it the best answer [16:38:56] role::build::udp-filter (since that is what we are going for) and then classes to include whatever for deps [16:39:00] or even more generic [16:39:06] role::build::analytics [16:39:34] RobH, paravoid, ok, i'm going to make a new raid1- file from squid [16:39:38] heh, reading backlog this is indeed a confusing as shit to follow conversation(s) if you arent really used to irc [16:39:52] it amuses me [16:40:14] ottomata: also sounds good, as long as it is somehow detached from the host and not just lists of packages (or generic::packages:foo lines) right in a node .. [16:40:18] just try to follow one conversation [16:40:41] ottomata: role::build::** might be it. Honestly, I don't really care which class you will end up choosing :-] [16:41:02] these are going to be hadoop boxes basically, but not producing stats on demand for the user, i.e. they aren't user-facing in the same way the apaches and squids are, right? [16:41:12] heh, you can temp filter nicks that are not in your convo by using /ignore and then /unignore again .. hehe [16:41:20] not that i do that ..but.. [16:41:22] apergos: correct [16:41:28] ok [16:41:36] they might not all be "hadoop" boxes, but they will be doing cluster services [16:41:43] (storm, hadoop, kafka, etc etc) [16:41:48] uh huh [16:42:23] RobH, paravoid: the ciscos have ext4 for /, can I keep our stuff consistent and use that? [16:42:48] i dont think we are using ext4 anywhere else on cluster [16:43:06] does it have particular benefits over ext3 for use here? [16:43:38] not that i know of, I think I don't really care atm [16:43:44] ciscos are ext4, is all [16:43:49] (i didn't do ciscos) [16:44:04] i would err to ext3 [16:44:32] just so its same as what other stuff is [16:44:32] ciscos are a small % of entire cluster =] [16:44:36] hoookaayyyy [16:44:58] hadoop performs better with ext4 [16:45:00] ottomata: you know how this works right? as a cross team member, we are going to slowly corrupt you into the ops method of cluster first thinking ;] [16:45:02] yeah.. ext3 is more stable [16:45:06] drdee, this is just / [16:45:12] k [16:45:12] so we can do whatever for hadoop parts [16:45:15] ottomata: the swift boxes are the same, they have a huge / [16:45:26] that we use for keeping swift logs because nobody set up logrotate [16:45:27] ext4 is pretty damn stable at this point [16:45:28] also, since I am making new file, should I remove the bit about the extra extended partition? [16:45:33] hashar: if there was another "take a picture of your desk" now ..heh, mine is currently an empty pizza box ;) [16:45:41] that we keep three times, one in daemon.log, one in syslog etc. [16:45:57] no it doesn't [16:46:00] mutante: working from home? [16:46:24] ottomata: you pick the role names, you are the analytics guy;) [16:46:53] ah, but hashar is the CI build guy! [16:46:53] RobH: yea, for now at least, been importing wikivoyage mysql dumps half of the night :p [16:47:02] hhe [16:47:27] mutante: so soon to be sleeping from home =] [16:47:41] paravoid: was "no it doesn't" your answer to my Q about removing extra extended partition from new file? [16:48:03] no, it was on needing ext4 [16:48:05] RobH: heh, yeah, some time, we are trying to launch this before the weekend ..and "de" is still to go [16:48:12] ah ok [16:48:15] notpeter: kind of, there was a corruption bug a week ago :) [16:48:21] notpeter: on some exotic feature it turned out [16:48:22] cool that's fine, i can do ext3 [16:48:28] you mean launch n a friday and leave? [16:48:29] notpeter: but a lot of people were really worried [16:48:31] as for removing extended partition, s'ok? [16:49:08] huh, ok [16:50:09] notpeter: but in any case, we should take this decision cluster-wide I'd say [16:50:09] mutante: mine is full of heineken beers and mac laptops :-] [16:50:10] heh, it's like channel rush hour [16:50:12] I don't disagree per se [16:50:36] paravoid: agreed on all counts [16:50:45] ottomata: mutante disconnecting sorry. I am going to socialize IRL :-] [16:50:55] cya tomorrow [16:51:02] just too bad we can't get some zfs.... [16:51:03] PROBLEM - Host ms-be7 is DOWN: PING CRITICAL - Packet loss = 100% [16:51:42] !log ms-be7 going offline to make room for new 720 servers [16:51:48] Logged the message, Master [16:52:05] brrr [16:52:16] sbernardin: before you take anything offline you need to admin log it [16:52:21] cmjohnson1: oh btw, we were wondering earlier [16:52:27] I think I know the answer, but just in case [16:52:38] cmjohnson1 [16:52:45] you're putting 720s exactly in the place of C2100s? [16:52:47] cmjohnson1: ok...forgot about that [16:53:15] paravoid: for the most part yes...i am going to redistribute a little more evenly though [16:53:28] are you changing racks? [16:53:38] this is okay, but it matters for swift [16:53:39] !log racking new ms-be7 to row C sdtpa [16:53:45] Logged the message, Master [16:53:46] swift has the concepts of "zones" [16:54:07] !log shutting down ms-be7 to row C sdtpa [16:54:07] it basically keeps three copies of each file for redundancy, but always keeps the three copies in different zones [16:54:11] ok, paravoid, which part in the file is about the extended partition? [16:54:12] Logged the message, Master [16:54:15] right..i know about the zones...the only ones that may be moving is going to be 1-4 [16:54:17] the 3rd line of the expert recipie? [16:54:25] that says linux-swap (not sure why it says swap) [16:54:26] ? [16:54:41] so zone should be something meaningful, either rack, pdu or something similar [16:54:43] paravoid: i will send you my proposed locations for all of them and notate if it is the location is the same [16:54:52] perfect [16:54:56] thanks a bunch [16:55:00] ottomata: yes, that line [16:55:04] ottomata: what do you want to do with it? [16:55:37] remove it i think [16:55:46] (why does it set type to linux-swap? just for lack of something better?) [16:56:22] that's the 0x83 on the partition table [16:56:22] ignore that [16:56:39] if you remove it, how are you going to use the rest of the space when that time comes? [16:56:52] fdisk? [16:56:56] figure it out then [16:56:57] right? [16:57:14] just don't see the need to partition now if we don't know what we'll do with it yet [16:57:14] why would you prefer running fdisk on the running system rather than having it prepartioned? [16:57:31] i'll probably want to change the type anyway, right? [16:57:36] maybe we will want raid? [16:57:38] who knows, right? [16:59:33] I'm not sure I understand your pattern, but whatever works for you [16:59:55] if you're not sure what to do with it, why don't use the LVM recipe? [16:59:55] haha, i think we don't understand each others patterns, heheh [17:00:08] and assign LVs dynamically [17:00:14] rather than fdisking live systems [17:00:19] because we might not want LVM, it has to do with whatever tech we end up using the space for [17:00:21] which we don't know [17:00:25] is fdisking a live system bad? [17:01:28] you can screw it up quite badly [17:01:32] and it needs a reboot [17:01:55] since you can't make the kernel re-read the partition table on a device that it's being used for / among other things [17:02:12] ok, but, i guess, for example, if I do end up wanting raid on these partitions, i'll have to fdisk it anyway, right? [17:02:51] or if we want LVM, we probably should change the fs type (although maybe it doesn't matter)? [17:04:58] you are setting it to linux-swap in the partman, which is for sure not what we are going to use it for, so we'll ahve to change it anyway, no? [17:05:14] so look, if you put the partition in the recipe and it turns out you need to change it later, you're no worse off than if you had no partition [17:05:38] if you don't put the partition in the recipe and, hey, it turns out that it owuld have been what you wanted, you are a little worse off [17:05:43] so... why not put it in? :-) [17:05:52] because it is being put in as linux-swap? [17:05:53] hah [17:05:56] i guess i'm cool with that [17:06:03] maybe I can change type to LVM then and leave it at athat? :p [17:06:38] lvm seems fine to me [17:08:57] paravoid, i think maybe that's my main point of confusion. what's the reason for setting this extra partition to linux-swap, if it def wont' be used for swap? [17:09:08] you said 0x83 on partition table, but i'm not sure what that means [17:09:18] as said before, ignore that :) [17:09:26] haha, but why?! [17:09:27] i must know why! [17:09:33] partition tables have a "type" option [17:09:36] right [17:09:45] 0x82/0x83 being the ones commonly used for Linux [17:10:03] and 0x8e iirc [17:10:40] I don't think you can set the type to Linux and tell it to ignore that space [17:10:49] so you just set it to linux-swap iirc [17:10:49] to LVM you mean? [17:10:57] no, to Linux [17:10:58] oh [17:11:00] i see [17:11:10] the types are Linux, Linux swap, Linux raid and Linux LVM [17:11:13] so swap is just like "listen man, its swap so don't touch it" [17:11:19] yeah [17:11:39] and if we did linux-lvm it would want to format it or something? [17:12:16] I think so [17:12:19] ok I am convinced and I understand! [17:12:20] thank you! [17:12:21] haha [17:12:22] but ignore that [17:12:26] i will leave it (and write a comment) [17:12:46] I've added comments on the top of all raid1-* templates [17:12:50] to explain what they do [17:13:06] right, but for newbies like me, I was really confused by the linux-swap bit, [17:13:12] if we ever start using the squid one(for example) for other hosts we may want to rename those [17:13:14] your comment said "+ an extended partition!" [17:13:16] by what they do [17:13:31] apergos: yes [17:13:40] and i was like "what extended unused partition?! I see more swap!" [17:13:44] ottomata: extended means it'll be sda5, not sda3 [17:13:48] no particular reason for that, just legacy [17:13:58] ? [17:14:00] it's how it has been and defined in the squid machines in site.pp [17:14:15] is there a reason to make it extended then? [17:14:24] can I change that to a primary (i think that is what you are talking about) [17:14:24] so if I changed it to sda3, we wouldn't be able to do machine reinstalls without reconfiguring puppet [17:14:25] ? [17:14:35] yes you can [17:14:36] (for squids) [17:14:37] ok [17:14:41] not for squids [17:14:46] don't change it in squids [17:14:50] right [17:14:51] this is my new file [17:14:53] it'll break reinstalls of squid boxes [17:15:01] sorry i was clarifying your sentence [17:15:10] not making my own :p [17:15:14] right [17:15:24] yeah, you can make it primary [17:15:31] so I add [17:15:35] $primary{ } [17:15:38] yes [17:15:39] before the method there, right? [17:15:43] so, for bonus points [17:15:47] oo bonus! [17:15:56] our swift boxes come in two flavors [17:16:00] one with SSDs and one without [17:16:05] we don't need SSDs on all of them [17:16:28] the SSD ones need some special partitioning, as / goes into sdm1/sdn1 [17:16:39] but for the rest [17:16:42] we can reuse your template :) [17:16:57] cool! [17:16:59] I'm a big fan of reusing as you might gather by now [17:17:05] I'm making 100G / and 10G swap [17:17:08] yea totally i'm for that [17:17:23] it'd be super cool if we could puppetize and templatize these, right? [17:17:24] 100G? [17:17:25] yikes [17:17:27] haha [17:17:28] that's quite a lot [17:17:36] that's a terabyte lost basically [17:17:40] ok ok, in the name of reuse [17:17:49] why d you want 10gb swap is my question [17:17:52] (twelve machines times 100G minus space actually used) [17:18:12] its kiinda lost, but you make up for it in a bit of peace of mind [17:18:13] New review: Andrew Bogott; "Sorry to catch you in a formatting holy war :) The one thing about coding conventions I'm positive ..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/30593 [17:18:24] and yeah [17:18:30] don't use 10G swap [17:18:33] ok [17:18:35] i'll leave that 1G [17:18:37] that's horrible [17:18:39] yep [17:18:55] once it starts to use most of 1 gb you're probably in swapdeath anyways [17:19:06] yes [17:19:29] so, here is somethign really annoying that I don't understand [17:19:46] the ciscos have 260GB swap [17:19:59] (don't ask me why) [17:20:00] wha?? [17:20:06] that's crazy [17:20:13] the virt* ciscos have 0GB swap [17:20:17] they have 192GB of ram [17:20:23] right [17:20:27] (I made that) [17:20:35] looks to me like the partman should give them 1-10GB swap [17:20:37] but there they are [17:20:39] with 260G [17:21:28] anyway, paravoid, in the name of partman reuse, is 30GB ok for this new file? [17:23:35] psssssh, i want 100GB :p !, ok [17:23:37] battery dying [17:23:47] need to pick up laundry and grab some food before 1 [17:23:54] be back in a little bit [17:23:56] yes! [17:23:57] enjoy (food) [17:30:28] average_drifter: just sent you your recommendation letter, let me know if all is good [17:30:46] wrong channel [17:57:01] New patchset: Pyoungmeister; "removing mw60 from bits apaches pool for upgrade to precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32582 [17:59:47] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32582 [18:02:01] New review: Nikerabbit; "Is there/will there be notes of the results of these changes?" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32416 [18:35:49] New patchset: Ottomata; "The new Analytics Dells are here!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32571 [18:38:09] !log wikivoyage imports: "en" done, "it" done [18:38:15] Logged the message, Master [18:39:33] New patchset: Ottomata; "The new Analytics Dells are here!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32571 [18:39:37] mutante: in production? [18:40:11] hashar: yes, S3 cluster [18:40:15] paravoid, how's that look? [18:40:26] mutante: nice :-] [18:40:26] would you review that for me just to be sure? [18:40:26] https://gerrit.wikimedia.org/r/32571 [18:40:34] hashar: does not mean the wiki is enabled for users though ;) [18:40:57] mutante: still, it is progressing! [18:41:03] yea [18:42:20] hashar: user login during import = breakage btw because the login creates the user (central auth) :p [18:42:38] so all disabled in Apache [18:45:50] !log wikivoyage imports: "de" (and all for now) done [18:45:55] Logged the message, Master [18:47:41] mutante: you want me to add a patch to enable the sites on Apaches? [18:48:28] sure [18:48:39] thanks, be back in 5 [18:50:31] csteipp: ..or 10. coffee break. then lets get on it;) [18:50:40] mutante: Not a problem! [18:54:46] !log authdns update "Adding .pmtpa.wmnet entries for labsdb1-3" [18:54:52] Logged the message, Master [18:55:40] sbernardin: once you have the disk in slot0 on labsdb1 and labsdb2 ....plz ping me [18:56:06] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [18:57:42] New patchset: CSteipp; "Enable Wikivoyage subdomains" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/32591 [19:01:34] any performance issues or recent changes with bits? https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#bits.wikimedia.org_very_slow [19:02:05] New review: Dzahn; "yea, i agree. turn into *. after QA check" [operations/apache-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/32591 [19:02:06] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/32591 [19:02:20] Zack and Megan were asking me yesterday about bits as well [19:05:05] dzahn is doing a graceful restart of all apaches [19:05:40] !log dzahn gracefulled all apaches [19:05:45] !log enabling wikivoyage subdomains in apache for QA check [19:05:46] csteipp: ^ [19:05:46] Logged the message, Master [19:05:51] Logged the message, Master [19:15:02] !log rebooting mw60 for upgarde to precise [19:15:08] Logged the message, notpeter [19:17:45] !log running sync-dblist [19:17:50] Logged the message, Master [19:18:36] RobH, could you review this for me? [19:18:36] https://gerrit.wikimedia.org/r/#/c/32571/ [19:19:16] sbernardin: if you need to know which drive is drive 0 go to http://www.cisco.com/en/US/docs/unified_computing/ucs/c/hw/C250M1/install/replace.html#wp1053178 [19:19:45] binasher: fyi, all the imports on s3 are done (at least the 7 initial languages and for now) [19:20:13] mutante: great! is that the last of the db work? [19:20:47] yea, unless other languages are added at a later point, but none would be as large as this. "en" and "de" are all in there [19:21:05] binasher: http://wikitech.wikimedia.org/view/User:Dzahn/wikivoyage_pastebin [19:22:55] PROBLEM - Apache HTTP on mw60 is CRITICAL: Connection refused [19:25:01] PROBLEM - SSH on mw60 is CRITICAL: Connection refused [19:34:55] RECOVERY - SSH on mw60 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:40:49] !log doing a git pull in wmf-config [19:40:54] Logged the message, Master [19:43:46] RECOVERY - Apache HTTP on mw60 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.002 seconds [19:46:01] PROBLEM - NTP on mw60 is CRITICAL: NTP CRITICAL: No response from NTP server [19:53:25] !log git reset in /h/w/common/ to revert local disabling of "voy" wikis and synced dblist again [19:53:31] Logged the message, Master [20:04:56] sbernardin: moving conversation to ops [20:05:10] so yes let me know once the disk are swapped..thx [20:05:31] cmjohnson1: ok [20:12:16] RECOVERY - NTP on mw60 is OK: NTP OK: Offset -0.01435041428 secs [20:17:52] New review: Ottomata; "I talked about this new partman file with Faidon and RobH earlier. If anyone has objections we can ..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/32571 [20:17:53] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32571 [20:25:28] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [20:25:28] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [20:26:28] !log installing ssd's in labsdb1 and labsdb2 [20:26:34] Logged the message, Master [20:32:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:33:25] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 35.55 ms [20:34:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [20:35:06] binasher: which partman recipe do you want on labsdb1-3? [20:36:47] cmjohnson1: mw.cfg for these, where the x25 should be sda and the only disk touched by the install [20:37:01] PROBLEM - SSH on analytics1011 is CRITICAL: Connection refused [20:37:06] ok..yep..i recall you telling me that before [20:37:12] thx for reminding me [20:37:33] New patchset: Catrope; "Add VisualEditor namespace creation to wmf-config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31949 [20:41:45] New patchset: Pyoungmeister; "Revert "removing mw60 from bits apaches pool for upgrade to precise"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32598 [20:42:03] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32598 [20:43:04] !log repooling mw60 in bits apaches [20:43:10] Logged the message, notpeter [20:46:51] cmjohnson1: intel ssd's have been installed in labsdb1 & labsdb2 [20:51:52] RECOVERY - SSH on analytics1011 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:02:40] RECOVERY - Host analytics1012 is UP: PING OK - Packet loss = 0%, RTA = 35.33 ms [21:02:40] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 35.32 ms [21:02:40] RECOVERY - Host analytics1016 is UP: PING OK - Packet loss = 0%, RTA = 35.60 ms [21:02:40] RECOVERY - Host analytics1017 is UP: PING OK - Packet loss = 0%, RTA = 35.33 ms [21:02:40] RECOVERY - Host analytics1015 is UP: PING OK - Packet loss = 0%, RTA = 35.71 ms [21:02:41] RECOVERY - Host analytics1019 is UP: PING OK - Packet loss = 0%, RTA = 35.35 ms [21:02:41] RECOVERY - Host analytics1021 is UP: PING OK - Packet loss = 0%, RTA = 35.81 ms [21:02:42] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 35.33 ms [21:02:42] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 35.33 ms [21:02:43] RECOVERY - Host analytics1022 is UP: PING OK - Packet loss = 0%, RTA = 35.33 ms [21:02:43] RECOVERY - Host analytics1018 is UP: PING OK - Packet loss = 0%, RTA = 35.40 ms [21:06:16] PROBLEM - SSH on analytics1012 is CRITICAL: Connection refused [21:06:25] PROBLEM - SSH on analytics1013 is CRITICAL: Connection refused [21:06:43] PROBLEM - SSH on analytics1018 is CRITICAL: Connection refused [21:06:43] PROBLEM - SSH on analytics1017 is CRITICAL: Connection refused [21:06:43] PROBLEM - SSH on analytics1014 is CRITICAL: Connection refused [21:06:52] PROBLEM - SSH on analytics1022 is CRITICAL: Connection refused [21:07:01] PROBLEM - SSH on analytics1015 is CRITICAL: Connection refused [21:07:01] PROBLEM - SSH on analytics1020 is CRITICAL: Connection refused [21:07:10] PROBLEM - SSH on analytics1019 is CRITICAL: Connection refused [21:07:28] PROBLEM - SSH on analytics1021 is CRITICAL: Connection refused [21:07:46] PROBLEM - SSH on analytics1016 is CRITICAL: Connection refused [21:08:48] New patchset: Cmjohnson; "adding labsdb* to netboot cfg" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32601 [21:10:10] mutante: can you check this for me ^ [21:15:34] PROBLEM - NTP on analytics1011 is CRITICAL: NTP CRITICAL: No response from NTP server [21:16:10] PROBLEM - Host analytics1012 is DOWN: PING CRITICAL - Packet loss = 100% [21:16:10] PROBLEM - Host analytics1021 is DOWN: PING CRITICAL - Packet loss = 100% [21:16:10] PROBLEM - Host analytics1018 is DOWN: PING CRITICAL - Packet loss = 100% [21:16:19] RECOVERY - SSH on analytics1013 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:16:28] RECOVERY - SSH on analytics1012 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:16:37] RECOVERY - SSH on analytics1014 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:16:37] RECOVERY - Host analytics1012 is UP: PING OK - Packet loss = 0%, RTA = 35.37 ms [21:16:46] RECOVERY - SSH on analytics1018 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:16:46] RECOVERY - SSH on analytics1017 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:16:46] RECOVERY - SSH on analytics1015 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:16:55] RECOVERY - Host analytics1018 is UP: PING OK - Packet loss = 0%, RTA = 35.37 ms [21:16:55] RECOVERY - SSH on analytics1022 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:17:04] RECOVERY - SSH on analytics1019 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:17:04] RECOVERY - SSH on analytics1020 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:17:13] RECOVERY - SSH on analytics1021 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:17:22] RECOVERY - Host analytics1021 is UP: PING OK - Packet loss = 0%, RTA = 35.35 ms [21:17:40] RECOVERY - SSH on analytics1016 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:20:56] !log reedy synchronized php/cache/interwiki.cdb 'Updating interwiki cache' [21:21:03] Logged the message, Master [21:21:26] RobH: http://cl.ly/image/3N0o3c3E212I [21:21:27] thank you! [21:26:31] PROBLEM - NTP on analytics1016 is CRITICAL: NTP CRITICAL: No response from NTP server [21:26:31] PROBLEM - NTP on analytics1017 is CRITICAL: NTP CRITICAL: No response from NTP server [21:26:32] PROBLEM - NTP on analytics1014 is CRITICAL: NTP CRITICAL: No response from NTP server [21:27:07] PROBLEM - NTP on analytics1015 is CRITICAL: NTP CRITICAL: No response from NTP server [21:27:07] PROBLEM - NTP on analytics1019 is CRITICAL: NTP CRITICAL: No response from NTP server [21:27:08] PROBLEM - NTP on analytics1022 is CRITICAL: NTP CRITICAL: No response from NTP server [21:27:08] PROBLEM - NTP on analytics1020 is CRITICAL: NTP CRITICAL: No response from NTP server [21:30:31] ottomata: iterm2 :) [21:33:49] yup! [21:33:50] heh [21:34:14] accept no substitutes [21:39:18] New patchset: Demon; "Fixing SSL issue for gerrit on labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32605 [21:40:19] PROBLEM - NTP on analytics1013 is CRITICAL: NTP CRITICAL: No response from NTP server [21:40:19] PROBLEM - NTP on analytics1018 is CRITICAL: NTP CRITICAL: No response from NTP server [21:40:28] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [21:40:28] PROBLEM - NTP on analytics1012 is CRITICAL: NTP CRITICAL: No response from NTP server [21:40:55] notpeter: if you are not busy could you look at this for me https://gerrit.wikimedia.org/r/32601 [21:40:55] PROBLEM - NTP on analytics1021 is CRITICAL: NTP CRITICAL: No response from NTP server [21:42:29] +2'd [21:42:31] still need lint [21:42:34] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: Puppet has not run in the last 10 hours [21:44:31] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: Puppet has not run in the last 10 hours [21:46:23] notpeter: k...thx [21:46:56] oh, and now gerrit's broken [21:48:49] that's awesome [21:48:59] csteipp: cmjohnson1 re, was at lunch [21:49:12] worksforme [21:49:18] mutante: how dare you.... ;-) [21:49:24] gerrit has had temp. glitches earlier [21:49:51] thought for a moment you mean to check why analytics hosts went down from context :p [21:50:08] just NTP though [21:50:31] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [21:51:03] lol @ topic branch name [21:51:38] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32601 [21:51:44] mutante: No emergency, but we may need to update the de/it tables :( [21:52:13] cmjohnson1: done [21:52:21] cmjohnson1: oh ..:p [21:52:31] csteipp: oh ..:P [21:52:33] thx [21:53:24] notpeter: i think yours was just missing the "Verified" [21:53:32] yep [21:53:51] si merged now? [21:53:54] *is [21:53:57] csteipp: missing "pageupdates"? [21:54:02] notpeter: yes, and sockpuppet [21:54:02] (sorry, splitbrain) [21:54:22] No, an extension had a schema update, and they were running the old versions on de/it [21:54:39] And I missed it when I ran the update.php [21:54:44] ah, ok [21:54:58] well, as long as it is not "en":) [21:55:24] mutante: awesome! thank you [21:55:24] Oh, I may need to update the revision table on en... could you do that tonight ;) [21:55:28] cmjohnson1: sorry, for distracted [21:55:31] *got [21:55:59] np...i appreciate you looking at it for me [21:58:11] So in all seriousness, if we have a .sql file that affects some small tables, is it possible to run that on s3? [21:58:21] Or do we have to run the updates and reimport? [21:58:39] run the updates locally, that is.. [21:59:27] I think it could be run [21:59:48] but the update contents and how small they really are should be checked before, of course [22:01:07] rename columns, add a column, adn drop / add an index. DBs have 7949 rows each. [22:02:06] 2 indexes that is. And create 1 table; [22:08:25] sbernardin: are you stil on 12? [22:08:48] cmjohnson1: back on 10 [22:09:11] Change merged: Dzahn; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32603 [22:09:21] okay [22:10:53] New patchset: Cmjohnson; "Adding labsdb1 and 3 to dhcpd file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32663 [22:11:21] mutante: can you check the change ^ forgot about it earlier [22:11:53] !log dzahn synchronized ./wmf-config/InitialiseSettings.php [22:11:59] Logged the message, Master [22:12:18] Danny_B|backup: Reedy : ^ [22:12:44] danke schon [22:14:32] cmjohnson1: has one little red thingie in 65 [22:14:35] Danny_B|backup: de rien [22:15:32] mutante: not sure why..it is the correct space...odd [22:15:55] oh i see [22:18:25] New patchset: Pyoungmeister; "removing mw60 from bits apaches pool again :/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32665 [22:19:13] New patchset: Cmjohnson; "Adding labsdb1 and 3 to dhcpd file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32663 [22:19:38] mutante: fixed [22:20:09] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32663 [22:20:45] cmjohnson1: on sockpuppet [22:21:06] cool [22:21:07] thx [22:21:11] np [22:21:57] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32665 [22:35:42] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [22:41:22] !log provisioning labsdb1 [22:41:27] Logged the message, Master [22:41:49] New patchset: Asher; "deploy ganglia plugin for redis" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32667 [22:46:39] PROBLEM - Puppet freshness on cp1042 is CRITICAL: Puppet has not run in the last 10 hours [22:47:42] Change restored: Asher; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32471 [22:47:49] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32667 [22:47:49] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32471 [22:52:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:53:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.457 seconds [23:01:25] New patchset: Pyoungmeister; "marking mw60 to use applicaitonserver/mediawiki modules" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32669 [23:03:38] New patchset: Asher; "move redis ganglia module to its own class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32670 [23:04:12] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32670 [23:06:38] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32669 [23:08:51] PROBLEM - Apache HTTP on mw60 is CRITICAL: Connection refused [23:09:27] PROBLEM - SSH on mw60 is CRITICAL: Connection refused [23:17:25] !log added some redis metrics to ganglia for mc1-16 [23:17:31] Logged the message, Master [23:19:30] RECOVERY - SSH on mw60 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [23:28:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:38:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.063 seconds [23:38:33] RECOVERY - Apache HTTP on mw60 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.004 seconds [23:42:27] PROBLEM - NTP on mw60 is CRITICAL: NTP CRITICAL: Offset unknown