[00:04:33] PROBLEM - Puppet freshness on hume is CRITICAL: Puppet has not run in the last 10 hours [00:04:39] New patchset: Ori.livneh; "(Bug 37787) Uninstall CustomUserSignup extension" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/18201 [00:14:58] ori-l: sad to lose CustomUserSignup extension :( [00:15:27] aude: i presume / desperately hope that you are being sarcastic? [00:17:02] i have deployment rights (as of yesterday, I think) but fenari is still rejecting my key. is it simply a matter of waiting for a puppet run, or is there some additional steps that needs to be done? [00:17:13] *are there some [00:17:20] would be nice if someone would maintain the extension [00:18:02] aude, we (e3) are going to give the sign up interface some much-needed love [00:18:08] the current signup interface is horrible [00:18:17] good! [00:18:31] > puppet ran at Thu Aug 9 00:17:23 UTC 2012 [00:18:35] for fenari [00:18:50] (per nagios) [00:18:55] 09 00:18:30 < jeremyb> > puppet ran at Thu Aug 9 00:17:23 UTC 2012 [00:19:02] so, i think that's good enough [00:19:14] odd! any idea what else could be the cause? [00:19:32] pastebin a verbose ssh ? [00:19:42] maybe -vvv ? i can't remember what a good invocation is [00:19:42] jeremyb: sure [00:19:50] i think -vv is the max? [00:20:10] ration problems. Multiple -v options increase the verbosity. The maximum is 3. [00:21:02] * aude sees that kaldari made Fancycaptcha-createaccount slightly better [00:21:14] still ugly [00:22:06] ori-l: what about an audio captcha for accessibility purposes? [00:22:21] that's one of the other problems [00:22:39] don't we just send people to ACC? [00:23:37] jeremyb: we do but that's not great [00:24:02] * aude wants to turn my screen off and be able to make an account with screen reader and keyboard [00:24:33] aude: sure. but it's not like we just forgot about them entirely [00:24:59] jeremyb: true [00:27:03] jeremyb / aude: http://dpaste.com/783520/ [00:28:19] i don't think there's a whole lot to it other than "fenari is rejecting my key" :) [00:29:50] ori-l: what username ? [00:29:56] and is it the key in puppet ? [00:30:05] LeslieCarr: olivneh, and yes [00:30:37] ori-l: has it actually worked anywhere yet? [00:30:51] hrm, no home directory [00:31:06] i'm homeless! i can ssh into emery, for example [00:31:20] ori-l: to be safe let's also see `ssh-keygen -l -f ~/.ssh/khorsabad_rsa.pub` [00:31:38] LeslieCarr: same happened with krinkle account [00:31:55] LeslieCarr: pm'd [00:32:08] I'm being rejected by gerrit now [00:32:19] as of ~ 20 minutes [00:32:19] hah [00:32:30] gerrit just let me in [00:32:36] Krinkle: you mean 22 or 29418? [00:32:53] krinkle@gerrit.wikimedia.org:29418/ [00:33:06] e.g. git remote update / git review -R [00:33:37] Krinkle: wfm [00:34:28] mutante: btw, last week I could ssh into integration.mediawiki.org; but now I can't anymore. fenari also still doesn't work. [00:34:46] LeslieCarr: thanks for looking, btw [00:35:18] Krinkle: ugh, checking on integration [00:35:29] created the home dirs, let's see if keys works or not [00:36:39] ok, it seems gerrit ssh works again. weird. [00:36:44] LeslieCarr: nope [00:36:47] Krinkle: your home and key are still there [00:36:53] olivneh try now [00:37:09] mutante: and integration also works again. [00:37:12] Krinkle: please try again, watching auth.log now [00:37:15] Krinkle: try now [00:37:17] weird, no clue what's going on. [00:37:18] Krinkle: oh? ok... [00:37:18] LeslieCarr: still no :/ [00:37:20] both work again. [00:37:21] added authorized_keys manually to both [00:37:37] fenari still denied though [00:38:11] ori-l: try one last time on fenari [00:39:06] ori-l: run that other command? [00:39:06] 09 00:31:20 < jeremyb> ori-l: to be safe let's also see `ssh-keygen -l -f ~/.ssh/khorsabad_rsa.pub` [00:39:16] ok mutante has got you guys set :) [00:39:17] bye! [00:39:24] Krinkle: one more time on fenari? [00:39:25] LeslieCarr: works! [00:39:29] woot [00:39:30] confirmed, i see the login :) [00:39:44] mutante: still denied [00:40:22] heh, wall :) [00:40:32] Krinkle: now? [00:40:43] session opened for user krinkle [00:40:50] checking with -v to see if the key is offered properly [00:40:54] wow, whole bunch of debug output [00:40:57] yes, Im in now [00:41:09] alright [00:41:11] thx [00:41:23] -v is also great for when non-technical friends are looking over your shoulder and you want to appear very impressive [00:41:50] lol [00:42:03] I remember doing $ dir -r /s on MS-DOS at some point [00:42:09] in the rooot [00:42:13] takes an hour sometimes [00:42:21] ori-l: what about xxd ? [00:42:21] and say "I'm wiping your hard drive" [00:42:29] Krinkle: hah [00:42:34] even though all it does is list everything in every dir [00:42:54] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/18201 [00:43:32] mutante: thanks! where do you guys keep all the warez? [00:43:43] i can see no /appz or /gamez dir anywhere [00:44:32] okay, jokes about mission critical production systems unappreciated by operations people responsible for their upkeep. noted :P [00:44:41] anyhow, thanks for the help [00:45:22] ori-l: haha :) no problem. [00:45:36] * ori-l slinks out [00:45:49] ori-l: apt-cache search game [00:45:52] cya [01:26:35] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [01:39:30] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [01:41:18] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 206 seconds [01:41:36] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 227 seconds [01:47:54] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 603s [01:54:39] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 17 seconds [01:58:15] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 11s [01:58:33] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 10 seconds [02:02:36] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [02:19:33] PROBLEM - Puppet freshness on labstore2 is CRITICAL: Puppet has not run in the last 10 hours [02:41:36] RECOVERY - Puppet freshness on labstore2 is OK: puppet ran at Thu Aug 9 02:41:13 UTC 2012 [02:42:38] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [02:57:30] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [03:10:05] RECOVERY - Puppet freshness on srv281 is OK: puppet ran at Thu Aug 9 03:09:45 UTC 2012 [03:12:20] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [03:12:38] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [03:27:29] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [03:37:05] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Thu Aug 9 03:36:46 UTC 2012 [04:56:36] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [04:56:36] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [05:33:01] New patchset: Tim Starling; "Split PoolCounter off into a separate class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18207 [05:33:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/18207 [05:41:08] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [05:43:33] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [05:49:49] New patchset: Tim Starling; "Split PoolCounter off into a separate class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18207 [05:50:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/18207 [05:59:17] New patchset: Tim Starling; "Split PoolCounter off into a separate class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18207 [05:59:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/18207 [06:01:25] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18207 [06:02:35] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [06:04:03] New patchset: Faidon; "Add a missing import for poolcounter.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18208 [06:04:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/18208 [06:04:53] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18208 [09:06:27] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16654 [09:14:55] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [09:25:25] RECOVERY - Puppet freshness on labstore1 is OK: puppet ran at Thu Aug 9 09:24:53 UTC 2012 [09:28:52] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [09:28:52] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [09:28:52] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [10:05:56] PROBLEM - Puppet freshness on hume is CRITICAL: Puppet has not run in the last 10 hours [11:18:04] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [11:20:10] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 1.62 ms [11:40:53] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [12:02:38] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [12:03:50] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [12:05:56] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [12:24:58] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [12:43:52] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [12:58:52] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [13:13:52] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [13:28:53] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [14:30:13] !log cp1001 squid services going offline for troubleshooting [14:30:23] Logged the message, RobH [14:32:00] New patchset: Ottomata; "Preparing to install Precise on analytics1011-1022 eqiad row C dell machines using partman/analytics-dell.cfg" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18235 [14:32:21] meeeeester RobH good morning to you! [14:32:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/18235 [14:34:25] PROBLEM - Frontend Squid HTTP on cp1001 is CRITICAL: Connection refused [14:35:19] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:46:44] PROBLEM - Host cp1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:47:02] ottomata: you want any assistance setting those things up? [14:47:13] yeeeeehaaaaw [14:47:16] yeah, so I just committed [14:47:20] I want a review on that [14:47:28] and then I will pxe boot and see how it goes [14:47:52] also, I'd love to keep this page up to date with my servers: [14:47:53] http://wikitech.wikimedia.org/view/Category:Servers [14:48:00] but um, I don't thikn I have a wikitech account [14:48:22] need review here: [14:48:22] https://gerrit.wikimedia.org/r/#/c/18235/ [14:49:18] you should have someone else review that. I have never set up a new row file. that siad, looks fine to me [14:49:30] oh, you should get a racktables account [14:49:38] that's where we actually keep track of servers [14:49:41] inventory [14:49:47] that page is soooooooo outdated [14:49:57] and we committed to not updating it by hand [14:50:19] RobH: can you review the above ^ [14:50:29] RECOVERY - Host cp1001 is UP: PING OK - Packet loss = 0%, RTA = 35.50 ms [14:51:32] i have one [14:51:35] oh really [14:51:35] ok [14:51:42] how does it get updated? [14:51:54] i wanted a place to write down the MACs [14:51:58] although…i guess they are in puppet now [14:52:15] \yeah, also, the more I look, that checkin is a-ok [14:52:17] I +1 [14:52:25] or, want me to +2? [14:52:43] i think it shoudl be ok, they are new servers so, it won't hurt anything if I did something all wrong [14:52:59] yeah [14:53:03] unless I accidentally told them to self destruct or something [14:53:08] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18235 [14:53:20] ok, you shoudl merge it on sockpuppet now [14:53:28] !log cp1001 returned to service, resolving rt 3212 [14:53:38] Logged the message, RobH [14:53:52] hmm, how do I merge [14:53:54] cd /etc/puppet [14:53:55] git pull? [14:54:01] ottomata: as long as you didnt use the $self-destruct => true, we will be ok [14:54:05] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27407 bytes in 0.143 seconds [14:54:17] ottomata: its on labs console, the directions to merge, lemme find [14:54:41] ottomata: https://labsconsole.wikimedia.org/wiki/Git#Public_repo [14:55:30] ok, trying that [14:55:37] ottomata: cd /root/puppet/ [14:55:57] git fetch && git diff HEAD origin/production [14:56:05] git merge origin/production [14:56:09] aye cool [14:56:19] * cmjohnson1 leaves for DC [14:56:26] cmjohnson1: travel safely! [14:56:34] cool, merged! [14:56:41] ok, going to try to pxe boot analytics1011 [14:56:45] oh wait [14:56:47] wait [14:56:50] root@stafford.pmtpa.wmnet's password: [14:56:52] force puppet run on brewster [14:56:57] oh right yeah [14:56:58] oh [14:57:00] brewster ok [14:57:02] uh [14:57:03] wiat [14:57:07] why brewster? [14:57:13] root@sockpuppet:~/puppet# git merge origin/production [14:57:13] Merge made by the 'recursive' strategy. [14:57:13] files/autoinstall/netboot.cfg | 1 + [14:57:13] files/autoinstall/subnets/analytics1-c-eqiad.cfg | 19 +++++++ [14:57:13] files/dhcpd/linux-host-entries.ttyS0-115200 | 61 ++++++++++++++++++++++ [14:57:14] 3 files changed, 81 insertions(+) [14:57:14] create mode 100644 files/autoinstall/subnets/analytics1-c-eqiad.cfg [14:57:15] root@stafford.pmtpa.wmnet's password: [14:57:25] your agent is not properly forwarded [14:57:29] bwaaaaa [14:57:32] it should be grrrr [14:57:44] it would seem that it's not ;) [14:57:48] aye [14:57:50] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [14:57:50] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [14:57:51] ok, will figure that out, hmm [14:57:56] kk [14:57:56] once I get that, can I just run merge again? [14:58:04] won't do anything [14:58:16] why is merge asking me for my pw? [14:58:33] because it has a hook to sync everything to stafford [14:58:41] that is where the puppet actually happens [14:58:56] hm k [14:58:57] like, sockpuppet only does cert signing [14:59:03] stafford handles all the traffic [14:59:20] ag, so I need the hook to fire [14:59:21] so merges happen on sockpuppet, pull happens on stafford [14:59:25] yeah [14:59:29] so, you can check in more code [14:59:35] hah, ok [14:59:35] or do it by hand on stafford [14:59:42] hmmm, might be cool to learn that [14:59:47] but I'm not 100% on what "it" is [14:59:51] I think just a git pull [15:00:08] not a git repo in /etc/puppet [15:00:10] on stafford [15:00:25] /var/lib/git [15:00:26] I believe [15:01:03] hah [15:01:07] # Changes not staged for commit: [15:01:07] # modified: files/misc/GeoIP.dat [15:01:07] /var/lib/git/operations [15:01:08] oh that's ok [15:01:10] that is a dl update [15:01:11] weird ok [15:01:17] ok git pulling [15:01:21] wait [15:01:28] too late [15:01:29] eh? [15:01:38] looks good, it pulled my stuff [15:01:46] ok [15:02:20] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.20:11000 (Connection timed out) [15:02:22] ok, now force puppet run on brewster [15:02:40] (our imaging box) [15:02:49] and that should get your stuff in place to pxe boot [15:02:53] k... [15:03:21] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18096 [15:03:36] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18099 [15:03:47] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18101 [15:04:00] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18102 [15:04:15] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18103 [15:04:34] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18105 [15:05:11] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [15:05:13] hm, uh oh [15:05:13] err: /Stage[main]/Misc::Install-server::Dhcp-server/Service[dhcp3-server]: Failed to call refresh: Could not start Service[dhcp3-server]: Execution of '/etc/init.d/dhcp3-server start' returned 1: at /var/lib/git/operations/puppet/manifests/misc/install-server.pp:234 [15:05:30] should I try to start it to see what the prob might be? [15:05:46] yes [15:06:19] hmm syntax errors in /etc/dhcp3/linux-host-entries.ttyS0-115200 and /etc/dhcp3/dhcpd.conf, although I don't think I edited dhcpd.conf [15:06:23] oh maybe that is the trace [15:06:24] yeah [15:06:42] ah got it [15:06:58] kk [15:07:18] ok, lemme see if I can +2 [15:07:21] New patchset: Ottomata; "Fixing syntax error in linux-host-entries.ttyS0-115200" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18240 [15:07:32] nope [15:07:33] can't [15:08:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/18240 [15:08:36] ah, kk [15:08:50] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18240 [15:09:34] ok [15:09:40] I did sockpuppet [15:09:43] force on brewster [15:10:01] oh ok [15:10:09] just did that [15:10:15] will let you know when done [15:10:18] haha [15:10:19] ok [15:10:38] ok, punch it [15:11:39] pxe boot? [15:11:43] yeah [15:13:02] ok its going, analytics1011 [15:13:07] cool [15:19:19] so i don't see it doing much in console [15:19:25] last was Checking NVRAM [15:20:18] ah poo and I just lost my console [15:20:23] i closed it and tried to open another one [15:20:40] woo [15:20:45] eh, just give it some time [15:20:47] It's serial so might be a little weird, usualy if it's blank there's nothing there or it's not redirected properly [15:20:54] oh [15:20:55] hit tab [15:21:05] anything? [15:21:13] maybe the word yes or the word no? [15:22:38] ergh, I closed my console though [15:22:41] can't get a new one [15:22:44] Info: SOL payload already active on another session [15:23:11] ah [15:23:15] yes.... [15:23:18] hurray dracs [15:23:23] just give it some time [15:23:25] ok... [15:23:29] how will I know though? [15:23:44] or ask RobH if there's a way to boot your old session [15:23:50] he may know wizardry [15:24:04] if its stuck [15:24:16] you can do the consoleclose command in the impi_mgmt script ;] [15:24:52] ottomata: ^ [15:25:19] ooo [15:25:20] cool [15:29:12] mark: does this look like what you wanted? https://gerrit.wikimedia.org/r/#/c/18194/ [15:30:21] er [15:30:27] why are you removing the existing file, it's still in use [15:35:17] ok, in console, no movement notpeter + RobH [15:36:13] So what exactly is happening now? You are attempting to install analytics1011 and attached to the serial console via the impi script? [15:37:34] mark: ah, correct... will ammend [15:39:30] yup [15:39:40] brewster is updated with the dhcp boot stuff [15:39:49] i did bootpxe, powercycle, console [15:42:57] New patchset: Pyoungmeister; "moving sudo definitions from sudo::applicationserver into the correct modules" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18194 [15:43:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/18194 [15:51:11] still sitting there, notpeter, RobH [15:51:48] ottomata: you arent seeing a post at all? [15:52:09] naw its just sitting there [15:52:25] ok, disconnect and let me take a look [15:52:56] k done [15:53:38] ipmi_mgmt analytics1011.mgmt.eqiad.wmnet console [15:55:12] ottomata: thats what i just did, these take a minute to post [15:55:22] but i am going to attach a physical console to verify operation [15:55:25] k [15:57:40] cmjohnson1: think search32 has power supply problems? [15:57:44] kinda the only thing left.... [15:58:31] ottomata: didnt this work yesterday? [15:58:35] its not working for me either [15:58:52] we didn't get this far yesterday [15:59:02] this is the first bootpxe i've done [15:59:09] the console did work though, with bootbios [15:59:12] i did that today [15:59:22] and I did see output on console after bootpxe + powercycle [15:59:25] but only a bit [15:59:35] ottomata: but i mean now the serial redirection shows NOTHING [15:59:38] im not even seeing post now [15:59:52] hrmm, wait, now it is, wtf [16:00:04] cmjohnson1: cool cool [16:00:07] thanks! [16:01:11] ottomata: uhh, these are in the wrong file to start with [16:01:18] you have them in ttyS0, its S1 [16:01:21] its com port 2 [16:01:53] but there may be other issues, still checking [16:02:15] but all those entries you put in linux-host-entries.ttyS0-115200 need to be in linux-host-entries.ttyS1-115200 [16:02:18] as they are dells, not ciscos [16:02:26] (the 1001-1010 are in linux-host-entries.ttyS0-115200) [16:03:44] so, once it does hit the install server, it wont output the installer correctly [16:03:49] whaaaaa [16:03:49] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [16:03:49] ok [16:03:51] ottomata: did you see it hit the dhcp server? [16:03:58] haha [16:04:01] ottomata: sorry I didn't catch that :( [16:04:11] nope, not sure how to look for dhcp server hits [16:04:14] well, it may not be the only problem [16:04:20] ok, lemme fix dell stuff [16:04:21] i see it trying to dhcp and not hitting [16:04:21] ok [16:06:09] New patchset: Ottomata; "analytics dells are on S1 (com port 2), not S0" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18254 [16:06:40] ottomata: its not hitting brewster [16:06:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/18254 [16:06:55] so i think the network isnt setup right (either the port isnt in the right vlan, or dhcp relay isnt setup) [16:06:57] LeslieCarr: ^ [16:07:40] ah [16:07:50] ok, is that analytics-1-c ? [16:07:53] yep [16:08:06] and why'd you break it rob ? [16:08:37] my guess is petty spite. [16:09:12] bitter cynicism is usually a better bet. [16:09:25] * RobH has been left alone too long. [16:09:44] haha [16:09:44] ottomata: reviewing your commit now [16:09:47] danke [16:10:18] New review: RobH; "looks good" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/18254 [16:10:18] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18254 [16:10:30] ottomata: ok, you can merge it on sockpuppet [16:11:48] ok [16:12:32] k, merged [16:12:46] want me to run puppet on brewster? [16:12:55] aaagh still wants my pw for stafford [16:13:03] will pull on stafford [16:13:45] ok, done that [16:13:49] RobH, should I run puppet on brewster? [16:13:53] yep [16:14:01] if you pass your key when you go to update puppet [16:14:04] you wont have to enter a password [16:14:08] yeah i should be doing that... [16:14:16] but maybe not [16:14:32] well, when i login to a bastion, i pass my cluster key to it [16:14:37] so then i can continue to pass if i need [16:14:42] but meh, not sure its best practice [16:15:06] RobH: fixed [16:15:11] ottomata: ^ [16:15:30] cool, puppet on brewster looks good [16:15:37] should I try bootpxe + powercycle again? [16:16:02] ottomata: im off it, its all yours [16:16:05] so sure [16:16:05] k [16:16:36] k, done ,watching in console [16:20:32] hmm [16:20:34] so far same thing [16:20:41] LSI Corporation MPT2 boot ROM successfully installed! [16:20:41] [16:20:41] [16:20:41] [16:20:41] [16:20:42] [16:20:42] [16:20:43] Checking NVRAM.. [16:20:47] then a buncha blank lines [16:20:49] now just sitting there [16:27:02] RobH^ [16:28:23] ottomata: 10: APPLY BASEBALL BAT [16:28:30] ottomata: 20: GOTO 10 [16:28:54] do you know how I check dhcp to see if an11 is reaching it? [16:28:58] heheh [16:29:19] sorry, im in the datacenter [16:29:21] working two cases [16:29:25] plus helping jeff with a server [16:29:28] can I help? [16:29:28] tail daemon log (maybe messages...) on brewster [16:29:31] cannot divide myself further =P [16:29:32] look for mac in lofs [16:29:33] logs [16:29:43] someone help ottomata with install, this is his first go at them [16:29:46] RobH: I can help ottomata with this [16:29:54] thanks =] [16:30:44] for the record, messages, not daemon.log [16:31:06] 1011 [16:31:08] ? [16:31:31] yeah [16:31:33] analytics1011 [16:31:50] Aug 9 16:31:05 brewster dhcpd: DHCPDISCOVER from 04:7d:7b:a5:e1:b2 via 10.64.36.3: unknown network segment [16:32:09] so it is getting there [16:32:26] notpeter: is this host in row C? [16:32:30] yes [16:32:32] maplebed: yep [16:32:57] hm. I thought mark got all of row c set in dhcp when I was setting up the swift hosts, [16:33:03] not analytics subnet [16:33:05] but that'd be what I'd look for. [16:33:13] need to add that to dhcpd.conf [16:33:27] the analytics subnet is different? [16:33:31] yes [16:33:33] ahh i see that [16:33:49] yeah it is 10.64.22.0/24 [16:33:50] will do [16:33:56] we segregate off analytics related stuff because all these people want access and none of them have any business on our production stuff :P [16:34:30] that sounds like the wrong ip range actually [16:34:53] 'analytics1-c-eqiad' => { [16:34:53] 'ipv4' => "10.64.36.0/24", [16:34:54] 'ipv6' => "2620:0:861:106::/64" [16:34:56] that's what it should be [16:34:57] for analytics row c? [16:34:59] wahhhhhhh [16:35:00] yes [16:35:07] thats different than what leslie said [16:35:10] LeslieCarr: ^ [16:35:16] leslie picked a row B range earlier [16:35:28] ok, crap, i gotta run out for a sec to pick up lounch, i'll be back in a few mins, thanks guys [16:35:31] ottomata: so that means the dns files need updating [16:35:38] this is in network.pp btw [16:36:05] better fix it now before installing all the machines :P [16:36:19] paravoid: can you merge https://gerrit.wikimedia.org/r/#/c/18243/2 for me? [16:36:52] preilly: paravoid I got it [16:36:59] huh ? 10.64.36.0/24 is analytics c [16:37:01] notpeter: okay cool [16:37:14] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18243 [16:37:26] 10.64.21.0/24 is analytics b [16:37:27] notpeter: can you push it live too [16:37:32] preilly: yeah [16:37:45] oh [16:37:47] huh [16:38:01] remember we discussed that leslie? [16:38:07] i did put it wrong in the rdns zone [16:38:08] yeah [16:40:33] preilly: done [16:40:38] oh wait [16:41:02] 10.64.36 is labs 1 c in rdns … so analytics-1-c needs to be 10.34.37.0/24 [16:41:09] i'll fix up the dns files since i messed it up [16:43:08] LeslieCarr: let's put it in network.pp [16:43:23] or at least agree on on authoritative source ;) [16:43:26] hehe [16:43:27] (well, besides the router configs ;) [16:43:55] well and since the router configs already have .36, i'll switch labs to .37 [16:43:59] and update network.pp too [16:44:12] hmm [16:44:18] the other way round would be more consistent with row B though [16:44:23] but ok [16:44:53] i'm gonna go gokarting now [16:44:58] :) [16:45:02] have fun [16:45:04] tnx! [16:45:05] mark: enjoy! [16:45:06] see ya tomorrow [16:45:25] gokart? fun! [16:46:49] cmjohnson1: they should be labeled as such [16:47:16] cmjohnson1: it's two of the ciscos [16:47:27] there should be the same in pmtpa [16:48:35] seems we have virt0-15 in pmtpa [16:48:44] that seems like more than we should have [16:49:16] oh? [16:49:49] virt13 and virt14 should be labsdb1 and labsdb2 [16:50:09] virt15 is a spare [16:50:22] virt 12 will eventually be a db too [16:50:49] thanks [16:54:02] New patchset: Lcarr; "adding labs1-c-eqiad subnet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18262 [16:54:41] Logged the message, Master [16:54:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/18262 [16:55:52] ok, fixed dns and the routers [16:55:58] analytics subnet shouldbe happy now [16:56:24] LeslieCarr: http://funmeme.com/image.axd?picture=Toohappy.jpg [16:56:36] New patchset: Pyoungmeister; "moving sudo definitions from sudo::applicationserver into the correct modules" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18194 [16:57:07] hehehe [16:57:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/18194 [16:57:48] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18262 [17:00:56] New patchset: Pyoungmeister; "moving sudo definitions from sudo::applicationserver into the correct modules" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18194 [17:01:27] hooray! swift upgrade (in labs) is successfully sending statsd metrics to ganglia! [17:01:32] http://ganglia.wmflabs.org/latest/?r=hour&cs=&ce=&c=swiftupgrade&h=su-fe1&tab=m&vn=&mc=2&z=medium&metric_group=NOGROUPS_|_proxy-server.Object.GET.timing [17:01:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/18194 [17:01:43] * Damianz finds maplebed cookies [17:01:49] NOMS [17:02:01] Heh @ network graph [17:02:21] yeah, that's just me launching a test. [17:02:44] requsting 100 GETs, sleeping 5 seconds, repeat. [17:19:37] New patchset: Bhartshorne; "swift changes for the upgrade to 1.5.0" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18264 [17:20:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/18264 [17:33:26] ottomata: hey, dunno if you saw, try installing now [17:33:55] i just got done with analytics meeting [17:33:56] ok cool [17:34:17] did you change the dhcp stuff in puppet too? [17:34:21] to use the new subnet? [17:34:43] i can do that if not [17:37:07] LeslieCarr^ [17:37:15] oh [17:37:24] the dhcp should be using dns names [17:37:30] let me double check [17:37:52] yep, using dns, not specified ip [17:37:58] hm [17:38:06] what about dhcpd.conf [17:38:07] ? [17:38:11] ithink i need to add [17:38:21] # analytics1-c-eqiad subnet [17:38:21] subnet 10.64.36.0 netmask 255.255.255.0 { [17:38:21] ... [17:38:37] ah yeah [17:38:45] if i was unlazy i'd make that be auto generated [17:39:00] hmm [17:39:08] also analytics1-c-eqiad.cfg [17:39:10] changing that too [17:39:30] also netboot.cfg [17:39:38] i got it, will need review one sec [17:40:11] New patchset: Ottomata; "Analytics row c is on 10.64.36.0/24" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18266 [17:40:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/18266 [17:42:52] ook, notpeter, or LeslieCarr, how do those look? [17:43:34] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18266 [17:43:38] lgtm [17:43:50] you have root now, right ? want to merge on sockpuppet ? [17:44:00] looks merged to me [17:44:02] :) [17:46:58] yeah [17:47:00] oh you merged? [17:47:22] leslie did [17:47:24] good to go [17:47:24] k [17:47:55] ok, running on brewster, then will try booting again [17:48:02] did you merge on sockpuppet and run puppet on brewster ? [17:48:16] notpeter said you merged on sockpuppet [17:48:18] i will if you did not [17:48:24] oh, i just merged on gerrit [17:48:27] not on sockpuppet [17:48:28] ahh right k [17:48:29] doing so now [17:48:43] sorry for the confusion [17:48:45] np [17:54:51] ooo, looking better, got some console dialogs! [17:55:06] hmm [17:55:07] maybe [17:55:08] Network autoconfiguration failed │ [17:55:08] │ Your network is probably not using the DHCP protocol. Alternatively, │ [17:55:08] │ the DHCP server may be slow or some network hardware is not working │ [17:55:08] │ properly. [17:55:16] LeslieCarr notpeter, does that look normal? [17:55:35] i can retry, or configure network manually [17:56:38] uh... no [17:56:47] hrm [17:58:24] hm [17:58:36] that's farther than it got before at least! [18:00:54] oop, signed out of #ops for a sec, don't know why [18:01:01] did I miss any response from you guys? [18:01:19] Logged the message, Master [18:02:56] ottomata: brewster is getting dhcp requests [18:03:01] and sending back offers [18:03:07] ok [18:03:11] can I grab the console on 1011? [18:03:22] hm, yeah, lemme try to exit [18:03:26] oh that worked [18:03:31] oh [18:03:32] cool! [18:03:32] yea grab it [18:03:50] maybe do it in a screen [18:03:51] wait, is this a cisco or a dell? [18:03:52] so I can see to? [18:03:53] too [18:03:54] dell [18:03:55] um [18:04:14] ipmi_mgmt analytics1011.mgmt.eqiad.wmnet console [18:04:15] oh, I just want to poke around for now [18:05:12] k [18:05:31] If you don't know what to use here, consult your network │ │ administrator. [18:05:34] bam [18:05:39] who's our network admin ;) [18:05:43] j/k [18:06:49] how the fuck do I get out of this? [18:06:53] Im not used to rob's tool [18:08:29] u [18:08:30] ~. [18:08:35] yeah [18:08:35] ~. [18:08:38] exits the tool [18:09:29] not for me!~ [18:09:36] probably an os x terminal thing [18:10:48] i think i can close the console for you :p [18:10:59] shoudl I try? [18:11:15] why can't I just ssh to it [18:11:17] this is annoying [18:11:34] although it might be a symptom of the problem... [18:11:39] notpeter: can you merge https://gerrit.wikimedia.org/r/18270 [18:11:47] preilly: yeah [18:12:04] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18270 [18:12:51] its ipmi? [18:15:30] preilly: live [18:18:26] hehe [18:18:35] notpeter, lemme know if I can do anything [18:19:03] ottomata: execute the BASIC script I mentioned earlier [18:19:40] notpeter: thanks [18:20:17] hah [18:20:29] ok, i'll see if Rob wrote a basic IPMI interface [18:22:08] RobH: do you have the cable id's for the XC between the pfw's and the cr's ? [18:23:46] buuuuuuuuuuut, if Rob hasn't written a BASIC apply baseball bat IPMI interface…what should I do? :) [18:24:08] (notpeter, if you are working on it, ignore me and thank you, if not, gimme a hint?) [18:24:19] workin' on it [18:24:23] cool danke [18:26:28] LeslieCarr: i thought i put them in the relevant rt ticket [18:26:34] LeslieCarr: but if i didnt then i can get them [18:26:39] let me double check then [18:27:38] don't see it in my email search of rt [18:31:17] oh well, i will get them [18:33:21] LeslieCarr: can brewster talk to the 10.64.36 network? [18:33:33] should be able to [18:33:49] hrm [18:33:51] so [18:33:56] the pxe boot works [18:34:06] but then during shcp discovery for the installer [18:34:14] I dont see anything in brewster's logs [18:34:16] which is weird [18:34:51] so i need to get some access working for jeff first, then i can check this out [18:35:15] kk [19:02:32] LeslieCarr, lemme know how it goes, or maybe an eta til you can work on it (no worries, just trying to budget my time) [19:06:20] i'd say looking like 2pm when i'd work on it [19:06:27] now i'm going to try to head into the office physically [19:15:49] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [19:17:11] ok cool, thanks, i'll probably be out by then [19:17:19] will find other stuff to work on in the meantime [19:20:49] !log rebooting db1011 for upgrade [19:20:58] Logged the message, Master [19:22:24] !log rebooting db1009, db1010 for upgrades [19:22:33] Logged the message, Master [19:23:10] PROBLEM - Host db1011 is DOWN: PING CRITICAL - Packet loss = 100% [19:24:04] RECOVERY - Host db1011 is UP: PING OK - Packet loss = 0%, RTA = 35.90 ms [19:24:49] PROBLEM - Host db1009 is DOWN: PING CRITICAL - Packet loss = 100% [19:25:25] PROBLEM - Host db1010 is DOWN: PING CRITICAL - Packet loss = 100% [19:25:52] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [19:26:20] RECOVERY - Host db1009 is UP: PING OK - Packet loss = 0%, RTA = 35.37 ms [19:26:20] RECOVERY - Host db1010 is UP: PING OK - Packet loss = 0%, RTA = 35.40 ms [19:27:51] New patchset: Pgehres; "Updating donate links on the errors page." [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/18331 [19:29:55] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [19:29:56] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [19:29:56] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [19:37:34] New patchset: Pyoungmeister; "moving sudo definitions from sudo::applicationserver into the correct modules" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18414 [19:38:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/18414 [19:39:08] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18194 [19:39:17] Change abandoned: Pyoungmeister; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18414 [19:56:04] New patchset: Kaldari; "Turning on Curation Toolbar by default for en.wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/18419 [20:01:44] drdee: stat1001 was fixed today [20:01:55] not sure if ya saw the rt ticket update =] [20:02:17] whoa cool [20:02:20] I didn't see it! [20:03:45] New review: Tychay; "config change :-)" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/18419 [20:06:50] PROBLEM - Puppet freshness on hume is CRITICAL: Puppet has not run in the last 10 hours [20:09:50] New review: Tychay; "I'm getting tired of them using Kaldari whenever Katie isn't around. ;-)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/11979 [20:11:34] ^^^ someone approve? :-) https://gerrit.wikimedia.org/r/#/c/11979 [20:29:50] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/18419 [20:33:51] RoanKattouw: Now that I'm in mortals for wmf-deployment, should I also be in this gerrit group? I'm not planning on deployments yet but I imagine this would be part of that (since I can already commit to the real repo from fenari). https://gerrit.wikimedia.org/r/#/admin/groups/21,members [20:34:09] Yes [20:34:26] Added you [20:34:28] ok [20:35:31] RoanKattouw: btw, regarding that repo. Do you know if there's any nontrivial things involved with this change ? https://gerrit.wikimedia.org/r/#/c/16241/ Or just waiting for merge? [20:35:56] That's a different repo [20:36:08] I don't know what's blocking that change [20:36:26] ? [20:36:35] How is that a different repo [20:36:41] mediawiki-config != wmf-deployment [20:36:54] it does [20:36:55] most certainly [20:37:15] apache-config !== mediawiki-config [20:37:27] Oooh, but wmf-deployment has rights on that repo, right [20:37:35] wmf-config is a dir in mediawiki-config [20:37:50] what other repos does that grant access to? [20:37:53] Didn't you name the repo? :p [20:37:57] wmf/1.20wmf* branches [20:38:03] Anyway, gotta run for lunch [20:38:09] oh, right. [20:45:49] New patchset: Asher; "moving eqiad enwiki snapshot host from db1033 to db1050" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18423 [20:46:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/18423 [20:47:28] PROBLEM - MySQL disk space on db1011 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=86%): [20:48:18] New patchset: Kaldari; "Committing Roan's live hack for VisualEditorParsoidPrefix" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/18424 [20:48:54] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/18424 [20:48:58] RECOVERY - MySQL disk space on db1011 is OK: DISK OK [20:59:47] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18423 [21:04:37] New review: Catrope; "Putting this on hold until $wgVisualEditorParsoidPrefix is no longer broken" [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/18125 [21:13:21] !log stopping mysql on db1017 for a minute (and all enwiki eqiad replication with it) [21:13:29] Logged the message, Master [21:13:40] binasher: I need some advise on placement of the additional two es servers in eqiad [21:13:46] the inital 4 are all in different racks [21:13:59] but thats because we had 3 database rack + a misc rack [21:14:13] do the additional two need to be in different racks as well? [21:14:27] (i can find room, just want to know what to do) [21:14:39] hey, yesterday there was some weirdness when i tried to log in to fenari that leslie fixed -- i think i didn't have a home directory or something like that. i tried to run sync-file now and failed and it occurred to me that my environment might not be properly set up (profile / bashrc & friends) [21:14:39] they arrived today. [21:15:49] the six are going to make up 2 clusters of 3 hosts. so these 2 can share racks with others, just as long as they aren't both in the same rack. then i can make sure that each set of three spans different racks [21:16:10] and those assignments arent set yet and you can swap them to accomodate where i find spcae? [21:16:19] (yay, typos!) [21:16:35] ori-l: what did sync-file say? [21:16:53] PROBLEM - MySQL Replication Heartbeat on db1043 is CRITICAL: CRIT replication delay 240 seconds [21:17:01] RobH: yep, it doesn't matter which hosts go into which cluster [21:17:02] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 250 seconds [21:17:10] Reedy: that i didn't have ssh-agent running. but running ssh-agent didn't seem to fix things. i am indeed missing the files in /etc/skel -- i presume i should just copy them over [21:17:11] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 260 seconds [21:17:29] PROBLEM - MySQL Replication Heartbeat on db1042 is CRITICAL: CRIT replication delay 278 seconds [21:17:29] PROBLEM - MySQL Replication Heartbeat on db1001 is CRITICAL: CRIT replication delay 279 seconds [21:17:38] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 286 seconds [21:17:47] PROBLEM - mysqld processes on db1017 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:18:45] binasher: awesome, thanks! [21:18:54] ori-l: worth trying, though, I don't think we modify them [21:19:17] RECOVERY - mysqld processes on db1017 is OK: PROCS OK: 1 process with command name mysqld [21:20:13] !log resumed enwiki replication in eqiad [21:20:22] Logged the message, Master [21:23:29] PROBLEM - MySQL Replication Heartbeat on db1017 is CRITICAL: CRIT replication delay 372 seconds [21:24:05] PROBLEM - MySQL Slave Delay on db1017 is CRITICAL: CRIT replication delay 346 seconds [21:27:41] RECOVERY - MySQL Replication Heartbeat on db1042 is OK: OK replication delay 24 seconds [21:27:41] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 24 seconds [21:27:41] RobH: do you know if the two new es hosts for tampa are here too? [21:27:50] RECOVERY - MySQL Replication Heartbeat on db1001 is OK: OK replication delay 14 seconds [21:27:50] RECOVERY - MySQL Replication Heartbeat on db1017 is OK: OK replication delay 9 seconds [21:28:26] RECOVERY - MySQL Slave Delay on db1017 is OK: OK replication delay 0 seconds [21:28:35] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 0 seconds [21:28:35] RECOVERY - MySQL Replication Heartbeat on db1043 is OK: OK replication delay 0 seconds [21:28:44] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [21:30:33] binasher: also, is db1048 being used in any way? I dont see it in any db config [21:30:44] i would like to move it up in the rack, as db1047 needs a disk shelf added [21:30:54] and it doesnt appear to be doing something right now, so its the easy one to move. [21:31:09] (i wont be taking down db1047 right now, not yet) [21:32:51] RobH: db1048 is the gerrit master [21:33:03] oh, damn. [21:33:11] well, hrmmmmm [21:33:13] yeah :/ so that'll have to be scheduled a day or two in advance [21:33:24] well, the other option is db1046 [21:33:55] db1046 is the replica of db1048 but that's ok to take down [21:34:13] ok, so I can move that to above db1050 (I am going to go look at it, i may not do it today) [21:35:03] if you really want to move 1048, we can make gerrit use db48 in tampa for the duration of the maintenance.. they are master/master. so it would just be a minute of downtime and gerrit being slow again [21:35:32] the analytics ppl will be oh so happy for the disk shelf [21:36:26] below is even easier, lower in the cabinet the easier on me [21:36:37] cool [21:38:40] binasher: ok, so i can take down db1046 at any time as long as i log it, shutdown mysql cleanly, and do a normal shutdown [21:38:52] when its booted back up a few u higher in the rack it should just work [21:39:00] yes? [21:39:10] i can do it now if so [21:39:31] then i can rack the shelf so the only thing stopping analytics getting the storage space is downtime on their own server [21:40:03] yeah, that should all be ok [21:40:23] cool [21:40:28] tho you'll have to start mysql when it boots back up [21:40:36] ahh, duly noted [21:40:48] !log shutting down db1046 to migrate its position in the rack [21:40:56] Logged the message, RobH [21:40:58] drdee: fyi ^^^ db1046 = where the gerritro user is used.. i see its connected to from stat1, though has been idle forever [21:41:49] ok, its shutting down, will be afk moving it, back shortly [21:41:56] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [21:44:02] PROBLEM - Host db1046 is DOWN: PING CRITICAL - Packet loss = 100% [21:44:20] <^demon|away> binasher: 1046? I thought gerrit was on 1042. [21:44:27] <^demon|away> (maybe wrong, so feel free to lambast me) [21:46:06] <^demon|away> We're both wrong. It's db1048 :p [21:46:12] <^demon|away> According to puppet. [21:49:26] ^demon|away: Yeah he said it was 1048 earlier [21:49:52] See scrollback for how binasher stopped RobH from taking down db1048 to move it around :) [21:52:04] * binasher reloads a gerrit page just to make sure the above conversation wasn't triggered by something terrible happening.. ok, phew  [21:52:42] <^demon|away> RoanKattouw: Meh, scrollback is hard. [21:59:12] ok, well, db1046 is coming back now [21:59:48] New patchset: Aaron Schulz; "Added swift to multiwrite backend for all wikis." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/18432 [21:59:58] \o/ [22:02:00] !log db1046 back online [22:02:09] Logged the message, RobH [22:02:10] RECOVERY - Host db1046 is UP: PING OK - Packet loss = 0%, RTA = 35.44 ms [22:03:03] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/18432 [22:04:54] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [22:07:16] PROBLEM - MySQL Replication Heartbeat on db1003 is CRITICAL: CRIT replication delay 246 seconds [22:07:25] PROBLEM - MySQL Slave Delay on db1003 is CRITICAL: CRIT replication delay 255 seconds [22:11:37] RECOVERY - MySQL Replication Heartbeat on db1003 is OK: OK replication delay 0 seconds [22:11:46] RECOVERY - MySQL Slave Delay on db1003 is OK: OK replication delay 0 seconds [22:19:44] ori-l: were you able to get everything deployed in the remainder of the window? [22:20:09] yes, thank you [22:20:47] yay [22:33:36] kaldari: Is there a reason you use a full-sync instead of specific files or directories? I haven't used scap before and I see other people do file-scapping a lot. Just wondering if it matters. [22:33:53] I imagine someone else might be working on another file or something. Its already hard to keep track of one file on fenari (with uncommitted changes) [22:34:56] ah, Its starting to make sense. [22:35:33] hm.. or not. in the log ic it after a singular file sync. so something happened in between? [22:37:09] I usually just sync individual files. The scap was for the launch of the curation toolbar in PageTriage which is a huge new feature with dozens of js and css files and tons of messages, etc. [22:37:27] i18n changes always require scap [22:37:29] scap extensions/PageTriage ? [22:37:30] to rebuild the l10n cahce [22:37:41] Krinkle: No, scap is global, it doesn't take a file or directory argument [22:37:54] "17:32 logmsgbot: reedy synchronized wmf-config/" [22:38:02] Yeah that's sync-dir [22:38:13] "19:48 logmsgbot: catrope synchronized php-1.20wmf9/extensions/VisualEditor/" [22:38:16] * RoanKattouw refers Krinkle to http://wikitech.wikimedia.org/view/How_to_deploy_code [22:38:18] right, so there is a dir one as well [22:38:21] Yes [22:38:27] And sync-file [22:38:45] It was just a big confusing in the log I saw lots of full sync that take up to an hour without it being obvious what was going on there [22:39:01] do those syncs take from the repo or the local files on fenari directly?> [22:39:08] I added the rebuildLocalisationCache.php script to the deploy instructions, but I haven't found it to work consistantly [22:39:10] they pull from NFS [22:39:26] right, so from the disk. [22:39:29] otherwise I would try using that instead of scap when possible [22:39:48] kaldari: essentially, it'd still be the same thing [22:40:02] just you could limit it to then only sync the files in cache/l10n [22:40:44] Reedy: ah, so you need to run both rebuildLocalisationCache and sync cache/l10n for it to work? [22:40:49] yeah [22:40:52] Not that I know, but I imagine if someone is vim'ing a file in wmf-config and then someone runs a full sync unannounced that may cause issues. I'd say don't edit files on fenari directly, but people do that apparently. [22:40:55] still it's hundreds of megs x 100s of servers [22:41:06] Krinkle: old habits die hard [22:41:16] yeah [22:41:23] (the 100s x 100s) [22:41:25] new habits too :) [22:42:28] kaldari: no criticism on you. Im just looking at the logs and trying to make sense of it and saw a fair amount of full-syncs from you that seemed a bit like a work-around. Which it turns out to be, so that's fine then. [22:42:32] I also announce any time I'm going to run a scap in #wikimedia-tech and then wait a few minutes in case anyone is doing some live hacking [22:42:50] cool [22:43:01] I only ran one scap today [22:43:07] though -tech apparently has been repurposed into a community support channel for javascript and css, so maybe here as well. [22:43:39] (or wikimediawiki tech stuff in general rather) [22:44:10] https://meta.wikimedia.org/wiki/Tech - yeah, weird stuff. I only found out about a week ago [22:44:46] which I guess is why -operations was created. lol, ops keep moving their channel away from non-ops/devs overflow [22:44:55] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [22:47:28] I'll try using Reedy's suggestion next time and see how well it works. [22:47:45] It probably won't make much difference... [22:47:54] changes are rsync'd [22:48:57] has anyone considered writing a "scap-light" script? [22:49:22] RoanKattouw: so, should we interface with /h/w/common or /apache (/usr/local/apache/common) both appear the same on fenari. Documentation pages mention both mixed. [22:49:27] kaldari: To do what? [22:49:46] Krinkle: Use /home on fenari , /apache on Apaches [22:50:01] You should normally only be touching fenari unless you're debugging a specific Apache or something [22:50:13] maybe that takes some arguments to just affect certain dirs/processes [22:50:20] so scap scups to fenari itself as well ? [22:50:26] (to /apache) [22:51:14] yeah, sync-common [22:52:06] New patchset: Pgehres; "Updating donate links on the errors page." [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/18331 [22:53:02] Reedy: but e.g. to do something for test.wikipedia.org (which runs from fenari) that would be on /h/w/ right? Or does test.wikipedia run from the apache side on fenari? [22:53:17] no, srv193 is the exception [22:53:25] that has /home mounted [22:53:36] e.g. require a sync first [22:53:39] just change the file and it's live on srv193 [22:53:49] right [22:54:05] srv193 is not fenari though, right ? [22:54:10] no [22:54:12] it mounts the same NFS [22:54:14] ok [22:55:04] what happened on wikitech? http://wikitech.wikimedia.org/view/Srv193 [22:55:05] [22:55:08] that wasn't there last wek [22:55:14] 1.17wmf1 [22:55:16] O_O [22:56:58] Reedy: does srv193 also serve other wikis (and execute /h/w instead of /usr/local/apache for testwiki wgDBname) or is it exclusively for test wiki ? [22:57:12] all it does is testwiki [22:57:23] ok [22:58:45] "" looks like more i18n is broken on wikitech, weird. [22:59:06] mutante tried to upgrade it, but stuff broke [22:59:56] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [23:05:33] Reedy: that was wikitech, which isnt on srv193 but outside the cluster [23:05:46] ah, got it [23:05:47] I know [23:05:49] heh :) [23:05:51] yep yep [23:06:20] New patchset: preilly; "new netmask for acl" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18438 [23:06:26] notpeter: ^^ [23:07:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/18438 [23:07:18] Ryan_Lane: ^^ [23:09:34] LeslieCarr: ^^ [23:09:45] one minute [23:09:49] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/18438 [23:10:07] got it [23:10:15] on phone with juniper [23:10:22] LeslieCarr: sorry [23:10:30] me too ! [23:10:48] preilly: pushing out now [23:11:51] notpeter: cool thanks [23:12:26] done [23:13:20] preilly: np [23:14:56] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [23:29:56] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours