[00:00:36] !log rebooting / upgrading kernel on es1003 first [00:00:41] Logged the message, Master [00:03:39] Platonides: haha. that one took me a while [00:04:13] LeslieCarr: hey, i'm just as far east as he! [00:04:28] so why arne't you out drinking ? :) [00:04:37] oooh, good idea [00:05:10] hmmmmm, should i use Leslie's whiskey? [00:05:42] it's national bourbon day [00:13:40] New patchset: Bsitu; " Add configuration variable to MoodBar" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11581 [00:13:46] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11581 [00:14:46] New review: Bsitu; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11581 [00:14:48] Change merged: Bsitu; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11581 [00:46:06] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [01:07:44] New patchset: SPQRobin; "Bug 37614 - Change namespace "Wikipidia pamandiran" to "Pamandiran Wikipidia" for bjnwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11583 [01:07:50] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11583 [01:33:30] New review: Jeremyb; "running for the train, will write a comment later" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/11583 [01:40:34] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 232 seconds [01:42:04] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 243 seconds [01:47:46] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [01:48:22] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 623s [01:51:22] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 36s [01:52:07] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 20 seconds [03:51:55] New review: Jeremyb; "* The new wgMetaNamespaceTalk value matches the NS_PROJECT_TALK value in all of master, 1.20wmf[45]...." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/11583 [03:58:38] can anyone tell me the prod gerrit version? [03:58:50] and how is it deployed? manual download? [03:59:07] there seem to be no war files in production puppet repo [04:17:25] New patchset: Jeremyb; "gerrit: fix some overzealous commentlink patterns" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11589 [04:17:53] New review: Jeremyb; "(no comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/11589 [04:17:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11589 [04:21:48] New review: Jeremyb; "Maybe there's a way to guarantee there's no match inside a URL or to specify precedence order for co..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/11589 [04:22:41] good night ;) [05:56:43] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [05:56:43] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [05:59:25] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [06:02:16] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [06:59:43] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [07:03:47] New review: Dzahn; "per mail to list" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8339 [07:03:50] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8339 [08:04:15] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [08:20:23] New review: Mark Bergsma; "Comments inline." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/11574 [08:25:15] New review: Dzahn; "see inline comment" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/6489 [08:32:44] New patchset: Dzahn; "mwscriptwikiset - do not rely on mwscript being in path (f.e. cron jobs)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6489 [08:33:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6489 [08:39:08] New review: Dzahn; "just dont feel like merging out of fear of breaking Nagios and causing notification bomb because of ..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/3291 [08:53:52] New review: Dzahn; "a good time to merge this would be ..ehm... a general outage :p" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/3291 [08:54:49] ?? [08:54:56] for an nrpe init file? [08:55:03] i'd merge that in my sleep [08:55:09] because in the past when we had those Nagios bombs [08:55:21] puppet tried restarting it but failed [08:55:25] on all boxes [08:55:33] i don't even think those are set as critical [08:55:39] please do:) [08:55:44] why is the init script needed? [08:55:50] i broke it before, so ..hrmm [08:55:59] just cause it wasnt puppetized [08:56:07] and the file itself is already merged in repo [08:56:08] no but presumably it's in the package? [08:56:11] just not the puppet definition [08:56:34] we added changes in the past [08:56:39] trying to fix other issues [08:56:45] what other issues? [08:57:18] something about failed restarts, adding a "sleep" afair [08:57:22] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/3291 [08:57:32] perhaps time to convert that into a sane upstart job then [08:57:37] that was to prevent some weird bug when it tried to restart to early [08:57:54] sounds right @ upstart [09:56:00] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [10:44:58] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5492 [10:45:01] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5492 [10:47:32] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [10:54:44] New review: Ryan Lane; "I'll be honest. I now have absolutely no idea what this code is doing." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/8120 [10:57:56] +INFILE="<%= ircecho_logs.map {|k,v| v = v.map {|c| c.sub(/^#?/,'#') }.join(","); "#{k}:#{v}" }.join(";") %>" [10:58:22] haha [11:01:31] New review: Ryan Lane; "Please try to avoid using crazy lambdas when possible. Yes, they are exquisite, but I don't like fee..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/8344 [11:25:51] New patchset: Mark Bergsma; "Prepare NaiveBGPPeering for multi-protocol operation" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11601 [11:25:52] New patchset: Mark Bergsma; "Attempt to make NaiveBGPPeering handle other address families" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11602 [11:25:53] New patchset: Mark Bergsma; "Clean up handling of Attribute constructors with their ancestors" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11603 [11:25:53] New patchset: Mark Bergsma; "Reimplement move of advertisements/withdrawals to MP attributes" [operations/debs/pybal] (mp-bgp) - https://gerrit.wikimedia.org/r/11604 [11:36:19] New patchset: Hashar; "'gs' package renamed 'ghostscript' in Precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11605 [11:36:40] paravoid: gs to ghostscript rename for production ^^^^^ [11:36:43] and hello :-] [11:36:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11605 [11:36:54] oh great [11:37:07] you're one step ahead :-) [11:37:17] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11605 [11:37:19] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11605 [11:37:30] the package exist in Lucid, so we might as well migrate production right now :-] [11:37:42] thx! [11:42:09] PROBLEM - Host cp3002 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:09] PROBLEM - Host cp3001 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:21] erm [11:42:36] PROBLEM - Host knsq21 is DOWN: CRITICAL - Host Unreachable (91.198.174.31) [11:42:36] PROBLEM - Host knsq23 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:36] PROBLEM - Host knsq27 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:36] PROBLEM - Host knsq20 is DOWN: CRITICAL - Host Unreachable (91.198.174.30) [11:42:45] PROBLEM - Host knsq26 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:45] PROBLEM - Host knsq24 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:45] PROBLEM - Host knsq19 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:54] PROBLEM - Host knsq17 is DOWN: CRITICAL - Host Unreachable (91.198.174.27) [11:42:54] PROBLEM - Host knsq18 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:54] PROBLEM - Host knsq22 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:03] PROBLEM - Host knsq16 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:15] looking [11:44:06] PROBLEM - LVS HTTPS IPv6 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:44:06] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:44:06] PROBLEM - LVS HTTPS IPv4 on wikimedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:44:15] PROBLEM - LVS HTTPS IPv4 on wikibooks-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:44:15] PROBLEM - LVS HTTPS IPv6 on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:44:33] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:44:33] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:44:33] PROBLEM - LVS HTTPS IPv6 on wikibooks-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:44:42] PROBLEM - LVS HTTPS IPv6 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:44:42] PROBLEM - LVS HTTPS IPv4 on wikisource-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:44:42] PROBLEM - LVS HTTPS IPv6 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:44:51] PROBLEM - LVS HTTPS IPv4 on wikipedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:44:51] PROBLEM - LVS HTTP IPv4 on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:45:00] PROBLEM - LVS HTTPS IPv4 on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:45:00] PROBLEM - LVS HTTPS IPv6 on wikisource-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:45:00] PROBLEM - LVS HTTPS IPv4 on wikiquote-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:45:09] PROBLEM - LVS HTTPS IPv6 on wikiquote-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:45:10] PROBLEM - LVS HTTPS IPv4 on wikinews-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:45:10] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:45:10] PROBLEM - LVS HTTPS IPv6 on wikipedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:45:10] PROBLEM - LVS HTTPS IPv4 on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:45:54] PROBLEM - LVS HTTPS IPv4 on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:46:33] !log csw1-esams.wikimedia.org line card 2 in trouble, power cycled it [11:46:38] Logged the message, Master [11:47:04] it sensed that replacement hardware has arrived at the data center I think [11:47:06] RECOVERY - LVS HTTPS IPv4 on wikibooks-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 44110 bytes in 0.666 seconds [11:47:06] RECOVERY - LVS HTTPS IPv6 on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60057 bytes in 0.665 seconds [11:47:06] RECOVERY - LVS HTTPS IPv6 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60059 bytes in 9.657 seconds [11:47:06] RECOVERY - LVS HTTPS IPv4 on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 80875 bytes in 9.771 seconds [11:47:07] just hours ago [11:47:15] RECOVERY - LVS HTTPS IPv4 on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 637 bytes in 0.442 seconds [11:47:24] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 80875 bytes in 0.770 seconds [11:47:24] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39111 bytes in 1.073 seconds [11:47:24] RECOVERY - LVS HTTPS IPv6 on wikibooks-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60058 bytes in 0.663 seconds [11:47:24] RECOVERY - Host knsq19 is UP: PING OK - Packet loss = 0%, RTA = 108.91 ms [11:47:24] RECOVERY - Host knsq20 is UP: PING OK - Packet loss = 0%, RTA = 108.83 ms [11:47:24] RECOVERY - Host knsq21 is UP: PING OK - Packet loss = 0%, RTA = 108.95 ms [11:47:25] RECOVERY - Host cp3001 is UP: PING OK - Packet loss = 0%, RTA = 121.32 ms [11:47:25] RECOVERY - Host knsq27 is UP: PING OK - Packet loss = 0%, RTA = 108.73 ms [11:47:26] RECOVERY - Host knsq24 is UP: PING OK - Packet loss = 0%, RTA = 108.65 ms [11:47:26] RECOVERY - Host knsq23 is UP: PING OK - Packet loss = 0%, RTA = 108.75 ms [11:47:27] RECOVERY - Host knsq26 is UP: PING OK - Packet loss = 0%, RTA = 109.52 ms [11:47:33] RECOVERY - LVS HTTPS IPv6 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60057 bytes in 0.665 seconds [11:47:33] RECOVERY - LVS HTTPS IPv6 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60058 bytes in 0.667 seconds [11:47:33] RECOVERY - LVS HTTPS IPv4 on wikisource-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 43917 bytes in 0.664 seconds [11:47:42] RECOVERY - LVS HTTPS IPv4 on wikipedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60053 bytes in 0.772 seconds [11:47:42] RECOVERY - Host knsq17 is UP: PING OK - Packet loss = 0%, RTA = 108.82 ms [11:47:42] RECOVERY - Host cp3002 is UP: PING OK - Packet loss = 0%, RTA = 109.27 ms [11:47:51] RECOVERY - LVS HTTPS IPv6 on wikisource-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60060 bytes in 0.664 seconds [11:47:51] RECOVERY - LVS HTTPS IPv4 on wikiquote-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 52459 bytes in 0.664 seconds [11:47:51] RECOVERY - Host knsq18 is UP: PING OK - Packet loss = 0%, RTA = 108.81 ms [11:47:51] RECOVERY - Host knsq16 is UP: PING OK - Packet loss = 0%, RTA = 108.59 ms [11:47:51] RECOVERY - Host knsq22 is UP: PING OK - Packet loss = 0%, RTA = 108.71 ms [11:48:00] RECOVERY - LVS HTTPS IPv4 on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.445 second response time [11:48:00] RECOVERY - LVS HTTPS IPv6 on wikiquote-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60056 bytes in 0.669 seconds [11:48:00] RECOVERY - LVS HTTPS IPv4 on wikinews-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 69663 bytes in 0.774 seconds [11:48:00] RECOVERY - LVS HTTPS IPv6 on wikipedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60057 bytes in 0.882 seconds [11:48:00] RECOVERY - LVS HTTPS IPv4 on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3837 bytes in 9.919 seconds [11:48:00] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60453 bytes in 0.979 seconds [11:48:28] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 48808 bytes in 0.904 seconds [11:51:49] paravoid: while you are around I have an other occurrence of Lucid -> Precise packages migration. That one is nastier though [11:51:51] https://gerrit.wikimedia.org/r/#/c/11358/ [11:51:54] PROBLEM - Varnish HTTP bits on cp3002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:52:09] some packages providing fonts have been deleted, so we will have to explicitly distribute the fonts we need :-/ [11:55:03] PROBLEM - Varnish HTTP bits on cp3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:55:50] top - 11:55:48 up 77 days, 22:08, 1 user, load average: 12923.16, 9550.86, 4640.65 [11:57:18] ouch [11:57:26] never seen something that high [11:57:45] RECOVERY - Varnish HTTP bits on cp3002 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 2.567 seconds [11:58:30] PROBLEM - LVS HTTPS IPv4 on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:59:42] RECOVERY - LVS HTTP IPv4 on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3873 bytes in 0.612 seconds [12:00:54] RECOVERY - Varnish HTTP bits on cp3001 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.219 seconds [12:01:21] RECOVERY - LVS HTTPS IPv4 on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3890 bytes in 0.439 seconds [12:02:24] PROBLEM - Varnish HTTP bits on cp3002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:05:15] RECOVERY - Varnish HTTP bits on cp3002 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.218 seconds [12:10:23] I didn't get paged [12:10:29] dammit [12:14:17] mark: did you get paged for the above? [12:14:33] yes [12:15:31] sigh [12:16:09] New patchset: Hashar; "redirects (301) /w/ to /w/index.php" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/11606 [12:19:40] paravoid: would you mind looking at another Lucid to Precise packages migration? :D https://gerrit.wikimedia.org/r/#/c/11358/ :-] [12:22:05] I didn't forget you :) [12:22:06] New patchset: Faidon; "Add faidon to SMS Nagios group (doh!)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11607 [12:22:07] sec. [12:22:12] that ^^^ is more important :-) [12:22:17] totally :-] [12:22:30] just making sure I am still somewhere in the queue !! [12:22:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11607 [12:22:38] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11607 [12:22:41] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11607 [12:23:39] now we just need another outage to be able to test this :> [12:23:52] paravoid: someone called me? :D [12:24:06] so which system you want to break [12:24:07] :) [12:25:15] New review: Siebrand; "Interesting discussion, by the way. Who's the owner of operations/mediawiki-config and who should/ca..." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/11082 [12:25:38] hashar: so [12:25:57] what did you meant when you said that we'll have to ship the fonts ourselves? [12:27:15] paravoid: I meant, we need to explicitly list the fonts [12:27:22] instead of using the meta packages that regroups fonts per language [12:27:31] probably need to be reworded [12:27:43] aha [12:27:59] I wonder what they replaced language-support-fonts-* with [12:30:37] so, [12:30:38] could not find any package replacement [12:30:40] if we have to do this anyway [12:30:45] then why not list the fonts for lucid too? [12:31:06] I used "apt-cache depends" to get list of fonts and rdepends on precise to try to find out some meta package [12:31:22] right, I just did that too :) [12:31:47] I did not want to mess too much with Lucid :-] [12:31:52] so I just copy pasted the existing packages [12:32:01] well, if it's doing the same… [12:32:02] but now we have two classes :/ [12:32:06] sure [12:33:04] mark: btw, what did you to powercycle the linecard? [12:33:14] not that I have access to that but wondering :) [12:33:17] power-off [12:37:50] paravoid: fonts packages : https://gerrit.wikimedia.org/r/#/c/11358/ ||| https://gerrit.wikimedia.org/r/#/c/11358/3/manifests/imagescaler.pp,unified [12:38:26] grrr [12:38:36] I should move the other fonts in that imagescaler::packages::fonts [12:39:02] liberation and libertine you mean? [12:39:06] yeah [12:39:18] sure [12:39:25] love the nice commit messages too :-) [12:39:32] though that commit make it clear that we made a transition from language-support-fonts to explicit fonts definition [12:40:21] New review: Siebrand; "There's probably a reason why you approved but didn't merge. Does that need any additional informati..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/10707 [12:42:01] New review: Nikerabbit; "I only merge stuff to config when I am going to deploy it immediately after, or I know someone else ..." [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10707 [12:44:49] damn river [12:49:41] Patchset 4 moves fonts related packages from imagescaler::packages to imagescaler::packages::fonts [12:50:16] paravoid: ok got it rebased with all fonts in the same package https://gerrit.wikimedia.org/r/#/c/11358/ [13:05:56] :) [13:10:03] danke trying out on test [13:16:47] seems to work [13:22:34] New patchset: Hashar; "job runner now supports being run on a specific job type" [operations/debs/wikimedia-job-runner] (master) - https://gerrit.wikimedia.org/r/11610 [13:25:06] New review: Hashar; "This kind of follow up https://gerrit.wikimedia.org/r/#/c/11041/ which made jobs-loop.sh to recogniz..." [operations/debs/wikimedia-job-runner] (master) C: 0; - https://gerrit.wikimedia.org/r/11610 [13:39:25] !log adding labs-ns0 and labs-ns1 dns entries [13:39:31] Logged the message, Master [13:41:36] why does wikipedia not resolve for me? [13:44:58] i had that happen to me once [13:45:09] but it was because I had edited my hosts file and pointed wikipedia at localhost [13:45:18] the I went and told this room that wikipedia didn't work [13:45:21] then* [13:47:04] New review: Hashar; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11145 [13:47:06] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11145 [13:48:06] New review: Hashar; "Merging test related stuff, not going to kill site :-]" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11146 [13:48:09] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11146 [13:48:17] so many stuff to do [13:57:28] New review: Demon; "Test comment." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/11314 [14:08:53] hey mark, i'm improving the sysctl define and using it [14:08:54] q [14:09:03] the number prefixes in sysctl.d files [14:09:14] should I just use 60- for all of our custom prefixes? [14:09:20] should I make that configurable? [14:09:22] does it matter? [14:09:32] do those just specify priority load order? [14:10:42] ottomata: as long as they don't overlap it doesn't really matter [14:11:03] as in there can't be 2 files with the same number (there currently are) [14:11:22] or as long as they don't overlap with numbers and values? [14:11:32] the latter [14:11:36] ah ok cool [14:11:48] cool, I will just leave them at 60 then [14:15:46] @infobot-ignore+ log [14:15:47] petan: Unknown identifier (log) [14:15:47] Item log is already in list [14:16:21] why do we have those bots in here? [14:20:56] New patchset: Demon; "Fix spammy IRC output once and for all." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11615 [14:21:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11615 [14:21:46] New patchset: Demon; "Fix spammy IRC output once and for all." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11615 [14:22:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11615 [14:23:41] Change abandoned: Demon; "Dropping in favor of I9c96d565." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11314 [14:26:05] <^demon> Ryan_Lane: Mind taking a look at https://gerrit.wikimedia.org/r/#/c/11615/? The logic's much better and it actually does what I intended originally. [14:27:20] I swear if this breaks the hooks, I'll stab you :) [14:27:39] gimme a min, that's a lot of files [14:27:52] ^demon: https://gerrit.wikimedia.org/r/#/c/11615/2/files/gerrit/hooks/hookhelper.py,unified [14:27:56] <^demon> It's a bunch of 2 line changes. And I tested it. [14:27:57] you are doing a strict equality there [14:28:00] \o/ [14:28:03] if user in hookconfig.spammyusers [14:28:30] <^demon> I'm using the regex'd version that you see on IRC. [14:28:33] <^demon> "Demon" [14:28:40] <^demon> "L10n-bot" [14:28:40] <^demon> etc [14:28:53] * mark does a bunch of 2-line changes on ^demon [14:29:04] ^demon: oh yeah:) [14:29:47] ^demon: you could just do that regex at the log_to_file() level though [14:30:05] to factor out the code like: user = re.sub(' \(.*', "", something ) [14:30:06] <^demon> Meh, each of the messages are a bit different. [14:30:23] <^demon> I'm tired of playing with it. I just want it to fucking work. [14:30:41] <^demon> If this works I'm going to wash my hands of it and not touch it again. [14:31:09] New patchset: Hashar; "(bug 34866) Change wgLanguageCode of several wikis to be renamed" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10707 [14:31:16] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/10707 [14:31:40] New review: Hashar; "Patchset 2 is a rebase" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10707 [14:31:43] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10707 [14:32:48] ^demon: well git blame know about you! [14:32:53] you will not escape :-] [14:33:04] hume: rsync: write failed on "/apache/common-local/wmf-config/InitialiseSettings.php": No space left on device (28) [14:33:14] !log hume is out of disk space [14:33:19] Logged the message, Master [14:34:20] !log hume: 5.0G 5.0G 68K 100% /usr/local/apache [14:34:25] Logged the message, Master [14:35:09] hashar: slooooow [14:35:14] It was out of disk space on monday [14:35:25] we need to have that partition resized [14:35:44] <^demon> We've got wmf[2-5] all on it [14:35:52] need notpeter to look at it probably [14:38:35] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11615 [14:38:38] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11615 [14:39:02] I killed the old l10n cache dirs we dont need around [14:39:11] saved quite a bit [14:39:32] Reedy: so we are killing data now instead of removing? :D [14:39:59] yeah [14:40:01] DIE DIE DIE [14:40:15] New review: Ryan Lane; "test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11615 [14:40:20] thank Reedy :-) [14:40:28] ^demon: ^^ done [14:40:46] <^demon> Thanks. Hopefully that did it. [14:40:59] <^demon> At the very least, it'll be easier to just change the one big of config if I got the name wrong [14:41:00] hopefully, yeah :) [14:41:05] /dev/mapper/tank-archive 1.5T 271G 1.3T 19% /archive [14:41:12] I think there's some free space there ;) [14:41:14] <^demon> *bit [14:42:33] I get like 2TB myself at home [14:42:35] mostly empty [14:42:45] that is just for my laptops backups :-D [14:45:11] Reedy: willing to review a small patch for me please ? https://gerrit.wikimedia.org/r/#/c/7298/ [14:45:23] to override wfHostname() return [14:45:41] need that on labs to get ride of the non human friendly instances names such as I-0000000D0E [14:46:01] thinking of needing reviews... [14:46:16] I still have extension changes that need reviews ;) [14:46:28] https://gerrit.wikimedia.org/r/#/c/11169/ [14:46:35] https://gerrit.wikimedia.org/r/#/c/11175/ [14:46:45] I can do them probably [14:46:53] Ryan_Lane: I got https://gerrit.wikimedia.org/r/#/c/7083/ for ya [14:46:58] Ryan_Lane: which is a bad patch btw :-( [14:48:18] hashar: bad patch? [14:50:28] the idea was to replace instance id with instance id + instance name [14:50:38] but only did that at one place :/ [14:52:13] ah [14:52:35] Yeah, it would be good for all of the messages to show what was happening on what instance [14:52:44] Many of the messages are really generic [14:55:07] <^demon> Ryan_Lane: 11175 reviewed. [14:55:36] I don't see a review [14:56:05] <^demon> I meant reviewed + merged. Not inline review. [14:56:06] oh you meant 11169 :) [14:56:11] <^demon> Oh, yeah [14:56:12] <^demon> whoops [14:56:14] heh [14:56:18] thanks [14:56:52] I really would have though we would have a global message for "you need to be logged in" [14:56:54] I was so, so wrong [14:57:02] that's duplicated in a billion places [14:57:08] <^demon> Other one too :) [14:57:17] sweet. thanks [14:57:22] now I can update labsconsole :) [14:57:24] <^demon> There probably is something in output page. [14:57:33] I need to cherry-pick in the core change first, though [14:57:41] <^demon> outputpage->loggedInError(), or throw new LoggedInError or somesuch [14:57:46] ha [14:57:48] err [14:57:49] ah [14:57:55] that is what i am looking for :( [14:57:55] https://gerrit.wikimedia.org/r/#/c/11169/ [14:57:58] can't find any [14:58:06] <^demon> I honestly don't know what's going on in mediawiki these days :p [14:58:19] <^demon> Too much gerrit and python. [14:58:24] well, this is from ages ago [14:59:00] we could just add a simple NotLoggedIn exception [14:59:20] In MediaWiki? Isn't there one already? AssertEdit extension? [14:59:24] Or something. [15:00:14] bah chad merged it [15:02:16] yeah [15:02:18] assetedit [15:04:26] I have opened https://bugzilla.wikimedia.org/show_bug.cgi?id=37627 [15:08:34] Ryan_Lane: if you are in a review mood, I have made some changes to the apache configurations https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/apache-config,n,z [15:08:43] mostly housework [15:08:53] though one is about making /w/ to 301 to /w/index.php [15:13:23] * Ryan_Lane twitches [15:13:30] I'm not messing with apache config on a friday [15:14:30] apache configs are sorta like dns changes [15:14:33] seriously [15:14:43] one space somewhere, one bad redirect, entire squid cache is polluted [15:14:50] yes [15:14:58] I'd prefer to not have two outages this week [15:16:19] I fullly agree [15:16:39] I was barely making you aware of the existences of such changes :-D [15:16:46] heh [15:17:05] I am not even going to deploy them though I might be able to do so :D [15:50:16] !log updating dns for new cisco machines [15:50:21] Logged the message, RobH [15:50:41] Ryan_Lane: ^ so today we are setting the mgmt up, the systems are wired and racked, and leslie has the info for setting up the network when she comes online later [15:50:52] yay! [15:52:38] YAY! [15:53:44] Ryan_Lane: I have not allocated DNS or any of the mac info for installs, that I will leave to you once they are accessible. [15:53:59] New patchset: Ottomata; "Refactoring udp2log classes and defines." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11574 [15:54:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11574 [15:54:54] :( [15:57:43] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [15:57:43] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [15:58:21] New patchset: Ottomata; "Refactoring udp2log classes and defines." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11574 [15:58:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11574 [16:00:12] New review: Ottomata; "(no comment)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/11574 [16:00:42] mmmk, mark, i just made the changes you suggested (i like) for my udp2log puppet refactor [16:00:49] whenever you get a sec, would love another look over [16:15:18] drdee: you there? [16:15:25] sure am [16:15:38] oh, you should get in on this too, ottomata [16:15:49] so, I'm going to start throwing search logs at... something [16:15:57] sweet [16:16:04] should I throw them at the udp2log instances? (can probably only do one right now) [16:16:17] that sounds reasonable, ottomata what do you think? [16:16:28] I'm not sure if the filters on udp2log will accept them, is the thing [16:17:00] I guess what I'm asking is "where shall I point the hose?" [16:17:20] maybe stat1001? [16:17:24] ottomata, what do you think? [16:18:25] also, this already exists to catch the packets: https://svn.wikimedia.org/viewvc/mediawiki/branches/lucene-search-2.1/udplogger/udplogger.py?view=markup&pathrev=51097 [16:18:55] (whether or not it works or can handle the traffic, I can't say) [16:19:07] hmmm, interseting [16:19:15] probably can [16:19:20] worth trying at least [16:19:24] so you are having lucene send traffic to 8192? [16:20:16] each lucene instance can send packets of the for " ah ok [16:20:27] over udp? [16:20:31] you might as well juset send to udp2log instance then [16:20:32] yeah [16:20:38] instead of having this middle man [16:20:38] so [16:20:44] udp2log itself will not drop any log lines [16:20:46] ottomata: it's a different format, I believe [16:20:49] oh, ok [16:20:53] so you just need a filter [16:20:55] it is only the filters that drop it [16:20:56] which yeah, is kinda weird [16:20:57] yes [16:20:57] and then it'll go... somewhere [16:20:59] cool [16:21:00] so, hmmm [16:21:08] it might be better to start up a different udp2log instance for this [16:21:13] +1 [16:21:16] that way you don't mess with other people's logs [16:21:24] sure [16:21:24] i've got a change in that should make that really easy [16:21:24] waiting for review now [16:21:28] that's easy to do :) [16:21:32] https://gerrit.wikimedia.org/r/#/c/11574/ [16:22:22] also [16:22:28] how about! [16:22:28] oh oh oh [16:22:28] so [16:22:28] notpeter [16:22:38] was talking with Robla yesterday, and with dieds the day before [16:22:44] we all want to try out scribe for this stuff [16:22:53] do you need to do sampling or filtering of these logs? [16:22:59] or are you basically just trying to use a remote log server? [16:24:59] ottomata: that change looks legit to me, but I would like for mark to rubber stamp it. [16:25:18] ottomata: I am just doing what drdee asked :) [16:25:27] ah [16:25:31] you should ask him what he desires to do with them [16:25:39] ok [16:25:43] I make the firehose, not the results :) [16:25:44] ask me otto, ask me! [16:25:50] sosososo! [16:25:51] drdee [16:25:57] whatcha think about using scribe for this? [16:26:00] i was talking with robla yesterday [16:26:06] and he wants to do what you do, find a use case and start testing scribe out [16:26:09] notpeter: thanks for setting up the firehose! [16:26:21] we were talking about doing it for the nginx ssl logs [16:26:23] but this would be a safer thing to try it with [16:26:23] since it is new [16:26:38] we can probably also do both [16:26:43] i am perfectly fine with that, we would kill two birds with a single stone [16:26:44] das true [16:27:23] drdee, do you need to do sampling or filtering on these logs? [16:27:27] like, if you set up a logging instance, I'll point traffic at it, then we're getting data. can also set up scribe do this, see how it prefprms [16:27:28] or do you just want to log them remotely? [16:27:36] i rather not sample [16:27:45] I think keepig it all will be reasonable [16:27:50] it's not super huge volumes [16:27:56] right [16:28:19] cool, perfect [16:28:24] then scribe is the perfect use case for that [16:28:25] so cool [16:28:30] New patchset: Jdlrobson; "mirror DeviceDetection.php (bug 33649)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11619 [16:28:39] ok, so i will bump up getting the scribe packages in our apt repo then [16:28:49] for now, notpeter, i guess we can spawn up a different udp2log instance [16:28:52] on oxygen I guess? [16:28:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11619 [16:29:00] whenever my change gets merged in [16:29:03] you should be able to do [16:29:15] ottomata: cool, sounds good! [16:29:42] udp2log::instance { 'lucene': log_directory => '/a/log/lucene' } (or whever is appropriate ) [16:29:43] notpeter, if you have some more spare cycles today, could you take a look at https://rt.wikimedia.org/Ticket/Display.html?id=3113 [16:29:51] oh and monitor => false [16:30:10] oh, weird, yeah. I can do that [16:30:27] drdee: unless there's some crazy backstory, but yeah [16:30:41] no crazy backstory :D [16:32:29] New patchset: Pyoungmeister; "send search logs to oxygen:51234" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11620 [16:32:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11620 [16:37:08] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11620 [16:37:10] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11620 [16:43:23] New patchset: Pyoungmeister; "some cleanup of search code" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11621 [16:43:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11621 [16:45:28] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11621 [16:45:30] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11621 [16:46:06] peter, there's no udp2log instance on 51234 oxygen yet, right? [16:46:16] ottomata: correct [16:47:33] is that gonna be a prob? [16:47:42] the packets will just be dropped [16:48:24] plus, I need to restart lucene on each host for the change to take [16:49:09] ok cool [16:49:13] juuuust checkin [16:50:41] yep. legit [16:52:49] drdee: you can't log into stat1001 because it's not on [16:52:54] and probably hasn't been installed yet [16:53:02] so, there's a lot more going on there :) [16:53:06] you mean it is physically turned off? [16:53:29] when I log into its management console, I see nothing [16:54:15] New patchset: Ottomata; "/var/run has been moved to /run in Ubuntu Precise. Updating generic::mysql::server accordingly." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11296 [16:54:15] well, no, I see a blinky cursor [16:54:19] so it might be turned on [16:54:22] but it's doing nothing [16:54:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11296 [16:55:53] New review: Ottomata; "OK great. Done in realm.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11296 [17:00:26] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [17:39:02] RobH: good morning [17:57:55] notpeter: so....what should happen to stat1001? [17:58:32] i was actually just about to start building it, unless notpeter was going to get it [17:58:32] I have no idea. [17:58:34] what is it for? [17:58:53] LeslieCarr: you seem to have some idea of what's going on, so go for it :) [17:59:31] Ryan_Lane: ping? [17:59:35] don't want to ruin your fun ;) [17:59:35] \ [17:59:48] ? [17:59:55] paravoid: ? [18:00:00] Ryan_Lane: did you manage to change wmflabs.org NS? [18:00:06] to what? [18:00:16] labs-ns0/1? [18:00:19] no [18:00:25] we're waiting till we have a secondary up [18:00:47] didn't you and mark decide to get rid of the NS CNAME? [18:00:54] yes. we did [18:00:56] it's not there anymore [18:01:04] it's an A record now [18:01:07] has been for most of the day [18:01:11] ah, okay [18:01:17] many resolvers are giving back proper results [18:01:17] how can I fix the SOA too? [18:01:30] awesome, except for our office resolvers [18:01:31] of course [18:01:47] andrew_wmf: ping? :) [18:01:56] paravoid: the SOA is broken? :( [18:02:01] hi [18:02:02] yes, I told you yesterday about it [18:02:14] oh you mean the email address? [18:02:16] hostmaster\@wikimedia.org. 20120612192602. 1800 3600 86400 7200 3600 [18:02:21] yes [18:02:26] \@ -> . [18:02:27] I'm not sure why there's a \ there [18:02:42] you need to make it into a dot [18:02:45] hostmaster.wikimedia.org [18:02:51] ah [18:02:52] ok [18:02:54] er [18:02:55] lemme fix that [18:02:55] and [18:02:57] wait [18:03:02] wtf [18:03:04] ? [18:03:16] you're lacking a field there too [18:03:32] whaaaaa [18:03:37] it should be [18:03:52] virt0.wikimedia.org hostmaster.wikimedia.org $serial ... [18:04:01] see the dot after the "serial"? [18:04:05] our serial currently is 1800 ... [18:04:55] New patchset: Lcarr; "putting in stat1001 as a precise machine" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11642 [18:05:19] here's what's in LDAP: sOARecord: hostmaster@wikimedia.org 20120612192602 1800 3600 86400 7200 [18:05:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11642 [18:05:30] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11642 [18:05:31] that's wrong [18:05:33] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11642 [18:06:37] how so? SOA is defined as (sn ref ret ex min) [18:06:40] that's what we have... [18:07:09] or does the ldap schema define this differently? [18:07:14] it's [18:07:27] origin contact sn ref ret ex min [18:07:33] you're missing origin [18:07:47] primary hostmaster serial refresh retry expire default_ttl [18:07:51] so, now it's origin = hostmaster@wikimedia.org, contact = 20120612192602. (see the dot) [18:07:55] serial = 1800 [18:07:58] etc. [18:08:05] yes. [18:08:13] where's the "primary"? [18:08:16] I see [18:08:29] well, that's a bitch [18:08:39] I need to fix that in the code [18:08:45] 21:03 < paravoid> it should be [18:08:45] 21:03 < paravoid> virt0.wikimedia.org hostmaster.wikimedia.org $serial ... [18:08:52] or labs-ns0 or whatever you like [18:08:53] doesn't matter [18:09:02] * Ryan_Lane nods [18:09:04] also, why is the serial that big? [18:09:19] that's the normal way to define one? [18:09:35] using a date string is always a good idea [18:09:47] it's usually YYYYMMDDNN [18:09:51] the date strings I've used in the past end in the day + two digits. [18:09:51] where NN is a number from 00-99 [18:10:12] we have six digits [18:11:16] hah [18:11:17] motherfucker [18:11:29] or did you put the time there? [18:11:47] it does have the time [18:11:52] that won't actually break anything, though [18:12:02] the soarecord is actually right for some things [18:12:14] yes it will [18:12:17] serial is 32-bit [18:12:24] the number you have there is much bigger than 2^32 [18:12:26] ah [18:13:55] LeslieCarr: Heyas [18:13:57] the last part of SOA is nowadays negative TTL [18:14:01] but used to be minimum TTL [18:14:05] So we have not joined row C yet [18:14:13] cool [18:14:15] been waiting on you, the fiber is run, and the stacking cable is run [18:14:16] I wonder if not defining that has something to do with the excessive TTLs we were seeing [18:14:26] just have not pulled the d1 to d3 stacking cable yet [18:14:28] could be [18:14:31] well. no [18:14:37] are you fixing the SOA? [18:14:38] the ttl for the wmflabs domains are low [18:14:45] yes, I need to do so in a few places, though [18:14:52] great [18:15:00] then I should wait from asking Andrew to purge the office's cache [18:15:07] yes [18:15:18] RobH: cool [18:15:23] LeslieCarr: So when would you have some time for us to work on that? [18:16:21] i'll get started now [18:16:27] to fix up the switches [18:16:38] ok, the serials are working, and the row c are in a stack as well [18:16:40] Ymd + NN? [18:16:45] paravoid: ^^ [18:16:47] correct format? [18:16:52] so just lemme know what ya need when ya need it [18:16:54] YYYYMMDDNN [18:17:06] yep [18:17:09] that's the same [18:17:09] great. [18:17:20] NN is likely difficult [18:17:24] because I don't read the old record [18:17:39] another form that's being used when 100 updates are not enough [18:17:44] is epoch [18:17:47] just the epoch [18:17:57] it's less intuitive when troubleshooting [18:18:03] hm. epoch may be easier [18:18:07] but much easier to code for and has a bigger accuracy [18:18:47] seconds since the epoch? [18:19:23] paravoid: hrm, do you have an idea why labsconsole.wikimedia.org would be showing properly in one office resolver and not in the other, when i had andrew restart pdns-recursor at the same time on both ? [18:19:42] they cached them at different times? [18:19:43] oh [18:19:52] they pulled from different resolvers [18:20:12] LeslieCarr: you mean you restarted both of them now? or some time in the past? [18:20:33] both of them about an hour ago [18:20:39] maybe 30 minutes ago [18:20:56] Ryan_Lane: So the mgmt is now up for the vrt6-virt15, that being said, its production network isnt connected yet [18:21:03] thats what leslie is working on with us now [18:21:12] RobH: thanks [18:21:30] but if you wanted to connect to the mgmt's to pull mac NICS you can [18:22:09] paravoid: hm. can't use epoch now, can we? [18:22:15] why not? [18:22:18] serial is 1800 now. [18:22:19] the values currently used are higher? [18:22:20] ah [18:22:21] right [18:22:23] heh [18:22:25] :) [18:22:31] what Leslie said worries me [18:22:34] well, that's actually helpful for once [18:22:38] an hour ago things were supposed to be fixed [18:22:54] no. because we fixed the cname thing today [18:23:22] yes, more than an hour ago [18:23:47] could it be that some resolvers always cache NS for one day? [18:24:03] NS are fine? [18:24:41] wait. lemme check something [18:25:22] LeslieCarr: could you do a round of digs for labsconsole.wikimedia.org A & virt0.wikimedia.org for both of them? [18:25:27] New review: preilly; "(no comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/11619 [18:25:39] sorry for the trouble, it's hard to guess... :/ [18:26:08] New patchset: Jdlrobson; "mirror DeviceDetection.php (bug 33649)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11619 [18:26:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11619 [18:26:44] ns0, ns1, and ns2 all return the proper result for virt0 and labsconsole [18:26:52] just wanted to do a sanity check :) [18:27:13] ok. fixed the code [18:27:19] now to update the SOAs in LDAP [18:28:12] great, now both are broken [18:28:46] now that makes no sense at all [18:28:49] New review: preilly; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/11619 [18:28:59] notpeter: can you deploy this https://gerrit.wikimedia.org/r/#/c/11619/ [18:28:59] paravoid: http://pastebin.com/R7gH0SP6 [18:29:26] soarecord: virt0.wikimedia.org hostmaster.wikimedia.org 1339784844 1800 3600 86400 7200 [18:29:35] paravoid: ^^ look correct? [18:29:41] no [18:29:57] you're missing final dots [18:30:05] virt0.wikimedia.org. hostmaster.wikimedia.org. [18:30:16] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11619 [18:30:17] notice the dots at the end of the fqdns [18:30:18] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11619 [18:30:22] !log rt 3113 complains stat1001 dns is bad and is why ssh doesnt work. dns is fine, so invalid reasoning, but system isnt ssh or ping responsive. [18:30:26] yes… I know [18:30:33] !log pretty sure it was never installed in the first place, rebooting to watch post and see [18:30:42] .......is morebots not about.... [18:30:45] preilly: do you need me to for ce a puppet run on the mobile varnish boxes? [18:30:46] so… the strange thing is the powerdns config guide for ldap leaves them out too [18:30:48] damn it. [18:30:54] labsconsole.wikimedia.org. 85967 IN A 208.80.153.135 [18:30:56] WTF?!?!?! [18:31:01] how? [18:31:08] seriously. wtf? [18:31:18] ok. I'm switching the ns records to use labs-ns0 and labs-ns1 [18:31:21] notpeter: yes please [18:31:23] something truly fucked up is going on [18:31:28] it can't be it [18:31:32] that's the old IP too [18:31:41] and I want to just get rid of the other entries [18:32:08] well, no. there's still a problem, then [18:32:19] because people won't be able to get to labsconsole [18:32:22] LeslieCarr, andrew_wmf: can I get "grep -v ^# /etc/powerdns/recursor.conf |grep -v ^$" from the two boxes? [18:32:41] Ryan_Lane: maybe until we sort this out we should readd the old IP to virt0? [18:32:48] LeslieCarr: ? [18:32:53] I mean, things that we don't yet understand are happening and cause outages [18:32:54] that seems like a good idea [18:32:59] it's in a different vlan, though [18:33:07] just a sec [18:33:11] paravoid: don't have direct access [18:33:14] LeslieCarr: can you add the system into both vlans? [18:33:20] we should sort this out without having the pressure of an outage [18:33:23] yes [18:33:26] agreed [18:33:31] it's gone on long enough [18:33:33] !log morebots, dont leave me again! [18:33:37] andrew_wmf: thanks a lot and sorry for involving you into something that's our problem [18:33:38] Logged the message, RobH [18:33:41] ok, can you set up virt0 to accept tagging ? [18:33:55] paravoid: mind handling that? [18:34:04] !log stat1001 isnt responsive to ssh, not sure it was ever installed, rebooting [18:34:11] RobH: it wasn't installed [18:34:23] then why the hell are folks compaining [18:34:25] RobH: about to install [18:34:27] i don't know [18:34:28] ottomata: ;P [18:34:32] LeslieCarr: we have eth1 [18:34:33] woosters: ^ [18:34:37] which isn't being used on virt0 [18:34:45] no reason to mess with tags at this point [18:34:47] RobH: i just set up the dns and stuff [18:34:51] yea, robH [18:35:02] sigh [18:35:13] LeslieCarr: so, can you just map the old VLAN to virt0's eth1? [18:35:22] ok, folks are taksing me on dns stuff when you are already working on it [18:35:45] sure i can do that [18:37:09] preilly: done [18:37:14] notpeter: thanks! [18:37:19] no prob [18:37:31] paravoid: soarecord: virt0.wikimedia.org. hostmaster.wikimedia.org. 20120218233655 1800 3600 86400 7200 [18:37:33] correct now? [18:37:38] LeslieCarr: remind me of netmask/gateway [18:37:44] excluding that it isn't using the epoch [18:37:51] heh, I was about to say that :) [18:38:11] fix the serial and it's fine [18:38:16] ok [18:38:25] it's a /26 and 208.80.153.129 [18:39:53] so, RobH/cmjohnson1 can you cable port 28 to eth1 on virt0 ? [18:40:01] andc do you want me to make a ticket for that request ? [18:40:28] wmflabs.org. 3600 IN SOA virt0.wikimedia.org. hostmaster.wikimedia.org. 1339784844 1800 3600 86400 7200 [18:40:50] so the system needs two network connections? [18:41:02] RobH: I still can't ssh into stat1001 [18:41:07] it should already have a network connection [18:41:09] drdee: its not installed [18:41:18] but its not dns, dns is fine. [18:41:22] it needs the OS. [18:41:45] Leslie was working on it I think she said, but I dunno how seems shes working on a dozen things right now [18:42:15] I think we should handle the outage, rather than stat1001 [18:42:20] drdee: it's not installed yet [18:42:30] drdee: so of course we can't ssh into it [18:42:39] Ryan_Lane: btw, andrew_wmf just pasted me pdns's config in private and it's all fine [18:42:45] RobH: yes needs 2 net connections [18:42:48] yeah. I imagined it was [18:42:51] sigh [18:42:52] LeslieCarr: understood [18:42:53] well, racktables is wrong [18:43:04] or the server is [18:43:06] virt0 should have two connections. does it not? [18:43:10] Ryan_Lane: was virt1 mobile1? [18:43:15] ah [18:43:16] right [18:43:17] yes [18:43:23] maybe it never got the second one added [18:43:26] ok, its labeled wrong in the datacenter, but racktavbles is fine [18:43:32] ok [18:43:47] i can add it, it will be a few minutes, i have to go get label and such [18:44:27] ok, so virt1's 2nd ethernet port is now in the old vlan as soon as it gets configured [18:46:30] LeslieCarr: done [18:47:19] paravoid/RobH I don't see an ipv4 addy on eth1 on virt0 ? [18:47:27] look again [18:47:28] i mean /Ryan_Lane [18:47:31] yay [18:47:41] so, [18:47:43] it doesn't work [18:47:45] and I know why [18:47:48] and I'm not sure if you can help it :) [18:48:01] so, the box is currently multihomed [18:48:09] it can only have one gateway though [18:48:24] that means that currently, the replies with the old IP are going to the gateway on eth0 [18:48:34] the router is probably dropping them, because of uRPF [18:48:37] or something similar to that [18:48:43] can we disable that temporarily? [18:49:23] yeah [18:50:43] !log doing a git pull for OpenStackManager on virt0 [18:51:04] hm. no bot [18:51:24] deactivated on both now .... [18:51:25] deactivated on both now …. [18:52:29] hrm [18:53:20] morebots: welcome back [18:53:22] !log doing a git pull for OpenStackManager on virt0 [18:53:27] Logged the message, Master [18:53:33] !log deactivated rpf-filter on cr1-sdtpa and cr2-pmtpa temporarily for virt0 [18:53:38] Logged the message, Mistress of the network gear. [18:54:15] notpeter: created a new ticket for stat1001: https://rt.wikimedia.org/Ticket/Display.html?id=3120 [18:54:20] paravoid: look how amazingly stupid this is: https://gerrit.wikimedia.org/r/#/c/11662/1/OpenStackNovaDomain.php,unified [18:54:22] !log updating dns for educacao redirect [18:54:26] Logged the message, RobH [18:54:36] when I generate the SOA it's correct [18:54:41] when I update the SOA it's not [18:54:48] fail [18:55:20] well, I should say, when I make the initial SOA it's correct, and updates are not [18:55:27] which is why some were correct and others weren't [18:57:18] LeslieCarr: sure? [18:57:23] who feels like approving https://gerrit.wikimedia.org/r/#/c/11574/ ? [18:57:30] i'm sure it was deactivated [18:57:30] LeslieCarr: still does not work [18:58:20] I see packets leaving eth0 to the gateway [18:58:58] same routers btw? [18:59:58] yeah, it's going to cr1-sdtpa and cr2-pmtpa [19:01:18] any ideas? [19:01:27] I can always do policy routing on the host but it's nasty… [19:02:03] hrm [19:02:06] yeah [19:02:19] so i am definitely seeing it try to send out replies on eth0 [19:02:23] but failing [19:03:08] let me put a firewall filter to count packets coming out from the interface [19:08:37] paravoid: are you also a JunOS addict? :-D [19:09:01] hashar: I do know my way around ios junos and some other networking stuff, yes [19:09:17] LeslieCarr: I'll just do policy routing, no reason to spend more time with it [19:09:22] after the frenchcabal-l we need a networknerds-l mailing list :D [19:09:42] hrm [19:09:44] that's strange [19:09:53] the firewall filter i put on isn't logging any packets [19:09:55] :-/ [19:10:10] it should be logging anything with a source or dest going via the default gateway of eth1 [19:10:16] i mean going via eth1's address [19:10:23] but on eth0's subnet [19:11:39] LeslieCarr: done [19:11:56] well yay [19:12:00] ip route add default via 208.80.153.129 table 200 [19:12:00] ip route add 208.80.153.128/26 dev eth1 table 200 [19:12:00] ip rule add from 208.80.153.128/26 lookup 200 [19:12:03] jfwiw [19:12:08] cool [19:12:27] now I "just" have to make dns, apache and whatnot listen on that IP too :/ [19:12:32] yeah [19:12:36] that should be fun [19:12:43] Ryan_Lane: I'm stopping puppet [19:12:48] * Ryan_Lane nods [19:13:08] ok, we need to make dns listen on eth1 as well [19:13:22] yes, see above :) [19:13:25] not just dns either... [19:13:46] dns done [19:13:55] dns and apache should be succifient [19:14:30] apache listens on inaddr_any and has no special vhost statements [19:14:31] should work [19:14:49] drdee: it looks like there won't be any stat1001 until robh gets back to eqiad --- the powercycling is failing [19:14:52] every other service is internal [19:15:16] !log adding pre-renumbering virt0's IP back on eth1; doing policy routing to work out multihoming [19:15:21] Logged the message, Master [19:15:37] !log virt0: modify pdns.conf to listen on the old IP; temporarily disable puppet [19:15:42] Logged the message, Master [19:15:53] dns is working on old ip [19:16:04] so, we should be good till we can figure out what the hell is wrong [19:16:13] hopefully [19:16:37] lemme send an update to labs-l [19:16:53] wmflabs may still have negative caches for wmflabs though [19:17:04] those are 1 hour [19:17:05] andrew_wmf: can you do me one last favor and restart both pdns recursors? [19:17:33] reportcard.wmflabs.org. 3600 IN A 208.80.153.208 [19:17:38] seems to work for me now [19:17:48] woosters: great, thanks [19:17:51] i can get to labsconsole using office dns [19:18:07] good. hopefully with the changes we've made, this will correct itself [19:18:15] we won't really know for a while [19:19:03] hm [19:19:14] I wonder if one of our auth servers is serving stale records on one of its threads [19:19:18] and that's causing everything [19:19:40] we could restart each server one by one [19:20:48] that said, it should be easy enough to spot, or things wouldn't be this fucked up [19:21:41] I just made a ton of requests for labsconsole.wikimedia.org on each, and they all return correctly [19:22:03] so, we're going to go do dim sum... [19:22:13] yea, just had erik tested them as well [19:22:16] wow [19:22:19] wtf [19:22:19] Ryan_Lane: me too [19:22:22] what? [19:22:22] ;; ADDITIONAL SECTION: [19:22:22] virt0.wikimedia.org. 64402 IN A 208.80.153.135 [19:22:22] labsconsole.wikimedia.org. 86399 IN A 208.80.153.135 [19:22:27] what? [19:22:28] where?! [19:22:32] on which resolver ? [19:22:49] that's from a resolver my virtual machine is hitting [19:22:55] I don't understand that at all [19:22:58] on which query? [19:23:07] dig wmflabs.org ns [19:23:13] fucking glue records [19:23:18] I said that yesterday, didn't I? [19:23:25] ARGH [19:23:36] I said it yesterday then discarded it [19:24:16] AAAAAARGH. [19:24:29] fcking afilias [19:24:34] so, [19:24:43] !log updating dns for ms-be12 mgmt [19:24:48] Logged the message, RobH [19:24:51] did you get access to the domain registrar? [19:24:55] yes [19:25:02] okay, go in their panel and fix this [19:25:05] want to change to labs-ns0 and labs-ns1? [19:25:08] no [19:25:14] I just want to fix glue records [19:25:29] which shouldn't be there [19:25:51] that explains everything [19:26:40] well, markmonitor is slow as hell for me right now [19:29:07] so, what am I looking for again? [19:30:03] paravoid: ? [19:30:05] if it doesn't have a section called "glue records" or "A records" or something like that [19:30:09] no [19:30:11] then just change the nameservers [19:30:16] and re-add labsconsole and virt0 [19:30:23] or labs-ns0/1, whichever you prefer [19:30:37] I'll keep the same for now [19:30:55] I'll switch the order [19:31:09] okay [19:31:14] I'll reply to the incident report [19:31:18] on ops [19:31:27] do you want me to do labs-l too or are you handling this? [19:32:03] hm. that didn't work [19:32:10] it kept the order [19:32:16] shitty web tool [19:32:30] you can update labs-l too [19:32:56] k [19:33:11] ok. removed one, then added it back in [19:33:14] that should update it :) [19:35:18] seems it actually did update it, but it always displays them in alphabetical order [19:35:27] which is slightly full of fail [19:35:41] LeslieCarr: btw, no reason to keep urpf off [19:38:05] what DNS needs is a "hey, I've fucked up" broadcast feature of some sort [19:40:25] can I mail both lists with one mail? [19:40:37] or should we not leaking the ops list into labs? [19:40:51] *be leaking [19:40:54] bcc? [19:41:28] gah, I'm going to break theading [19:41:41] heh [19:43:01] is the dns issue better now? [19:43:04] I'll just send two mails [19:43:11] andrew_wmf: yes, I just Cc'ed you on the updated mail [19:43:26] Ryan_Lane: btw, I still see the old entries in my dig [19:43:30] "org" NS may take a while to update [19:43:35] * Ryan_Lane nods [19:43:41] usually a day [19:44:41] no, more often for sure [19:44:55] DAMN IT. I said "glue records" yesterday, didn't I? [19:44:57] argh. [19:45:12] I heard nothing of the such [19:45:14] ; [19:45:15] ;) [19:45:56] well [19:46:07] a plus is that we fixed all the other problems while we were at it [19:46:36] none of which were likely causing the issue, but it's nice that it's correct now [19:46:52] okay [19:46:56] and monday we'll make a secondary ;) [19:46:56] I have to leave now [19:46:59] same [19:47:28] friends are waitin for me for an hour and a half :( [19:47:37] have fun [19:47:40] oh [19:47:40] wow [19:47:41] sorry [19:47:56] heh, no reason to be sorry for [19:47:59] I'm glad we fixed this [19:48:08] same [19:49:44] as I also said, the CNAME as a NS as the root of the problem was not convincing me :) [19:49:47] heh [19:49:53] dammit, I should have been more persistent [19:50:04] :-) [19:54:29] hey maplebed, you around? [19:54:41] I have a puppet/monitoring Q and I think you might know some stuff [19:55:32] or someone else maybe knows [19:55:32] so [19:55:46] diederik has a cron job that runs once a day on stat1 [19:55:49] generating gerrit stats [19:55:59] i want to monitor that it is working [19:56:06] i just need to check a file once a day [19:56:17] and ensure that there is an entry at the bottom of the file for the previous day [19:56:23] each entry has a date [19:56:25] so that is pretty easy [19:56:33] but, I am unfamiliar with nagios/ganglia [19:56:38] not sure which I want to use (both?) [19:56:43] ottomata: make a custom nagios check if a check doesn't already exist [19:56:46] or how to use them [19:56:56] nagios check, ok checking about nagios checks [19:57:10] see the checks that are defined [19:57:13] it's in a nagios template [19:57:23] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [19:59:23] RobH: do you mind opening a account for me too, in case something happens during the weekend? [19:59:35] RobH: markmonitor account that is [19:59:40] !log mobing asw and msw-d3-sdtpa from single to dual power [19:59:45] Logged the message, RobH [19:59:50] paravoid: i rather we just page either myself, mark, or ryan in that case [20:00:27] only cuz its a single one off account mgmt via an account manager [20:00:36] its not an access list we control directly [20:00:58] uhh okay [20:01:34] we arent changing anything on markmonitor for outage resolution is my understanding [20:01:43] yes we do [20:01:44] we are going to only change it once more once we add the new nameservers on monday [20:02:01] well, Ryan did change it already, but I'm not sure if it worked [20:02:04] we are pointing it to a different nameserver? [20:02:06] there's a lag between cause and effect [20:02:13] right, but changing it again wont help... [20:02:15] no, but they are caching glue records (= IPs) as well [20:02:49] this change hasn't yet got effect, since org updates take a while [20:03:02] so, I'm wondering if it's actually fixed [20:03:09] it may not be [20:03:19] but we won't know until tomorrow [20:03:33] right [20:03:49] or earlier, but still, not now [20:03:52] glue records definitely seem correct, though. based on the behavior... [20:03:56] they make sense [20:04:12] paravoid: if ya think ya need it, then you can have it, but ct has to approve it, since it affects a lot of things [20:04:19] ct had previously approved ryan's access [20:04:23] (on godaddy) [20:04:36] RobH: I don't need it as long as you or Ryan are able to drop by tomorrow and have a look [20:04:41] to make sure everything's okay [20:04:42] I'll be around [20:04:44] i assumed Ryan_Lane was on this [20:04:48] great [20:04:50] im going on a boat [20:04:55] ^_^ [20:05:00] (will have air and mifi) [20:05:00] RobH: with flippy floppies? [20:05:04] I didn't know the rules, and I knew I'd be here tomorrow, so that's why I asked [20:05:20] Ryan_Lane, I don't see anything helpful yet [20:05:36] re checkcommands [20:05:48] then you'll need to add a new one [20:05:51] i'm a nagios noob, ok [20:05:56] should I be using nrpe? [20:05:59] leaving now [20:06:03] good night [20:06:05] or evening [20:06:16] as long as you don't pass nrpe args, it's fine [20:06:18] paravoid: night [20:06:22] Ryan_Lane: i got my swim trunks [20:06:42] * RobH listens to song now [20:06:49] I think you'll have to use nrpe [20:07:14] so is the outage resolveD? [20:07:21] can i resume harassing LeslieCarr about row C? [20:07:57] !log mobing asw and msw-d3-sdtpa from single to dual power again, got sidetracked [20:08:02] Logged the message, RobH [20:08:47] RobH: yeah [20:08:49] Ryan_Lane: I just saw the serial of .org getting bumped [20:08:52] but no changes [20:08:56] :( [20:09:01] $ dig +norec +short @b2.org.afilias-nst.org org SOA [20:09:01] a0.org.afilias-nst.info. noc.afilias-nst.info. 2010100976 1800 900 604800 86400 [20:09:09] it was 972 before [20:09:24] anyway, really leaving :) [20:09:30] see ya [20:10:53] PROBLEM - Host ps1-d3-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [20:11:38] RECOVERY - Host ps1-d3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.64 ms [20:41:00] !log updating dns for mc1-mc16 mgmt [20:41:06] Logged the message, Master [20:42:00] RobHalsell: hey [20:42:35] ? [20:42:49] LeslieCarr: sup? [20:43:14] party time [20:43:26] so, i want to actually add the switches one at a time [20:43:30] according to the documentation [20:45:59] iirc asw-c3 is connecting via a stacking cable to asw-d3 ? [20:47:43] heh i got that backwards [20:47:47] c1-d1 [20:48:36] RobHalsell: I also want to catch you before you pumpkin today, if any time remains after switching. [20:48:45] would you ping me if there is? [20:48:54] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [20:48:58] so how about we disconnect c1 completely, then remove the cable between d1 and d3 [20:49:39] RobHalsell: FYI I am going by this procedure - http://www.juniper.net/techpubs/en_US/junos/topics/task/installation/virtual-chassis-ex4200-member-adding-cli.html [20:50:47] cool [20:51:29] so first let's remove the stacking cable between d1 and d3, next let's unplug c1's power AND stacking cables, then we'll cable c1 to d1, then we turn on c1 [20:51:50] i want to make sure that c1 is not connected to either power or any other switches [20:53:34] yay [20:54:10] cool, we're single homed right now :) [20:54:22] let's connect c1 to d1 via the stacking cable [20:55:07] let's turn on c1 switch [20:55:26] * AaronSchulz braces for an outage [20:56:11] hehe [20:56:16] * Reedy boots AaronSchulz [20:56:19] well i'm monitoring chassisd and messages [20:56:28] !log attaching asw-c1-pmtpa to asw-d-pmtpa ring [20:56:33] Logged the message, Mistress of the network gear. [20:56:42] logged just in case ;) [20:57:32] * AaronSchulz chains Reedy to the desk and forces him to do cr [20:58:25] so the switch has come up [20:58:41] however not joining the VC [20:59:05] hrm, i wonder/think it may be due to software mismatches [20:59:14] cmjohnson1: have a thumb drive ? [20:59:57] we have a couple here [21:00:30] maplebed: whats up? its packing up time here. [21:00:37] its friday man! [21:00:50] RobHalsell: yeah, I figured when i didn't ping you this morning I was up a creek. [21:01:05] early next week then. swift storage nodes and SSDs. [21:01:21] yep [21:04:42] so right now the software is mismatched, i think that's preventing the virtual chassis from forming [21:04:46] i'm uploading new software [21:04:55] seeing if it's conencted enough to install it on the chassis [21:07:47] RobHalsell/cmjohnson1 in my home directory on bast1001 is a file called jinstall-ex-4200-10.4R3.4-domestic-signed.tgz -- can you download it to the usb stick and plug that into the ex4200 ? [21:09:56] thanks [21:17:01] cmjohnson1: woot, slowly slowly trying to install :) [21:25:30] cmjohnson1: woot asw-c1 is rebooting [21:28:14] let's see what happens with this :) [21:28:17] still rebooting... [21:28:21] they take quite a while [21:28:31] oh doh, it's 10.4r6 not 10.4r3 [21:28:37] let's see if it works anyways (same major version [21:30:00] hah, rebooting again to complete reinstallation for real this time [21:30:10] can you pull the usb stick ? [21:30:48] seriously, i am going to introduce a top of rack switch that reboots in 60 seconds and make billions [21:31:33] i'd have shrines in every dc ? [21:32:00] oh [21:32:08] ooo it's seeing the interfaces [21:32:08] :) [21:32:12] can you put the stick in c2 ? [21:33:36] is asw-c2 powered on ? [21:34:25] ok, so i don't want the stacking cables connected to asw-c2, but want the switch powered on [21:34:29] so i can upgrade, then we attach it [21:34:33] then same thing with asw-c3 [21:34:42] then we can do the c3-d3 connection [21:34:48] hrm, serial plugged in and everything ? [21:35:38] yeah, why don't you plug in the usb stick to c3 so i can start it installing while we troubleshoot ? [21:37:01] oh grrrr [21:37:05] can you copy thta file on again ? [21:37:13] for some dumb reason junos decided to delete the copy [21:37:17] (don't ask me why) [21:39:13] the jinstall file ? [21:39:16] i only see the jloader file [21:40:37] yeah, the switch removed the jinstall file after copying it over… because junos is stupid :( [21:40:46] do you have an account on fenari ? [21:40:55] let me put it in your home dir [21:42:21] ok, it's in your home directory now [21:42:31] on fenari [21:44:00] cool, can you put it on the usb disk ? [22:04:04] thanks [22:05:20] cmjohnson1: i don't see it yet ? [22:05:23] maybe reseat it ? [22:08:56] hrm [22:09:10] one last reseat ? [22:09:21] i sort of saw it get inserted but is erroring... [22:12:06] rebooting that switch [22:18:11] woot, instlaling on c3 finally [22:18:18] the reboot fixed it [22:18:49] hehe [22:18:55] have you tried turning it off and on ? [22:19:20] hehe [22:19:41] grrr, so it removed the file from the usb stick again, can you recopy it to the stick and then plug it in to c2 ? [22:20:11] hehe [22:23:39] ok, yay, so close :) [22:25:30] ok, pull the usb stick [22:26:03] :) when it's done with the install upon reboot, i'll shut it down, ask you to pull power, and then connect it via a stacking cable to c1 [22:29:47] still installing.... [22:31:58] hey asher, are you aware of this technique "Hardened stateless session cookies", see http://t.co/PnABtktT [22:32:08] i meant binasher ^^ [22:32:24] for a second i read that as hardened stainless steel cookies [22:32:31] which seem like a not very delicious idea [22:33:01] lol [22:34:41] still installing... [22:34:45] it's sort of sneaky now [22:34:59] it seems like it installs quicker… because half of the installation is after rebooting [22:36:16] cmjohnson1: yay [22:36:32] can you unplus asw-c2, and then plug its stacking cable into asw-c1 please ? [22:37:39] yep [22:37:52] oh yes, i do [22:37:53] thank you [22:38:04] just want to add c2 [22:38:21] power on c2 now [22:41:13] see the powering :) [22:42:12] awesome, it's trying to join the ring now [22:43:50] taking a while... [22:44:52] hrm [22:44:57] something else slow may be starting .... [22:45:32] i think we need to reinstall c1 [22:47:22] on the good side, i can do that via the cli with no usb stick :) [22:56:55] cmjohnson1: have you touched the c1 to d1 connection ? [22:57:18] oh nm [22:57:19] it's up now [22:57:21] just slooooow [22:58:36] ok, now rebooting c2 [22:58:42] which hopefully this time will attach via c1 [22:58:50] and then we can attach c3 and then reconnect the ring [23:02:01] anyone around that has access to markmonitor? :) [23:02:19] did RobH leave already? [23:02:23] damit [23:02:38] Ryan_Lane: ping? [23:02:47] Ryan_Lane: the NS haven't changed and the serial has been bumped a few times [23:03:31] New review: Reedy; "Blame Pediapress..." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/11161 [23:05:52] cmjohnson1: yay [23:06:01] c2 is happy [23:06:11] attach c2 to c3 ? [23:06:26] and unplug c3 [23:07:32] yay [23:07:36] power on c3 please [23:08:26] ok, now let's wait for another hour [23:08:33] i'm sure this is how you wanted to spend your friday night ;) [23:12:51] yay, c3 is almost up... [23:12:53] getting there [23:15:50] can you reseat the cable between c2 and c3 ? [23:16:47] when you're done i'll reboot [23:17:19] rebooting now [23:21:43] yay it's adding its interfaces now [23:22:03] sweet, now we're almost ready to plug in c3 to d3 [23:22:54] ok, plug in c3 to d3 please [23:26:06] hrm [23:26:34] so i see it coming up yet don't see it actually put into the ring [23:26:45] Hey ops folks, could someone check the nginx error logs on the SSL termination proxies? Someone over in -tech says they got a 500 from nginx [23:27:02] Or well, that someone is here actually :) it's Thehelpfulone [23:27:16] replace {{/user|3834619|MacMed }} with {{/user|3834681|MacMed }} [23:27:43] RoanKattouw: ^ try that, I don't know if that could actually cause an issue, just that edit [23:27:49] changing the number and adding a space [23:28:45] Thehelpfulone: which cluster are you hitting? [23:28:57] i.e. do an en.wikipedia.org [23:29:05] this is on meta-wiki, the error message is 500 Internal Server Error nginx/0.7.65 [23:29:24] okay, do a "host meta.wikimedia.org" please [23:29:32] paravoid: this is windowz [23:29:35] cmjohnson1: ahha [23:29:44] it's not showing as up on 1/2 on c3 [23:29:44] "nslookup meta.wikimedia.org" then [23:29:53] in cmd? ok [23:30:12] yes [23:30:17] cmjohnson1: is it in the first or 2nd slot ? [23:30:25] cmjohnson1: if it's in the first, can you move it to the second slot ? [23:31:11] paravoid: http://pastebin.com/YAL9ppdn [23:32:54] yay!!!! [23:32:55] it's all up [23:33:03] we have a working, redundant switch ring!!! [23:33:07] * LeslieCarr pops the champagne! [23:34:52] you're free to enjoy your weekend :) [23:35:26] thank you [23:35:40] :) [23:36:06] paravoid: did you find what you were looking for from that? [23:36:52] I did, thanks [23:37:14] sure [23:52:23] so yeah the edit works on http:// but not https:// [23:56:57] "2012/06/15 23:25:12 [crit] 14310#0: *228018368 pwrite() "/var/lib/nginx/body/0000085479" failed (28: No space left on device), client: 0.0.0.0, server: *.wikimedia.org, request: "POST /w/index.php?title=Identification_noticeboard&action=submit HTTP/1.1", host: "meta.wikimedia.org", referrer: "https://meta.wikimedia.org/w/index.php?title=Identification_noticeboard&action=edit" [23:57:38] anyway, SSLs are running out of space [23:58:32] ssl3001 is 100% full, ssl3002 & ssl3003 are 97% [23:59:28] how *crazy* is our partitioning? [23:59:43] 9.2GB out of 250GB disks? wtf?