[00:05:02] PROBLEM - Host mw1143 is DOWN: PING CRITICAL - Packet loss = 100% [01:33:17] http://wikipedia.geo.blitzed.org/ finally stopped working [01:33:22] The end of the world :P [01:33:44] "can't determine language" PHP fatal error in /usr/local/apache/common-local/multiversion/MWMultiVersion.php line 342: [01:34:03] it is still mentioned in the apache cofnigs [01:37:07] what is that? [01:37:13] http://noc.wikimedia.org/conf/main.conf [01:37:32] ServerAlias *.wikipedia.org [01:37:33] ServerAlias wikipedia.geo.blitzed.org [01:38:05] some obscure backup domain it seems [01:38:24] http://www.gossamer-threads.com/lists/wiki/wikitech/15495 [01:41:48] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 275 seconds [01:44:35] wikipedia.geo.blitzed.org is an alias for cache.wikimedia.org. [01:44:36] cache.wikimedia.org is an alias for wikimedia-lb.wikimedia.org. [01:45:51] wow, 8 years ago [01:45:56] yeah [01:46:00] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [02:43:54] RECOVERY - Puppet freshness on dobson is OK: puppet ran at Tue May 29 02:43:52 UTC 2012 [03:13:00] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [03:13:00] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [03:13:00] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [03:13:00] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [03:55:00] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [04:20:57] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [06:16:52] New patchset: ArielGlenn; "fix wrong indentation so config is initialized properly" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/9215 [06:17:40] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9215 [06:17:43] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/9215 [06:48:19] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [06:54:19] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [06:58:13] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [06:58:13] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [07:05:16] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [07:05:16] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [07:05:16] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [07:16:13] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [07:28:54] New patchset: ArielGlenn; "option to not regenerate existing file for wiki/date" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/9219 [07:30:01] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9219 [07:30:06] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/9219 [07:48:10] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:51:01] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:54:38] New review: Hashar; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/9136 [07:57:19] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [07:59:16] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [08:19:15] New patchset: ArielGlenn; "fix nooverwrite so it works without verbose (*cough*)" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/9223 [08:19:42] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9223 [08:19:44] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/9223 [08:23:01] New patchset: ArielGlenn; "option to skip wikis with existing output files; cleanup verbose var" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/9224 [08:23:34] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9224 [08:23:36] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/9224 [08:25:43] hey hashar [08:27:10] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [08:29:07] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [08:32:16] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [08:49:58] hello [08:50:07] database question again :) [08:50:43] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [08:50:45] i have a heartbeat table entry for i suppose s2 that seems not to be growing [08:50:57] | ts | server_id | file | position | relay_master_log_file | exec_master_log_pos | [08:50:57] +----------------------------+-----------+-----------------+-----------+-----------------------+---------------------+ [08:50:57] | 2012-05-26T10:06:16.001170 | 10623 | db13-bin.000418 | 428144675 | db52-bin.000045 | 627296604 | [08:51:06] | 2012-05-26T10:06:16.000890 | 10662 | db52-bin.000045 | 627296604 | NULL | NULL | [08:51:22] is this fine so far? [08:52:50] seems replication is still working and only slow but i have no idea why the heartbeat value is not increasing [08:53:24] ah ok the values are increasing - they are just slow [08:54:27] seems that replication is ok (as I look at http://noc.wikimedia.org/dbtree/) [09:06:27] we probably should switch to db52 as master first :D [09:07:13] *eyeroll* [10:01:31] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - check plugin (check_job_queue) or PHP errors - [10:02:34] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - check plugin (check_job_queue) or PHP errors - [10:22:23] New patchset: Hashar; "import overriding system for InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9237 [10:22:29] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9237 [10:44:24] New patchset: Bhartshorne; " giving those with sudo access acclogin access as well." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9239 [10:44:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9239 [10:45:19] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [10:47:14] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9239 [10:47:16] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9239 [12:22:25] mark: "Register by TOMORROW to participate in the Launch on 6 June." -- not sure if you want us to [12:22:32] just sayin :) [12:45:19] PROBLEM - Puppet freshness on dobson is CRITICAL: Puppet has not run in the last 10 hours [12:55:27] New review: Helder.wiki; "(no comment)" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/9237 [13:15:10] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [13:15:11] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [13:15:11] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [13:15:11] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [13:16:49] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [13:17:43] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [13:21:01] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [13:27:55] New patchset: Bhartshorne; "the group is necessary to include the accounts." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9248 [13:28:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9248 [13:28:39] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9248 [13:28:42] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9248 [13:36:14] New patchset: Bhartshorne; "Revert "the group is necessary to include the accounts." I didn't see that ldap::client::wmf-cluster is already on the host." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9249 [13:36:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9249 [13:36:42] New patchset: Bhartshorne; "Revert " giving those with sudo access acclogin access as well." I didn't see that ldap::client::wmf-cluster is already there." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9250 [13:37:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9250 [13:37:17] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9249 [13:37:20] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9249 [13:37:26] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9250 [13:37:29] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9250 [13:43:38] hiiii everyone! [13:43:40] i need to set up backups on stat1 [13:43:41] amanda? [13:43:43] is there a wiki page out there I should read about how to do this? [13:49:57] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [13:53:54] @regsearch .. [13:53:54] Results (Found 62): puppet, instance, morebots, git, bang, nagios, bot, labs-home-wm, labs-nagios-wm, labs-morebots, gerrit-wm, wiki, labs, bastion, extension, wm-bot, projects, putty, gerrit, change, wikitech, revision, monitor, alert, password, unicorn, help, bz, os-change, instancelist, instance-json, leslie's-reset, damianz's-reset, amend, credentials, queue, socks-proxy, sal, info, security, logging, ask, sudo, access, $realm, keys, $site, bug, pageant, blueprint-dns, stucked, pxe, ghsh, group, pathconflict, terminology, rt, erb, regsubst, bots, wt, gerrit-search, [13:54:03] .. [13:54:31] @infobot-link #wikimedia-labs [13:54:31] petan|wk: Unknown identifier (#wikimedia-labs [13:54:32] These channels now share the same infobot db [13:54:38] @regsearch .. [13:54:38] Results (Found 120): morebots, bang, nagios, labs-home-wm, labs-nagios-wm, labs-morebots, gerrit-wm, wiki, labs, extension, wm-bot, gerrit, change, revision, monitor, alert, password, unicorn, bz, os-change, instancelist, instance-json, leslie's-reset, damianz's-reset, amend, credentials, queue, sal, info, logging, ask, sudo, access, $realm, keys, $site, pageant, blueprint-dns, bots, stucked, rt, pxe, ghsh, group, pathconflict, terminology, etherpad, epad, nova-resource, pastebin, newgrp, osm-bug, bastion, ryanland, afk, test, initial-login, manage-projects, rights, new-labsuser, cs, new-ldapuser, projects, quilt, labs-project, openstack-manager, wikitech, load, load-all, wl, domain, docs, address, ssh, documentation, help, account, start, link, socks-proxy, requests, magic, gitweb, labsconf, console, ping, hexmode, Ryan, resource, account-questions, hyperon, deployment-prep, security, project-discuss, project-access, putty, :), instanceproject, puppet-variables, demon, linux, git, port-forwarding, pong, logs, whatIwant, broken, damianz, accountreq, puppet, report, db, nagios-fix, whitespace, instance, hashar, bot, sexytime, Thehelpfulone, bug, [13:54:39] :) [13:54:51] this is what Ryan wanted like 4 months ago :D [13:55:00] cool :) [13:55:28] so for our geodns, does anyone know where the db lives ? [13:55:39] this one? [13:55:41] on labs [13:55:47] in a file :P [13:55:50] nope, for a completely different issue :) [13:55:51] geo dns [13:55:53] aha [13:56:16] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [13:56:19] got someone in europe who was showing a few issues www.wp.org but they're getting the american ip [13:56:45] I am in europe [13:57:12] petanb@srv:~/repo/wikimedia-bot$ ping www.wp.org [13:57:12] PING www.wp.org (69.43.160.149) 56(84) bytes of data. [13:57:49] I am using google dns server [13:57:53] oh [13:57:55] wikipedia.org [13:58:01] i was being lazy and typing it wp [13:58:02] oh right [13:58:13] petanb@srv:~/repo/wikimedia-bot$ ping www.wikipedia.org [13:58:13] PING wikipedia-lb.eqiad.wikimedia.org (208.80.154.225) 56(84) bytes of data. [13:58:53] oh [13:58:55] interesting [13:59:08] what ip do you get for en.wikipedia ? [13:59:28] petanb@srv:~/repo/wikimedia-bot$ ping en.wikipedia.org [13:59:28] PING wikipedia-lb.eqiad.wikimedia.org (208.80.154.225) 56(84) bytes of data. [13:59:38] same [13:59:46] :( [14:00:00] hrm [14:00:20] so on the onehand that makes sense, on the other, i think google dns is supposed to also be passing through some ip information [14:00:36] I am pretty sure an analytic guy published the recent IP db somewhere [14:00:38] and either it's not or we're not looking at it, or something else borked with that [14:00:42] can't find out the commit / gerrit change :-( [14:01:32] LeslieCarr: do we actually use that spec? [14:01:41] I'd amazed if we did, I've never seen anyone else use it [14:02:31] hehe i don't think we do [14:02:38] in any case, aren't Google's DNS anycasted? [14:02:39] i think akamai is supposed to be using it soon ? [14:03:00] hmm [14:03:06] 8.8.8.8 is in Europe for me [14:03:15] but I get eqiad as a reply for our sites [14:03:16] yeah, though i am guessing our geoip db doesn't think it is [14:03:49] I don't think they would use the anycast IP for their outgoing queries [14:04:04] /usr/share/GeoIP [14:04:10] paravoid: 8.8.8.8 is using anycast [14:04:30] for *outgoing* queries? [14:04:31] so it is wherever the shortest BGP route sends your packet to ;-] [14:04:43] I know what anycast is and I know they're using it for 8.8.8.8 [14:05:06] what I'm asking is when they do recursion for you, if they're also using that IP to hit other servers [14:05:12] ((( and by the way, you should not use 8.8.8.8 if you care about your privacy ))) [14:05:18] I'd be very very amazed if they do [14:05:23] ohh sorry [14:05:29] they are most certainly using another address [14:05:45] since there are asymmetries in the internets, and they might lose the reply [14:05:59] hi, i worked on GeoIP stuff recently, [14:06:04] what's up? [14:06:04] well the reply might end up at another of their servers [14:06:19] and technically, they might be able to redirect it to the correct server. But that would be a bit of a mad thing to do [14:06:32] ottomata: I think LeslieCarr is trying to find out how the geoIP stuff works out [14:06:47] for DNS [14:06:59] for DNS... [14:07:12] hm, I don't know much about it for DNS I think [14:07:32] I recently changed how the .dat files were being downloaded and distributed, but afaik that is for IP address geolocating [14:08:47] anyway if some user in Europe use 8.8.8.8 as a DNS resolver AND its ISP has bad connectivity in Europe, its DNS query can totally be sent over sea to the US [14:09:23] then the Google US IP will be sent to our us and GeoIP will give back pmtpa / eqiad instead of knams :-] [14:09:27] and there is nothing we can do [14:09:48] again, I'm hitting a european 8.8.8.8 node but I'm getting eqiad as a reply for it [14:09:59] ohh [14:10:11] from it even [14:10:45] ohh [14:10:47] got same [14:11:26] my guess is that google's ips are all considered US or unknown or something [14:11:31] $ dig +short www.wikipedia.org @8.8.8.8 [14:11:31] wikipedia-lb.wikimedia.org. [14:11:33] wikipedia-lb.eqiad.wikimedia.org. [14:11:33] 208.80.154.225 [14:11:38] right [14:12:50] which is hosted somewhere behind 209.85.251.231 [14:13:06] which must be in Europe given my latency is 40ms to that node [14:13:32] damn Google does not have any reverse DNS set on its routers :-( [14:13:34] well hmm [14:13:47] X.25 was way better for that [14:15:43] oh well after looking this up more my question is moot [14:15:50] it's an indian ISP but they're going via europe [14:16:07] * hashar learned a new word: 'moot' [14:16:29] I guess from India to USA, the cheapest/ shortest route is through Europe :-((( [14:20:07] usually it's via the pacific [14:20:16] well best [14:20:19] cheapest may not be [14:20:58] !log deployment prep killing puppet on jobrunner boxes [14:21:02] ... [14:21:08] bah [14:21:10] morebot died [14:21:20] mutante: apergos can you possibly resurrect morebot? [14:21:51] again? [14:22:13] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [14:24:22] apergos: maybe it should be automatically restarted every few hours hours :-D [14:24:54] apergos: or we need to add it in some kind of a loop while(1) { morebot.sh ; sleep 60; } [14:28:59] it is [14:29:02] in a frickin loop! [14:30:03] and hereit is again [14:30:12] (I did not get over to linode so I guess th eloop got it) [14:30:18] did it [14:30:30] yer fast [14:30:33] started it in screen, then it was killed, thought that was you [14:30:57] uhm, well, as long as it lives :) not even sure now [14:34:10] or [14:34:15] you could get nagios to restart it ;-] [14:34:33] nagios has a parameter that let you execute a process to try to recover a faulty check [14:34:35] true [14:34:45] something like: /etc/init.d/apache2 restart hehe [14:34:52] I used that to reboot some routers past some uptime [14:35:43] good idea, just let's have a regular init script we can use first [14:35:49] hehe [14:36:14] upstart ftw!!! [14:36:27] supervisord ftw [14:36:35] mutante: http://upstart.ubuntu.com/cookbook/ [14:36:35] ew [14:37:08] k [14:37:18] Tivoli Workload Scheduler owns you all [14:37:26] Logged the message, Master [14:37:45] (it is just like a central cron job system) [14:40:19] hashar: lol [14:40:25] that's what I use everyday in my job [14:40:27] XD [14:40:30] ahah [14:40:45] you would earn massive amount of cash in some big french companies ;-D [14:40:54] (thought that would be a boring job haha) [14:40:59] I work for big german company :P [14:41:07] it's boring job [14:41:15] it works for me [15:00:12] ottomata: I just cherry-picked your 'generic::mysql::server' patch into labs, and could now use some advice. (I'm not totally clear on whether my question relates to your patches or not, but it seems like a good start.) Do you have a moment? [15:01:21] mutante: paravoid: I got a change to ignore /private/ directory in operations/puppet https://gerrit.wikimedia.org/r/#/q/I8b2a1864,n,z . Any of you could have a look at them please ? They are for test and production branches. [15:01:38] mutante: paravoid : useful when we have a private checkout there ;-] [15:02:30] andrewbogott_ [15:02:31] sure! [15:03:02] ottomata: Keep in mind that I know almost nothing about mysql. So probably this will turn out to be a trivial issue. [15:03:27] okey doke, what's up? [15:03:58] I've installed that class on an instance, and now I'm running the mediawiki web installer. It seems happy with all of its tests, but when it actually does the configuration it says it can't create a database. Which seems pretty basic... [15:04:06] Lemme pastebin the error page. [15:04:25] k [15:04:27] Or, actually, it's right here http://mwreview-foo.pmtpa.wmflabs/core/mw-config/index.php?page=Install (if you have a proxy set up for labs http) [15:04:59] pastebin here: http://pastebin.com/cNZ6kD14 [15:05:14] Is that just because i need to do a bit of by-hand db setup before running the mw installer? [15:05:24] hm, i don't have labs proxy set up... [15:05:32] looking... [15:06:07] Most likely I just need to tell mw the proper login/password and I'm too thick to know what that is. [15:06:14] hm, i think you need to do some by hand stuff [15:06:14] ottomata: http://dpaste.org/cKabw/ [15:06:23] the puppet stuff I set up doesn't set up any users [15:06:24] change User hashar obviously ;-] [15:06:31] so all it has is the default root no pw install [15:06:46] lemme check, I think that's what mw expects. [15:06:47] so you'd have to create whatever user (and maybe database) you need by hand [15:07:14] hashar, aye, I have an ssh proxy, but not one for http [15:07:28] ohh [15:07:36] Hm... I'm also taking for granted that the mw self-installer is stable and works. [15:08:21] ottomata: how can't you access http? Are you doing some kind of IP over DNS to bypass a paying wifi ? [15:08:50] eh? naw, I'd just need to set up an http proxy or a tunnel through bastion or whatever to get to the labs stuff [15:08:50] hashar: will that just ignore ./private or also ./foo/bla/private ? [15:08:59] ottomata, hashar: The proxy I'm talking about is this: https://labsconsole.wikimedia.org/wiki/Help:Access#Accessing_web_services_using_a_SOCKS_proxy It's for accessing labs web hosts that don't have a public ip. [15:09:02] Pretty obscure. [15:09:34] mutante: hmm goo dquestion haha [15:09:48] mutante: I am pretty sure it is relative to the git repository root path [15:10:05] * hashar tries [15:10:05] hashar, I have a question for you about amanda backups, do you know much about that? [15:10:36] andrewbogott, I set up a mediawiki install recently and used the gui setup, but I think I created the database and user myself manually [15:10:36] mutante: you are right :-( [15:10:46] and entered those when mw installer asked me [15:10:56] ottomata: you got the wikitech page already? [15:11:04] yeah i found that [15:11:10] so that is easy [15:11:17] ottomata: ok. Lemme rtfm and see if I can figure that out... I'll come back and get you to hold my hand if that fails. [15:11:18] i'm editing the disklist-* files to add the host and path [15:11:21] ok cool [15:11:22] New patchset: Hashar; "git ignore /private/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6470 [15:11:27] this is /a on stat1 [15:11:32] and there is a mysql data dir in /a [15:11:37] /a/mysql [15:11:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6470 [15:11:48] this mysql stuff is just temporary workspace [15:11:53] don't need to back it up [15:11:56] so I was going to exclude it [15:12:02] does that make sense? [15:12:25] I am about to commit this [15:12:25] https://gist.github.com/2828926 [15:12:48] hashar: the one on labs is already merged [15:13:00] mutante: yeah I am ending a new change [15:13:03] cool [15:13:17] mutante: https://gerrit.wikimedia.org/r/9253 (test) and patchset 2 of https://gerrit.wikimedia.org/r/6470 (prod) [15:15:49] mutante, does excluding /a/mysql like that make sense? [15:16:08] New review: Dzahn; "yea, / was the intention instead of ." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6470 [15:16:11] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6470 [15:16:57] ottomata: Indeed, the gui installer is documented as only needing a login and password, and defaults to root w/no pass. So that's suspicious. [15:17:07] But, I will try by hand anyway... [15:17:55] ottomata: ehm, not sure if we want to keep the whole /a/ thing [15:18:10] no? [15:18:22] someone told me I should have amanda backups of stat1 [15:18:28] and ha, that's as much as I know [15:18:34] what does that mean? what should I do? [15:20:00] Ooh, I can't create a db on the commandline either. [15:20:01] mysql> create database wikidb; [15:20:01] ERROR 1006 (HY000): Can't create database 'wikidb' (errno: 13) [15:20:13] ottomata: So I must've broken your patch when I resolved conflicts :( [15:20:44] shid 9ca6e07c5fbc3e283c38b1d0ac5670116ea4ea46 [15:21:07] hmmmm [15:21:24] can you pastebin the /etc/mysql/my.cnf file [15:21:50] ottomata: f.e. ask domas about the "dammit.lt" dir in there if you get to see him [15:21:59] Yep, or you can visit the instance yourself if you want. it's called 'mwreview-foo' [15:22:59] mutante, afaik, the /a directory is just a workspace for stats related things [15:23:11] and i've been told to back it up [15:23:23] i'll ask domas, but does that mean you think I shouldn't backup /a [15:23:23] ? [15:23:24] do you already have mysql dump crons or something? [15:23:35] and then file backup the dump files [15:23:37] no, and I don't really care about /a/mysql [15:23:47] its just temp workspace [15:24:06] fabian is the only one that uses it right now, he imports some stuff, manipulates it, and then outputs files for his own use [15:24:13] ok, well, i don't know better than you who uses it [15:24:13] ottomata: paste: http://pastebin.com/6ug9qJDF [15:24:43] andrewbogott, thanks mwreview-foo says Permission denied (publickey). when I try to log in [15:24:57] this looks like a problem [15:25:00] datadir                        = /mnt [15:25:03] oh, hang on... [15:25:16] did you set datadir in puppet? [15:25:22] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [15:25:25] you can leave it lank [15:25:25] blank [15:25:27] for a generic install [15:25:40] include generic::mysql::server should work fine [15:25:51] and requires the least amount of manual setup [15:25:55] if you change the datadir at all [15:25:58] ok, added you to the project, I think. [15:26:03] you ahve to make sure you install a mysql database at that location [15:26:04] Catching up... [15:26:20] I do want to set the data dir, because /mnt is a different (big) volume, as opposed to the default location. [15:26:23] either copying the default /var/lib/mysql that comes with the package to the datadir, or using mysql_install_db [15:26:59] /mnt seems like a weird place for a datadir [15:27:17] Inasmuch as the class takes a datadir arg, would it be reasonable to alter the class so that it sets things up properly for an arbitrary datadir? [15:27:23] /mnt is usually an empty directory that has subdirs on which remote filesystems are mounted [15:27:29] hmmmmmm [15:27:43] Yeah... I used /mnt upon ryan_lane's advice and didn't think about it much. [15:27:44] not a bad idea, but i dunno [15:28:00] And, for labs instances, it does seem to map to an empty volume. [15:28:01] now getting Connection closed by UNKNOWN when trying to log in [15:28:31] PROBLEM - MySQL Slave Running on bellin is CRITICAL: Connection refused by host [15:28:39] hmmm [15:28:40] PROBLEM - MySQL disk space on bellin is CRITICAL: Connection refused by host [15:28:40] PROBLEM - MySQL Slave Delay on bellin is CRITICAL: Connection refused by host [15:28:40] PROBLEM - Full LVS Snapshot on bellin is CRITICAL: Connection refused by host [15:29:14] `bellin` in sounds great [15:29:16] PROBLEM - MySQL Idle Transactions on bellin is CRITICAL: Connection refused by host [15:29:28] hmm, i dunno, having puppet alter the datadir is a little scary to me [15:29:34] PROBLEM - MySQL Recent Restart on bellin is CRITICAL: Connection refused by host [15:29:34] PROBLEM - MySQL Replication Heartbeat on bellin is CRITICAL: Connection refused by host [15:30:07] maybe if it wasn't done by default [15:30:10] hashar, whatdya think about that [15:30:27] if my generic::mysql::server stuff ran mysql_install_db if the datadir was empty? [15:30:35] ottomata: The datadir is set via a role class. so it isn't modified directly in your class. [15:30:53] labsmysql.pp [15:31:15] ottomata: I am far from being a puppet guru [15:31:26] aye, but it is more a mysql question [15:31:26] I am surely not going to have any final word ;-] [15:31:40] ottomata: I am far from being a mysql guru [15:31:48] I am surely not going to have any final word ;-] [15:31:50] hahah, ok [15:31:52] ottomata: You might be able to log in now. [15:31:54] If that's still interesting. [15:31:56] (I should make that an alias ;)))))))))) ) [15:32:03] ahhhhh, man i always get hashar mixed up with binasher [15:32:13] I was here first!! ;-D [15:32:17] hashar lives in france, yes? [15:32:21] yes [15:32:24] yes! [15:32:26] and not at all a MySQL guru [15:32:54] I just know a few tricks to optimize queries and the very basic sysadmin stuff (like show process list; and settings variables) [15:33:09] When you said copy default /var/lib/mysql database... do you just mean that whole dir? [15:33:37] yeah, with cp -a [15:33:44] 'k [15:33:48] the default datadir is /var/lib/mysql [15:33:55] and [15:33:59] make sure mysql is not running when you copy it [15:34:24] so if you change the datadir, you need to make the datadir contain the same stuff as a default datadir [15:34:38] the mysql-server package sets a datadir up there [15:34:42] so you can just copy it [15:34:56] or use mysql_install_db to install a new one (the package does this when it is installed) [15:35:17] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [15:35:33] still can't get into mwreview-foo: Connection closed by UNKNOWN [15:35:46] but don't worry about it [15:38:00] New patchset: Ottomata; "Setting up amanda backups for /a on stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9258 [15:38:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9258 [15:40:25] mutante: who whould I ask to review/test that? [15:40:54] ottomata: who/what creates the /var/lib/mysql dir? Does the debian package do that? [15:41:15] yes [15:45:27] * andrewbogott wonders how to stop and start mysql from within a puppet class [15:47:05] like permanently start stop [15:47:08] or just make it restart ? [15:47:24] because you could do service mysql / ensure => stopped/running [15:47:40] hm [15:47:50] don't use puppet to start/stop [15:48:00] puppet will ensure that it is one or the other [15:48:07] puppet should be used to set up how things *should* be [15:48:17] if you are temporarily messing with stuff [15:48:24] then disable puppet for a bit while you work: [15:48:28] sudo service puppet stop [15:48:39] (don't forget to turn it back on when you are done though ) [15:48:55] New review: Dzahn; "would it make sense to move the inclusion of "backup::client" into the statistics role class? (have ..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/9258 [15:49:09] ottomata: ^, i can look at it closer tomorrow... gotta leave now though [15:49:19] but that was my thought at first glance [15:50:05] ok cool, i thought about that too, will comment in gerrit [15:50:13] ottomata: Right, I need puppet to suspend mysql while it moves files. [15:50:22] maybe... not quite there yet. [15:50:25] disable puppet [15:50:32] til you ahve everything as it should be [15:50:36] sudo service puppet stop [15:50:42] then turn of mysql yourself [15:50:46] sudo service mysql sto [15:50:51] sudo service mysql stop [15:50:56] Um... trying to puppetize the process, see. [15:51:06] oh the moving of the datadir? [15:51:09] right. [15:51:13] ahhh [15:51:14] ok [15:51:29] Which, I fear that the current system starts mysql before it gets to the point of moving things. [15:51:37] i wouldn't puppetize the moving of the datadir [15:51:41] use mysql_install_db [15:51:43] so [15:52:40] exec {"install_data_dir": command => "mysql_install_db", refreshonly => true, unless => "test -d $datadir/mysql", notify => Service['mysql'] } [15:52:42] or something like that :) [15:52:50] oh, naw, remove the refreshonly [15:52:58] exec {"install_data_dir": command => "mysql_install_db", unless => "test -d $datadir/mysql", notify => Service['mysql'] } [15:53:04] Oh, ok, that seems straightforward. [15:53:10] exec {"install_datadir": command => "mysql_install_db", refreshonly => true, unless => "test -d $datadir/mysql", notify => Service['mysql'] } [15:53:14] http://www.puppetcookbook.com/posts/restart-a-service-when-a-file-changes.html [15:53:25] <--maybe you can do that and just change "running" to "stopped" [15:53:36] laters [15:53:48] (there is also "disabled" for services) [15:54:15] yeah, all you need to do is install the db if it doesn't exist, and then restart mysql [15:54:37] you can run mysql_install_db while mysql is running (pretty sure anyway) [15:54:57] as long as the datadir is not where the currently running mysqld thinks the datadir is (from when it was first started) [15:56:30] New review: Ottomata; "Yeah, I thought about that too. But, I'm not so sure. Since you need to edit files that are not ma..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/9258 [15:58:09] bah! mysql_install_db fails and tells me to look in an empty dir for the explanation [15:58:38] sudo? :) [15:58:50] Yeah, I am. [15:58:57] does the datadir exist? [15:59:07] It creates it before failing. [15:59:25] no explanation [15:59:28] what about mysql lost [15:59:29] ummmm [15:59:32] somewhere in /var/log [15:59:43] tail -n 100 /var/log/{mysql*,mysql/*} [16:01:26] ok, thanks, I may be getting somewhere now... [16:01:38] Thought I had to specify the dir for mysql_install_db but it's getting it from the config. [16:01:54] aye ja [16:02:09] New patchset: Hashar; "remove old wgNoticeBanner_Harvard2011 config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9205 [16:02:15] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9205 [16:02:36] New review: Hashar; "Patchset 2 is a rebase on master." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9205 [16:02:39] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9205 [16:03:26] ottomata: Does mysql always append 'mysql' to the end of 'datadir'? Or should I change my $datadir to /mnt/mysql? [16:04:49] it does not [16:04:55] it does create a database called 'mysql' [16:04:59] inside of the data dir [16:05:08] for mysqld's internal use [16:05:16] but [16:05:19] i do think that [16:05:25] 1. /mnt is a terrible place for a datadir :p [16:05:55] and [16:05:55] 2. using a subdir called 'mysql' is good [16:06:04] just like /var/lib/mysql is the default [16:06:11] i created this class for stat1 [16:06:14] and my datadir there is /a/mysql [16:06:15] New patchset: Hashar; "Clean up search-redirect.php per code conventions" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9206 [16:06:21] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9206 [16:06:24] New patchset: Ottomata; "Setting up amanda backups for /a and /home on stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9258 [16:06:46] New review: Hashar; "Patchset 2 is a rebase." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9206 [16:06:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9258 [16:06:47] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9206 [16:07:17] soooo many changes [16:07:22] it looks like it is never going to end [16:08:18] mutante, ottomata, reasonable? https://gerrit.wikimedia.org/r/#/c/9260/ [16:08:30] Or do I need yet another step in there to restart mysql? [16:10:40] ahrhrh [16:10:55] !log srv187 and srv188 are out of disk space [16:10:58] Logged the message, Master [16:11:00] the notify => Service bit should restart mysql if the command runs [16:11:02] so that is good [16:11:10] can someone clean them up? I just deleted /apache/common-local/search-redirect.php [16:11:18] could you also add some comments there describing what is happening? [16:11:36] i think checking $datadir/mysql for existence is good, because mysql_install_db always creates that [16:11:40] say something about that [16:11:45] oh, and remove the refreshonly => true bit [16:11:48] that was a mistake to add [16:16:25] 'k [16:17:08] !log /usr/local/apache/common-local is 4G where as / is 7G on srv187. Looks like deploying wmf2 + wmf3 + wmf4 will require partitions to be resized. [16:17:11] Logged the message, Master [16:17:13] and [16:17:17] there is nothing I can do I guess [16:17:47] 846M /usr/local/apache/common-local/php-1.20wmf4 [16:17:47] 1.8G /usr/local/apache/common-local/php-1.20wmf3 [16:17:48] 1.3G /usr/local/apache/common-local/php-1.20wmf2 [16:19:33] hashar: srv187 and 1888 are disabled in pybal [16:19:35] they get no traffic [16:19:42] ohhh grat [16:19:46] they can be removed from dsh groups and decommed [16:19:48] thanks notpeter ;-] [16:20:04] can you possible remove them from the dsh groups please ? [16:20:04] heya, LeslieCarr [16:20:08] any opinion on this? [16:20:09] sure [16:20:11] https://rt.wikimedia.org/Ticket/Display.html?id=3025 [16:20:13] (if that is easy/simple to do of course) [16:20:18] yep [16:20:20] i'd like the analytics cluster to be able to reach the outside internet [16:20:25] is there a default route we can set? [16:20:26] ottomata: Take two: https://gerrit.wikimedia.org/r/#/c/9260/ [16:20:31] they don't need public IPs [16:20:37] just a way to reach the internet [16:21:27] ottomata: oh so we don't have any nat set up :( [16:21:38] at all [16:21:54] andrewbogott, yup, I think that will work…although, I usually use full paths to executables, I'm not sure if we have the default search path configured in puppet or not [16:21:55] looking at ticket now [16:21:58] so it would be safer to do that [16:22:17] LeslieCarr, thanks, [16:22:29] hrm, so you want to be running apt-get directly ? [16:22:48] at least for now, yeah, we are just testing things out right now, different versions, different pacakges [16:22:55] we aren't sure what we will be setting up with production puppet yet [16:22:56] so yeah, if you're not sure what you'd want, labs is the best place to test stuff out [16:23:01] right now we are in try things out perioud [16:23:02] well [16:23:05] is there a reason that testing diff versions wouldn't work on labs ? [16:23:06] we are doing benchmarking tests [16:23:10] ah [16:24:04] we've got 10 beefy machines, we're going to try a few of different setups and benchmark them for a while [16:24:09] once we decide on what we want [16:24:19] we will reinstall all the machines, and set them up again via puppet [16:25:00] btw, ottomata, what is your labs username? Is it just 'ottomata'? [16:25:13] New patchset: Hashar; "use protocol-relative url for nostalgiawiki wgSiteNotice" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9204 [16:25:15] otto [16:25:19] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9204 [16:25:50] Ah, that's why you couldn't access then... [16:25:50] labs,svn,git,gerrit: otto [16:25:50] irc,mediawiki,places-where-otto-not-available: ottomata [16:26:20] New review: Hashar; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9204 [16:26:22] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9204 [16:26:41] Um... wait, your labsconsole name is Ottomata though. [16:26:52] hmm, yes [16:26:53] but his shell is otto? [16:27:02] yes? [16:27:02] yes [16:27:04] i see "mediawiki" in the list [16:27:12] of ottomatas [16:27:13] ;) [16:27:19] yes [16:27:21] (labsconsole is a wiki!) [16:27:22] yeah, shell vs. labsconsole use different names, which confuses everyone. [16:27:26] But is reasonable. [16:27:39] yes? [16:27:42] yes. [16:27:51] andrewbogott: well it could just show both names in the web. maybe it does already but more prominently then [16:28:14] New review: Hashar; "https://nostalgia.wikipedia.org/wiki/HomePage?cache=none" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9204 [16:32:29] hmm [16:32:34] looks like the cluster is fine [16:32:40] so I am probably going to call it an end [16:32:42] and get home [16:33:05] definitely not a good time to start deploying apache configuration changes ! [16:33:46] PROBLEM - MySQL Slave Delay on db1038 is CRITICAL: CRIT replication delay 175351 seconds [16:34:13] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 179609 seconds [16:35:52] PROBLEM - MySQL Slave Delay on db1004 is CRITICAL: CRIT replication delay 170807 seconds [16:39:59] New patchset: Pyoungmeister; "decom of all srv boxes lower than 190" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9262 [16:40:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9262 [16:40:54] !log decom of all srv lower than 190 [16:40:58] Logged the message, notpeter [16:44:57] I'm trying to forward-port http://svn.wikimedia.org/svnroot/mediawiki/trunk/debs/php5-fss/ to precise [16:45:03] why the directory there is "debian.lucid"? [16:45:19] should I rename it to precise, or copy it? [16:47:19] yep [16:47:42] shall imake a ticket in rt for the decom? [16:48:16] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9262 [16:48:18] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9262 [16:48:40] definitely! [16:49:13] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [16:49:20] ottomata, mutante, puppet does not like 'unless => "test -d $datadir/mysql"' at all. [16:49:29] err: Failed to apply catalog: Parameter unless failed: 'test -d /mnt/mysql/mysql' is not qualified and no path was specified. Please qualify the command or specify a path. [16:50:19] Because I guess it doesn't know to execute it? [16:50:28] dobson is unhappy (authdns-update won't finish and puppet won't finish). I think it all stems from pdns_control not responding. [16:50:43] maplebed: how did you convert debs/pybal to git? should I do it for php5-fss too? [16:50:55] er [16:51:06] I"m tempted to stop pdns, kill all outstanding pdns_control processes, and restart it. [16:51:16] mark: how did you convert debs/pybal to git? should I do it for php5-fss too? [16:51:19] New review: Sara; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8980 [16:51:22] Change merged: Sara; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8980 [16:51:32] paravoid: with git-svn [16:51:43] anybody know how badly that will fuck with the cluster running? I hope the answer is 'it won't.'. [16:52:13] maplebed: you can kill just one auth server with no issues [16:52:15] there are 2 more [16:52:18] just make sure the others are up [16:52:49] I've recently run dig against all three. that's a good enough test, right? [16:52:58] sure [16:53:04] nagios will also complain if one is down [16:53:09] [C [16:53:10] ok, thanks. [16:53:14] whoops [16:54:07] !log kicking pdns on dobson to try and make it happy again. [16:54:10] Logged the message, Master [16:54:38] heh... it won't stop either, since that's a pdns_control command. [16:54:46] * maplebed brings out the kill [16:55:07] Oh, it doesn't know where 'test' is. [16:55:13] PROBLEM - Puppet freshness on search17 is CRITICAL: Puppet has not run in the last 10 hours [16:55:46] andrewbogott [16:55:53] right, give it full paths to executables [16:56:06] ottomata [16:56:06] andrewbogott, yup, I think that will work…although, I usually use full paths to executables, I'm not sure if we have the default search path configured in puppet or not [16:56:29] Yeah, I caught that before but for some reason thought that 'test' was implicit rather than just a regular old unix tool. [16:56:47] ahhh ok [16:59:16] PROBLEM - Puppet freshness on search15 is CRITICAL: Puppet has not run in the last 10 hours [16:59:17] PROBLEM - Puppet freshness on search13 is CRITICAL: Puppet has not run in the last 10 hours [16:59:43] RECOVERY - Puppet freshness on dobson is OK: puppet ran at Tue May 29 16:59:36 UTC 2012 [17:00:05] cmjohnson1: yeah, we're going to extend the existing row D stack into row C [17:00:14] but probably with fibers instead of stacking cables, because of length [17:01:27] mutante: puppet's unfrozen on dobson; it was hanging on pdns_control. though of course now it's broken for a different reason... [17:02:16] RECOVERY - MySQL Replication Heartbeat on db22 is OK: OK replication delay 0 seconds [17:02:16] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [17:02:53] can someone review this for me? [17:02:57] i'm not sure who knows much about backups [17:02:57] https://gerrit.wikimedia.org/r/#/c/9258/ [17:05:32] LeslieCarr, any insight from reading that RT ticket? [17:05:52] RECOVERY - MySQL Slave Delay on db1004 is OK: OK replication delay 0 seconds [17:06:19] PROBLEM - Puppet freshness on search14 is CRITICAL: Puppet has not run in the last 10 hours [17:06:19] PROBLEM - Puppet freshness on search16 is CRITICAL: Puppet has not run in the last 10 hours [17:06:19] PROBLEM - Puppet freshness on search19 is CRITICAL: Puppet has not run in the last 10 hours [17:06:47] ottomata: can we give those machines external ip's and no private data to start ? [17:07:20] a single public IP would be enough, sure [17:07:30] although, 'no private data' might make benchmarking some things a little annoying [17:07:59] we intend to use some default benchmarks, as well as for a few or our own usecases, which would entail running log data through them [17:08:22] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4483 [17:08:24] i'd set up NAT on analytics1001 if you'd give it an external facing IP [17:10:04] PROBLEM - MySQL Slave Delay on db1004 is CRITICAL: CRIT replication delay 89863 seconds [17:14:28] well each would need a public ip [17:14:37] eh [17:14:41] use iptables to do some nat ? [17:15:25] well if you wanna do the config i would definitely be ok with that ;) [17:15:37] RECOVERY - Puppet freshness on spence is OK: puppet ran at Tue May 29 17:15:27 UTC 2012 [17:16:56] ottomata: also, 2992 [17:17:06] (fyi i just got back from being gone for 2 weeks, so catching up) [17:17:16] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [17:17:30] if you want to take that info and make a little class of "geoip::update" or whatever, that would be cool [17:17:54] ottomata: make sure that any license keys are only in the private repo though. we do NOT want any of them in public ... [17:17:57] LeslieCarr, yeah I can do the config [17:18:12] want me to update both tickets ? [17:18:13] for NAT, especially since it is temporary [17:18:16] and re: GeoIP [17:18:18] yup! [17:18:21] that is what is done [17:18:28] GeoIP.conf is in private repo [17:18:49] put in place on puppetmaster, which runs geoipupdate to dl the .dat files into the volatile directory [17:19:11] clients then can include geoip class with provider => 'puppet' to sync the files via puppetmaster fileserver [17:19:23] ah cool, reading this ticket! [17:19:26] i've not seen it [17:19:27] New patchset: Pyoungmeister; "Add new mobile puppet manifest with simple vumi class * provides ussd application server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4483 [17:19:39] both? i can update 2992 [17:19:42] what's the other? [17:19:43] cool [17:19:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4483 [17:19:49] oh ha [17:19:52] yes, i've seen this ticket, doh [17:19:54] i created it [17:19:57] 3025 [17:19:58] :) [17:20:00] right, it is done [17:20:01] i will resolve it [17:20:06] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4483 [17:20:10] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4483 [17:20:12] Responding 3025 with what i just said here :) [17:20:22] ok cool [17:20:27] i will resolve 2992 [17:20:30] hey woosters, looks like my shell access doesn't work. who should i bug about it? [17:20:32] which server for 3025 ? [17:20:47] kaldari: do you have an rt ticket ? [17:20:59] nope :( [17:21:05] analytics1001 would be fine…are you doing that via puppet? [17:21:15] if so, then I willl suggest a different machine [17:21:34] I've been messing with some stuff on analytics1001, and puppet is a little funky there atm [17:21:51] if you are just adding the network configs manually [17:21:55] then analytics1001 is fine [17:22:03] guess I'll get an rt ticket [17:25:06] there is an rt ticket [17:25:12] for the original request [17:25:28] maplebed worked on it [17:26:01] kaldari: did I send you an email asking you to test and let me know if it worked? I know I meant to but I mighta just pinged in IRC. [17:26:08] I think there were steps that I left out. [17:26:25] maplebed: I don't think so [17:26:47] damn. sorry, I meant to make sure you tested it before closing the ticket. [17:27:00] that's ok, I filed a new ticket :) [17:29:08] oh wait, now I remember. you already had an account, so it was supposed to just work. [17:29:16] maplebed: yeah [17:29:29] can you log in anywhere? [17:29:38] (eg fenari) [17:29:52] I can log in lots of places, but not fenari :) [17:30:52] I remember looking at the key in the config, and it looked right [17:31:04] but I have no home dir on fenari [17:31:17] and it gives me a permission denied error [17:31:18] I think I need to make you a homedir. [17:31:31] New patchset: Pyoungmeister; "adding mobile::vumi to silver and zhen" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9267 [17:31:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9267 [17:32:23] Jeff_Green: you were doing that for the other new users, right? is it on nfs1? [17:32:50] maplebed: yeah, you need to make a homedir on fenari [17:32:57] puppet doesn't handle that [17:33:26] do I make it on nfs1 or on fenari and let NFS to its job? I suppose it doesn't matter... [17:33:39] does puppet create the .ssh dir once the homedir exists? [17:33:42] well, it's just an nfs share, so anywhere it's mounted [17:33:44] I don't think so [17:34:47] maplebed: making homedirs? yes [17:34:54] i meant to document it at the top of admin.pp but completely forgot [17:35:03] did you make .ssh dirs too? [17:35:08] or did puppet? [17:35:10] New patchset: Sara; "Don't bother applying ganglia::collector role to hooft.esams or streber, as neither of these actually run gmetad. (Change 9265 is the corresponding change in the test branch.)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9268 [17:35:23] totally manual, puppet won't touch homedirs on the nfs systems [17:35:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9268 [17:35:40] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9267 [17:35:43] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9267 [17:35:55] ssmollett: the ganglia::collector role applies to gmond, doesn't it? to make it not deaf? [17:36:51] ssmollett: oh, nevermind. [17:37:03] maplebed: that's ganglia_aggregator. [17:37:11] too many similar names. [17:37:22] yup. [17:37:45] Jeff_Green: it doesn't even populate authorized_keys with the key in puppet? [17:37:53] nope [17:38:23] so you manually also created bsitu's authorized keys (for example)? [17:38:28] puppet has a resource/directive/variable/somethingerother called managehome or something, it's 'no' for nfs hosts [17:38:50] ssmollett: I'm pushing your stuff out [17:39:12] notpeter: ok. [17:39:13] and the calls to the ssh authorized key resource are skipped if it's 'no' too [17:39:50] New review: Sara; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9268 [17:40:17] kaldari: try again now please [17:40:54] works now [17:40:58] thnkas! [17:41:41] yw. [17:43:08] so, http://svn.wikimedia.org/svnroot/mediawiki/trunk/debs/php5-fss/ has only the debian.lucid/ dir; where's the rest of the source? [17:43:17] (please don't tell me apt-get source :) [17:47:39] RECOVERY - MySQL Replication Heartbeat on db1004 is OK: OK replication delay 0 seconds [17:47:39] RECOVERY - MySQL Replication Heartbeat on db1038 is OK: OK replication delay 0 seconds [17:47:48] RECOVERY - MySQL Slave Delay on db1004 is OK: OK replication delay 1 seconds [17:47:58] RECOVERY - MySQL Slave Delay on db1038 is OK: OK replication delay 0 seconds [17:48:39] New patchset: Bhartshorne; "use sudo_user only for distributions newer than hardy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9270 [17:49:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9270 [17:49:20] mark: ^^^ that's what you mean, right? [17:50:21] ottomata: okay, not updating ip's via puppet though [17:50:56] ? [17:51:18] oh responding to wrong ticket [17:51:19] right, so, IPs are not in puppet usually, right? [17:51:20] Change merged: Sara; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9268 [17:51:24] i mean wrong message [17:51:24] nm [17:51:25] yes [17:51:28] so analytics1001 is fine [17:51:29] ok cool [17:51:46] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9270 [17:51:49] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9270 [17:56:44] thanks LeslieCarr! lemme know when that is up and I will try it up + set up NAT for the others [17:58:27] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [18:00:10] maplebed: yep [18:00:33] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [18:03:25] ::sigh:: puppet on dobson is still broken, but at least the sudo stuff isn't the reason anymore. [18:04:53] but I think I'll work on it more after dinner. [18:05:57] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 188 seconds [18:06:24] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 194 seconds [18:16:09] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [18:29:58] New patchset: Hashar; "import overriding system for InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9237 [18:30:06] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9237 [18:30:43] New review: Hashar; "Patchset 2 add a warning at top of wmf-config/InitialiseSettings-wmflabs.php" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/9237 [18:30:59] can anyone explain what the purpose is of gmetad is on manutius? and which of the gmetad.conf config options need to be different from those on nickel? [18:36:06] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 193 seconds [18:36:06] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 193 seconds [18:41:30] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.048 second response time [18:41:48] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 191 seconds [18:41:57] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 187 seconds [18:42:38] (please don't tell me apt-get source :) [18:42:47] how about debcheckout? ;) [18:42:51] * apergos tells jeremyb apt-get source :-P [18:43:16] what was the answer anyway? [18:43:22] what was the question? :-P [18:43:33] * apergos is off he clock if you can't tell... [18:43:36] *the [18:43:51] 29 17:43:07 < paravoid> so, http://svn.wikimedia.org/svnroot/mediawiki/trunk/debs/php5-fss/ has only the debian.lucid/ dir; where's the rest of the source? [18:44:18] apergos: have you sync'd clocks w/ paravoid ? [18:44:33] no [18:44:37] I use ntpd :-P [18:45:42] I thought so, I added that to svn initially at some point [18:46:08] damn, /me stabs viewvc [18:46:10] i wants git [18:46:14] heh [18:46:37] following our build practices at the time [18:46:45] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [18:46:58] really? huh [18:47:05] off the clock! :-) [18:47:48] it's eqiad... i thought those were not being used? or at least could live without for now? [18:47:48] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [18:48:02] to flap or not to flap. that is the question [18:48:16] mw8? really? [18:48:31] mw8.pmtpa.wmnet [18:48:33] err [18:48:40] yeah, i'm braindead [18:48:42] sorry! [18:48:42] New patchset: preilly; "remove Opera Mini IP Addresses" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9277 [18:48:44] Ryan_Lane: are you online [18:48:54] Ryan_Lane: can you merge https://gerrit.wikimedia.org/r/9277 [18:49:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9277 [18:49:15] New review: preilly; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/9277 [18:49:31] preilly: I'm getting on a plane really, really soon [18:49:46] Ryan_Lane: can you do it really fast? [18:49:47] maybe LeslieCarr can? (preilly ^^) [18:49:52] have a safe flight ryan-lane [18:49:54] LeslieCarr: are you available? [18:50:24] woosters: thanks [18:50:48] yes, have a safe and bumpless flight ;) [18:50:54] hey [18:50:56] oh what's up [18:51:00] LeslieCarr: can you merge https://gerrit.wikimedia.org/r/#/c/9277/ [18:51:02] i'm around :) [18:51:11] ok, checking this out now [18:51:42] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [18:52:51] assuming that you definitely want to remove those ip's, looks good - do we need to do the normal push and flush the mobile cache ? [18:52:57] or will a push suffice [18:53:10] LeslieCarr: yes the normal push and flush would be great [18:56:03] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 229 seconds [18:56:06] preilly: :D [18:56:12] phrasing? [18:56:21] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 245 seconds [18:57:02] ok. it's close to boarding time [18:57:04] * Ryan_Lane waves [18:57:13] Ryan_Lane: it's LeslieCarr's phrasing? [18:57:20] jeremyb: no, patricks [18:57:27] hehehe [18:57:31] Ryan_Lane: he quoted verbatim [18:57:37] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9277 [18:57:38] both, then :) [18:57:40] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9277 [18:57:48] well he left off a few words [18:58:30] ha ha [18:59:04] preilly: how's it look ? [18:59:12] !log flushed mobile varnish caches after push [18:59:16] Logged the message, Mistress of the network gear. [18:59:16] oh wait 1 minute [18:59:47] hah, didn't do the puppet pull [18:59:54] i am still in vacation mode [19:00:16] ha ha [19:04:15] anyways, how about now [19:07:00] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.723 second response time [19:07:55] preilly: ^ [19:08:02] LeslieCarr…busy day after vaca, eh? i know you are busy, so no worries, but I am wondering about an eta for the analytics1001 ip [19:09:19] LeslieCarr: looks good [19:09:38] hehe [19:09:42] um, in a bit [19:11:10] mk, ty [19:31:45] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 186 seconds [19:31:54] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 186 seconds [19:32:55] ottomata: going to mess with analytics1001 now … killing your connection to it [19:34:53] that's fine [19:35:02] closed it myself :) [19:35:05] well after i figure out how to get mgmt access on the new ciscos [19:35:49] db40 (parser cache) may need a look. see #-tech and http://ganglia.wikimedia.org/latest/?c=MySQL%20pmtpa&h=db40.pmtpa.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [19:35:58] m, uhh, i have root if you need me to do something [19:36:36] LeslieCarr: sounded like it was doc'd pretty well. or else andrewbogott has experience doing it. (mgmt on new ciscos) [19:36:56] oh i have root as well, want to haveserial access [19:36:57] hrm [19:37:12] andrewbogott: hey - any advice on getting serial access [19:37:22] LeslieCarr: they mgmt interface on the ciscos likes 'admin' rather than 'root'. Is that what's blocking you? [19:37:23] andrewbogott: on the new ciscos - "connect host" isn't getting me anything [19:37:49] Oh... you mean you're already talking to the serial interface? [19:38:16] yeah [19:38:36] [re db40 aka parser cache (so not in a cluster?)] first reported issue was ~19:00 UTC and ganglia shows lots of spikes right before then. they generally haven't recovered after spiking [19:38:46] (click my link above) [19:39:46] LeslieCarr: What happens when you 'connect host'? [19:39:49] ganglia thinks it has 12 procs and the lowest load avg for the last hour is 34. (but usually more like 80) [19:39:59] CISCO Serial Over LAN: \ Close Network Connection to Exit [19:40:06] and then nothing - i can hit enter 50 times and nada [19:41:15] I take it you know there to be a running OS on the server? [19:41:37] as long as the mgmt console is actually connecting ot the correct server, then yes [19:42:39] Hm... I'm probably useless then. You're talking about analytics1001? I can have a look but will most likely just confirm your findings. [19:42:39] going back further in ganglia the typical load avg seems to be ~20. there was an earlier spike at ~17:35 and then plateau at the high level and then ~19:00 it spiked even higher [19:42:58] (and now it's still extra high) [19:43:50] andrewbogott: yes, analytics1001 [19:45:31] Reedy: Could you check if there is an uncommitted change or accidental merge-commit in php-1.20wmf3? [19:45:36] Reedy: https://en.wikipedia.org/wiki/Special:Version [19:45:37] https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git;h=5451033fd3be633b46398aa037251dc964258df1 [19:45:38] 404 error [19:46:30] yeah [19:46:33] thanks andrew [19:48:43] Logged the message, Master [19:48:47] Logged the message, Master [19:48:50] Logged the message, Master [19:49:04] Logged the message, Master [19:49:07] Logged the message, Master [19:49:10] Logged the message, Master [19:49:23] Logged the message, Master [19:49:26] Logged the message, Master [19:49:30] Logged the message, Master [19:49:40] Logged the message, Master [19:49:44] Logged the message, Master [19:49:47] Logged the message, Master [19:49:57] <^demon|busy> We should really really restrict that to people with real hostmasks :\ [19:50:04] Logged the message, Master [19:50:07] Logged the message, Master [19:50:11] Logged the message, Master [19:50:21] that's a separate issue [19:50:27] Well [19:50:27] is that the pdf machines ? [19:50:30] or what ? [19:50:34] At least they use IPv6 [19:50:40] Logged the message, Master [19:50:44] Logged the message, Master [19:50:47] Logged the message, Master [19:50:53] we should give these pdf boxen real, accurate rdns [19:51:37] <^demon|busy> Oh, was this actually legit? I thought it was spam. [19:51:51] i'm certain i've seen it before [19:52:02] whether it was legit in the past i can't say for certain [19:52:42] <^demon|busy> jeremyb: May 8th, March 9th, at the very least. [19:53:02] LeslieCarr: Yep, I'm seeing the same nonresponsive terminal that you are. [19:53:07] <^demon|busy> March 6-8th, even. [19:53:08] Possibly because it is actually hung? [19:53:08] :( [19:53:23] but i'm in analytics1001 in band [19:53:27] it's up and responsive [19:53:32] can you reset the mgmt interface as root? [19:53:56] jeremyb: The mgmt interface is working fine, it's the console that's hanging. [19:54:13] huh... [19:54:23] not entirely sure what that means. i thought ocnsole was part of mgmt [19:54:38] tried `echo foo >/dev/console` ? [19:54:46] console* [19:54:49] Well... I'm presuming that the console is working, but that whatever it is that the console is displaying is not working. [19:55:01] Although that's not the only possibility. [19:55:32] so, who to poke about db40? seems to not be fixing itself? [19:55:36] :( [19:57:06] i mean, without this i could easily leave analytics1001 compltely unreachable [19:57:48] LeslieCarr: reboot without changing IP and see what happens? [19:57:59] LeslieCarr: also, < jeremyb> tried `echo foo >/dev/console` ? [19:58:15] i mean the machine itself is in fine working order [19:58:36] hrm [19:58:38] so that works [19:58:45] echoing info into /dev/console [19:59:59] Hey, yeah, I see the echo too. [20:00:22] So we don't have a shell running on the console (or have a broken one.) [20:00:35] so, parser cache has bad a small bump in overall get count. but tp99 has gone really crazy. https://graphite.wikimedia.org/dashboard/temporary-15 [20:00:56] you could try booting the getty or whatever it is [20:01:40] LeslieCarr: I'm looking at a guide that describes config stuff you have to do within linux to enable SOL. Of course, I expect that ubuntu does all these things by default anyway... [20:02:04] SOL? [20:02:23] serial over lan [20:02:57] (Why would IBM publish a guide to configuring cisco servers I wonder?) [20:03:03] hah [20:03:54] who knows with ibm [20:06:23] Um... here is that giant link, in case you feel like browsing unhelpful docs: [20:06:23] http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CHYQFjAD&url=http%3A%2F%2Fwww.google.com%2Furl%3Fsa%3Dt%26rct%3Dj%26q%3D%26esrc%3Ds%26source%3Dweb%26cd%3D4%26ved%3D0CHYQFjAD%26url%3Dhttp%253A%252F%252Fpublib.boulder.ibm.com%252Finfocenter%252Fbladectr%252Fdocumentation%252Ftopic%252Fcom.ibm.bladecenter.mgtmod.doc%252FSOL_Setup_Guide_PDF.pdf%26ei%3D2SnFT8D_M6nO2AX5_LXvAQ%26usg%3DAFQjCNFM37Mh2MfZIyM3JBc4dozS89e3 [20:06:24] 2SnFT8D_M6nO2AX5_LXvAQ&usg=AFQjCNFM37Mh2MfZIyM3JBc4dozS89e3Gg [20:06:38] Dammit, why won't google just give me the actual link rather than a googlified link? [20:07:08] Hm... actually there's little reason to think that that guide applies at all :( [20:07:23] hah [20:07:34] @infobot-link blah [20:07:35] Krinkle: Unknown identifier (blah [20:07:35] Permission denied [20:08:16] @infobot-link blah [20:08:16] Krinkle: Unknown identifier (blah [20:08:16] Permission denied [20:08:23] petan|wk: :D [20:08:43] ungooglified: http://publib.boulder.ibm.com/infocenter/bladectr/documentation/topic/com.ibm.bladecenter.mgtmod.doc/SOL_Setup_Guide_PDF.pdf [20:09:02] hexmode: do you know who renamed "Site requests" to "Site configuration" in Bugzilla? can't see anything in the SAL [20:09:07] @trustadd .*@wikimedia/Krinkle admin [20:09:08] Successfuly added .*@wikimedia/Krinkle [20:09:15] @infobot-link blah [20:09:15] Krinkle: Unknown identifier (blah [20:09:15] This channel is already linked to #wikimedia-labs if you want to link it to another channel, you need to remove it [20:09:31] Krinkle: if you have access to bots-1 you can insert yourself as global admin [20:09:38] Thehelpfulone: there may be something on bugzilla, let me check. [20:09:40] just sudo su wmib [20:09:43] petan|wk: ok [20:09:49] * Krinkle makes a note for later, busy with something else now [20:09:51] sure [20:10:05] petan|wk: thx, I'll work on something else, now. Nice job on the sharing [20:10:06] I think I created a category for this bot on bugzilla before i lost my powers XD [20:10:14] LeslieCarr: Sorry, I am now out of ideas. Well, all but one: reboot it while connected to SOL and see if the startup info displays. That may not be a good option if anyone but you is using it. [20:10:17] so you can create a bugs for that [20:10:22] and I will try to fix that [20:10:48] otto's not here... [20:11:14] Thehelpfulone: no, sorry [20:11:44] ok, we should probably get people to log things when they make changes :) [20:12:08] hrm [20:12:08] well [20:12:13] let's try it [20:12:27] !log reloading analytics1001 [20:12:30] Logged the message, Mistress of the network gear. [20:12:48] hrm [20:12:57] ok, so we're seeing the loading screen [20:12:58] that's ogod [20:12:59] good [20:13:09] oh god i forgot it takes like 20 minutes to test the memory [20:13:09] blecch [20:13:27] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 186 seconds [20:13:31] <^demon|busy> LeslieCarr: That sounds like an excuse for a coffee break :) [20:13:45] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 204 seconds [20:14:47] rotfl [20:15:25] !log "Site requests" was renamed to "Site configuration" under the Wikimedia product in Bugzilla, don't know who did it though [20:15:29] Logged the message, Master [20:15:38] Thehelpfulone: or when! [20:16:01] <^demon|busy> Or why it needed !logging ;-) [20:16:01] we have a list of 20ish suspects ;) [20:16:09] PROBLEM - Host analytics1001 is DOWN: PING CRITICAL - Packet loss = 100% [20:18:20] hrm, now it's fine [20:18:22] stupid ciscos [20:18:42] RECOVERY - Host analytics1001 is UP: PING OK - Packet loss = 0%, RTA = 26.44 ms [20:18:51] Reboot ftw! [20:20:30] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 189 seconds [20:20:57] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 203 seconds [20:25:05] Logged the message, Master [20:25:08] Logged the message, Master [20:25:12] Logged the message, Master [20:25:17] Logged the message, Master [20:25:20] Logged the message, Master [20:25:24] Logged the message, Master [20:26:20] New patchset: Thehelpfulone; "meta sysop +/- transadmin self,crat -transadmin" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9333 [20:26:26] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9333 [20:28:51] Reedy: is the correct process for getting someone to review that to add someone as a reviewer? I added you to it, does that send you an email or something? [20:30:18] <^demon|busy> Yes, that does notify them. And it adds it to their dashboard. [20:30:57] so is there a list of people that are tasked with reviewing each repo ^demon|busy? [20:31:08] <^demon|busy> No [20:31:27] i find it helpful to also poke the person on email or irc sometimes [20:31:36] depending on the person [20:31:56] <^demon|busy> I only poke manually on IRC/e-mail if it's urgent. If it's not, I hate bothering people with spam. [20:32:03] <^demon|busy> :) [20:32:53] fair enough, but there should be one so that these changes do get reviewed - who knows the information to make one? :) [20:33:08] <^demon|busy> The information's all stored in gerrit. [20:33:21] <^demon|busy> Someone who's got a free weekend and wants to do it could whip up a tool. [20:33:53] phew, back, was on satellite internet and a storm rolled in! [20:33:57] * jeremyb has been begging for a way to ask for review without having to specify a reviewer [20:34:06] ottomata: wtf, where are you? [20:34:09] no storm here [20:34:15] in cumberland, maryland [20:34:19] oh [20:34:24] brooklyn's dry! [20:34:32] aye, i hear it is pretty warm up there this week too [20:34:36] <^demon|busy> No storm here either. [20:35:26] ^demon|busy: compared http://forecast.weather.gov/zipcity.php?inputstring=richmond,+va vs. http://forecast.weather.gov/zipcity.php?inputstring=richmond+va ? [20:35:57] <^demon|busy> I'm looking out my window and I see sunshine and scattered clouds. [20:36:06] <^demon|busy> I find "looking out the window" more accurate :) [20:36:12] hah [20:36:21] not the forecast, look at the other parts of the page [20:37:42] ^demon, are you in richmond, va? [20:37:58] * jeremyb wonders who decided to put "wtf" in a URL path for an NWS page... [20:38:21] <^demon|busy> ottomata: Yup. Hopefully just one more year :) [20:39:33] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 187 seconds [20:40:06] aye, i'm from gloucester, va [20:40:09] in richmond occasionally too [20:40:32] looks like db40's recovered ~1/2 way. and tp99 is much better [20:40:45] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 205 seconds [20:43:27] PROBLEM - Host analytics1001 is DOWN: PING CRITICAL - Packet loss = 100% [20:45:54] <^demon|busy> ottomata: I don't think I've ever made it out to Gloucester. I'm down in Norfolk/Va Beach pretty often though. [20:46:15] aye [20:46:23] yeah, no reason you'd go there :) [20:46:27] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 183 seconds [20:46:36] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 182 seconds [20:46:58] <^demon|busy> ottomata: That's what my friend says about Suffolk too :p [20:47:30] are you from VA? [20:47:40] <^demon|busy> Yup. Grew up just south of Richmond. [20:49:29] !log renaming analytics1001.eqiad.wmnet to analytics1001.wikimedia.org [20:49:33] Logged the message, Mistress of the network gear. [20:56:21] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [21:00:19] hm, interesting [21:00:37] didn't realize we'd have to change the hostname [21:06:33] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 7 seconds [21:06:33] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 6 seconds [21:07:35] New patchset: Aaron Schulz; "Enabled new thumb purge hook on remaining wikis." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9338 [21:07:41] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9338 [21:08:09] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9338 [21:08:11] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9338 [21:09:06] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [21:09:38] hrm [21:09:54] so for some reason analytics1001.wikimedia.org still thinks it's .eqiad.wmnet .. [21:10:05] hm, is that bad? [21:10:09] i can get in [21:10:15] well yeah but puppet's not running [21:10:19] i'm guessing you want puppet to run [21:10:21] ah, that is for another reason [21:10:26] remember I said I was messing with puppet on 1001? [21:10:32] oh did you mess stuff up ? [21:10:38] yes/no, [21:10:42] on purpose yes [21:10:45] hehe [21:10:51] its temp [21:10:57] which is why I said not to pick 1001 if you needed puppet [21:10:59] right now [21:11:05] do you need to run puppet to get this set up? [21:11:05] well, ok, it's done [21:11:09] nope [21:11:11] ok cool [21:11:12] phew [21:11:13] just thought you would want it [21:11:15] good [21:11:15] yay [21:11:16] yeah, i got it [21:11:19] yayyyy, ok cool [21:11:22] awesome [21:11:24] i can ping! [21:11:26] yayyyy thank youuuuu [21:11:37] ottomata: OK, let's try again... can you access mwreview-qux? [21:11:56] so, is this machine no longer reachable under the eqiad name from fenari? [21:12:50] andrewbogott [21:12:52] yup :) [21:12:54] LeslieCarr: you might need to regen the cert and sign it? [21:13:06] jeremyb: yeah, that's the not working bit [21:13:06] ottomata: Ok... 2nd question, can you figure out why mysql won't start? [21:13:16] ottomata: it definitely shouldn't be reachable via the eqiad name [21:13:26] ok, good to know [21:13:58] if it is, then i have no idea what's going on [21:14:11] andrewbogott, need sudo [21:14:22] oops, one minute [21:19:15] ottomata: OK, can you sudo now? [21:19:38] nopers [21:20:03] well, dammit. I added you to sysadmin and netadmin and reran puppet... [21:20:22] Is there something else I need to do? [21:20:38] oops, new sudo features... [21:22:44] make sure ottomata's logging out of his shell before trying again... [21:23:10] old procs (including shells) won't pick up new group memberships [21:23:12] I think [21:25:17] oo, works now! [21:25:42] cool. [21:31:11] ottomata: so, any guesses what I broke and how I broke it? [21:32:12] apparmor is not happy…not sure why yet [21:32:19] not your fault though I think! [21:35:17] for some reason it is trying to create its .sock file in /run/mysql [21:35:23] it should be in /var/run/mysqld [21:36:05] hmm, or maybe that is my fault, still not sure.. [21:38:30] ahhhhh! [21:38:38] on labs, /var/run is symlinked to /run [21:38:40] hm [21:39:20] no idea why that would be [21:39:24] Ryan_Lane, you there? [21:39:35] hmm, actually, i'll go ask in @# labs [21:39:37] i think he's on a plane to berlin ? [21:39:43] ah makes sense [21:39:45] are you going to Berlin? [21:39:48] Leslie? [21:39:57] nope - doing aids lifecycle instead [21:40:03] Leslie is going to Los Angeles... the hard way. [21:40:04] www.tofighthiv.org/goto/leslie [21:40:06] yep [21:40:10] nice! [21:42:55] ok andrewbogott [21:43:14] in puppet [21:43:17] where you include the mysql class [21:43:21] try doing [21:43:48] socket => /run/mysqld/mysqld.sock, [21:43:48] pid_file => /run/mysqld/mysqld.pid, [21:43:51] so [21:44:03] those are params that your mysql class takes? [21:44:06] yes [21:44:36] * jeremyb is back, was on phone [21:44:50] err, and i don't even know where i'm typing! [21:45:18] yeah, so, newer ubuntus are getting rid of /var/run [21:45:25] huh [21:45:25] which is where mysql used to keep it's socket files [21:45:36] and my puppet class assumed as much [21:46:19] specifying them manually will make the apparmor config file contain the real path, and hopefully be nicer to mysql [21:47:02] ottomata: https://gerrit.wikimedia.org/r/#/c/9342/ [21:47:18] what mysql package is this anyway? an asher build or from ubuntu or? [21:47:28] mysql? good Q [21:48:01] looks good [21:48:14] * jeremyb is about to amend andrewbogott [21:48:21] (9342) [21:48:25] Always have to run it once just to get the commas in the right place... [21:48:45] hah [21:48:54] took me a few secs to parse what you said [21:49:15] jeremyb: Can you really amend a different person's patch? [21:49:20] surely [21:49:21] I have no objection, just never tried. [21:51:32] ok, i gotta run for the day [21:51:38] I reviewed, but I don't ahve approval powers [21:51:55] ottomata: OK, thanks for your help! No doubt I will bother you more in the morning :) [21:52:01] please do :) [21:52:09] laters everyone! [21:52:09] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 24 seconds [21:52:18] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds [21:53:24] andrewbogott: whatever this is includes base, right? [21:54:05] 'whatever this is' meaning the puppet manifest, or meaning the instance? [21:54:39] instance [21:54:42] catalog [21:55:26] why are you using the test branch anyway? (or is this not for analytics1001?) [21:57:31] nope, unrelated. [21:58:09] jeremyb: The big picture is, I want labs to have a one-click (or close to one-click) way to set up a single-host MW install. [21:58:30] I believe that all the labs instances get 'base' whether I like it or not. [21:59:06] labsmw.pp contains the top-most puppet class, the one I'm actually selecting for my tests. [22:01:34] sorry phone again. back in 2 mins [22:04:39] back [22:06:57] jeremyb: Are you about to amend that patch? Or should I go ahead and approve and test? [22:07:46] ugh, as soon as hang up I get another call! [22:07:57] but now i turned off my ringer ;P [22:08:13] andrewbogott: i'm planning to amend very shortly [22:08:22] 'k [22:09:27] will you test after i push? [22:10:39] by 'push' do you mean 'submit to gerrit'? [22:10:42] I mean, yes, either way. [22:11:03] well submit is a loaded term [22:11:04] ;( [22:11:17] (can mean submit for merge instead of "push for review") [22:11:35] Hm... I tend to think 'submit for review' vs. 'push to head' [22:11:40] So I guess they're opposites for me. [22:11:44] Anyway. [22:11:50] well go read some gerrit docs then ;) [22:12:00] I always use 'git review'. [22:12:00] or look at a failed merge after a submit [22:12:06] So I don't use 'push' unless I'm actually... pushing. [22:12:19] But I guess no one at wikimedia actually pushes these days. [22:12:30] and i don't do review, i just push [22:15:40] Right, but when you say 'push' you mean 'push to gerrit' right? This confuses me for some reason. [22:15:50] sometimes [22:15:53] push can be either [22:16:25] submit is clicking submit in the web UI or doing it with git review or some other tool but without sending any new blogs up to the server [22:18:12] No, I mean, /right now/ are you going to push to gerrit? [22:18:18] * aude confused [22:18:21] See, my confusion is justified! [22:18:57] hehe [22:19:08] yes [22:20:18] figuring out how to work with puppet and gerrit for maps stuff [22:20:26] on labs? [22:20:30] yes [22:20:38] for the hackathon [22:21:00] and wikidata [22:21:25] aude: The system right now is super stupid, paravoid is in the process of improving things. [22:21:32] but the whole workflow is confusing [22:21:36] But probably not before the hackathon. [22:21:38] andrewbogott: true [22:22:12] I think it will still be confusing post-paravoid-changes, but at least our dumb mistakes will be more private. [22:22:23] if we get going this weekend and then have something working for the next hackathon at wikimania for maps, i'll be happy [22:22:52] At the moment the process is very streamlined, it's just out of order. [22:23:01] i see [22:23:03] the changes are in btw :) [22:23:06] Because 'test' comes at the end. [22:23:09] Really? [22:23:12] oh yes [22:23:23] there are a few things missing, like reverting back to the central puppetmaster [22:23:28] Are they 'in' in some way that's invisible? Or could I be using a branch right now if I wanted? [22:23:36] nope, they're in the test branch [22:23:44] you add the puppetmaster::self class to your instance [22:23:49] then you get a local puppetmaster [22:24:06] cool! [22:24:09] then /etc/puppet/manifests is a symlink to /var/lib/git/operations/puppet/manifests [22:24:19] and any changes that you do there take immediate effect when you do puppetd -vt [22:24:54] I was hoping Ryan would have a more closer look [22:25:01] paravoid: no more puppetd and `puppet` no longer defaults to the apply face. if we ever get to telly that is [22:25:07] (rc is out) [22:25:08] then communicate it some more [22:25:23] by a page in labsconsole perhaps [22:25:23] New review: Reedy; "It's useful to mention any relevant bug reports in the commit summary" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/9333 [22:25:27] jeremyb: yeah, i know [22:25:32] paravoid: Makes sense. But I'm happy to be an early adoptee if you're ready for testers. [22:26:30] I am [22:26:31] please do! :) [22:26:47] andrewbogott: you want to test? [22:26:56] I should have said team rather than Ryan [22:26:56] maybe i need a better name for the var too [22:27:02] Is there any more documentation, other than 'add the puppetmaster::self class to your instance'? [22:27:04] or i failed lint [22:27:13] nope, that should be enough [22:27:16] And, paravoid, is the puppet master project-wide, or instance-local? [22:27:20] then cd /etc/puppet/manifests and hack away [22:27:23] instance local [22:27:33] ok, easy. [22:27:42] we can't do central per-instance/branch puppet master unless we fully migrate to module [22:27:45] modules [22:27:52] when we'll have modules, we can use environments [22:28:03] then we can use the branch->environment mapping that many people use [22:29:23] makes sense. [22:30:08] andrewbogott: ok passed lint now [22:30:31] jeremyb: Ultimately the {base::run} reference should go in generic::mysql::server and elsewhere... [22:30:43] but it's safer to test it in my isolated case for starters... [22:30:49] sure [22:32:34] oh, good puppet has an elsif [22:32:51] andrewbogott: does that mean it worked? [22:33:04] It's going to be a bit before I can tell. [22:33:30] Partly because your change is... not directly related to what I was trying to fix, so I have to sort out what's happening now. [22:34:33] My socket is still /var/run/mysqld/mysqld.sock but I don't know if that's because your patch is wrong or just because it's left over [22:37:29] jeremyb: OK, the path construction is working properly, modulo an extra / [22:37:42] It didn't actually make mysql work, but it's probably a step in the right direction. [22:38:32] * andrewbogott spins up yet another instance [22:39:00] heh. does that mean you just threw one away? [22:39:13] About to, yeah. [22:40:32] In theory, applying a puppet change to an existing instance should be the same as applying it to a fresh instance. [22:40:37] In practice, less so [22:41:16] probably wrong place to ask but when i look at https://labsconsole.wikimedia.org, the projects and instances links don't work for me :( [22:41:34] not just me i'm sure [22:41:35] all of them? [22:41:40] define don't work? [22:41:50] jeremyb: all of them [22:42:04] not the links on the side bar [22:42:15] 29 22:41:40 < jeremyb> define don't work? [22:42:17] the ones under usage [22:42:35] those used to work and was how i could see what projects there are [22:43:07] * jeremyb can repeat his question again ;) [22:43:08] * aude can only see maps and knows i'm member of bots and wikidata [22:44:00] aude: There's a link at the top to filter by project. [22:44:05] 'Show project filter' [22:44:10] * jeremyb waits for slow computer [22:44:12] It's not a bug, it's a feature :) [22:44:31] whoa, my first login since the token system [22:46:26] andrewbogott: it works for instances though i can only see instances for my projects [22:46:36] * aude can't see what other projects exist now [22:47:50] yeah, that's annoying [22:48:13] it's recent [22:48:42] The filter is an improvement but needs work. [22:48:45] query is [[Resource Type::project]] [22:49:00] + additional data to display [22:50:58] * jeremyb has edited [[user:aude]] ;) [22:51:40] heh [22:51:54] * aude can see https://labsconsole.wikimedia.org/wiki/Property:Project [22:51:59] then find projects that way [23:13:55] paravoid: so is this puppetmaster::self ready for general consumption? or at least beta testing? [23:14:17] paravoid: what about the state of the git repo? is test going away? [23:14:20] beta testing I'd say [23:14:22] no [23:15:19] k [23:15:33] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [23:15:33] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [23:15:33] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [23:15:33] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [23:57:33] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours