[00:00:55] TimStarling: so, any perf concerns from moving HTTP_X_Forwarded_Proto earlier in all.conf? [00:01:17] TimStarling: so that it can be used in redirect targets for other conf files [00:01:20] is that a gerrit change? [00:01:24] not yet [00:02:00] i made some other changes to the redirects system. escaping and a new line at EOF [00:02:08] (those are already merged) [00:02:20] (03PS1) 10Lcarr: Adding in semicolon for proper ferm syntax [operations/puppet] - 10https://gerrit.wikimedia.org/r/107516 [00:02:46] you are referring to: [00:02:46] RewriteRule . - [E=HTTP_X_Forwarded_Proto:%{HTTP:X-Forwarded-Proto}] [00:02:46] RewriteCond %{ENV:HTTP_X_Forwarded_Proto} !=https [00:02:47] RewriteRule . - [E=HTTP_X_Forwarded_Proto:http] [00:03:02] ? [00:03:13] yes [00:03:41] PROBLEM - LibreNMS HTTPS on netmon1001 is CRITICAL: Connection timed out [00:03:46] jenkins you are so slow today..... a watched jenkins never code reviews [00:03:47] well, this cannot be in the server context [00:04:35] should i just copy it to other files? [00:04:38] ok, well the docs say it can be in the server context, but how does that make sense? [00:04:53] how not? [00:05:01] (03CR) 10Lcarr: [C: 032] Adding in semicolon for proper ferm syntax [operations/puppet] - 10https://gerrit.wikimedia.org/r/107516 (owner: 10Lcarr) [00:05:17] btw, a use case: https://wikipedia.org/foo [00:05:33] ah, here we go: "Note that rewrite configurations are not inherited by virtual hosts. This means that you need to have a RewriteEngine on directive for each virtual host in which you wish to use rewrite rules." [00:05:43] hah [00:05:43] makes more sense now, yes? [00:06:15] so a RewriteRule in the server context won't be executed if the request matches a [00:06:31] TimStarling: so what say you? copy it? [00:06:37] yes, copy it [00:08:49] hah, how old is this? wwwportals.conf:31: RewriteRule ^/(mailman|pipermail)(.*)$ http://mail.wikipedia.org/$1$2 [R=301,L] [00:15:11] PROBLEM - NTP on netmon1001 is CRITICAL: NTP CRITICAL: No response from NTP server [00:35:51] cajoel and i are looking at ntp on netmon1001 [00:39:06] !log on ssl1, testing CPU hotplug feature for possible power impact [00:39:11] Logged the message, Master [00:44:13] made no difference, for the record [00:45:13] it's the thought, etc. [00:50:11] (03PS1) 10Aaron Schulz: Try increasing scap fanout to 40 [operations/puppet] - 10https://gerrit.wikimedia.org/r/107522 [00:57:11] TimStarling: Did you turn of NSA mode while you were there? [00:57:17] *off [00:58:52] don't be silly [00:59:09] you know there's no way to turn off NSA mode without physically reflashing the RAC firmware [01:00:05] (03PS2) 10Ori.livneh: Try increasing scap fanout to 40 [operations/puppet] - 10https://gerrit.wikimedia.org/r/107522 (owner: 10Aaron Schulz) [01:00:14] (03CR) 10Ori.livneh: [C: 032 V: 032] Try increasing scap fanout to 40 [operations/puppet] - 10https://gerrit.wikimedia.org/r/107522 (owner: 10Aaron Schulz) [01:01:30] at LCA there was a keynote by Matthew Garrett, he talked about boot security and how there's no way to tell whether or not you've been hacked by the NSA [01:02:12] he said that if you want to work with something that you can trust completely, you should work with sheep or cattle, not computers [01:07:38] Pfft [01:07:45] I'm not sure you can trust cattle [01:07:51] That can maul and trample you pretty easily [01:08:38] maul? [01:08:53] Not the right word.. [01:09:01] not the right cattle [01:10:05] AaronSchulz: ran puppet on tin, if you want to test [01:10:39] ok [01:10:56] There's a word for it... [01:11:01] Headbutted I guess explains it [01:11:35] or gored [01:11:50] What's the proper way to make tests for stuff that requires MediaWiki to be run by a web server? [01:12:03] uh, wrong channel. [01:12:21] I wasn't really meaning with horns [01:13:11] RECOVERY - NTP on netmon1001 is OK: NTP OK: Offset -0.001911044121 secs [01:21:12] springle: seem to be a lot of slow Block::newLoad queries on enwiki? Where some cold slaves brought in or something? [01:21:23] I don't think that's a new query [01:25:15] ok so lesliecarr and cajoel were working on ferm on netmon1001 to support pmacct, and the firewall changes blocked NTP for a while. for the moment i've manually added a rule to allow connections to udp dpt:123, looking at how do this the Ferm Way.. [01:25:39] I would remove ferm [01:25:42] it wasn't there before [01:26:09] we added it as an attempt to manage the nat redirects [01:26:21] (signing off for real now) [01:26:51] AaronSchulz: yep, db1049 got hammered by those yesterday on warm up. it was missing an index. but in the process of fixing I also found bug 60035 helping to make it all slow [01:27:19] jgage: hrm, it started blocking ? [01:27:46] the default ruleset it loaded did not have a hole for udp:123 [01:27:53] so, yes [01:28:21] :( [01:29:19] springle: I can look at the query generation [01:29:36] great! [01:29:43] i didn't get a chance [01:32:25] jgage: what do you think about adding udp123 as a default accept for ferm ? [01:33:38] seems reasonable [01:34:48] interesting that the input chain is default-deny, then allows all tcp, but udp only to 500. [01:36:27] which is weird.... [01:38:52] (03PS1) 10Lcarr: allow udp by default with ferm [operations/puppet] - 10https://gerrit.wikimedia.org/r/107532 [01:39:14] jgage: https://gerrit.wikimedia.org/r/#/c/107532/ ? [01:41:12] (03CR) 10Gage: [C: 032] allow udp by default with ferm [operations/puppet] - 10https://gerrit.wikimedia.org/r/107532 (owner: 10Lcarr) [01:41:38] :) [01:42:44] woot i puppet merged it [01:43:16] cool [01:43:50] IMHO that should have belonged in base::firewall. [01:45:18] (03PS1) 10Lcarr: dport is important [operations/puppet] - 10https://gerrit.wikimedia.org/r/107534 [01:45:50] ah [01:46:04] we should have these things in one location .... [01:46:09] that's it, i quit! [01:46:18] heh [01:46:24] (i am too easily amused because that's still not yet old) [01:46:29] (03CR) 10Lcarr: [C: 032 V: 032] dport is important [operations/puppet] - 10https://gerrit.wikimedia.org/r/107534 (owner: 10Lcarr) [01:48:05] ok woo [01:48:18] that is making puppet happy-ish [01:48:23] still wants ip6table love [01:53:38] looks like you just need to prepend domain (ip ip6) [01:54:34] well... that sounds like what i need to do tomorrow ;) [01:54:38] au revoir! [01:57:19] grrrit-wm: OK for me to grab an LD for Roan for tomorrow? Single VE change, backport. [01:57:21] Err. [01:57:29] greg-g: OK for me to grab an LD for Roan for tomorrow? Single VE change, backport. [01:57:35] * James_F sighs. [02:08:46] James_F: A) what kind of change? :) link? [02:08:48] email is good [02:08:57] post 5pm ;) [02:14:05] greg-g: It's a fix for a fatal JS error on save that occurs any time you edited a template [02:14:18] I've been a bit BOLD and already put it on the [[Deployments]] page [02:21:14] !log LocalisationUpdate completed (1.23wmf10) at Wed Jan 15 02:21:14 UTC 2014 [02:21:20] Logged the message, Master [02:28:10] springle: any clue how to repro bug 60035? [02:30:09] (03PS2) 10Mattflaschen: Enable GuidedTour on translated languages [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105997 (owner: 10Reza) [02:30:49] (03CR) 10Mattflaschen: "I changed the URL in the commit message, since this is unrelated to pt-br." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105997 (owner: 10Reza) [02:40:10] !log LocalisationUpdate completed (1.23wmf9) at Wed Jan 15 02:40:09 UTC 2014 [02:40:16] Logged the message, Master [03:13:57] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jan 15 03:13:57 UTC 2014 [03:14:04] Logged the message, Master [03:17:42] AaronSchulz: none. although your question on the IPs was interesting. commented on bug 60035 [03:43:35] (03PS1) 10Springle: warm up db1040 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107542 [03:44:06] (03CR) 10Springle: [C: 032] warm up db1040 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107542 (owner: 10Springle) [03:44:15] (03Merged) 10jenkins-bot: warm up db1040 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107542 (owner: 10Springle) [03:45:09] !log springle synchronized wmf-config/db-eqiad.php 'warm up db1040' [03:45:16] Logged the message, Master [03:52:04] (03PS7) 10Ori.livneh: Kibana puppet class [operations/puppet] - 10https://gerrit.wikimedia.org/r/106169 (owner: 10BryanDavis) [03:52:11] (03CR) 10Ori.livneh: [C: 032 V: 032] Kibana puppet class [operations/puppet] - 10https://gerrit.wikimedia.org/r/106169 (owner: 10BryanDavis) [04:02:05] (03PS1) 10Ori.livneh: Fix Kibana init.pp file location [operations/puppet] - 10https://gerrit.wikimedia.org/r/107544 [04:02:38] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix Kibana init.pp file location [operations/puppet] - 10https://gerrit.wikimedia.org/r/107544 (owner: 10Ori.livneh) [04:06:54] springle: so will these 'contributions' servers have a very low 'load' value? [04:07:43] that way they wouldn't be picked by wfGetDB() normally (often) unless there was a the right $group...the low weights could still be adjusted relative to each other [04:07:52] correct [04:09:39] also, if it is sharded by user what about newbie contribs? [04:10:39] i don't follow. you mean new user ids? [04:10:42] meh, that had to filesort all results anyway [04:10:52] so I guess that doesn't make much difference probably [04:11:34] it's the "newest 1% of users" query (rev_user >= X ORDER BY rev_timestamp DESC) [04:11:57] ah gotcha [04:12:06] well that should be improved too [04:12:30] most likely is that partitioning by range(rev_user) is best for biggest wikis [04:12:52] so that would hit the last partition [04:15:09] I assumed it was module...is the range partitioning well balanced? [04:15:17] *modulo [04:15:34] though range would be good for that specific query...not as much for the other ones by user [04:17:37] i was testing hash & moduolo earlier [04:18:18] hash is nice and simple allowing good balancing for equality [04:18:53] range is potentially more flexible but needs custom ranges for each wiki depending on the distribution of user_id activity [04:19:59] although i find that mostly it's user ids < 100000 causing slow queries, maybe because of age or becuase it incudes most bots. =0 is also high [04:20:43] same trend for all the large wikis [04:21:05] is hash just the modulo of the hash (for non-integers)? [04:21:12] yes [04:21:17] right [04:21:30] RewriteRule ^/([a-z]{2}|meta)/(.*)$ %{ENV:HTTP_X_Forwarded_Proto}://$1.wikipedia.org/wiki/$2 [R=301,L,NE] [04:21:42] TimStarling: no way to do that with the new redirects scheme, right? [04:24:34] you're referring to a rewrite rule which I moved out of redirects.conf specifically because there was no way to do it in the new redirects scheme [04:27:12] good answer, danke :) [04:27:40] (03PS1) 10Springle: db1040 to full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107546 [04:29:11] (03CR) 10Springle: [C: 032] db1040 to full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107546 (owner: 10Springle) [04:29:19] (03Merged) 10jenkins-bot: db1040 to full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107546 (owner: 10Springle) [04:30:16] !log springle synchronized wmf-config/db-eqiad.php 'db1040 to full steam' [04:30:23] Logged the message, Master [04:38:37] TimStarling: maybe getReaderIndex shouldn't set mReadIndex if $group was set? [04:39:24] that sounds broken [04:40:34] paravoid: /srv/deployment/librenms is owned by you, should be 'sartoris' [04:40:41] aka git-deploy aka trebuchet aka trebuchet-deploy [04:40:54] it should only be set if both $group and $wiki were false [04:41:29] $this->mLoads[$i] > 0 [04:41:44] that condition is probably the reason it doesn't break everything completely [04:44:09] Is git-deploy interactive? [04:44:23] no [04:44:51] We have an interactive deploy script at work. It's pretty nice. [04:47:46] Gloria: interactive in which way? [04:48:29] Ryan_Lane: it muders people that break the deployment [04:48:34] *murders [04:48:47] the CI system should do that, not the deployment system ;) [04:48:54] Ryan_Lane: It gives a prompt asking "where do you want to deploy to?" [04:49:02] And you select S for staging, L for live, Q for quit. [04:49:10] Then you it gives a list of repos and says "which repo do you want to push?" [04:49:18] And you type in the repo name (with tab-completion). [04:49:39] are you likely to be eaten by a grue? [04:49:49] Then it clones the repo, does a git log diff, says "these are the users to be contacted if something goes wrong". [04:49:51] git deploy knows what you're trying to deploy, since it runs from within a repo [04:50:22] ori: Not usually. [04:50:23] I'd like to add stage selection (git deploy canary) [04:50:43] Well, everything gets pushed to staging first. [04:50:45] Which is nice. [04:50:53] Wikimedia would probably benefit from that. [04:50:55] Gloria: you should have it say "You are unlikely to be eaten by a grue", then [04:51:04] Gloria: it actually wouldn't much [04:51:12] That's basically what Beta Labs is. [04:51:18] So I imagine it would. ;-) [04:51:27] that's already automated [04:51:35] but it runs from master [04:51:39] Right. [04:51:59] test.wikipedia.org would be a canary [04:52:12] rather than beta [04:52:15] ori: I got a log added so that I could tail it on IRC. [04:52:37] Ryan_Lane: Yes, I suppose you're right. [04:52:54] I don't deploy, so it makes little difference to me. The process never sounded very enjoyable, though. [04:52:57] Ryan_Lane: https://dpaste.de/zHAw/raw everything ran successfully....nothing got deployed? [04:53:22] 2 minions pending (2 reporting) [04:53:24] pending [04:53:41] the reporting could be clearer [04:53:56] well [04:53:57] # NOTE : Sync ran ok! Everything looks good! Automatically finishing [04:54:00] ah, I should add a troubleshooting section for checking if minions are pinging [04:54:03] # YAY : 'sync' for 'kibana/kibana' completed successfully (now at kibana/kibana-20140115-044735) [04:54:11] ori: that's due to the perl git-deploy's stupidity [04:54:14] i usually take 'YAY' in all caps to be an indication of success [04:54:34] What about 'NOTE' and 'INFO' and 'USER'? [04:54:51] all of that output is crap [04:55:03] NOTE: all of that output is crap * [04:55:14] Or maybe that'd be INFO:. [04:55:26] we need to upgrade salt [04:55:48] there's some bug with grains and subscription [04:55:57] salt -G 'deployment_target:kibana' test.ping <-- not pinging [04:58:02] ori: ok, I restarted salt on the minions, it'll deploy now [04:58:09] adding to the troubleshooting section [04:58:30] YAY [04:58:37] oh, it was already there: https://wikitech.wikimedia.org/wiki/Trebuchet#Initial_fetches_are_failing_.28minions_forever_pending.29 [04:58:49] Wait, trebuchet wasn't a joke? [04:59:03] trebuchet is the name of the system [04:59:24] I thought that was sartoris. :-/ [04:59:28] git deploy is an interface to it. it'll soon be replaced by trebuchet trigger: https://github.com/trebuchet-deploy/trigger/ [04:59:36] sartoris never existed on wikimedia servers [04:59:48] We shall never speak of it again. [05:00:08] it was a rewrite of the perl git deploy in python [05:00:45] it was gpl licensed, but trebuchet is apache2 licensed, so I wrote trigger from scratch [05:03:01] wait, trebuchet trigger? [05:03:10] ryan! [05:03:16] you spent your name quota! [05:03:32] trigger is a part of trebuchet ;) [05:03:37] it's the frontend [05:03:45] trebuchet turtle [05:03:49] It's the recursive potion. [05:03:53] the backend is trebuchet [05:03:56] portion * [05:04:03] Typos, please stop ruining my jokes. [05:04:10] Gloria: :) [05:05:09] ori: heh. I send you a link to that ages ago, I'm surprised you didn't call me out on that then :) [05:05:13] *sent [05:07:11] puppet runs trebuchet trigger as sartoris to push to salt minions [05:07:17] any questions? [05:07:20] :) [05:07:26] I'll rename the user to trebuchet ;) [05:07:42] and also, the name of trigger doesn't matter. users will just know "git deploy " [05:07:47] i love giving you a hard time about this, i don't know why [05:08:00] it's not my fault it's been renamed so many times :( [05:08:14] we should make it a tradition [05:08:16] !blame [05:08:19] rename it every month [05:08:23] hahaha. no thanks. I'm the upstream now [05:08:50] do you know how much of a pain in the ass it is to set everything up? github, pypi, launchpad, etc, etc [05:10:52] is brewster no longer the apt host? [05:11:32] it isn't? [05:11:36] paravoid: who in the world blocked access to brewster from iron? [05:11:42] it's *killing* me [05:12:45] !log adding git-python to the apt repo. it comes from the saltstack ppa. [05:12:52] Logged the message, Master [05:13:00] !log make that python-git [05:13:06] Logged the message, Master [05:18:28] (03PS2) 10Jeremyb: clean up chapters redirects [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106107 [05:18:30] (03PS2) 10Jeremyb: rm trailing slash from destinations where unneeded [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106110 [05:18:32] (03PS2) 10Jeremyb: final (I hope!) fix for protorel redirects [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106109 [05:18:40] (03PS2) 10Jeremyb: move {jobs,careers}.wikimedia.org to redirects.dat [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106108 [05:20:06] (03PS6) 10Ori.livneh: Proxy logstash.wikimedia.org via misc varnish cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/106170 (owner: 10BryanDavis) [05:21:41] (03CR) 10Ori.livneh: "Faidon, is this OK for now?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/106170 (owner: 10BryanDavis) [05:24:57] could use some QA on those changes, particularly https://gerrit.wikimedia.org/r/106109 [05:26:01] * jeremyb will poke Jeff tomorrow but maybe i can also get some extra sanity checking [05:26:27] mutante: ^ [05:26:39] and of course TimStarling :-) [05:26:43] * jeremyb sleeps [05:27:34] (/me already nominated Gloria for review :P) [05:30:16] (03PS7) 10BryanDavis: Proxy logstash.wikimedia.org via misc varnish cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/106170 [05:30:38] (03CR) 10Jeremyb: "hello world?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94111 (owner: 10Jeremyb) [05:43:34] (03CR) 10BryanDavis: Proxy logstash.wikimedia.org via misc varnish cluster (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/106170 (owner: 10BryanDavis) [07:00:18] gerrit is... very... slow , both git review and the web interface [07:00:52] It's rebelling for having to work after 5pm [07:01:11] ori restarted gerrit a few days ago for it being too damn slow [07:01:13] I think [07:09:56] it's awful, isn't it? [07:10:21] gerrit? [07:10:22] yeah [07:10:27] who would've thunk [07:10:34] the performance regression, i mean [07:10:38] it's semi-unusable [07:10:45] it's as slow as ever for me [07:10:58] I thought it was for everyone until I tried to use it from SF, and boy did that make a difference [07:57:49] morning [08:10:07] morning paravoid [08:14:20] (03PS2) 10Faidon Liambotis: RT 3645 - give iodine a predictable IPv6 address [operations/puppet] - 10https://gerrit.wikimedia.org/r/94111 (owner: 10Jeremyb) [08:14:26] (03PS3) 10Faidon Liambotis: Give iodine a predictable IPv6 address [operations/puppet] - 10https://gerrit.wikimedia.org/r/94111 (owner: 10Jeremyb) [08:14:35] (03CR) 10Faidon Liambotis: [C: 032] Give iodine a predictable IPv6 address [operations/puppet] - 10https://gerrit.wikimedia.org/r/94111 (owner: 10Jeremyb) [08:15:10] looks like beta labs is down [08:16:07] (03PS8) 10Faidon Liambotis: Proxy logstash.wikimedia.org via misc varnish cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/106170 (owner: 10BryanDavis) [08:16:14] (03CR) 10Faidon Liambotis: [C: 032] Proxy logstash.wikimedia.org via misc varnish cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/106170 (owner: 10BryanDavis) [08:17:52] (03PS4) 10Faidon Liambotis: Add logstash.wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/105105 (owner: 10BryanDavis) [08:18:02] (03CR) 10Faidon Liambotis: [C: 032] Add logstash.wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/105105 (owner: 10BryanDavis) [08:25:28] (03CR) 10Physikerwelt: added basic hbase support (033 comments) [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/99381 (owner: 10Physikerwelt) [08:28:44] (03PS1) 10Faidon Liambotis: Revert "Initial commit of pmacct module" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107550 [08:32:30] (03PS2) 10Faidon Liambotis: Revert "Initial commit of pmacct module" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107550 [08:32:48] (03CR) 10Faidon Liambotis: [C: 032] Revert "Initial commit of pmacct module" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107550 (owner: 10Faidon Liambotis) [08:38:40] (03PS5) 10Physikerwelt: added basic hbase support [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/99381 [08:40:39] (03CR) 10Physikerwelt: [C: 04-1] "I did not test the recent changes... I'll come back to this later (i.e. my environment is a little bit different since I'm using vagrant)" [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/99381 (owner: 10Physikerwelt) [08:42:08] (03CR) 10Faidon Liambotis: [C: 04-1] "One real easy to fix comment (breakage), plus two very OCD comments." (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102352 (owner: 10Mwalker) [08:45:23] (03CR) 10Physikerwelt: "I'm not sure about the default values." (039 comments) [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/99381 (owner: 10Physikerwelt) [08:46:29] (03PS1) 10Faidon Liambotis: Give Nik & Chad root access to lucene search boxes [operations/puppet] - 10https://gerrit.wikimedia.org/r/107551 [08:47:35] (03CR) 10Faidon Liambotis: [C: 032] Give Nik & Chad root access to lucene search boxes [operations/puppet] - 10https://gerrit.wikimedia.org/r/107551 (owner: 10Faidon Liambotis) [08:53:50] (03CR) 10Faidon Liambotis: [C: 04-1] "I'm afraid we can do none of that at all. We don't do git clones on production boxes, especially from untrusted sources (GitHub) & we don'" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96552 (owner: 10Addshore) [09:01:54] paravoid: regarding your comments on my patch; the path /srv/ocg/tmp wont have parents; but can I assume (because of trebuchet) that /tmp/deployment/ocg will be there for subfolders (and that iwll be safe?) [09:02:48] !log jenkins: python-git package receiving an "update" 0.3.2~RC1-1~precise1 -> 0.3.2.RC1-1 (same version assuming we repackaged it somehow) [09:02:54] Logged the message, Master [09:05:26] (03CR) 10KartikMistry: "Nitpick in changelog." (031 comment) [operations/debs/vips] - 10https://gerrit.wikimedia.org/r/102617 (owner: 10coren) [09:08:05] !log jenkins: restarting Zuul to make sure it uses the new python-git [09:08:12] Logged the message, Master [09:24:12] (03CR) 10Hashar: [C: 04-1] "What Faidon said, plus some other nitpicks :-]" (037 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96552 (owner: 10Addshore) [09:24:26] paravoid: and it is not like I told Wikidata team to NOT use github :-D [09:25:25] the good thing is that addshore is friendly and willing to comply [09:26:17] (03PS1) 10Spage: Enable Flow on meta and enwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107553 [09:30:47] (03CR) 10Spage: "I think this could go out any time, but we definitely want it for the Thursday update of phase2 wikis to 1.23wmf10." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107553 (owner: 10Spage) [09:42:31] (03PS1) 10Matanya: gitblit: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107555 [09:43:15] (03CR) 10jenkins-bot: [V: 04-1] gitblit: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107555 (owner: 10Matanya) [09:49:02] mwalker: I'm talking about /srv/ocg itself [09:49:56] (03PS2) 10Matanya: gitblit: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107555 [09:57:11] jeremyb: i see requestors there [10:02:24] (03PS1) 10Ori.livneh: Kibana: use a plain block for /status [operations/puppet] - 10https://gerrit.wikimedia.org/r/107556 [10:02:52] (03CR) 10Ori.livneh: [C: 032 V: 032] Kibana: use a plain block for /status [operations/puppet] - 10https://gerrit.wikimedia.org/r/107556 (owner: 10Ori.livneh) [10:03:34] paravoid: i merged your sudo for chad / nik change [10:03:41] oh did I forget to do that [10:03:44] thanks [10:04:01] root@tungsten:~# ls -lah /var/lib/graphite/search_index [10:04:01] -rw------- 1 _graphite _graphite 2.5K Dec 12 20:53 /var/lib/graphite/search_index [10:04:07] tungsten's broken in this sense [10:04:20] I wonder how it works right now [10:04:22] that's not the real location, is it? [10:04:35] i think i have it in /var/lib/carbon [10:04:43] i was just nervous about deleting that one [10:04:50] oh, right [10:05:03] but it still only works by dint of an explicit chmod [10:05:15] so you're right that that needs a fix, i haven't forgotten [10:05:30] yeah, /usr/bin/graphite-build-search-index is locally modified [10:05:31] (03CR) 10Dzahn: download: lint clean (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/107341 (owner: 10Matanya) [10:06:57] it was elves [10:07:12] pretty sure it was elves [10:07:24] your logstash change doesn't make any sense [10:07:26] i did revert the other one [10:07:33] (or i'm missing something) [10:08:21] well, tried logstash.wikimedia.org, varnish 503 [10:08:31] well, it's a right change, it just won't fix the effect you're seeing [10:08:44] varnishlog on cp1043 says backend is sick [10:08:51] it 404s [10:08:52] curl /status, get nada [10:09:09] GET /status HTTP/1.1 [10:09:09] Host: logstash.wikimedia.org [10:09:09] HTTP/1.1 404 Not Found [10:09:37] try again [10:09:47] well i haven't done logstash1003 yet [10:09:49] but the other two work [10:09:57] (03PS2) 10Matanya: download: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107341 [10:10:29] (03CR) 10Dzahn: [C: 031] cpufrequtils: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107346 (owner: 10Matanya) [10:12:05] you only tell me when i'm wrong, paravoid :P [10:12:14] hm? [10:12:21] it works now [10:12:32] by dint of my change which doesn't make any sense and won't fix the effect [10:12:39] yes, it works, you're right [10:13:08] i'm not sure why, though [10:13:20] missing RewriteEngine on? [10:13:26] does LocationMatch require it? [10:13:34] no, it shouldn't [10:13:41] no [10:14:37] anyways, i'm sufficiently embarrassed to go clean up the search index file permissions thing for good, bbiaf [10:14:55] ori: you should go to bed [10:15:05] it's harder to type in bed [10:15:46] (03CR) 10Dzahn: "inline comment and it should have a role class where you set "git.wikimedia.org" and then pass it as a variable to be used in init.pp and " (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/107555 (owner: 10Matanya) [10:16:05] I suspect the regexp is already bound [10:16:08] and ^ $ don't work [10:16:20] not that this makes much sense [10:16:48] mutante: there is a role already [10:19:39] (03CR) 10Matanya: "the role class is in manifests/role/gitblit.pp and has the system role there. should i move to init.pp?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107555 (owner: 10Matanya) [10:21:32] (03CR) 10Dzahn: download: lint clean (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/107341 (owner: 10Matanya) [10:22:46] paravoid: graphite-build-search-index is pretty annoying [10:23:06] it uses mktemp and then mvs it into place so it's not world-readable [10:23:26] we didn't have a reason to NOT have NRPE on virt0, did we? [10:23:33] and it chowns it to _graphite:_graphite but it makes sense to run carbon as _graphite and graphite-web as www-dev [10:24:31] you don't want graphite-web to be tampering with whisper files ever, so it makes sense to run it as a different user than carbon [10:24:41] (03CR) 10Dzahn: "ah, ok. no the system role should stay in role class, but the apache site and cert name should be set there instead of inside the module" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107555 (owner: 10Matanya) [10:29:19] (03PS1) 10Faidon Liambotis: kibana: redirect to HTTPS [operations/puppet] - 10https://gerrit.wikimedia.org/r/107557 [10:29:29] (03PS3) 10Matanya: download: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107341 [10:29:43] (03CR) 10Dzahn: [C: 031] nrpe: enable on virt0 [operations/puppet] - 10https://gerrit.wikimedia.org/r/107424 (owner: 10Gage) [10:30:50] (03CR) 10Dzahn: [C: 032] "thx for solving the path conflict, and fixes the feed" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107376 (owner: 10Nemo bis) [10:31:22] (03CR) 10Dzahn: "what, "needs verified"? it already is..sigh" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107376 (owner: 10Nemo bis) [10:32:55] (03PS3) 10Matanya: gitblit: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107555 [10:35:44] (03CR) 10Dzahn: "< mutante> did we have any reason to NOT have NRPE on virt0? ..." [operations/puppet] - 10https://gerrit.wikimedia.org/r/107424 (owner: 10Gage) [10:36:52] (03CR) 10Dzahn: "need to check firewalling before this (ferm) and/or maybe putting that on the replacement host right away" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107424 (owner: 10Gage) [10:37:28] (03CR) 10Dzahn: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107376 (owner: 10Nemo bis) [10:38:58] hashar: 107376 says it's Verified by jenkins but also "needs verified", what am i missing [10:39:34] touches an .erb [10:40:43] (03CR) 10Faidon Liambotis: [C: 032] kibana: redirect to HTTPS [operations/puppet] - 10https://gerrit.wikimedia.org/r/107557 (owner: 10Faidon Liambotis) [10:44:48] (03PS1) 10Faidon Liambotis: kibana: enable Apache2 rewrite module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107558 [10:45:03] (03CR) 10Faidon Liambotis: [C: 032] kibana: enable Apache2 rewrite module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107558 (owner: 10Faidon Liambotis) [10:45:17] (03CR) 10Faidon Liambotis: [V: 032] kibana: enable Apache2 rewrite module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107558 (owner: 10Faidon Liambotis) [10:45:31] (03CR) 10Dzahn: [C: 031] "heh, a 2-spaces -> 4-spaces lint, +hashar" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107347 (owner: 10Matanya) [10:49:55] (03CR) 10Dzahn: [C: 031] dynamicproxy: lint clean (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/107339 (owner: 10Matanya) [10:50:11] (03CR) 10Matanya: "I would change the package declarations to be more consistent, but not without input from hashar. e.g" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107347 (owner: 10Matanya) [10:56:00] (03CR) 10Dzahn: "i think he does that to be prepared to add more packages or had more than 1 here before" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107347 (owner: 10Matanya) [10:59:38] (03CR) 10Hashar: "That is the reason indeed, so adding a new package would just mean inserting a single line." [operations/puppet] - 10https://gerrit.wikimedia.org/r/107347 (owner: 10Matanya) [11:04:31] PROBLEM - MySQL Processlist on db1006 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 0 copy to table, 69 statistics [11:05:04] mutante: https://gerrit.wikimedia.org/r/#/c/107376/ confuses me, now V+1 but still no merge [11:05:27] Nemo_bis: me too, i asked hashar [11:05:31] RECOVERY - MySQL Processlist on db1006 is OK: OK 1 unauthenticated, 0 locked, 0 copy to table, 2 statistics [11:12:57] Nemo_bis: it does +1, but we want it to +2, why it doesnt do that, we have been asked to file a bug now:) [11:14:49] Nemo_bis: oh, you know, i think i remember now [11:15:03] Nemo_bis: it's because you made the change and you are not in that trusted regex [11:15:19] at least related to that, we'll find out [11:21:11] (03PS19) 10Mwalker: Collection Renderer (Now a module!) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102352 [11:24:08] mutante: nope, Nemo_bis isn't there. [11:27:35] twkozlowski: is this the exact same thing then ? [11:27:46] https://bugzilla.wikimedia.org/show_bug.cgi?id=60082 [11:27:58] the one we had with one of your changes once [11:28:47] so there are 2 options, add to regex or manual verify to merge [11:29:50] matanya must be on it or it would happen on other puppet changes too [11:40:06] what mutante ? [11:41:20] yes, i'm on the list with two email accounts [11:41:46] matanya: yea, that makes jenkins give +2 on your changes [11:41:53] makes sense then if nemo isnt [11:45:18] mutante: https://gerrit.wikimedia.org/r/#/c/102460/2/layout.yaml <-- for example [11:45:45] it's so readable:) [11:46:08] yea, i remember we added odder to that [11:48:53] (03CR) 10Faidon Liambotis: "Ping?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/98307 (owner: 10Faidon Liambotis) [11:51:18] mutante: yyh, sorry [11:51:30] mutante: yeah, that's the thing exactly [11:51:42] (03CR) 10Dzahn: [V: 032] [Planet] Fix Guillaume's blog feed URL [operations/puppet] - 10https://gerrit.wikimedia.org/r/107376 (owner: 10Nemo bis) [11:53:01] twkozlowski: yep, and as matanya pointed out people always just verified his changes manually and quite a few in gerrit history [11:55:25] we can ask hasharAway to add him when he is back [11:58:59] (03CR) 10Dzahn: [C: 031] rm trailing slash from destinations where unneeded [operations/apache-config] - 10https://gerrit.wikimedia.org/r/106110 (owner: 10Jeremyb) [12:09:13] mutante: thanks :) [12:10:47] guillom: welcome, i had noticed it was broken and others found the correct URL:) [12:11:21] mutante: yes, Nemo_bis was kind enough to fix it :) [12:16:33] so many pings for a 5-characters change :D [12:40:52] (03CR) 10Hashar: [C: 032] adding '*.openbeelden.nl' to the wgCopyUploadsDomains array. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107138 (owner: 10Dan-nl) [12:41:02] (03Merged) 10jenkins-bot: adding '*.openbeelden.nl' to the wgCopyUploadsDomains array. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107138 (owner: 10Dan-nl) [12:42:26] !log hashar synchronized wmf-config/InitialiseSettings.php 'adding '*.openbeelden.nl' to the wgCopyUploadsDomains array. {{gerrit|107138}}' [12:42:33] Logged the message, Master [12:52:19] (03PS1) 10Matanya: etherpad: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 [13:15:04] (03PS2) 10Matanya: etherpad: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 [13:16:08] (03PS3) 10Matanya: etherpad: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 [13:45:29] mutante: are you managing the tickets related to pmtpa ? [13:45:46] found out a proxy named url-downloader.wikimedia.org which is in tmpta [13:46:25] I am not even sure why we have a dedicated proxy though since we have brewster/carbon [13:46:52] because the site depending on an arbitrary proxy on brewster is such a great idea? :) [13:47:27] so you got one dedicated to ensure site is not disrupted whenever the shared proxy is down / tweaked / under maintenance ? [13:47:27] hashar: you can open a ticket for that [13:47:40] that and it is a lot more restricted [13:47:46] matanya: IRC has a better SLA :-] [13:47:59] mark: make sense :D thx [13:48:04] hashar: only if mark is around [13:48:35] matanya: mark has tweaked his IRC client to trigger a fire alarm in his house whenever I ping him over IRC. [13:48:44] hashar: 4075 Unable to use url-downloader to access HTTPS sites [13:48:48] (or I guess he did since he is very responsive) :D [13:48:58] and it's mentioned on 6157 old DNS servers dobson and linne [13:49:12] linne is a Wikimedia Upload-by-URL proxy (misc::url-downloader). [13:49:23] yeah that is it [13:49:29] so it is tracked in RT Thanks mutante! [13:49:30] then 6157 [13:49:43] linked to 6099: (Daniel Zahn) what's left in Tampa [new] [13:49:47] yep, it is [13:50:28] should it be a module mutante ? [13:52:07] matanya: url-downloader? eh, only if we keep using it, see comments above if it even makes sense i dunno right now [13:52:28] i'll leave as it is now [13:55:26] yes it makes sense, yes we keep using it [13:55:57] ok, well then i guess it should also be a module [13:55:59] matanya: [13:56:27] will do it soon. the review queue is huge though :) [13:56:36] i know:) thx [13:56:59] only 15 [13:58:39] yea, other ops join in on matanya's lint changes if you like [14:00:41] * mark hates massive lint changes though [14:01:10] almost as much as I hate puppet lint itself [14:01:48] well yea, now they are many changes because we said they shouldn't be too massive [14:02:20] not sure either, but it's well-meaning [14:02:29] it's certainly well-meaning [14:04:23] it's complicated, sometimes "sneaking them in" is appropriate, sometimes people rightfully say they should not combine lint and operational changes [14:07:02] mutante: here you go, minor stuff [14:07:21] mark: you can review my modules instead if you wish :) [14:08:10] matanya: thanks [14:12:02] (03CR) 10Mark Bergsma: etherpad: convert into a module (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 (owner: 10Matanya) [14:12:53] all the webserver stuff is a bit controversial [14:13:05] I think I like the newer webserver::apache stuff if we make it a bit better [14:13:33] having to supply a full apache config file for each and every vhost, as in the old system, is ridiculous [14:13:38] but he puppet labs apache module sucks too [14:13:56] so webserver::apache attempts to cover 90% of cases, and we can make it flexible where needed [14:14:00] mark: i spoke with andrewbogott on writing a whole new webserver module [14:14:48] how will it be different? [14:15:03] based on webser::apache and keeping apache::mods and expending it as well as apache::vhosts [14:16:00] that makes no sense [14:16:00] so far when moving from misc to modules i put the apache template into /modules/foo/templates/apache/ so each module has the config as mark said, but still one step above from having them all in /templates/apache/sites and there might even still be /files/apache/sites [14:16:15] they're different concepts, so mixing them doesn't work [14:16:50] i wasn't clear mark [14:17:13] i mean have a module webserver that will hold only apache config [14:17:29] the sites should go into their appropriate module [14:17:38] what sites? [14:17:47] e.g. git.wikimeida.org [14:18:11] belongs to gitblit module [14:18:18] yeah, but what does that mean? [14:18:26] do you want to put the vhost apache config file for gitblit in the gitblit module? [14:18:55] it makes some sense to me, but i might be worng here [14:19:04] well [14:19:10] that's not really better than what we have been doing so far [14:19:13] what I would like to see [14:19:21] is no having to write an apache vhost config for each and every site again [14:19:27] because they're 90% the same [14:19:40] so a long time ago I experimented with webserver::apache:: and subclasses [14:19:45] which handle most cases for you [14:20:02] i use apache_site{} but what it does is just the symlink from sites-available to sites-enabled, so maybe that could do more than that and also create the config [14:20:02] and I actually think most sites we have should use something like that instead of shipping a config file [14:20:20] so you want a generic apache::vhost that will handle most of this? [14:20:25] that already exists [14:20:41] its webserver::apache::site [14:20:48] it's just not tested and used a lot yet [14:20:55] ah [14:20:59] i believe only torrus, smokeping and librenms use it [14:21:02] and we should turn it into a proper module etc [14:21:04] i see your point [14:21:16] added to the todo list [14:21:18] unless we find severe problems with it [14:21:44] so need to ditch puppetlabs apache module, which mostly sucks [14:21:50] we should ALSO have a way to supply a full config file (as we always do now) for the few cases where we need to [14:22:05] yes, and also ditch the apache_site stuff and all that in webserver.pp which is NOT webserver::apache [14:22:10] except that is used all over the place right now [14:22:17] heh, indeed [14:22:28] that is webserver::apache2 classes [14:22:28] one goal of webserver::apache is also [14:22:34] the ability to mix multiple services on one machine [14:22:41] so if service A needs the php module for apache [14:22:45] and service B needs it too [14:22:53] they both can declare that in their respective modules without it conflicting [14:23:34] the holy grail [14:23:37] that would be very good because we run into this all the time, a class that setups up Apache in some way is ok, _until_ we try to combine it with another existing one on the same host [14:23:41] in various ways [14:23:41] hopefully that already works but it's not tested well yet [14:24:35] webserver::apache2::site also makes e.g. SSL easy [14:24:43] you can just "turn on" SSL and it'll do the config for you [14:24:49] instead of having to copy the whole vhost config etc [14:24:58] also, stuff like putting NameVirtualHost *:443 into ports.conf globally and not repeating it in the site configs ,etc [14:25:05] yes [14:25:07] is an annoyance currently [14:25:11] all that is retarded right now [14:25:55] that is a big task, but i'll try to work it out [14:26:05] well [14:26:07] instead of doing this [14:26:17] perhaps just first convert a few (simple) services that now use apache_site and the like [14:26:21] to use webserver::apache2 instead [14:26:27] it doesn't matter yet that it's not a module yet [14:26:33] let's first see if it works well and what we need to change about it [14:26:52] (03PS1) 10Hashar: beta: pull VisualEditor individually [operations/puppet] - 10https://gerrit.wikimedia.org/r/107575 [14:26:57] turning it into a module is a bit of copy paste, that's not the issue ;) [14:27:11] torrus now uses webserver::apache2 [14:27:14] libenms and smokeping too [14:27:16] mark: do that was my initial plan [14:27:18] but nothing else yet I believe [14:28:14] (03CR) 10Hashar: [C: 04-1] "The VisualEditor has to be unregistered from mediawiki/extensions.git first which is change https://gerrit.wikimedia.org/r/#/c/107574/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107575 (owner: 10Hashar) [14:28:31] ok, so for now: doing the url-download module, fixing upon your comments the etherpad module and start fixing webserver [14:28:41] i'll be busy in the forseenable time [14:29:08] let me know if I can help [14:32:20] will the url-downloder be converted to varnish too? [14:32:26] no [14:32:36] varnish is reverse proxy only [14:32:41] we will probably convert it to squid 3 at some point [14:32:43] but that can wait [14:32:43] mark: by the way, in the etherpad mosule, you said the passwords should move to the role, did you refer to all the passwords or just the first? [14:32:50] all [14:32:55] ok [14:33:55] faidon used webserver::apache2 for the first time too last week, he also may have some ideas and comments on it [14:34:38] (03CR) 10Hashar: "And regarding splitting up packages.pp , I am not sure it offers any added value since all those packages needs to be installed on the con" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107347 (owner: 10Matanya) [14:34:54] mark: ok, thanks for that, i'll also look into replacing some minor misc. services with apache2 [14:35:09] matanya, we can review each other, i'm sure [14:35:51] PROBLEM - RAID on db1016 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [14:36:25] mutante: thanks, we will find out :) [14:36:31] PROBLEM - MySQL Processlist on db1006 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 1 copy to table, 67 statistics [14:36:52] db1016 is me replacing disk [14:37:31] RECOVERY - MySQL Processlist on db1006 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 1 statistics [14:37:58] !log replacing failed disk ms1001 [14:38:03] Logged the message, Master [14:38:11] (03CR) 10Matanya: "# FIXME: split this! is right at the top, and it does make sense to me to separate them according to type, e.g. python stuff in packages::" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107347 (owner: 10Matanya) [14:38:44] cmjohnson1: 1016 or 1006 ? [14:39:31] (03CR) 10Hashar: "You are right daniel. Potentially we could create a new group 'mwlog' and add apache / mwdeploy / l10nupdate to it then make /var/log/med" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83574 (owner: 10Reedy) [14:40:24] mutante 1016 [14:41:05] cmjohnson1: matanya [14:42:08] !log replacing failing disk on db1029 slot 8 [14:42:15] Logged the message, Master [14:45:11] PROBLEM - RAID on db1029 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [14:47:27] (03CR) 10Alexandros Kosiaris: [C: 032] Changed date format in l10nupdate-1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/106892 (owner: 10Tinaj1234) [14:47:41] (03CR) 10Alexandros Kosiaris: [V: 032] Changed date format in l10nupdate-1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/106892 (owner: 10Tinaj1234) [14:48:12] !log replacing failing disk db1031 slot 6 [14:48:18] Logged the message, Master [14:51:41] PROBLEM - RAID on db1031 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [14:53:55] (03CR) 10Alexandros Kosiaris: [C: 032] Mark iptables as being soon deprecated. [operations/puppet] - 10https://gerrit.wikimedia.org/r/107399 (owner: 10Jkrauska) [14:54:51] PROBLEM - Varnish HTCP daemon on cp1066 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:55:01] PROBLEM - Varnish traffic logger on cp1066 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:55:31] PROBLEM - Varnish HTTP text-backend on cp1066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:55:44] mutante: just to verify i got the point, in etherpad we it both http and https [14:56:39] matanya: eh, i see that we currently have both when i try in browser, and there _was_ a discussion about enforcing https and redirect or not [14:56:46] and the result was to leave it like that, afair [14:56:52] details from akosiaris [14:57:47] if that is what you meant at all, but yeah, either way you need both virtual hosts [14:58:07] mutante: o_0 [14:58:26] twkozlowski: what?:) [14:58:51] !log reedy updated /a/common to {{Gerrit|I26f2f5322}}: db1040 to full steam [14:58:52] https://bugzilla.wikimedia.org/show_bug.cgi?id=60085#c1 [14:58:52] installs httpseverywhere on twkozlowski [14:58:57] (03PS1) 10Reedy: Call updateBitsBranchPointers in multiversion/switchAllMediaWikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107579 [14:58:58] Logged the message, Master [14:59:13] you have some bot of sorts, mutante? :) [14:59:47] hah, not anymore, i used to run eggdrop [15:00:04] I meant the comment on Bugzilla :) [15:00:34] twkozlowski: oh, hehe, not yet, but i had to do that when i see "Keyword: shell" being added [15:00:43] and then people think that means its blocked by ops [15:00:44] heh [15:00:52] that's what 'ops' is for [15:00:58] mutante: so using only webserver::apache::site { 'etherpad.wikimedia.org':} wouldn't be enough [15:01:13] twkozlowski: ok, but which part of it requires shell ? [15:01:19] 'shell' is generally added to all site request bugs than can be solved through Gerrit [15:01:23] it sucks, I know [15:01:26] besides deployment window [15:01:37] mutante: just the deployment, I guess. [15:02:05] ack, that's what i added to make clear anyone can work on the next step, and shell is actually 2 steps later [15:02:08] I know it sucks but I didn't decide this :) [15:02:08] matanya: yeah... both HTTP and HTTPS is desired there. There has been a long long discussion about it. [15:02:37] that's above my paygrade and something for Andre to decide, I guess [15:02:39] akosiaris: details in short? [15:02:43] matanya: one config can have 2 virtual hosts [15:03:11] you dont want etherpad.wikimedia.org and etherpad.wikimedia.org.ssl in sites-enabled or so [15:03:15] just one file [15:03:24] but two VirtualHosts [15:05:20] mutante: so webserver::apache::site { 'etherpad.wikimedia.org':} with what parameters? i don't see any paramter allow me to use two vhosts [15:05:29] i'm mostly blind [15:05:38] matanya: is it not apache2 ? [15:05:57] the class [15:06:02] matanya: I think you need to include that twice. [15:06:11] (03CR) 10Alexandros Kosiaris: Bump ParsoidCacheUpdateJobOnDependencyChange runners (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/107420 (owner: 10Aaron Schulz) [15:07:11] mutante: no, it is webserver::apache::site, if i understand mark comment correctly [15:07:17] matanya: i always used the other method putting a template with both virtual hosts in sites-available, then enabling it with apache_site(), i don't know yet for that [15:07:29] didnt look yet, but i will [15:08:13] matanya: details in long https://gerrit.wikimedia.org/r/#/c/84873/ :-) [15:09:01] RECOVERY - RAID on db1016 is OK: OK: optimal, 1 logical, 2 physical [15:10:05] matanya: http/https redirect/enforcing yes/no tickets are generally fun :) [15:10:29] so much fun [15:12:11] matanya: and sometimes they end up with changes in the repo of httpseverywhere extension [15:12:37] or you got new bugs there as well [15:12:53] yeah, i had really annoying issues with commons when httpseverywhere was enables on the site [15:13:02] *d [15:13:03] you can ask RoanKattouw_away i think [15:13:09] to get an upstream merge [15:13:15] of their regex with wikimedia services [15:13:29] twkozlowski: https://bugzilla.wikimedia.org/show_bug.cgi?id=49494 ? [15:14:31] mutante: I think one of the maintainers watches my github clone [15:14:49] I added a branch but didn't notify them and it got pulled in upstream :D [15:15:08] Reedy: :) nice! [15:15:24] when the net organizes itself ... [15:16:11] RECOVERY - RAID on db1029 is OK: OK: optimal, 1 logical, 2 physical [15:16:37] Poking the guy in irc usually sees other things merged within 24-48 hours though [15:17:47] the task is to sync their regex with: https://wikitech.wikimedia.org/wiki/Httpsless_domains [15:17:57] ok, mutante i'm leaving it aside until this is clear to me. [15:19:05] https://github.com/reedy/https-everywhere/blob/master/src/chrome/content/rules/Wikimedia.xml [15:19:30] grr, the RT link template is broken [15:19:34] matanya: It's pretty easy to do [15:19:35] [15:19:46] Just need to add/update that URL as approprate [15:19:48] stats.wm.org is fixed , correct certificate, i see [15:19:50] *regex [15:20:05] Might be a couple of dead things on that list at a quick glance [15:20:39] project2 for starters [15:21:31] (03CR) 10Addshore: "The idea was to puppetize this but not have it running in production, only on a Labs instance" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96552 (owner: 10Addshore) [15:21:41] RECOVERY - RAID on db1031 is OK: OK: optimal, 1 logical, 2 physical [15:23:45] yongle, ersch, bayle, bayes [15:24:12] uhm... you can search them all by hostname in RT, use advanced search, match subject and status _isn't open_ [15:24:45] wlm.wm.org [15:24:56] is also dead [15:25:28] Might be worth sticking it in an etherpad or something [15:27:13] well, that wiki page is supposed to be that:) [15:27:18] with ticket links and all [15:27:25] if the RT template wasnt broken for some reason [15:27:39] plus there is tracking bug on BZ [15:27:50] (03CR) 10Ottomata: "If you don't set a default for $master_host, it will force users of this class to set it. Puppet will throw an error if it isn't set. Th" (031 comment) [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/99381 (owner: 10Physikerwelt) [15:46:37] (03PS1) 10Matanya: url-downloader: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107590 [16:23:00] mutante: on https://rt.wikimedia.org/Ticket/Display.html?id=1960 ? [16:23:06] mutante: (requestors) [16:24:17] jeremyb: no, confirmed [16:24:53] jeremyb: my guess for the reason: it was Anthony and when he left WMF somebody removed him from all , or so [16:25:13] i agree it's strange though that you see him in people tab [16:25:29] mutante: but others have left... i guess we should try to find someone else configured like him [16:25:40] mutante: also, couldn't find him on staff page... [16:25:46] (wmfwiki) [16:25:49] * jeremyb runs away [16:26:25] jeremyb: user doesnt exist anymore and has been deleted, (but should have been just deactivated) ? [16:26:55] it's also really old [16:27:08] and resolved [16:37:32] (03PS1) 10Odder: Let admins add users to three groups on zhwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107596 [16:43:26] (03PS1) 10Odder: Let bureaucrats add users to accountcreator on elwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107599 [16:43:49] <^d> paravoid: Thanks for access to the old search boxen. [16:52:02] hah http://blog.waja.info/2014/01/15/bye-bye-nagios-plugins/ [16:53:13] greg-g: because they are icinga-plugins now:?:P reads [16:53:44] oh, trademark issue? duh [16:53:59] good that we dont have nagios.wm anymore [16:56:50] yep :) [16:58:16] greg-g, i think zero will skip today [16:58:23] k [17:04:28] (03CR) 10Ricordisamoa: [C: 031] Let bureaucrats add users to accountcreator on elwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107599 (owner: 10Odder) [17:12:26] (03PS3) 10Ottomata: Fixes for ganglia types and non standard gmond aggregator ports [operations/software/ganglios] - 10https://gerrit.wikimedia.org/r/107501 [17:13:35] (03CR) 10Ottomata: [C: 032 V: 032] Fixes for ganglia types and non standard gmond aggregator ports [operations/software/ganglios] - 10https://gerrit.wikimedia.org/r/107501 (owner: 10Ottomata) [17:18:14] (03CR) 10Liangent: [C: 04-1] "Doesn't do what it claims." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107566 (owner: 10Odder) [17:18:30] !log updated ganglios in apt to 1.3 [17:18:36] Logged the message, Master [17:20:17] (03CR) 10Odder: "Go away." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107566 (owner: 10Odder) [17:22:16] lolwut [17:39:53] (03CR) 10Faidon Liambotis: [C: 032] Collection Renderer (Now a module!) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102352 (owner: 10Mwalker) [17:39:55] (03Abandoned) 10Odder: Enable setting logo per language variant on zh.voy [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/107566 (owner: 10Odder) [17:44:48] (03CR) 10Jgreen: [C: 032 V: 031] Collection Renderer (Now a module!) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102352 (owner: 10Mwalker) [17:54:47] (03PS4) 10Dzahn: turn wikistats into module - WIP [operations/puppet] - 10https://gerrit.wikimedia.org/r/94409 [17:55:50] (03CR) 10Dzahn: turn wikistats into module - WIP (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/94409 (owner: 10Dzahn) [17:56:30] (03PS5) 10Dzahn: turn wikistats into module - WIP [operations/puppet] - 10https://gerrit.wikimedia.org/r/94409 [17:56:32] (03PS1) 10Ottomata: Updating debian/control to use newer standards, no longer depending on nagios3, removing pyversions file [operations/software/ganglios] - 10https://gerrit.wikimedia.org/r/107608 [17:56:58] heya paravoid [17:57:02] coudl you comment on this real quick? [17:57:02] https://gerrit.wikimedia.org/r/#/c/107608/1/debian/control [17:57:08] i'm not looking for a full debian package review here [17:57:18] i didn't write it and I don't want to go down the rabbit hole of reorganizing everything [17:57:31] buuut, i'm not sure about the Suggests and standards bits [17:57:39] I think suggests is good here, [17:57:46] as far as I can tell, this does not actually depend on nagios at all [17:58:04] it does need a gmetad.conf file installed somewhere, but that is ganglia, not nagios [17:58:05] can i do "go back to previous PS" within a gerrit change with just web ui, maybe? [17:58:08] aside from that it is just a script that can be run by nagios [18:00:34] mutante: I think going back is not possible at all; you have to submit a new change altogether. [18:01:40] mutante: you mean resporing a previous patchset? i think you can only do that from command line [18:01:45] No you don't [18:02:01] yea, i just want PS1 again [18:02:01] use git checkout with the correct string for the patchset [18:02:05] then just git review -R [18:02:12] /whatever [18:02:13] ^ what i meant [18:02:17] yea, ok, that was the alternative [18:02:23] just thought maybe there is a button [18:02:25] thx [18:02:42] Unfortunately not [18:02:52] Being able to rebase an older patch to become the current patch would be nice [18:03:13] "revert" but just on a PS [18:03:21] that isn't merged [18:03:25] nods [18:13:48] ^d: Any idea if there's a FR in gerrit for something like that? [18:13:53] revert to older patchset or similar [18:14:17] <^d> I think there's an upstream bug for it. [18:15:08] (03PS1) 10BryanDavis: kibana: Block access to /status via Varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/107609 [18:27:24] is gerrit really slow again? [18:28:16] (03CR) 10Matanya: [C: 031] turn wikistats into module [operations/puppet] - 10https://gerrit.wikimedia.org/r/94409 (owner: 10Dzahn) [18:28:44] yes AaronSchulz [18:31:06] see also: java [18:31:22] <^d> Not slow for me. [18:37:30] (03CR) 10Guido.iaquinti: [C: 031] kibana: Block access to /status via Varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/107609 (owner: 10BryanDavis) [18:42:51] (03CR) 10Guido.iaquinti: [C: 031] apt: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107358 (owner: 10Matanya) [18:44:17] (03CR) 10Guido.iaquinti: [C: 031] base: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107355 (owner: 10Matanya) [18:45:54] (03CR) 10Guido.iaquinti: [C: 031] deployment: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107343 (owner: 10Matanya) [18:46:42] (03CR) 10Guido.iaquinti: [C: 04-1] mail.pp puppet lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/104807 (owner: 10Hashar) [18:47:57] (03CR) 10Guido.iaquinti: [C: 031] decom hooper,eiximenis [operations/puppet] - 10https://gerrit.wikimedia.org/r/107159 (owner: 10Alexandros Kosiaris) [18:49:26] (03PS4) 10Guido.iaquinti: download: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107341 (owner: 10Matanya) [18:51:56] (03CR) 10Guido.iaquinti: [C: 031] cpufrequtils: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107346 (owner: 10Matanya) [19:00:32] RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: OK: kafka.varnishkafka.kafka_drerr.per_second is 0.0 [19:00:32] RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: OK: kafka.varnishkafka.kafka_drerr.per_second is 0.0 [19:00:32] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate is 2539.09699445 [19:00:32] RECOVERY - Kafka Broker Messages In on analytics1022 is OK: OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate is 2533.74183034 [19:00:32] RECOVERY - Varnishkafka Delivery Errors on cp4019 is OK: OK: kafka.varnishkafka.kafka_drerr.per_second is 0.0 [19:00:32] RECOVERY - Varnishkafka Delivery Errors on cp4020 is OK: OK: kafka.varnishkafka.kafka_drerr.per_second is 0.0 [19:01:01] RECOVERY - Varnishkafka Delivery Errors on cp4011 is OK: OK: kafka.varnishkafka.kafka_drerr.per_second is 0.0 [19:01:11] RECOVERY - Varnishkafka Delivery Errors on cp4012 is OK: OK: kafka.varnishkafka.kafka_drerr.per_second is 0.0 [19:15:19] cool! [19:18:51] (03PS1) 10Ori.livneh: Add 'graphite-index' script and cron job [operations/puppet] - 10https://gerrit.wikimedia.org/r/107616 [19:19:02] (03PS2) 10Ori.livneh: Add 'graphite-index' script and cron job [operations/puppet] - 10https://gerrit.wikimedia.org/r/107616 [19:19:32] ugh, typo [19:19:41] (03CR) 10jenkins-bot: [V: 04-1] Add 'graphite-index' script and cron job [operations/puppet] - 10https://gerrit.wikimedia.org/r/107616 (owner: 10Ori.livneh) [19:19:47] yes, i know [19:20:27] (03PS3) 10Ori.livneh: Add 'graphite-index' script and cron job [operations/puppet] - 10https://gerrit.wikimedia.org/r/107616 [19:22:24] paravoid: can you review? i imagine you'd prefer to abide by the debian package maintainer's framework but i don't like it, don't have the time at the moment to file a bug and wait for a fix, and i think that this is a tidy workaround. [19:27:29] (03CR) 10Faidon Liambotis: [C: 04-1] "Yes on the concept :)" (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/107616 (owner: 10Ori.livneh) [19:27:31] (03CR) 10Ori.livneh: "Faidon: I imagine you'd prefer to abide by the debian package maintainer's framework but I don't like it, don't have the time at the momen" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107616 (owner: 10Ori.livneh) [19:27:34] ah [19:27:35] heh [19:33:28] (03CR) 10Ori.livneh: Add 'graphite-index' script and cron job (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/107616 (owner: 10Ori.livneh) [19:37:02] (03PS4) 10Ori.livneh: Add 'graphite-index' script and cron job [operations/puppet] - 10https://gerrit.wikimedia.org/r/107616 [19:37:20] !log replacing failing disk db1031 slot 8 [19:37:21] (03CR) 10Ori.livneh: "I will rm /sbin/graphite-auth manually" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107616 (owner: 10Ori.livneh) [19:37:26] Logged the message, Master [19:37:58] want to see something cool? [19:38:04] always [19:38:14] http://noc.wikimedia.org/~faidon/graphitus/dashboard.html?id=cdn.http.reqerror [19:38:24] gdash replacement [19:38:29] I like it very much [19:38:33] what do you think? [19:38:41] (03CR) 10Matanya: turn wikistats into module (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/94409 (owner: 10Dzahn) [19:39:14] if you click on the graphs, it turns them into rickshaw graphs so you can hover over them and see the actual timestamp/value [19:39:21] (which is very useful when correlating with logs) [19:39:35] plus multiple parameters, like inserting dates, hours back etc. [19:39:42] oh and it's all client-side (duh!) [19:39:54] the histogram functionality is nice [19:40:04] needs cloudflare and bootstrapsomething unblocked in noscript (and all of cloudflare and bootstrapwhatever, not just specific subdomains) [19:40:18] yeah if we'd use it, we'll switch to serving these locally [19:40:36] I didn't bother for testing it [19:41:33] * greg-g nods [19:42:16] paravoid: it is cool but crashes me when clicking histogram [19:42:41] PROBLEM - RAID on db1031 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [19:43:10] paravoid: i dig. the interface is not perfect but the code looks clean and hackable. [19:43:54] yup, there's a few things I'd change too [19:43:57] it's very new [19:44:00] first commit is 2 months ago [19:44:14] the "ZOMG UPDATE IN 10..9..8.." ticker is really distracting [19:44:44] I don't mind it, but I can totally see why others might [19:45:02] I was hoping it changed to "ZOMG!..." when it was less than 10 seconds to go, but alas... [19:45:13] greg-g: submit a pull request :P [19:45:15] :) [19:45:33] * greg-g goes to get lunch [19:46:41] paravoid: with respect to gdash, it's a lot of config overhead for what really should have been a static site generator, which is a point that you made originally and that i find convincing [19:47:02] if you look at the code to generate the dashboard html from the graph definition files, it's pretty small, and it's all ruby [19:47:17] I hate this, sure, but this wasn't my motivation [19:47:18] so i was thinking of making it a puppet type [19:47:22] I really want more interactive graphs [19:47:53] if there's a log spike in 5xxs, I see the alert, I see the spikes, how am I supposed to grep the logs for it? [19:48:04] I need a timestamp, the graphs can't provide this right now [19:48:25] and well, sure, I can script it [19:48:29] well, kibana provides a better workflow for that anyhow [19:48:38] since it integrates plotting and log grepping [19:48:52] but yeah, i know what you mean [19:48:59] sure, but I think having more interactive graphs is useful nevertheless [19:49:04] yeah [19:49:43] (03CR) 10Hashar: sanity test for refreshWikiversionsCDB (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105698 (owner: 10Hashar) [19:50:00] (03PS5) 10Hashar: sanity test for refreshWikiversionsCDB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105698 [19:50:12] (03CR) 10jenkins-bot: [V: 04-1] sanity test for refreshWikiversionsCDB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105698 (owner: 10Hashar) [19:50:14] (03PS6) 10Hashar: sanity test for refreshWikiversionsCDB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105698 [19:50:22] (03CR) 10jenkins-bot: [V: 04-1] sanity test for refreshWikiversionsCDB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105698 (owner: 10Hashar) [19:51:08] (03CR) 10Hashar: "/srv/ssd/jenkins-slave/workspace/operations-mw-config-tests/../multiversion/refreshWikiversionsCDB: not found" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105698 (owner: 10Hashar) [19:51:15] paravoid: any experience with , or ? [19:51:58] none, although I've seen their web page too [19:52:02] web pages [19:52:17] I've generally tried to avoid anything that needs more stuff on the server side, at least for now [19:52:37] I've also played with http://noc.wikimedia.org/~faidon/giraffe/ btw [19:52:42] (it's not great) [19:52:44] yeah, i didn't like giraffe to be honest [19:52:53] graphitus looks better [19:52:54] (03PS7) 10Hashar: sanity test for refreshWikiversionsCDB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105698 [19:52:57] yup [19:54:13] (03CR) 10Hashar: "Uses __DIR__ and drop the extra ../ :-]" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105698 (owner: 10Hashar) [19:55:19] i should probably step back a bit from graphite once i've cleared the TODOs [19:55:38] it's really ops infrastructure, makes sense for ops to own it [19:57:37] ottomata: an1012 will need a reinstall but it's all yours again [19:58:02] !!!!!!!!!! [19:58:04] :D :D :D [19:58:43] it only took 3 mother boards but it's fixed [19:59:12] thank youuu wooooo [19:59:34] ottomata is dancing on the channel floor [19:59:49] paravoid: updated the patch with your suggestions, btw [20:00:46] (03CR) 10Faidon Liambotis: [C: 032] Add 'graphite-index' script and cron job [operations/puppet] - 10https://gerrit.wikimedia.org/r/107616 (owner: 10Ori.livneh) [20:01:26] thanks, i'll run puppet on tungsten, revert local mods to the debian package's script, and delete the /sbin copy [20:02:18] paravoid: speaking of which, what's the proper way to restore a file to the package's version? [20:02:21] "CVE Request: Apache Archiva Remote Command Execution 0day" [20:02:23] funny [20:02:43] ori: there is none; apt-get install --reinstall $package would do it, but possibly do other things as well [20:03:02] i have a copy of the unmodified script, i'll just copy it into place [20:11:19] omg how did ganglios ever work? [20:12:02] very bad ottomata [20:12:14] i just noticed this [20:12:21] so, parse_ganglia [20:12:24] downloads the xml [20:12:32] and then strips out everything but each metric line, and stores that in a file per host [20:12:33] fine. [20:12:43] but then it uses xml libs to try to parse those files! [20:12:48] which have been converted to invalid xml! [20:13:24] yeah, ganglios isn't very friendly [20:13:42] RECOVERY - RAID on db1031 is OK: OK: optimal, 1 logical, 2 physical [20:13:43] i worked with a bit, and decided it wasn't worth the time [20:13:51] i mean, i could write this way better (coudln't we all?!) but oof, i didn't notice it was so bad [20:13:54] i thought i just had to fix a couple of bugs [20:14:01] but it doesn't work for most of what it should do [20:14:12] i think i could make it store the full xml so it would still parse [20:14:13] hmm [20:14:26] rather than stripping out extra elements [20:14:31] but i'm not sure if that would break other stuff [20:14:59] (03PS1) 10Ori.livneh: graphite-web: set DEFAULT_CACHE_DURATION to 120s [operations/puppet] - 10https://gerrit.wikimedia.org/r/107634 [20:15:11] i think it would, the xml part is a dependency for other parts iirc ottomata [20:15:38] yeah, it looks like everything uses the ganglios. methods [20:15:43] which all expect xml [20:15:50] (03CR) 10Ori.livneh: [C: 032 V: 032] "Merging per https://etherpad.wikimedia.org/p/GraphiteTODO" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107634 (owner: 10Ori.livneh) [20:15:51] dunno why ganglia_parser is stripping out xml [20:19:09] oof [20:21:08] (03PS2) 10Ori.livneh: Make it possible for logmsgbot to report to more than one channel [operations/puppet] - 10https://gerrit.wikimedia.org/r/101816 [20:23:50] (03CR) 10Ori.livneh: [C: 032] Make it possible for logmsgbot to report to more than one channel [operations/puppet] - 10https://gerrit.wikimedia.org/r/101816 (owner: 10Ori.livneh) [20:26:18] !log db1031 swapping failing disk slot 11 [20:26:24] Logged the message, Master [20:28:47] qchris, ^d: 'git pull's from Gerrit are terribly slow today [20:29:01] <^d> Someone else said that. [20:29:06] yesterday too [20:29:19] <^d> Let's upgrade to 2.8 [20:29:20] i did [20:29:41] PROBLEM - RAID on db1031 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [20:29:48] Mhmmm will that solve responsiveness? [20:29:55] Meh. Let's upgrade :-) [20:30:06] nice approch [20:30:23] when in doubt, upgrade the software! [20:30:25] <^d> qchris: 2.7-rc2 sucks. [20:30:29] :) [20:30:33] <^d> 2.8.1 is out now [20:30:38] http://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&h=ytterbium.wikimedia.org&m=cpu_report&s=by+name&mc=2&g=mem_report&c=Miscellaneous+eqiad [20:30:43] ^d you're the gerrit master :-) [20:30:47] the sharp decline was me restarting the server [20:30:59] it had a lot of java procs busy-waiting on a futex [20:31:05] might be good to check if that's still the case [20:31:21] * ^d has no idea [20:31:57] :%s/java/c/g [20:32:25] So it seems to be the git server that's slow [20:32:28] Gerrit operations are fast [20:32:44] Like, even merging things via Jenkins is fast [20:32:51] It's just that fetching things is really slow [20:33:41] As in, git pull and git review -d operations take multiple minutes [20:34:17] yep [20:35:47] <^d> I've wondered how much smaller the repo would be if we could prune history prior to the git migration. [20:35:58] <^d> But I've not found an easy git-filter-branch one-liner. [20:36:34] matanya: i had forgotten about this [20:36:35] https://github.com/larsks/check_ganglia [20:37:14] ottomata: didn't know it exist [20:37:18] thanks for this [20:37:34] ^d: you mean on core? please don't, why would you? [20:37:55] <^d> I've thought about forking the repo with way less history. [20:37:57] <^d> Make it fast. [20:38:04] <^d> Old stuff is boring and slow. [20:38:07] it's not slow [20:38:13] ooof i had seen that like a year ago and forgotten about it [20:38:16] (other than on gerrit, but that's gerrit's fault) [20:38:16] i should probably be using this :/ [20:38:49] let me know how it is ottomata [20:39:01] ^d: it's slow for ops/puppet too [20:39:10] <^d> Big repos. [20:39:14] <^d> Lemme try gc'ing them [20:40:40] <^d> We're supposed to be gc'ing all of them weekly. [20:40:45] <^d> I hope that cron still works :\ [20:41:07] RoanKattouw: btw, what do you mean by 'git server'? is there some division of labor between gerrit proper and some other tool? [20:41:38] I'm talking about the bit that serves things from the git repo to git clients over ssh/https [20:42:01] right, it's all part of the same app, no? [20:42:06] Yes I think so [20:42:19] ^d: How would that explain sudden slowness in trying to fetch single commits? [20:42:41] <^d> big repo -> poorly packed -> hard to fetch [20:43:24] <^d> operations/puppet went from 294M -> 104M [20:43:27] <^d> That should help :) [20:43:31] <^d> mw/core still going [20:44:49] <^d> This was the thing that caused a ton of slowness way back when. [20:44:59] <^d> Then we repacked everything and it was all better. [20:45:02] <^d> I wonder if the cron's broken. [20:45:07] does it log? [20:45:14] to the stash! [20:45:20] <^d> I don't think so. [20:45:32] ^d: do you know if our version of jgit includes https://eclipse.googlesource.com/jgit/jgit/+/84afea9179932995d1e59f8fda4e6b11217382ad%5E!/ ? [20:46:03] <^d> ori: I honestly don't know offhand. qchris? [20:46:51] ori: Thats from February ... I think it should be contained. Let me see if I find something... [20:48:16] <^d> We're still running 1e7090b [20:48:22] <^d> Which is https://git.wikimedia.org/tree/gerrit/1e7090b1249b14115b68ee7a5b9beb7a1350a597 [20:49:11] <^d> Just based on the dates alone I'd think we have it. [20:49:27] ^d: operations/pupet is faster now, thanks [20:49:34] <^d> Good. [20:49:43] <^d> mw/core went from 1.1G to 554M [20:49:44] !log deployment-prep updating elasticsearch to 0.90.10 [20:49:45] <^d> That's better. [20:49:51] Logged the message, Master [20:50:11] ori: Yes. Is contained. [20:50:16] (03PS1) 10BryanDavis: kibana: Restrict URLs proxied without authentication [operations/puppet] - 10https://gerrit.wikimedia.org/r/107639 [20:50:20] qchris: thanks for checking [20:50:41] so let's figure out what happened to the cron job [20:50:46] <^d> !log gerrit: running gc on all repositories. think weekly cron to do this might be broken. [20:50:52] Logged the message, Master [20:53:30] (03CR) 10Milimetric: "looks great, I like the flexible type stuff" [operations/software/ganglios] - 10https://gerrit.wikimedia.org/r/107501 (owner: 10Ottomata) [20:55:07] milimetric: thanks, except i am discovering that ganglios sucks and i probably =can't use it [20:55:19] :() [20:55:22] I mean :( [20:55:48] does that make you stuck? [20:57:54] ottomata: oh crap! [20:59:17] milimetric: , i dfound something that looks more complete [20:59:20] https://github.com/larsks/check_ganglia [20:59:21] trying it now [21:00:23] hey at least this'll make a great blog post when you get it stable [21:00:36] but poke me if I can help at all [21:00:41] RECOVERY - RAID on db1031 is OK: OK: optimal, 1 logical, 2 physical [21:05:11] PROBLEM - Puppet freshness on rhodium is CRITICAL: Last successful Puppet run was Wed 15 Jan 2014 06:04:59 PM UTC [21:09:20] <^d> greg-g: Is today's LD spoken for? [21:13:32] ^d: you looking for us or something else? [21:13:47] <^d> Us. I wanted to sync out the config for the new cirrus logs. [21:13:53] <^d> They'll be going into new wmf branch tomorrow. [21:15:17] ah [21:15:20] cool [21:17:03] !log aaron synchronized php-1.23wmf10/includes/filebackend/SwiftFileBackend.php 'bc2a0ddbff4c3e4bff8e931163fd3d1c0340a78e' [21:17:09] Logged the message, Master [21:17:17] ah, nice [21:18:01] ^d: https://gerrit.wikimedia.org/r/#/c/107711/1 silly mistake [21:18:08] (03PS1) 10BryanDavis: Logstash: Configure Elasticsearch to automatically create indices [operations/puppet] - 10https://gerrit.wikimedia.org/r/107716 [21:24:04] (03CR) 10Manybubbles: [C: 031] Logstash: Configure Elasticsearch to automatically create indices [operations/puppet] - 10https://gerrit.wikimedia.org/r/107716 (owner: 10BryanDavis) [21:30:57] !log Jenkins: generation of MediaWiki documentation with doxygen got broken since Zuul upgrade. Fix: {{gerrit|107718}} [21:31:03] Logged the message, Master [21:36:08] (03PS1) 10Ottomata: Casting port to int in case it comes from CLI [operations/debs/check_ganglia] - 10https://gerrit.wikimedia.org/r/107720 [21:36:21] (03CR) 10Ottomata: [C: 032 V: 032] Casting port to int in case it comes from CLI [operations/debs/check_ganglia] - 10https://gerrit.wikimedia.org/r/107720 (owner: 10Ottomata) [21:36:49] (03PS1) 10Mwalker: Do not include and declare the OCG class [operations/puppet] - 10https://gerrit.wikimedia.org/r/107721 [21:38:41] (03CR) 10Jgreen: [C: 032 V: 031] Do not include and declare the OCG class [operations/puppet] - 10https://gerrit.wikimedia.org/r/107721 (owner: 10Mwalker) [21:38:44] ^d: a little busy, 3 already, but that sounds needed, no? [21:38:56] <^d> Yeah. Should be harmless. [21:39:00] k [21:39:11] either with the crew of 3, or after/before is fine with me [21:39:39] ottomata: ok to merge your stuff? [21:40:34] ? [21:40:44] not my stuff Jeff_Green [21:41:18] orly? curious. [21:41:41] Matt Walker [21:41:41] ? [21:42:00] he just did a two one-line change, that part I expected [21:42:05] h mabye ori [21:42:11] Make it possible for logmsgbot to report to more than one channel [21:42:18] that fits yeah [21:43:01] odd that there's no gerrit botspew [21:46:23] I sent a new email to ops-requests and I didn't get an autoreply with a link to the new ticket. Is that normal? [21:46:27] Jeff_Green: ^ [21:47:34] sumanah: yes/no. from what I've heard it's known broken atm [21:48:15] Jeff_Green: ah ok. the subject line was "check on the Bugzilla daemon?" in case it's helpful to look through logs for that [21:49:27] I'm 3 tasks deep and don't have time right now to grep logs. can you poke in the ticket via web instead [21:49:41] ^d: https://gerrit.wikimedia.org/r/#/c/107543/ will fix some annoying queries [21:50:27] (03PS1) 10Ottomata: Initial deb packaging [operations/debs/check_ganglia] (debian) - 10https://gerrit.wikimedia.org/r/107723 [21:51:27] paravoid: ^ :) reviewy? [21:54:20] sumanah: interesting. your ticket did get created [21:55:20] anybody got a favorite tool for drawing network diagrams? lucidachart is making me unhappy, omnigraffle is $99. other options? [21:55:50] dia is cost effective and sort functional [21:56:17] cool, i haven't used that in years. i'll give it a shot, thanks. [21:56:32] it's less crashy than it used to be [21:56:43] but not much better otherwise :-( [21:56:48] heh [21:57:30] This script requires the Python Imaging Library - http://www.pythonware.com/products/pil/ [21:57:32] jgage: I think greg-g has recent opinions on that too [21:57:40] huh, maybe that should be on terbium [21:57:43] and hates dia least [21:57:54] csteipp: I can't even add captchas then :/ [21:58:08] I think dia is the best option on linux(sic) right now [21:58:31] hmm? [21:58:33] it does what I want, easily, and is relatively intuitive. Way more intuitive than massaging Inkscape into a flowchart creator [21:58:54] it doesn't do everything, of course, and isn't actively maintained, but, meh [21:59:02] csteipp: https://wikitech.wikimedia.org/wiki/Generating_CAPTCHAs doesn't work anymore [21:59:32] for sumanah's sake: I still do hate it ;) [21:59:50] (the exporting to non-dia is flakey) [22:00:09] AaronSchulz: As in you suddenly can't run that in production? [22:00:32] I don't know how long ago that stopped working [22:00:43] nobody ever runs that ;) [22:02:02] captcha.py hasn't been changed in a while... so it might be a change in the server config? [22:02:30] It worked fine for my on Ubuntu 12.04 [22:02:38] thanks, greg-g. i've just downloaded the free trial of omnigraffle, gonna give that a whirl. [22:02:43] paravoid: maybe python-imaging should be installed on all maintenance servers? [22:29:03] where are the config files for the beta cluster's wikis? [22:31:17] jackmcbarn: same place the main site's are [22:31:24] just appended with -labs [22:31:53] thanks [22:31:58] np [22:33:08] aww icinga is still restricted :( [22:35:19] I can't log into RT and when I ask for a password reset it doesn't show up :/ [22:43:32] sumanah: open rt ticket on it [22:43:52] matanya: you're probably right [22:44:08] however right now I am suspecting that a bunch of email-related stuff is just slowed down [22:44:20] and my ticket would be a dupe [22:44:22] on your end? [22:45:00] no, not on my end [22:45:39] i'm pretty sure there is a issue with rt password resets i believe [22:45:50] (the email requests) [22:48:22] yeah, I think Jeff said that too, so no point me filing another ticket about it [22:48:56] I'll leave my original request alone and go do other stuff [22:49:02] sumanah: do you need anything specific? i'm in rt now anywy [22:49:39] matanya: #6642, "check on the Bugzilla daemon?" [22:50:10] what about it [22:50:19] no updates since your comment [22:50:34] matanya: I am trying to figure out why I got the "we think it's resolved" msg [22:50:36] was that an error? [22:51:06] jeff green marked it as resolved and then it was moved to open [22:51:25] huh [22:51:27] ok [22:51:48] to what bug are referring there? [22:51:50] I realize now I didn't include the actual link to the bug https://bugzilla.wikimedia.org/show_bug.cgi?id=58208 which ought to have a comment today from Dan [22:52:22] computers a voodoo [22:52:26] *are [22:52:46] (there are a few actually, but that and https://bugzilla.wikimedia.org/show_bug.cgi?id=60095 should) [22:52:47] and i should stop typing typos [22:52:58] and I should, you know, actually include relevant info in tickets [22:54:29] sumanah: the bug is solveable if you either use puppet or a packaged version [22:55:31] matanya: yeah, the analytics team is talking about doing a vagrant/puppet setup thing to make dev setup easier [22:57:23] sumanah: there is one already [22:58:40] sumanah: https://git.wikimedia.org/tree/operations%2Fpuppet.git/872f8574977ca5ffe79b52c2778b965ba739f2ac/modules%2Flabs_vagrant [23:00:08] matanya: ok, that's the general labs vagrant, but is there an actual manifest already for wikimetrics? I don't think there is [23:00:32] matanya: when you said "the bug is solveable if you either use puppet or a packaged version" you meant the httplib2 bug, right? [23:00:34] no there isn't as far as i know [23:00:41] yes sumanah [23:01:31] got it. There is not a really tested dev-friendly vagrant/puppet solution for getting a personal wikimetrics dev env up and running [23:02:18] i might create one if i knew requeirements [23:03:33] matanya: the README at https://gerrit.wikimedia.org/r/#/c/107710/ gives the manual instructions - you could puppetize that [23:05:31] i'll try to give it a look on the weekend, i promised mark. and others stuff i need to finish before [23:06:00] cool [23:08:25] sumanah: who should deal with it from analyics side? [23:08:42] matanya: we can talk in #wikimedia-analytics :-) [23:12:20] so, https://ganglia.wikimedia.org/latest/?c=Text%20caches%20eqiad&h=cp1066.eqiad.wmnet&m=cpu_report&r=day&s=by%20name&hc=4&mc=2 ? [23:12:51] se4598 noticed that, what looks like the text caches having some issues [23:13:21] greg-g: also all esams text caches overloaded apparenlty [23:16:19] only thing in the SAL around there is db related [23:16:20] :/ [23:16:30] paravoid: still on? [23:16:40] ori: thoughts? [23:26:04] robla: on? [23:29:04] that is one broken server [23:29:26] yep [23:29:56] kswapd? [23:30:12] oh right, and no buffer memory [23:30:36] no paravoid? [23:31:42] not unless he just doesn't like me :) [23:32:15] (03PS1) 10Aaron Schulz: Minor scap comment tweak [operations/puppet] - 10https://gerrit.wikimedia.org/r/107738 [23:32:26] esams just looks normal [23:32:45] cp1066 will need a reboot [23:33:53] !log on cp1066: stopping varnish -- kswapd has gone crazy and is continuously flushing all buffers [23:33:59] Logged the message, Master [23:34:15] greg-g: am now...still need me? [23:34:41] RECOVERY - Varnish HTCP daemon on cp1066 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [23:34:43] robla: nope, was going to have you poke someone ops-like in the office, I got one better (TIm) [23:34:51] RECOVERY - Varnish traffic logger on cp1066 is OK: PROCS OK: 2 processes with command name varnishncsa [23:35:14] TimStarling: you need to reboot the machine? stoping varnish woun't be enough? [23:35:21] RECOVERY - Varnish HTTP text-backend on cp1066 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.001 second response time [23:36:25] greg-g: fwiw, I actually don't see anyone physically here from Ops [23:36:54] robla: /me nods, was going to accept an o-ri. [23:37:19] that's because they keep quitting [23:37:26] ops should have a no-SF policy [23:38:15] matanya: it seems unlikely, since it is a kernel bug [23:38:41] PROBLEM - Varnish HTTP text-frontend on cp1066 is CRITICAL: Connection refused [23:38:58] TimStarling: yep ^ [23:39:21] PROBLEM - Varnish HTTP text-backend on cp1066 is CRITICAL: Connection refused [23:39:30] matanya: just for icinga throwing https://bugzilla.wikimedia.org/show_bug.cgi?id=60112 in [23:40:11] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [23:40:21] hm? [23:40:37] thanks se4598 [23:40:40] right, so it looks like cp1066 is running the old 3.2 kernel [23:40:42] sorry, i missed the pings in this channel; could someone recap with an executive summary? [23:40:51] or not, if tim already knows what is going on [23:40:59] ori: https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&h=cp1066.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=cpu_report&c=Text+caches+eqiad [23:41:21] well, see the ops list from november with subject "Varnish eqiad periodic 503s & latency spikes" [23:41:33] ori: cp1066 kswapd has gone crazy and is continuously flushing all buffers [23:41:59] bblack pushed out a new build recently IIRC to work around these very issues [23:42:12] paravoid was also involved, so ping ping. [23:42:35] a new build of varnish? [23:43:10] Jan 5: 19:26 bblack: varnish mobile caches updated to -wm27 package (crash fix, http/0.9 patch removal to test effect on logging anomalies) [23:43:51] well, it's not a crash [23:44:14] cp1065 is running 3.2.0 as well [23:47:08] there's also cp1065 freaking out dec 20th: 19:09 mutante: cp1065 - [19060.904886] BUG: scheduling while atomic: kworker/11:0/27073/0x00000200 [23:47:41] RECOVERY - Varnish HTTP text-frontend on cp1066 is OK: HTTP OK: HTTP/1.1 200 OK - 198 bytes in 0.001 second response time [23:48:17] wtf? [23:48:44] how hard can it be to kill a service? [23:49:01] TimStarling: you can check on vanadium the eventlogs from varnish, no? [23:49:05] greg-g: / TimStarling: bblack is going to have a look as well. [23:49:07] the BUG one is likely really a kernel/hw issue [23:49:25] did someone just start varnish on cp1066? [23:49:37] cp1066 probably hasn't been upgraded [23:49:38] puppet, perhaps? [23:49:44] puppet [23:49:48] tail /var/log/puppet.log? [23:49:56] right, it runs from cron now doesn't it? [23:50:28] I only stopped the puppet agent [23:50:51] yes [23:50:58] /etc/cron.d/puppet [23:50:59] I stopped cron [23:51:13] I'm going to upgrade the varnish package on just cp1066 for now, since it's having issues [23:51:30] in general we've been holding back on upgrades to the text ones because paravoid's still looking at other things there [23:51:41] PROBLEM - Varnish HTTP text-frontend on cp1066 is CRITICAL: Connection refused [23:51:43] but they're way behind now on a lot of little fixes [23:52:21] RECOVERY - Varnish HTTP text-backend on cp1066 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.005 second response time [23:52:33] well, judging by ganglia, at least 5 of the eqiad text varnishes have broken kernels [23:52:58] (thanks se4598 ;) ) [23:54:54] who else is active on cp1066 as root? [23:54:57] did you stop varnish? [23:55:12] yes, many many times [23:55:19] and I logged it [23:55:21] PROBLEM - Varnish HTTP text-backend on cp1066 is CRITICAL: Connection refused [23:55:32] I mean, just now since I said I was upgrading [23:55:34] yes [23:55:43] ori! :D do you have a moment? -- Jeff earlier was going to push a fix to puppet for me; but you apparently had some unpublished bot stuff between my patch and what's live which he didn't want to take responsibility for -- I'd absolutely make you cookies if you pushed those changes :D [23:55:46] why? [23:56:00] because I want to reboot the server once I'm done investigating the kernel versions [23:56:26] why would you try to run it with a broken kernel? [23:56:35] it'll just fail again some other day [23:57:07] mwalker: ah, yes, I forgot to run puppet-merge. I will do so, but let's wait a couple of minutes for things to calm down. [23:57:19] *nods* makes sense [23:57:35] I just wanted to see that the problem was invariant, the varnish package on all texts is very old and lacks a lot of bugfixes [23:57:55] (which can probably affect behaviors that trigger kernel bugs or not) [23:58:27] well, you may be right, cp1068 has the same kernel and it has the smoothest cache usage graph [23:58:47] in any case, 1066 still needs a reboot in the short term [23:58:52] feel free whenever [23:59:06] does cp1068 have a newer package? [23:59:34] nope [23:59:51] all the text nodes should still be on -wm16, current is -wm27