[00:00:21] (03CR) 10Rush: [C: 032 V: 032] trusty friendly phabricator [operations/puppet] - 10https://gerrit.wikimedia.org/r/143510 (owner: 10Rush) [00:03:00] (03Abandoned) 10Yuvipanda: toollabs: Don't collect NFS stats [operations/puppet] - 10https://gerrit.wikimedia.org/r/143236 (owner: 10Yuvipanda) [00:04:26] (03Abandoned) 10Rush: phabricator: use apache::site, not apache::vhost [operations/puppet] - 10https://gerrit.wikimedia.org/r/143378 (owner: 10Ori.livneh) [00:04:48] chasemp: :) [00:10:04] (03PS1) 10Rush: my.cnf.erb syntax fixup [operations/puppet] - 10https://gerrit.wikimedia.org/r/143523 [00:12:22] (03CR) 10Rush: "example:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143523 (owner: 10Rush) [00:13:27] (03CR) 10Dzahn: [C: 031] my.cnf.erb syntax fixup [operations/puppet] - 10https://gerrit.wikimedia.org/r/143523 (owner: 10Rush) [00:14:12] (03CR) 10Rush: [C: 032] my.cnf.erb syntax fixup [operations/puppet] - 10https://gerrit.wikimedia.org/r/143523 (owner: 10Rush) [00:15:07] (03CR) 10BryanDavis: [C: 031] "Applied on deployment-bastion: apache vhost looks good and trebuchet works." [operations/puppet] - 10https://gerrit.wikimedia.org/r/142407 (owner: 10BryanDavis) [00:15:54] (03PS1) 10Dzahn: replace deprecated erb template variable syntax [operations/puppet] - 10https://gerrit.wikimedia.org/r/143526 [00:17:24] chasemp, mutante: bd808's patch is the last apache::vhost, could one of you review? it's scoped to beta where it is already applied [00:19:14] (03CR) 10Rush: labs: role::deployment - port apache::vhost to apache::site (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142407 (owner: 10BryanDavis) [00:19:33] ah, that's a good point [00:19:37] bd808: i'll amend [00:22:10] (03PS4) 10Ori.livneh: labs: role::deployment - port apache::vhost to apache::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/142407 (owner: 10BryanDavis) [00:22:27] PROBLEM - HTTP on francium is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 700 bytes in 0.003 second response time [00:22:27] PROBLEM - Memcached on francium is CRITICAL: Connection refused [00:22:27] (03PS5) 10Ori.livneh: labs: role::deployment - port apache::vhost to apache::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/142407 (owner: 10BryanDavis) [00:23:50] (03CR) 10Rush: [C: 031] "cool" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142407 (owner: 10BryanDavis) [00:23:54] (03PS6) 10Ori.livneh: labs: role::deployment - port apache::vhost to apache::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/142407 (owner: 10BryanDavis) [00:24:18] chasemp: i can't merge it myself; i can only +2 my own patches [00:24:33] ok merging sorry [00:24:44] (03CR) 10Rush: [C: 032 V: 032] labs: role::deployment - port apache::vhost to apache::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/142407 (owner: 10BryanDavis) [00:24:58] wee, thanks [00:25:30] Ah. Good. I didn't like the $::fqdn thing etiher [00:25:57] PROBLEM - puppet last run on francium is CRITICAL: CRITICAL: Puppet has 1 failures [00:26:23] (03PS1) 10Dzahn: deprecated syntax in icinga checkcommands.cfg.erb [operations/puppet] - 10https://gerrit.wikimedia.org/r/143527 [00:27:45] (03CR) 10Rush: [C: 031] deprecated syntax in icinga checkcommands.cfg.erb [operations/puppet] - 10https://gerrit.wikimedia.org/r/143527 (owner: 10Dzahn) [00:29:21] (03PS1) 10Ori.livneh: apache: get rid of apache::vhost [operations/puppet] - 10https://gerrit.wikimedia.org/r/143528 [00:31:51] (03PS1) 10Dzahn: deprecated syntax in mysql/generic_my.cnf.erb [operations/puppet] - 10https://gerrit.wikimedia.org/r/143529 [00:33:02] (03PS2) 10Dzahn: deprecated syntax in mysql/generic_my.cnf.erb [operations/puppet] - 10https://gerrit.wikimedia.org/r/143529 [00:35:50] (03PS1) 10Rush: legalpad to radon [operations/puppet] - 10https://gerrit.wikimedia.org/r/143531 [00:37:29] (03PS2) 10Rush: legalpad to radon [operations/puppet] - 10https://gerrit.wikimedia.org/r/143531 [00:38:03] chasemp: what do you mean by "#designated to iridium can be below it soon " [00:38:26] yeah fixed the lingo there [00:38:35] I realized my brain didn't catchup to my hands [00:38:59] but essentially used hostname matching for radon to do the same below it for iridium, even tho iridium isn't in there yet [00:39:03] was what I meant to say [00:39:41] (03CR) 10Dzahn: legalpad to radon (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143531 (owner: 10Rush) [00:40:08] chasemp: ah,hm, so iridium will be nonlegal-ph ? [00:40:26] heh, I was going to to just make iridum teh default namespace [00:40:30] phabricator- [00:40:42] the db's for radon are named like [00:40:47] phlegal-maniphest [00:40:47] etc [00:40:54] phlegal is cleared with springle btw [00:41:00] imho it should be 2 role classes to avoid "if ($::fqdn" inside a role [00:41:10] like, we already apply role to just the right nodes... [00:41:11] k, so yeah I wasn't sure [00:41:26] would it be [00:41:27] (03PS6) 10Yuvipanda: labs: Enable diamond PuppetAgent collector on all nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/143193 [00:41:31] role::phabricator::radon? [00:41:38] role::phabricator::legalpad [00:41:40] role::phabricator::legal [00:41:43] the lattetr [00:41:46] and then the default? [00:41:48] dont use a node name in there [00:41:55] role::phabricator::phabricator seemed odd [00:41:58] so I didn't go there [00:42:11] role::phabricator::default? [00:42:15] role::phabricator::production ? [00:42:20] yea, or that [00:42:27] I'm sold will amend [00:43:35] it could even be.. role::phabricator::legal inherits role::phabricator::base [00:43:48] but not trying to make it more complicated.. need food still:) [00:43:51] well I don't love that because legal doesn't really inheret from base [00:43:57] heh understood [00:44:21] yea, just make 2 roles in that one role file [00:46:35] (03PS1) 10Ori.livneh: apache: improve docs; provision apache::def before apache::{mod_conf,conf} [operations/puppet] - 10https://gerrit.wikimedia.org/r/143532 [00:47:26] paravoid: that about wraps it up for my ::apache adventures [00:47:29] i feel pretty good about the result [00:47:38] i hope opsen find it nice to use [00:47:55] I haven't seen it yet [00:48:00] who will migrate us off webserver.pp? :) [00:48:17] (03CR) 10Dzahn: [C: 031] apache: improve docs; provision apache::def before apache::{mod_conf,conf} [operations/puppet] - 10https://gerrit.wikimedia.org/r/143532 (owner: 10Ori.livneh) [00:49:00] dunno, whomever is up for it. i have to turn to other things. i did get rid of apache_site, apache_mod, apache::vhost, and a bunch of other monstrosities [00:49:22] (03PS3) 10Rush: legalpad to radon [operations/puppet] - 10https://gerrit.wikimedia.org/r/143531 [00:49:24] it should be pretty straightforward [00:49:30] mutante: thanks [00:50:46] wanna do https://gerrit.wikimedia.org/r/#/c/143528/ as well? it's easy to verify that apache::vhost is gone, you can grep for it [00:52:05] (03CR) 10Dzahn: legalpad to radon (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143531 (owner: 10Rush) [00:55:45] (03CR) 10Dzahn: [C: 031] "looks like it's not used in prod, yea, grepped for it" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143528 (owner: 10Ori.livneh) [00:56:19] (03CR) 10Ori.livneh: [C: 032] apache: get rid of apache::vhost [operations/puppet] - 10https://gerrit.wikimedia.org/r/143528 (owner: 10Ori.livneh) [00:56:29] (03CR) 10Ori.livneh: [C: 032] apache: improve docs; provision apache::def before apache::{mod_conf,conf} [operations/puppet] - 10https://gerrit.wikimedia.org/r/143532 (owner: 10Ori.livneh) [00:57:25] mutante: :)))) [00:57:35] i may have a little ceremony now and release a dove and some balloons into the sky [00:58:06] * YuviPanda gives ori a Tuna sandwich [00:58:54] ori: add some flash drives with Korean WP :) [00:59:05] heh [00:59:24] it applied correctly on zirconium, random app server, and tin [01:00:33] ori: congrats, that entire refactoring [01:00:41] checked antimony, no issues [01:00:55] (03PS7) 10Yuvipanda: labs: Enable MinimalPuppetAgent collector on all nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/143193 [01:01:00] chasemp: ^ I wonder if I did a puppet goofup [01:01:03] should chasemp put those variables inside the role class? [01:01:15] in the diamond::collector::minimalpuppetagent [01:01:34] mutante: fwiw I tested that does work [01:01:48] and I want those variables in more than one role class so it seems weird do dupe them for each [01:01:52] idk [01:02:03] * YuviPanda goes to test his patch [01:02:10] chasemp: ok, then just .. you want to use the new rolename in site.pp now [01:02:26] oops thought I fixed it honestly, coming up [01:03:38] (03PS4) 10Rush: legalpad to radon [operations/puppet] - 10https://gerrit.wikimedia.org/r/143531 [01:04:46] (03CR) 10Dzahn: [C: 031] "let's add base::firewall , but that's for tomorrow :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143531 (owner: 10Rush) [01:05:25] YuviPanda: not sure what you mean by puppet goofup [01:06:01] (03PS5) 10Rush: legalpad to radon [operations/puppet] - 10https://gerrit.wikimedia.org/r/143531 [01:06:09] (03CR) 10Rush: [C: 032 V: 032] legalpad to radon [operations/puppet] - 10https://gerrit.wikimedia.org/r/143531 (owner: 10Rush) [01:09:00] (03CR) 10Rush: [C: 04-1] "few thoughts...instead of" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143193 (owner: 10Yuvipanda) [01:09:18] chasemp: weirdly, I'm running into permission issues reading that yaml file as an unprevilaged user [01:10:47] hmmm should be world readable [01:11:02] chasemp: yeah, but even an ls /var/lib/puppet fails for me, and looking at the perms it shouldn't [01:11:03] ah but not on precise? [01:11:13] I am on precise. [01:11:19] precise: -rw-r----- 1 root root (that I see) [01:11:24] oh [01:11:26] trusty: [01:11:26] -rw-r--r-- 1 root root [01:11:33] idk why the diff, puppet version? [01:12:01] chasemp: maybe I should slip in something to make sure it's readable? [01:12:21] I would rather put in the header that it requires puppet agent version X [01:12:26] and then restrict it to trusty hosts in puppet [01:12:35] uh, well, tools can't use it... [01:12:42] do we have puppet3 backported to precise? [01:12:50] that is a question I don't know, I thought yes? [01:12:56] _joe_ is your guy there [01:13:14] chasemp: there are no trusty hosts in tools, and I don't think we're going to migrate for a few months at least. [01:13:36] we must have puppet3 on precise somehow, see f.e. host iron [01:13:46] Version: 3.4.3-1~ubuntu12.04.1 [01:13:49] interesting [01:13:55] puppet is 3.4.3 but it's precise [01:13:58] but now really out :) [01:13:59] chasemp: I think it's an ensure => latest problem [01:14:20] it's Version: 2.7.11-1ubuntu2.7 on the precise box I'm checking [01:14:31] chasemp: yeah, but apt-cache show tells me a 3.4 exists [01:14:31] in prod [01:14:42] chasemp: It's not installed by default, but it's available. [01:14:47] chasemp: see "iron" [01:14:58] (It is installed by default in Labs though) [01:15:52] YuviPanda: I think your next step is to see about mass ensure latest? [01:16:06] but I don't know why it isn't done now, etc etc [01:16:18] and I wonder what'll break :D [01:16:28] scumbag puppet writes a status files......not user readable [01:16:45] chasemp: can't I just set permissions on that file and let it be for this patch, and do the mass upgrade later? there are 250 hosts affected by this change... [01:17:07] I wouldn't personally be in favor of changing perms for a file managed by a deb outside of the deb [01:17:24] hmm [01:17:45] I understand your thought, I just....that's a slippery slope [01:17:46] chasemp: hmm,a ctually [01:17:53] chasemp: the box I'm on actually has puppet 3.4 [01:18:03] chasemp: and... still permission errors?! [01:18:20] yeah...wtf [01:18:38] -rw-r--r-- 1 root root 599 Jul 2 01:13 last_run_summary.yaml [01:18:46] drwxr-xr-t 3 puppet puppet 4096 Jul 2 01:13 state [01:18:49] what's the 't' bit? [01:19:20] sticky bit [01:19:26] https://en.wikipedia.org/wiki/Sticky_bit [01:19:28] http://www.linuxdevcenter.com/pub/a/linux/lpt/22_06.html [01:19:36] beat me and better link [01:19:38] touche [01:19:43] YuviPanda: Sticky. On directories, it means only the owner of a file can unlink it [01:19:49] ah, right [01:19:53] shouldn't actually matter here [01:20:02] It's silly, 'cause s is something else. [01:20:08] YuviPanda: I think I've seen where a new install vs. an upgrade can be permissions silly [01:20:11] sometimes [01:20:17] try installing from scratch the package [01:20:22] and then try upgrading it, etc [01:20:23] yuvipanda@graphite-test:~$ ls /var/lib/puppet [01:20:27] ls: cannot open directory /var/lib/puppet: Permission denied [01:20:35] chasemp: even ls reports it, so something weirder is happening [01:20:53] oh [01:20:56] I can access it fine now [01:20:59] wtf? [01:21:01] * YuviPanda is so confused atm [01:21:12] is it locked when it's running an update? [01:21:28] alice in wonderland moment here maybe [01:22:06] [2014-07-02 01:21:00,022] [Thread-68] graphite.graphite-test.puppetagent.puppetagent.time_since_last_run 444 1404264060 [01:23:35] chasemp: ghost in the machine? [01:24:04] chasemp: also do you think this metric would be useful in prod? [01:24:09] I don't think so because the precise box I'm looking at is still bad perms [01:24:29] chasemp: right, but all labs boxen seem to have puppet 3.4.3 [01:24:40] is good point [01:24:41] Coren: ^ is that assessment right? I checked a few precise boxes and they all have 3.4.3 [01:25:04] Yeah, all the labs boxen should be 3.4.3 [01:25:19] chasemp: let me move things around [01:25:31] * YuviPanda found about 4 boxes with stale pids due to /var/log exhaustion a few days ago [01:26:04] Yeay diamond? [01:26:37] Coren: heh :) [01:27:01] Coren: it shouldn't do that anymore, since labs rotates logs more aggressively now. but really, we shouldn't have 2G /vars [01:28:00] I beg to differ; we shouldn't have things that logs GB of stuff in days rather. [01:28:21] Unless it's actually expected/needed for the use in which case you then allocate /var/log accordingly. :-) [01:28:59] Coren: it was logging about 400M over 5 days, and since most instances are at about 1.3G full normally they went over. but yeah, this particular instance was diamond's fault, but I still think we should put biglogs everywhere [01:29:04] By definition, GB of logs for something you aren't actually trying to debug means you're logging too much stuff at to low a level and any signal is drowned. [01:29:21] indeed. [01:29:36] Coren: not defending diamond :) [01:29:39] well, in this case at least [01:30:09] it was a silly runaway log, should be cool now [01:30:40] chasemp: updating to reflect your comments [01:31:20] chasemp: shouldn't it be files/collector/ rather than collectors, to match manifests? [01:31:51] (03CR) 10TTO: "I love that this change has five +1s! It's such a happy sight." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141100 (https://bugzilla.wikimedia.org/58932) (owner: 10TTO) [01:32:21] YuviPanda: the soul of semantics :) [01:32:28] :D [01:32:33] idk I think ppl are using plural that I've seen [01:32:49] but I'm not going to -1 becuase of it [01:32:57] (03PS8) 10Yuvipanda: labs: Enable MinimalPuppetAgent collector on all nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/143193 [01:33:06] chasemp: I want to be consistent with manifests, so we change both or leave 'em be [01:33:16] * YuviPanda tests [01:35:31] YuviPanda: probably should throw in the header of the collector that it requires puppet version 3 or greater [01:35:37] prevent much confusion later [01:35:44] chasemp: yeah, makes sense [01:37:34] chasemp: hmm, I see it in diamond.log but not in graphite [01:37:47] what is it? [01:38:33] [2014-07-02 01:14:00,006] [Thread-40] graphite.graphite-test.puppetagent.puppetagent.time_since_last_run 24 1404263640 [01:38:35] the actual metrics [01:38:41] also I wonder if my code is goofing up somehow. [01:39:07] aaah, I'm not [01:39:13] unix timestamps are in seconds [01:39:25] OH DEAR GOD I'VE BEEN WRITING JAVA FOR TOO LONG TO EXPECT MILLISECONDS [01:40:13] epoch time :) [01:40:24] (03PS9) 10Yuvipanda: labs: Enable MinimalPuppetAgent collector on all nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/143193 [01:40:39] you can watch the creation log on the graphite box to see if they are being created [01:40:50] yeah that's what I'm doing now [01:41:40] chasemp: uh oh, weird. I don't actually see it in the whisper files :| [01:43:42] so you see it logged, and other metrics show up in graphite [01:44:01] I would give it a minute? graphite creation tends to be throttled, unsure of the lab config [01:44:11] (03PS1) 10Rush: mediawiki oauth for phab [operations/puppet] - 10https://gerrit.wikimedia.org/r/143549 [01:44:18] chasemp: hmm, alright. [01:44:32] (03CR) 10jenkins-bot: [V: 04-1] mediawiki oauth for phab [operations/puppet] - 10https://gerrit.wikimedia.org/r/143549 (owner: 10Rush) [01:44:45] (03PS2) 10Rush: mediawiki oauth for phab [operations/puppet] - 10https://gerrit.wikimedia.org/r/143549 [01:46:32] (03CR) 10Rush: [C: 032 V: 032] "this was otherwise approved and merged, just moving the radon tag up to implement SUL" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143549 (owner: 10Rush) [01:46:53] chasemp: aha, found my bug. [01:47:49] (03PS10) 10Yuvipanda: labs: Enable MinimalPuppetAgent collector on all nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/143193 [01:48:29] YuviPanda: have to step away I'll look if you still need eyes in the morning? [01:48:31] getting late here [01:48:50] chasemp: heh, 7 AM here :) yeah, I'll get someone to merge and if not poke you tomorrow [01:48:54] chasemp: thanks for the help/advice! [01:49:14] I should go to sleep soon [01:50:37] Coren: can you merge ^? works now :) [01:51:44] gah, no. [01:51:46] this is weird [01:52:39] ah, right. so the previous, buggy patch's metrics just showed up, and the new one's will in a while [01:52:52] (03PS11) 10Yuvipanda: labs: Enable MinimalPuppetAgent collector on all nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/143193 [01:56:33] (03CR) 10Yuvipanda: [C: 031] "Tested. Metric are coming in now \o/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143193 (owner: 10Yuvipanda) [01:57:50] * YuviPanda goes to sleep [02:15:56] !log LocalisationUpdate completed (1.24wmf10) at 2014-07-02 02:14:53+00:00 [02:16:04] Logged the message, Master [02:27:28] !log LocalisationUpdate completed (1.24wmf11) at 2014-07-02 02:26:24+00:00 [02:27:33] Logged the message, Master [02:57:39] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jul 2 02:56:33 UTC 2014 (duration 56m 32s) [02:57:43] Logged the message, Master [03:00:43] !log rebooting lead [03:00:48] Logged the message, Master [03:02:37] PROBLEM - Host lead is DOWN: PING CRITICAL - Packet loss = 100% [03:02:57] RECOVERY - Host lead is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [03:04:08] (03PS1) 10Faidon Liambotis: Add role::mail::mx to lead [operations/puppet] - 10https://gerrit.wikimedia.org/r/143554 [03:04:59] (03CR) 10Faidon Liambotis: [C: 032] Add role::mail::mx to lead [operations/puppet] - 10https://gerrit.wikimedia.org/r/143554 (owner: 10Faidon Liambotis) [03:05:34] (03CR) 10Faidon Liambotis: [V: 032] Add role::mail::mx to lead [operations/puppet] - 10https://gerrit.wikimedia.org/r/143554 (owner: 10Faidon Liambotis) [03:08:57] PROBLEM - DPKG on lead is CRITICAL: DPKG CRITICAL dpkg reports broken packages [03:09:37] PROBLEM - Disk space on lead is CRITICAL: DISK CRITICAL - /var/spool/exim4/db is not accessible: Permission denied [03:10:37] PROBLEM - puppet last run on lead is CRITICAL: CRITICAL: Puppet has 1 failures [03:11:57] RECOVERY - DPKG on lead is OK: All packages OK [03:14:37] RECOVERY - puppet last run on lead is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [03:17:20] (03PS1) 10Faidon Liambotis: spamassassin: don't create user/group debian-spamd [operations/puppet] - 10https://gerrit.wikimedia.org/r/143555 [03:17:43] (03CR) 10Faidon Liambotis: [C: 032] spamassassin: don't create user/group debian-spamd [operations/puppet] - 10https://gerrit.wikimedia.org/r/143555 (owner: 10Faidon Liambotis) [03:18:21] (03CR) 10Faidon Liambotis: [V: 032] spamassassin: don't create user/group debian-spamd [operations/puppet] - 10https://gerrit.wikimedia.org/r/143555 (owner: 10Faidon Liambotis) [03:20:20] (03PS1) 10Faidon Liambotis: spamassassin: fix failed dependency [operations/puppet] - 10https://gerrit.wikimedia.org/r/143556 [03:20:39] (03CR) 10Faidon Liambotis: [C: 032 V: 032] spamassassin: fix failed dependency [operations/puppet] - 10https://gerrit.wikimedia.org/r/143556 (owner: 10Faidon Liambotis) [03:21:40] PROBLEM - puppet last run on lead is CRITICAL: CRITICAL: Complete puppet failure [03:21:50] PROBLEM - spamassassin on lead is CRITICAL: PROCS CRITICAL: 0 processes with args spamd [03:22:40] RECOVERY - puppet last run on lead is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [03:22:50] RECOVERY - spamassassin on lead is OK: PROCS OK: 3 processes with args spamd [03:24:29] (03PS1) 10Faidon Liambotis: Add AAAA for lead.wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/143557 [03:25:03] (03CR) 10Faidon Liambotis: [C: 032] Add AAAA for lead.wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/143557 (owner: 10Faidon Liambotis) [03:26:30] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 999.082192522 [03:31:06] (03PS1) 10Faidon Liambotis: mail: add lead as second smarthost, remove mchenry [operations/puppet] - 10https://gerrit.wikimedia.org/r/143558 [03:34:16] (03PS1) 10Faidon Liambotis: MX switch, part 3 [operations/dns] - 10https://gerrit.wikimedia.org/r/143559 [03:34:18] (03PS1) 10Faidon Liambotis: MX switch, part 4 [operations/dns] - 10https://gerrit.wikimedia.org/r/143560 [03:56:14] (03CR) 10KartikMistry: [C: 031] Fix string interpolation in cxserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/143361 (owner: 10Nikerabbit) [03:59:22] (03CR) 10KartikMistry: [C: 031] "LGTM as it is fixed as per long discussion and agreement :)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140723 (owner: 10Nikerabbit) [04:40:53] PROBLEM - Puppet freshness on db1058 is CRITICAL: Last successful Puppet run was Wed 02 Jul 2014 04:38:42 UTC [04:42:53] PROBLEM - Puppet freshness on db1058 is CRITICAL: Last successful Puppet run was Wed 02 Jul 2014 04:38:42 UTC [04:44:53] PROBLEM - Puppet freshness on db1058 is CRITICAL: Last successful Puppet run was Wed 02 Jul 2014 04:38:42 UTC [04:46:53] PROBLEM - Puppet freshness on db1058 is CRITICAL: Last successful Puppet run was Wed 02 Jul 2014 04:38:42 UTC [04:48:53] PROBLEM - Puppet freshness on db1058 is CRITICAL: Last successful Puppet run was Wed 02 Jul 2014 04:38:42 UTC [04:49:15] odd [04:50:33] RECOVERY - Puppet freshness on db1058 is OK: puppet ran at Wed Jul 2 04:50:25 UTC 2014 [04:51:53] PROBLEM - Puppet freshness on db1058 is CRITICAL: Last successful Puppet run was Wed 02 Jul 2014 04:50:25 UTC [04:53:53] PROBLEM - Puppet freshness on db1058 is CRITICAL: Last successful Puppet run was Wed 02 Jul 2014 04:50:25 UTC [04:55:53] PROBLEM - Puppet freshness on db1058 is CRITICAL: Last successful Puppet run was Wed 02 Jul 2014 04:50:25 UTC [04:57:53] PROBLEM - Puppet freshness on db1058 is CRITICAL: Last successful Puppet run was Wed 02 Jul 2014 04:50:25 UTC [04:58:10] icinga: Warning: The results of service 'Puppet freshness' on host 'db1058' are stale by 0d 0h 0m 45s (threshold=0d 0h 1m 15s) [04:58:37] what breaks a threshold check like that [04:58:43] RECOVERY - Puppet freshness on db1058 is OK: puppet ran at Wed Jul 2 04:58:41 UTC 2014 [06:27:31] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:30] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:30] PROBLEM - puppet last run on mw1099 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:30] PROBLEM - puppet last run on mw1068 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:40] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:40] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:40] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:40] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:41] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:41] PROBLEM - puppet last run on mw1117 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:41] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:50] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:50] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:50] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:50] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:50] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:00] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:00] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:10] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:20] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:20] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:20] PROBLEM - puppet last run on mw1150 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:40] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:50] PROBLEM - puppetmaster https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:30:20] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:40] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.023 second response time [06:36:50] PROBLEM - RAID on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:37:40] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [06:45:01] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:45:11] RECOVERY - puppet last run on mw1150 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:45:21] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:45:31] RECOVERY - puppet last run on mw1068 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:45:41] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:45:41] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:45:41] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:45:51] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:46:11] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:46:31] RECOVERY - puppet last run on mw1099 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:46:31] RECOVERY - puppet last run on searchidx1001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:46:41] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:46:41] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:46:41] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:46:42] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:46:42] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:46:52] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:46:52] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:46:52] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:46:52] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:47:01] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:47:01] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:47:21] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:47:41] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:50:53] <_joe_> we don't know about service dependencies do we [07:01:32] (03Abandoned) 10Matanya: apt::puppet: remove puppet 2.7 conditional [operations/puppet] - 10https://gerrit.wikimedia.org/r/143283 (owner: 10Matanya) [07:02:16] (03CR) 10Matanya: [C: 031] Remove unused upstream Openstack module [operations/puppet] - 10https://gerrit.wikimedia.org/r/141835 (owner: 10Andrew Bogott) [07:35:46] good morning [07:36:04] hey hashar [07:36:28] godog: I completely forgot about my puppet change for Zuul :-( [07:36:46] hashar: no worries [07:40:02] <_joe_> godog: seee ^, don't you think we need to start to define some service dependency in icinga? [07:40:12] <_joe_> like, if the puppet master fails, all nodes will fail [07:40:41] <_joe_> and we know that, and we don't need to be flooded [07:42:21] +1 [07:42:38] though maintaining dependencies is not an easy task :-/ [07:43:17] indeed, at least for checks that we know for sure are going to fail/flood that'd be good [07:45:24] !log umounted (empty and broken) sdk1 from ms-be3003 and wipe its first sectors, no more remounts [07:45:30] Logged the message, Master [07:47:56] (03PS2) 10Filippo Giunchedi: swift: add icehouse pinning to ms-be1* [operations/puppet] - 10https://gerrit.wikimedia.org/r/143282 [07:48:03] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: add icehouse pinning to ms-be1* [operations/puppet] - 10https://gerrit.wikimedia.org/r/143282 (owner: 10Filippo Giunchedi) [07:49:06] (03CR) 10Hashar: [C: 032] "The change only impacts -labs file which is safe. The service fully rely on beta and not on production at all or another labs project." (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140723 (owner: 10Nikerabbit) [07:49:20] (03Merged) 10jenkins-bot: Enable ContentTranslation extension on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140723 (owner: 10Nikerabbit) [07:49:55] PROBLEM - puppet last run on ms-be3003 is CRITICAL: CRITICAL: Puppet has 1 failures [07:54:46] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.001 second response time [07:58:45] PROBLEM - puppet last run on mw1047 is CRITICAL: CRITICAL: Puppet has 1 failures [07:59:10] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: Fetching readonly [08:00:31] !log upgrading ms-be1001 to swift icehouse [08:00:36] Logged the message, Master [08:08:12] <_joe_> we need one more puppetmaster i'd say [08:14:50] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 6.878 second response time [08:16:44] ah, strontium/palladium are at capacity? [08:17:50] RECOVERY - puppet last run on mw1047 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [08:18:01] <_joe_> godog: they contionue to spawn 502s from time to time [08:18:09] <_joe_> gotta love puppet scalability [08:18:38] <_joe_> maybe we can look in optimizing them if it hasn't been done thoroughly [08:20:15] hashar: thanks. [08:20:51] hashar: https://gerrit.wikimedia.org/r/#/c/143361/ if you're free :) [08:21:07] ahah [08:21:09] usual typo [08:21:20] (03CR) 10Hashar: "usual typo :-D" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143361 (owner: 10Nikerabbit) [08:21:27] (03CR) 10Hashar: [C: 031] Fix string interpolation in cxserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/143361 (owner: 10Nikerabbit) [08:21:43] akosiaris: would you merge a trivial puppet typo for cxserver please ? https://gerrit.wikimedia.org/r/#/c/143361/ :-D [08:26:26] PROBLEM - MySQL Processlist on db1068 is CRITICAL: CRIT 73 unauthenticated, 0 locked, 0 copy to table, 0 statistics [08:27:26] RECOVERY - MySQL Processlist on db1068 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 1 statistics [08:29:46] (03CR) 10Alexandros Kosiaris: [C: 032] Fix string interpolation in cxserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/143361 (owner: 10Nikerabbit) [08:32:47] hashar: done [08:36:37] kart_: merged :D [09:04:01] (03PS2) 10Matanya: swift: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140654 [09:04:49] godog: can you please review this before i need to do another painful rebase ? :) [09:05:07] hashar: akosiaris: \0/ [09:05:17] (03CR) 10jenkins-bot: [V: 04-1] swift: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140654 (owner: 10Matanya) [09:05:55] matanya: seems like jenkins does not like it [09:06:05] me too :/ [09:06:27] :D [09:08:07] matanya: thanks! can it way until I'm done with https://wikitech.wikimedia.org/wiki/Swift/Icehouse however? [09:08:12] wait even [09:08:38] as long as i keep rebasing godog:) [09:09:34] matanya: rebasing swift you mean? I don't expect many changes soon hence painless rebases in theory [09:09:59] matanya: btw I wasn't aware you were on a quest to remove tabs from puppet :) [09:10:09] then cool, i thought that about this one too, but you see what happened ... [09:10:28] godog: my goal is to allow jenkins to vote on style [09:11:32] heheh [09:16:29] (03PS3) 10Matanya: swift: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140654 [09:18:08] godog: at your spare time, it is ready. I'll keep it rebased as needed. thanks! [09:18:56] matanya: ack, I'll eyeball it but the timing is a bit unfortunate :( thanks for keeping it up to date :)) [09:21:07] (03PS2) 10Matanya: cache: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/140678 [09:21:46] (03PS2) 10Matanya: mailrelay: convert 'true' into a real boolean [operations/puppet] - 10https://gerrit.wikimedia.org/r/143251 [09:22:00] (03PS2) 10Matanya: apt: minor lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/143271 [09:22:28] * matanya keeps rebasing forever ... [09:22:42] godog: I am running those 3 through the catalog differ, I have them [09:22:48] plus some others :-) [09:23:29] matanya: I fixed it yesterday, it should not be long now [09:23:38] thanks akosiaris! [09:28:48] (03PS2) 10Spage: add new Mantle extension, required by coming Flow [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142151 (https://bugzilla.wikimedia.org/66094) [09:32:37] akosiaris: ah ok! that's good proof too [09:56:12] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [10:43:11] (03CR) 10Tim Landscheidt: [C: 04-1] "No, the package has been requested by Daniel in bug #52717 and actually you (:-)) submitted the Puppet change. The fix to correct the dep" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142819 (owner: 10Yuvipanda) [10:57:15] (03PS1) 10TTO: Add additional upload domain for Erasmus University [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143593 [10:57:28] (03PS2) 10TTO: Add additional upload domain for Erasmus University [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143593 (https://bugzilla.wikimedia.org/67355) [11:02:43] (03CR) 10Hashar: [C: 031] Add additional upload domain for Erasmus University [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143593 (https://bugzilla.wikimedia.org/67355) (owner: 10TTO) [11:03:00] Reedy: around? I am not sure how to deploy nowadays :-D [11:03:15] I would like to push an update of wmf-config/InitialiseSettings.php I guess scap-file still works isn't it ? [11:03:19] It's roughly the same [11:03:22] Review it, submit [11:03:25] git pull in /a/common [11:03:33] sync-file wmf-config/InitialiseSettings.php FOOBAR [11:03:40] (03CR) 10Hashar: [C: 032] Add additional upload domain for Erasmus University [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143593 (https://bugzilla.wikimedia.org/67355) (owner: 10TTO) [11:03:42] trying :_D [11:03:47] (03Merged) 10jenkins-bot: Add additional upload domain for Erasmus University [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143593 (https://bugzilla.wikimedia.org/67355) (owner: 10TTO) [11:03:51] that's on tin [11:05:13] !log hashar Synchronized wmf-config/InitialiseSettings.php: additional upload domain for Erasmus University {{gerrit|143593}} {{bug|67355}} (duration: 00m 06s) [11:05:14] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: Fetching readonly [11:05:18] Logged the message, Master [11:05:27] hmm [11:05:40] bd808|BUFFER: scappy is nice. My first ever prod deploy using it \O/ [11:06:48] Reedy: thank you! [11:17:14] (03CR) 10Alexandros Kosiaris: "Yes it replaces both old puppet checks" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142560 (owner: 10Dzahn) [11:30:52] (03PS1) 10Giuseppe Lavagetto: nutcracker: move config in puppet, work with trusty packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/143597 [11:32:01] <_joe_> ok, off to lunch, if anyone needs the puppet catalog compiler, I'll fix it in the afternoon to work how we want now (as a change differ) [11:32:04] (03CR) 10jenkins-bot: [V: 04-1] nutcracker: move config in puppet, work with trusty packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/143597 (owner: 10Giuseppe Lavagetto) [11:37:57] (03PS2) 10Springle: Make dbstore1002 handle s5 analytics queries [operations/dns] - 10https://gerrit.wikimedia.org/r/143399 (https://bugzilla.wikimedia.org/66068) (owner: 10QChris) [11:38:18] (03CR) 10Springle: [C: 032] Make dbstore1002 handle s5 analytics queries [operations/dns] - 10https://gerrit.wikimedia.org/r/143399 (https://bugzilla.wikimedia.org/66068) (owner: 10QChris) [11:40:25] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Wed 02 Jul 2014 09:40:01 UTC [11:42:22] * YuviPanda waves [11:49:25] PROBLEM - uWSGI web apps on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:35] PROBLEM - puppet last run on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:45] PROBLEM - RAID on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:46] PROBLEM - SSH on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:49:55] PROBLEM - DPKG on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:05] PROBLEM - check if dhclient is running on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:15] PROBLEM - gdash.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:50:15] PROBLEM - Disk space on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:25] PROBLEM - Graphite Carbon on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:35] PROBLEM - check configured eth on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:45] PROBLEM - MediaWiki profile collector on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:55] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:51:35] RECOVERY - SSH on tungsten is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [11:51:35] RECOVERY - MediaWiki profile collector on tungsten is OK: OK: All defined mwprof jobs are runnning. [11:51:35] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [11:51:45] RECOVERY - DPKG on tungsten is OK: All packages OK [11:51:46] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 2.960 second response time [11:51:55] RECOVERY - check if dhclient is running on tungsten is OK: PROCS OK: 0 processes with command name dhclient [11:52:05] RECOVERY - gdash.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 9055 bytes in 0.014 second response time [11:52:05] RECOVERY - Disk space on tungsten is OK: DISK OK [11:52:15] RECOVERY - uWSGI web apps on tungsten is OK: OK: All defined uWSGI apps are runnning. [11:52:15] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [11:52:25] RECOVERY - check configured eth on tungsten is OK: NRPE: Unable to read output [11:52:25] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 1340 seconds ago with 0 failures [12:00:03] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Wed Jul 2 12:00:02 UTC 2014 [12:08:33] PROBLEM - DPKG on mw1017 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:09:57] !log upgraded PHP5 to 5.3.10-1ubuntu3.12+wmf1 on test.wikipedia.org [12:10:02] Logged the message, Master [12:10:33] RECOVERY - DPKG on mw1017 is OK: All packages OK [12:18:34] !log upgraded PH5 to 5.3.10-1ubuntu3.12+wmf1 on deployment-apache01 and deployment-apache02 (beta) [12:18:39] Logged the message, Master [12:21:51] (03PS2) 10Giuseppe Lavagetto: nutcracker: move config in puppet, work with trusty packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/143597 [12:29:48] (03PS3) 10Giuseppe Lavagetto: nutcracker: move config in puppet, work with trusty packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/143597 [12:32:05] (03CR) 10Faidon Liambotis: nutcracker: move config in puppet, work with trusty packages (036 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143597 (owner: 10Giuseppe Lavagetto) [12:34:04] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Wed 02 Jul 2014 10:32:54 UTC [12:41:21] (03CR) 10Giuseppe Lavagetto: nutcracker: move config in puppet, work with trusty packages (035 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143597 (owner: 10Giuseppe Lavagetto) [12:42:39] (03PS1) 10Hoo man: Properly handle single quotes in mwgrep search terms [operations/puppet] - 10https://gerrit.wikimedia.org/r/143599 [12:43:53] PROBLEM - check if dhclient is running on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:43:53] PROBLEM - DPKG on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:43:53] PROBLEM - uWSGI web apps on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:44:13] PROBLEM - Disk space on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:44:13] PROBLEM - gdash.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:44:13] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:44:23] PROBLEM - Graphite Carbon on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:44:43] RECOVERY - check if dhclient is running on tungsten is OK: PROCS OK: 0 processes with command name dhclient [12:44:43] RECOVERY - DPKG on tungsten is OK: All packages OK [12:44:43] RECOVERY - uWSGI web apps on tungsten is OK: OK: All defined uWSGI apps are runnning. [12:45:03] RECOVERY - Disk space on tungsten is OK: DISK OK [12:45:03] RECOVERY - gdash.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 9055 bytes in 0.127 second response time [12:45:03] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.003 second response time [12:45:13] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [12:46:25] (03PS2) 10Faidon Liambotis: mail: add lead as second smarthost, remove mchenry [operations/puppet] - 10https://gerrit.wikimedia.org/r/143558 [12:48:06] (03PS1) 10Alexandros Kosiaris: osm-dbs to labsdb1006 and labsdb1007 [operations/dns] - 10https://gerrit.wikimedia.org/r/143600 [12:48:08] (03PS1) 10Alexandros Kosiaris: osm-dbs to labsdbs mgmt IPs rename [operations/dns] - 10https://gerrit.wikimedia.org/r/143601 [12:50:21] mark: should I deploy https://gerrit.wikimedia.org/r/#/c/80973/ ? :) [12:50:26] like, now :) [12:54:25] (03Abandoned) 10Hashar: Updating debian package files [operations/debs/adminbot] - 10https://gerrit.wikimedia.org/r/68935 (owner: 10AzaToth) [13:00:04] K4-713: The time is nigh to deploy Fundraising (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140702T1300) [13:02:29] !log Jenkins: dropping history of puppet related jobs after 90 days. {{gerrit|136992}} [13:02:34] Logged the message, Master [13:07:26] (03PS3) 10Hashar: zuul: split conf file for server and merger [operations/puppet] - 10https://gerrit.wikimedia.org/r/141572 [13:08:45] (03PS3) 10Hashar: zuul: migrate statsd_host to zuul::server [operations/puppet] - 10https://gerrit.wikimedia.org/r/141657 [13:10:14] (03CR) 10Mark Bergsma: [C: 04-1] "Let's get redundant transport first before we put even more on esams..." [operations/dns] - 10https://gerrit.wikimedia.org/r/80973 (owner: 10Faidon Liambotis) [13:10:37] !log Jenkins being busy deleting history files [13:10:43] Logged the message, Master [13:11:17] (03CR) 10Manybubbles: [C: 031] Properly handle single quotes in mwgrep search terms [operations/puppet] - 10https://gerrit.wikimedia.org/r/143599 (owner: 10Hoo man) [13:13:21] (03CR) 10Hashar: [C: 04-1] "I hate the configuration files duplication. Need to find something smarter to generate the conf files." [operations/puppet] - 10https://gerrit.wikimedia.org/r/141572 (owner: 10Hashar) [13:16:33] (03PS6) 10Faidon Liambotis: Switch Central/South Asia to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/80973 [13:16:35] (03PS2) 10Faidon Liambotis: Switch more Asia-Pacific countries to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/140064 [13:17:07] (03CR) 10Faidon Liambotis: [C: 032] Switch more Asia-Pacific countries to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/140064 (owner: 10Faidon Liambotis) [13:17:39] jenkins down? [13:18:11] (03CR) 10Faidon Liambotis: [V: 032] Switch more Asia-Pacific countries to ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/140064 (owner: 10Faidon Liambotis) [13:18:11] na busy [13:18:37] deleting the million of files in the history of puppet jobs was probably not a good idea [13:19:16] it is resuming [13:19:31] I forced V+2 it [13:19:52] I think I managed the one java thread that was locking everything \O/ [13:39:00] hashar: adminbot is not used anymore? [13:39:21] AzaToth: no clue. I have abandoned the patch because it was too old :D [13:39:42] feel free to restore it [13:39:45] (03CR) 10Andrew Bogott: [C: 032] mailrelay: convert 'true' into a real boolean [operations/puppet] - 10https://gerrit.wikimedia.org/r/143251 (owner: 10Matanya) [13:40:02] hashar: well, the main issue was the total lack of license [13:41:23] hashar: and what do you mean it was too old? [13:41:35] there's been no aditional commit to the repo since [13:41:37] reopen it if you want [13:41:59] I randomly abandon changes that had no activities after several months [13:42:03] ok [13:42:04] assuming they are bitrotting [13:42:54] (03Restored) 10AzaToth: Updating debian package files [operations/debs/adminbot] - 10https://gerrit.wikimedia.org/r/68935 (owner: 10AzaToth) [13:43:30] hashar: you could merge the change if you want [13:43:49] the actual changeset doesn't depend on the license [13:44:02] only the use of the code [13:53:44] (03PS3) 10Giuseppe Lavagetto: mediawiki: manage the apache config via puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 [13:58:10] PROBLEM - puppet last run on virt1009 is CRITICAL: CRITICAL: Complete puppet failure [13:59:10] RECOVERY - puppet last run on virt1009 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [14:00:24] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Wed 02 Jul 2014 12:00:02 UTC [14:00:47] (03CR) 10coren: [C: 032] "That seems like an odd confusion between metric and alert, but meh." [operations/puppet] - 10https://gerrit.wikimedia.org/r/143193 (owner: 10Yuvipanda) [14:00:59] Coren: why so? [14:01:34] they're both metrics. [14:02:56] Well, "time since puppet was last run" is strictly a metric but not a quantitatively useful one. (I could see why puppet run length might be, since you'd want to know if there is a trend) [14:04:15] Coren: you can see the interval between puppet runs with this metric [14:05:04] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 3279.81390801 [14:05:10] should be 20 minutes in general, but won't be due to issues, so you can measure how far you are from expected 20 minutes [14:05:38] The interval is fixed by definition; so the graph would be a pretty sawtooth except when puppet fails to run, in which case the only actually useful data is that it didn't. Hence "alert vs metric" :-) [14:08:41] Another way to put it: every sample has only a single bit of information that switches between (reset the value) and (increase the value by the sample rate). [14:08:56] (03PS1) 10Filippo Giunchedi: swift: rewrite middle integration test [operations/puppet] - 10https://gerrit.wikimedia.org/r/143611 [14:09:54] I see your point Coren :) [14:13:35] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Wed Jul 2 14:13:33 UTC 2014 [14:21:35] (03CR) 10jenkins-bot: [V: 04-1] swift: rewrite middle integration test [operations/puppet] - 10https://gerrit.wikimedia.org/r/143611 (owner: 10Filippo Giunchedi) [14:22:50] (03CR) 10Giuseppe Lavagetto: "@Brian: I'm not proficient enough with beta to be able to suggest you how to use this. Consider this commit a mid-step on the road to a fu" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 (owner: 10Giuseppe Lavagetto) [14:24:13] (03PS2) 10Filippo Giunchedi: swift: rewrite middle integration test [operations/puppet] - 10https://gerrit.wikimedia.org/r/143611 [14:27:19] (03CR) 10Andrew Bogott: "Can you explain a bit better in the header comments how to set up the mediawiki install that this relies on? It's not clear to me how thi" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143611 (owner: 10Filippo Giunchedi) [14:27:24] (03PS3) 10Filippo Giunchedi: swift: rewrite middleware integration test [operations/puppet] - 10https://gerrit.wikimedia.org/r/143611 [14:28:15] (03CR) 10Andrew Bogott: "...but let's keep the old checks until the new checks are running reliably." [operations/puppet] - 10https://gerrit.wikimedia.org/r/142560 (owner: 10Dzahn) [14:39:16] (03Abandoned) 10Chad: Pool 2 wikis (dewiki, frwiki, jawiki) get Cirrus as primary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140754 (owner: 10Chad) [14:39:22] (03Abandoned) 10Chad: Move remaining pool 3 wikis to Cirrus as primary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140753 (owner: 10Chad) [14:39:46] (03PS2) 10Andrew Bogott: Modify nova role to better support labs uses. [operations/puppet] - 10https://gerrit.wikimedia.org/r/141836 [14:39:48] (03PS2) 10Andrew Bogott: Remove unused upstream Openstack module [operations/puppet] - 10https://gerrit.wikimedia.org/r/141835 [14:41:25] (03CR) 10Andrew Bogott: [C: 032] Remove unused upstream Openstack module [operations/puppet] - 10https://gerrit.wikimedia.org/r/141835 (owner: 10Andrew Bogott) [14:44:23] (03CR) 10Andrew Bogott: [C: 04-2] "This will never be merged, but I'm salvaging bits of it for other patches." [operations/puppet] - 10https://gerrit.wikimedia.org/r/53989 (owner: 10Andrew Bogott) [14:44:30] (03PS2) 10Chad: Reverse Cirrus config, all wikis get it by default now [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142607 [14:44:32] (03PS3) 10Chad: Move commons over to Cirrus as primary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140752 [14:55:36] (03PS1) 10Phuedx: Re-enable the anonymous signup invite experiment [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143614 [14:56:19] anomie: looks like I'll do swat today too [14:56:47] manybubbles|away: Ah, some came in last minute. Go for it. [14:57:04] man, I've been |away for ever! [14:57:10] (03CR) 10Manybubbles: [C: 031] Remove two rights from editors on ruwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143094 (https://bugzilla.wikimedia.org/67304) (owner: 10John F. Lewis) [14:57:38] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Wed 02 Jul 2014 12:57:11 UTC [14:57:52] maybubbles :p [14:57:54] (03CR) 10Manybubbles: [C: 031] Reverse Cirrus config, all wikis get it by default now [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142607 (owner: 10Chad) [14:58:04] JohnLewis: want me to do yours first? [14:58:15] <^d> maybubbles? [14:58:18] maybubbles: I'm not bothered :) [14:58:24] ^d: ? [14:58:31] maybubbles: Your nick :p [14:58:33] Missing the n [14:58:53] can't type sometimes [14:59:03] maybubbles sounds fun though [14:59:08] (03CR) 10Manybubbles: [C: 032] Remove two rights from editors on ruwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143094 (https://bugzilla.wikimedia.org/67304) (owner: 10John F. Lewis) [14:59:15] (03Merged) 10jenkins-bot: Remove two rights from editors on ruwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143094 (https://bugzilla.wikimedia.org/67304) (owner: 10John F. Lewis) [14:59:32] manybubbles: Use that as your nick in May :D [14:59:38] greg grossimer? am I good to work on this? https://bugzilla.wikimedia.org/show_bug.cgi?id=51497 [14:59:49] (03PS4) 10Filippo Giunchedi: swift: rewrite middle integration test [operations/puppet] - 10https://gerrit.wikimedia.org/r/143611 [14:59:57] dogeydogey: he is greg-g on irc [15:00:04] manybubbles, anomie, ^d: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140702T1500) [15:00:20] greg-g -- can I work on this? https://bugzilla.wikimedia.org/show_bug.cgi?id=51497 [15:00:46] (03CR) 10Filippo Giunchedi: swift: rewrite middle integration test (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143611 (owner: 10Filippo Giunchedi) [15:00:50] ooh, monitoring for the beta cluster :p [15:00:53] !log manybubbles Synchronized wmf-config: SWAT Remove two permissions from some editors on ruwiki (duration: 00m 07s) [15:00:57] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Wed Jul 2 15:00:48 UTC 2014 [15:00:58] Logged the message, Master [15:01:08] matanya are there any bugs I can work on? [15:01:13] JohnLewis: done - can you verify? [15:01:24] manybubbles: looking [15:01:45] (03CR) 10Filippo Giunchedi: "I've clarified the commit message to explain what's the intended use case, it should be more clear now but let me know if I'm missing some" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143611 (owner: 10Filippo Giunchedi) [15:01:53] dogeydogey: you should ask the ops team this question. I guess mark or paravoid can answer best [15:02:17] manybubbles: looks good [15:02:17] mark or paravoid got any good tasks for a new contributor to work on? [15:02:31] JohnLewis: sweet [15:02:37] ^d: time for our noop [15:02:52] <^d> Yes. InitialiseSettings needs to go first. [15:03:23] (03CR) 10Manybubbles: [C: 032] Reverse Cirrus config, all wikis get it by default now [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142607 (owner: 10Chad) [15:03:31] (03PS4) 10Giuseppe Lavagetto: nutcracker: move config in puppet, work with trusty packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/143597 [15:03:33] ^d: can do [15:03:33] (03Merged) 10jenkins-bot: Reverse Cirrus config, all wikis get it by default now [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142607 (owner: 10Chad) [15:03:38] <_joe_> manybubbles: \o/ [15:03:52] _joe_: its not quite that happy - 11 are still opted out [15:04:00] and we'll slowly opt them in over the next two month [15:04:15] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 06s) [15:04:16] <_joe_> manybubbles: pool 4 is our of prod with this change, is it? [15:04:19] Logged the message, Master [15:04:32] !log manybubbles Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 05s) [15:04:38] Logged the message, Master [15:04:46] <^d> _joe_: pool 4 still has commons, that's monday. [15:04:46] _joe_: Monday we're moving commons over [15:04:52] <_joe_> ok [15:05:02] that'll be the last one, save for some annoying stragglers [15:05:07] <_joe_> :) [15:05:10] <^d> Nothing else is on pool 4. [15:06:43] ^d: except for indiscussionpage and maybe some other undexpected stragglers - we'll have to look once we cut over commons [15:07:04] <^d> lemme double check again. [15:07:10] <^d> I think unexpected stragglers is 0. [15:07:13] !log swap complete - logged off of tin [15:07:17] Logged the message, Master [15:07:36] (03PS3) 10Giuseppe Lavagetto: mediawiki: collect apc variables via diamond [operations/puppet] - 10https://gerrit.wikimedia.org/r/142250 [15:07:52] manybubbles: swap ? [15:08:09] <_joe_> when is the deploy window over? [15:08:14] ^d: I believe the "right" way to check is to download the search log from oxygen.wikimedia.org/a/log/lucene/lucene.log and grep for the servers in group4 [15:08:24] !log *SWAT* complete [15:08:28] Logged the message, Master [15:08:29] _joe_: right now [15:08:34] I just finished it [15:08:38] <_joe_> ok! [15:08:46] <^d> manybubbles: I check lucene.pp in puppet. [15:08:47] <_joe_> I wanted to merge the first apc change [15:08:55] <_joe_> apc-monitoring [15:08:56] _joe_: you have the floor [15:09:08] ^d: yeah - I've been checking there too [15:09:18] I think that the "ultimate" authority is the actual log of queries though [15:10:00] Coren: re: metrics vs alert, true :) but it's fairly useful, I'd say [15:10:49] <^d> _joe_: The upside of that config change is at least...if you create a new wiki, nothing special has to be done :) [15:11:15] <_joe_> ^d: that sounds good! [15:12:47] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: Fetching readonly [15:15:25] (03CR) 10Andrew Bogott: [C: 031] swift: rewrite middle integration test [operations/puppet] - 10https://gerrit.wikimedia.org/r/143611 (owner: 10Filippo Giunchedi) [15:15:28] awe, did I mess it up! [15:16:06] looks like I need tin again. [15:16:08] any objections? [15:16:17] <_joe_> no [15:16:39] <_joe_> I've aborted my crazy idea of merging the change without proper testing :D [15:16:46] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 18s) [15:16:47] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: Fetching readonly [15:16:50] Logged the message, Master [15:17:09] <_joe_> this check *is* useful after all :) [15:17:14] <_joe_> happy to see this [15:17:16] _joe_: yeah! [15:17:22] I'm not very good at this I guess [15:17:33] !log manybubbles Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 28s) [15:17:34] people suck at repeating tasks [15:17:38] Logged the message, Master [15:17:58] ^d: now its done [15:18:07] also, we should look at raising the timeouts for insource:// [15:19:18] !log done with SWAT for real this time [15:19:22] Logged the message, Master [15:21:37] (03PS1) 10Andrew Bogott: Requote a boolean. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143616 [15:21:45] <^d> th [15:21:46] <^d> x [15:21:50] <^d> ^ manybubbles [15:21:51] <^d> I can't type. [15:22:00] ^d: welcome! [15:22:48] (03CR) 10Rush: [C: 031] Requote a boolean. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143616 (owner: 10Andrew Bogott) [15:23:50] (03PS2) 10Andrew Bogott: Requote a boolean. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143616 [15:24:09] (03PS1) 10Giuseppe Lavagetto: catalog-compiler: move compilation of changes to puppet 3 by default [operations/software] - 10https://gerrit.wikimedia.org/r/143618 [15:24:38] (03CR) 10Giuseppe Lavagetto: [C: 032] catalog-compiler: move compilation of changes to puppet 3 by default [operations/software] - 10https://gerrit.wikimedia.org/r/143618 (owner: 10Giuseppe Lavagetto) [15:25:36] PROBLEM - puppet last run on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:25:46] PROBLEM - SSH on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:38] <_joe_> tungsten needs love [15:26:46] PROBLEM - RAID on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:26:49] (03CR) 10Rush: [C: 031] "I <3 this" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143616 (owner: 10Andrew Bogott) [15:26:53] <_joe_> or hate, whichever motivates us more [15:26:56] PROBLEM - MediaWiki profile collector on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:26:56] PROBLEM - DPKG on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:26:56] PROBLEM - Disk space on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:27:06] PROBLEM - gdash.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:16] PROBLEM - check configured eth on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:27:26] PROBLEM - check if dhclient is running on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:27:26] PROBLEM - Graphite Carbon on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:27:36] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 1031 seconds ago with 0 failures [15:27:37] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [15:27:37] RECOVERY - SSH on tungsten is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [15:27:46] RECOVERY - MediaWiki profile collector on tungsten is OK: OK: All defined mwprof jobs are runnning. [15:27:46] RECOVERY - Disk space on tungsten is OK: DISK OK [15:27:46] RECOVERY - DPKG on tungsten is OK: All packages OK [15:27:56] RECOVERY - gdash.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 9055 bytes in 0.016 second response time [15:28:07] RECOVERY - check configured eth on tungsten is OK: NRPE: Unable to read output [15:28:17] RECOVERY - check if dhclient is running on tungsten is OK: PROCS OK: 0 processes with command name dhclient [15:28:17] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [15:29:29] akosiaris: did you notice that the puppet check threw a few dozen criticals last night? [15:35:45] andrewbogott: yeah, I 've wanted to ask you. Did you figure out why ? [15:36:02] and the race condition still exists in virt1009, right ? [15:38:48] (03CR) 10Alexandros Kosiaris: "Maybe diamond supports using other boolean meant strings like rsync does? yes,no worked nice in rsync" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143616 (owner: 10Andrew Bogott) [15:39:05] we are btw gonna see more of this ^ [15:42:20] dogeydogey: you can, but I'm worried it isn't an easy task. YuviPanda has some opinions on it. You two should sync up. [15:45:36] YuviPanda let me know if this is something you want to pass off to me, otherwise I can work on something else [15:45:54] akosiaris: I didn't figure out why. I looked at a couple of servers that were throwing alarms and they seemed fine... [15:46:01] And, yeah, virt1009 is still inconsistent. [15:46:35] Someone said something last night about thinking that a hiccup on the server side (e.g. palladium) caused the storm but I don't know if that's right. [15:47:25] (03PS1) 10Reedy: Add pr_index table from Proofread Page extension [operations/software] - 10https://gerrit.wikimedia.org/r/143622 [15:48:46] YuviPanda: the dogeydogey question is re https://bugzilla.wikimedia.org/show_bug.cgi?id=51497 [15:51:14] (03CR) 10Tpt: [C: 031] Add pr_index table from Proofread Page extension [operations/software] - 10https://gerrit.wikimedia.org/r/143622 (owner: 10Reedy) [15:56:13] (03CR) 10Phe: [C: 031] "Needed for future improvement of wikisource tools." [operations/software] - 10https://gerrit.wikimedia.org/r/143622 (owner: 10Reedy) [15:56:45] (03PS3) 10Krinkle: Requote a boolean [operations/puppet] - 10https://gerrit.wikimedia.org/r/143616 (owner: 10Andrew Bogott) [15:57:14] (03CR) 10Krinkle: Requote a boolean (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143616 (owner: 10Andrew Bogott) [15:58:00] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Wed Jul 2 15:57:56 UTC 2014 [16:09:06] (03PS3) 10Ottomata: Use CDH5 Hive for labs instances [operations/puppet] - 10https://gerrit.wikimedia.org/r/143476 [16:13:42] (03PS4) 10Ottomata: Use CDH5 Hive for labs instances [operations/puppet] - 10https://gerrit.wikimedia.org/r/143476 [16:19:03] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:21:15] hashar: since the patch for deploying ContentTranslation is merged. When it should be available on http://en.wikipedia.beta.wmflabs.org/ ? [16:21:39] (03PS1) 10Ottomata: Don't set zookeeper configs in hive-site.xml if $zookeeper_hosts is empty [operations/puppet/cdh] - 10https://gerrit.wikimedia.org/r/143630 [16:22:01] (03CR) 10Alexandros Kosiaris: [C: 031] "LGTM" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/142983 (owner: 10Ori.livneh) [16:27:15] kart_: mediawiki-config changes are deployed just after merge. so should be enabled [16:27:22] kart_: need to get out, daughter back home sorry [16:30:24] !log upgrading jenkins to jenkins_1.554.3_all.deb on the apt repo [16:30:28] Logged the message, Master [16:32:57] hey akosiaris, thanks for the review [16:33:11] i think the apache module is in pretty good shape now, what do you think? [16:33:21] (03CR) 10Ottomata: [C: 032 V: 032] Don't set zookeeper configs in hive-site.xml if $zookeeper_hosts is empty [operations/puppet/cdh] - 10https://gerrit.wikimedia.org/r/143630 (owner: 10Ottomata) [16:33:52] (03PS5) 10Ottomata: Use CDH5 Hive for labs instances [operations/puppet] - 10https://gerrit.wikimedia.org/r/143476 [16:34:49] there are configuration snippets that are repeated in the codebase (ldap auth specifically) that seem to beg for some abstraction to facilitate reuse [16:34:54] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.005 second response time [16:35:12] but i haven't been able to come up with a particularly clean way of doing it [16:38:13] <_joe_> ori: let's not ovecomplicate things :) [16:38:24] <_joe_> I think the module is pretty fine now [16:38:35] yeah, me too. i'm happy with it [16:39:06] some copypasta is better than abstractions that overreach and get in the way [16:39:50] that makes 3 of us :-) [16:42:52] what sort of precautions should we take for deploying the maxclients patch? (https://gerrit.wikimedia.org/r/#/c/137947/) [16:43:29] (03CR) 10Gage: [C: 032] Use CDH5 Hive for labs instances [operations/puppet] - 10https://gerrit.wikimedia.org/r/143476 (owner: 10Ottomata) [16:43:38] probably makes sense to add that to the deployment calendar [16:44:00] any preference with respect to day of week / time of day? [16:45:23] <_joe_> ori: I'd say apache-config + twemproxy config are very delicate as well [16:45:44] twemproxy is such a headache because it's "nutcracker" in trusty [16:45:52] could we perhaps backport faidon's package to precise? [16:46:40] it'd be horrible to introduce if versioncmp($::lsbdistrelease, '14.04') $pkg = 'nutcracker' else 'twemproxy' garbage to the code [16:47:41] I was just deliberating that earlier :) [16:49:57] alternately we could revel in it and celebrate confusion and introduce a third name, unrelated to the first two [16:50:03] <_joe_> ori: yes that was the other possibility (I don't see that as a tragedy as long as you mask inside the module) [16:50:16] paravoid: btw, today is my last day this week, so if I don't chat with you later, god speed and be safe. [16:50:23] greg-g: cheers! [16:50:33] I'll be online on and off next week as well [16:50:36] <_joe_> ori: it's not just the package name, there is a change for that [16:50:48] paravoid: oh right, well, pre-emptive be safe then ;) [16:50:58] _joe_: yeah, twemproxy's upstart file is provisioned by the module, but nutcracker's is included with the package [16:51:04] ;) [16:51:15] <_joe_> ori: look at the change I submitted [16:51:20] * ori does [16:51:36] oh hey, look at that. i missed that entirely. [16:51:36] <_joe_> this is what we need to do if we don't backport the new package [16:51:45] ori: already reviewed too ;) [16:52:51] (03PS1) 10Krinkle: Add OBSOLETE [operations/debs/testswarm] - 10https://gerrit.wikimedia.org/r/143635 [16:53:15] _joe_: nice work! [16:53:33] What is the policy on utterly obsolete repos? Is operations comfortable with having them deleted? [16:53:40] yes [16:53:41] Or do we want to keep it? [16:53:44] from a glance at least at puppet I don't there would be many places to put if trusty / else no? [16:54:10] specifically debs* kind of repos. The other's are goint to be removed either way [16:54:23] godog: i was mostly worried about the amount of thinking required to get it right, but i think _joe_ got it [16:54:48] https://icinga.wikimedia.org/icinga/ [16:55:27] jingle bells, jingle bells, jingle all the way [16:55:57] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?hosts=all&style=hostservicedetail&hoststatustypes=12&hostprops=2097162&servicestatustypes=28&serviceprops=2097162&nostatusheader [16:56:01] specifically [16:56:11] ^d: So, these have been redundant for 1+ year, I just cleared them out: integration/gruntjs.git, integration/testswarm.git, integration/grunt-contrib-wikimedia.git [16:56:26] lead: DISK CRITICAL - /var/spool/exim4/db is not accessible: Permission denied <-- that one, i remember how we added the nagios/icinga user into an additional group for that , i think it was that [16:56:29] ori: ye I was looking at the code review too, I was thinking more of a nutcracker module tbh (again, from a quick look at puppet) [16:56:34] paravoid: the 'puppet last run' crits can probably be ignored, that test is still troubled. [16:56:35] mutante: yeah I'll fix that [16:56:55] mutante: I provisioned that box at 6am, didn't really feel like fixing it at the time :P [16:57:30] still missing_thankyous, we're so uncooth [16:57:48] godog: yeah, we should rename it to nutcracker [16:58:21] <^d> Krinkle: You're admin on wikimedia github, right? [16:58:37] I'm not [16:58:45] <^d> Ok nvm. [16:58:48] only collab, no admin [16:59:16] (03PS24) 10Alexandros Kosiaris: etherpad: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 (owner: 10Matanya) [16:59:24] <_joe_> godog: having separate modules here and doing the conditional in the role is *exactly* the pattern we should stay away form. Apart from that, I'll backport the new nutcracker package to precise and that will simplify everything [16:59:36] paravoid: cool:) i think we just added user to exim group but with an "exec" in puppet [17:00:03] _joe_: it is a transitional thing though, not like labs/nolabs [17:00:07] <_joe_> godog: IMO, puppet is useful if implementation details are masked within modules that offer one function [17:00:13] ottomata: will you do the SoS? [17:00:17] <^d> Krinkle: Heh, you are I think :p [17:00:34] ^d: Thx [17:00:49] <_joe_> godog: it's common for things to change between distro versions, but I do agree that backporting will make the module more uniform [17:00:57] what's up with dobson? [17:01:15] it's still running but refuses all in icinga.. looking [17:01:26] ^d: I guess the reason you're checking there is because gerrit doens't delete mirrors? [17:01:31] _joe_: i think backporting would be cleaner, yeah, but the way you did it in https://gerrit.wikimedia.org/r/#/c/143597/ looks ok too [17:01:45] <^d> Krinkle: Yeah, so was wondering if I'd need to clean up github or make you DIY :p [17:01:54] np [17:02:35] <_joe_> ori: also, as paravoid noted, it would mean we try the new nutcracker _before_ releasing hhvm to jobrunners [17:02:43] <_joe_> which is a neat plus I think [17:03:08] yep [17:03:09] <^d> Krinkle: All 3 gone from gerrit + gitblit [17:04:09] _joe_: yep backporting works too, as long as we stick with one name for everything [17:04:23] <_joe_> godog: eh :) [17:04:52] paravoid: the packages work really well in mediawiki-vagrant btw [17:04:53] nutproxy? [17:04:53] <^d> chasemp: I've already got a patch for the elastic index thing :) [17:05:01] mutante: sounds... gross [17:05:22] heh, it seemed better than twemcracker [17:05:50] runs puppet on erbium [17:06:00] why does twemproxy/nutcracker have two names? can we settle on one? [17:06:02] which was reported as errors in icinga, but then it does finish [17:06:16] jgage: because upstream are jerks [17:06:26] yay [17:06:43] jgage: they started with a rename and got tired halfway through and stopped responding to pings [17:06:50] ha [17:10:25] <_joe_> nutcracker is a lame name [17:12:27] (03PS1) 10ArielGlenn: add template code for the dumps secondary download index.html [operations/puppet] - 10https://gerrit.wikimedia.org/r/143638 [17:15:23] Generic::Systemuser[file_mover]/Ssh_authorized_key[file_mover@systemuser]: Could not evaluate: No such file or directory - /var/lib/file_mover/.ssh [17:15:26] on erbium [17:16:23] (03CR) 10ArielGlenn: [C: 032] add template code for the dumps secondary download index.html [operations/puppet] - 10https://gerrit.wikimedia.org/r/143638 (owner: 10ArielGlenn) [17:16:30] apergos: :) [17:17:17] does that include the changed HTML from godog's patch? [17:17:49] his patch already went live [17:18:17] it just had no effect, that page is meant to be copied to main inex of web server i guess or something (riginal code inherited years ago) [17:18:21] I'm repurposing it [17:18:38] this won't have any impact til I edit the script that generats the index page [17:20:13] apergos: gotcha! thanks [17:20:38] yea, i knew it was merged but not applied [17:20:57] oh it was applied all right [17:21:16] it was on the hosts. just the code doesn't happen to use that file as they thought it did [17:21:28] oh,ok [17:22:11] Jeff_Green: /a/log/fundraising]: Skipping because of failed dependencies , you know about the file_mover thing? [17:22:44] mutante: that's regarding udp2log collection and rotation? [17:23:07] yes, sudo_user { 'file_mover': privileges => ['ALL = NOPASSWD: /usr/bin/killall -HUP udp2log'] } [17:23:44] (03PS1) 10ArielGlenn: make monitor use download-index.html template for index page [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/143640 [17:23:47] ok. I understood it when we set it up, but it's been a while. what happened that broke it? [17:25:16] Jeff_Green: maybe something related to replacing generic::systemuser, though, it still uses that [17:25:27] so it can't find /var/lib/file_mover/.ssh anymore [17:25:30] (03CR) 10ArielGlenn: [C: 032] make monitor use download-index.html template for index page [operations/dumps] (ariel) - 10https://gerrit.wikimedia.org/r/143640 (owner: 10ArielGlenn) [17:25:31] looking [17:26:15] Jeff_Green: /var/lib/backupmover vs. /var/lib/file_mover ? does that tell you anything? [17:26:29] what's the manifest in question? [17:26:43] manifests/misc/fundraising.pp [17:27:35] wait, maybe not.. umpf.. node 'erbium.eqiad.wmnet' inherits 'base_analytics_logging_node' { [17:28:52] which host is this dying on? [17:28:57] erbium [17:29:13] more an analytics thing [17:29:36] running puppet there so I can see what it barfs [17:29:39] ^d nice, link? [17:29:57] <^d> Not up yet. arc diff is hanging. [17:30:02] Jeff_Green: yea, see the "no such file or directory" above, the rest is all dependency errors [17:30:10] YuviPanda: what's the process for adding a !log bot to channel? [17:30:33] ori: hmm, not sure. I saw only https://wikitech.wikimedia.org/wiki/Morebots [17:31:38] YuviPanda: ah thanks, that's useful actually [17:31:44] Jeff_Green: it's manifests/role/logging.pp [17:31:45] ori: :) [17:32:13] mutante: ya [17:32:40] but then i don't see "/var/lib/file_mover" being defined as the home, anywhere [17:33:08] we can try and replace generic::systemuser .. that's what we want anyways [17:33:24] mutante: that would be the default for a system user named file_mover [17:33:42] then how did it disappear from the actual server ?;p [17:33:47] still trying to understand what this is doing [17:34:56] why doesnt it just create that directory then... [17:35:09] well [17:35:20] one possibility is that $log_directory is not what we expect it to be [17:36:11] we are being murdered slowly by abstraction [17:36:21] the generic::systemuser part, it doesn't use $log_directory [17:36:27] yes, agreed [17:36:37] killing this is removing a layer of abstraction [17:36:47] yay [17:37:36] ok, so just so we're totally on the same page... [17:38:16] it's blowing up when the it tries to write the authorized key file to /var/lib/file_mover/.ssh/* [17:38:24] ack [17:38:33] because that directory simply doesn't exist [17:38:36] right [17:38:42] backtracking from there... [17:38:50] on line 395, generic::systemuser [17:38:55] is supposed to setup that user [17:38:59] managehome true to create and set homedir? [17:39:17] chasemp: the thing is, it's broken BEFORE we replaced generic::systemuser [17:39:20] it still uses that [17:39:22] line 395 in which manifest? [17:39:31] manifests/role/logging.pp [17:39:40] mutante: ah [17:39:46] that should simply create the user's home , no? [17:39:50] but it doesnt [17:39:51] i don't think that line actually creates the user [17:40:10] i think it just requires that that user was created somewhere else before file {} will run [17:40:22] chasemp: we can try to just replace it now to fix it though [17:40:25] so where are we actually creating the user? [17:41:16] generic::systemuser { 'file_mover': [17:41:22] name => 'file_mover', [17:41:28] uid => 30001, [17:41:33] i think that is it [17:41:39] oh! THAT line 395 hah [17:42:12] let's try and replace that whole thing [17:42:16] with a normal "user" [17:42:24] that's what we are doing anyways [17:42:54] i was just looking at this error totally unrelated first, because it shows up as puppet failure in icinga [17:43:21] (03PS1) 10ArielGlenn: fix % in python template for dumps index page [operations/puppet] - 10https://gerrit.wikimedia.org/r/143645 [17:43:23] the whole thing confuses me [17:43:30] i still would be wondering why the old way doesnt just create the home [17:43:41] right [17:43:59] generic::systemuser should create home before it tries to write the auth keys file [17:44:05] is generic::systemuser just broken? [17:44:30] right, i think so [17:45:38] (03CR) 10ArielGlenn: [C: 032] fix % in python template for dumps index page [operations/puppet] - 10https://gerrit.wikimedia.org/r/143645 (owner: 10ArielGlenn) [17:49:47] (03PS2) 10Reedy: Add pr_index table from Proofread Page extension [operations/software] - 10https://gerrit.wikimedia.org/r/143622 [17:49:49] (03PS1) 10Reedy: Normalise quotes used. Sync fullviews [operations/software] - 10https://gerrit.wikimedia.org/r/143649 [17:54:11] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Wed 02 Jul 2014 15:53:41 UTC [17:55:21] Jeff_Green: chasemp: it would be this https://gerrit.wikimedia.org/r/#/c/137999/ [17:55:29] let's try ? [17:55:39] like, it's already broken in another way anyways:) [17:56:09] (03CR) 10Rush: [C: 031] "seems good" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137999 (owner: 10Rush) [17:56:11] looks right to me [17:56:23] me too [17:56:31] (03CR) 10Dzahn: [C: 032] logging-replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137999 (owner: 10Rush) [17:59:45] Role::Logging::Systemusers/Ssh_authorized_key[file_mover]: Could not evaluate: No such file or directory - /var/lib/file_mover/.ssh [17:59:53] we still get that line, BUT [17:59:58] all the dependency errors are gone [18:00:04] yurik: The time is nigh to deploy Wikipedia Zero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140702T1800) [18:04:53] greg-g: I noticed a regression just now on mediawiki.org, can't investigate right now, btu wanted you to know [18:04:58] On pages like https://www.mediawiki.org/w/index.php?title=Manual:Configuration_settings&action=edit§ion=21 [18:05:12] There's odd output saying "No matching items in log." [18:05:22] huh [18:05:26] yreah [18:05:28] looks like something is trying to display a deletion or protection log even though there is nothing there and no reason to output that [18:05:36] I'll report a bug and cc you, who else? [18:05:47] Aaron and Niklas maybe [18:05:49] kk [18:05:50] thanks [18:06:36] It shows up every single edit page. Not critical but quite confusing for the average user I imagine. [18:07:27] yeah [18:11:38] Krinkle|detached: there is already a bug for that IIRC [18:11:42] grrr [18:11:44] (03PS1) 10Ottomata: Use CDH5 for oozie in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/143653 [18:11:45] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 9 below the confidence bounds [18:11:46] https://bugzilla.wikimedia.org/show_bug.cgi?id=67425 [18:11:47] usually that's an old protection which is not logged [18:12:36] (03PS2) 10Withoutaname: Restore defaults for nowikibooks bureaucrats [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139428 (https://bugzilla.wikimedia.org/42105) [18:12:37] Nemo_bis: nothing obvious was recommended by BZ for "No matching items in log." [18:12:54] https://bugzilla.wikimedia.org/buglist.cgi?bug_status=__open__&content=%22No%20matching%20items%20in%20log.%22&list_id=326288&order=relevance%20desc&query_format=specific [18:13:14] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Wed Jul 2 18:13:10 UTC 2014 [18:13:17] The page existed in 2006 or earlier and there was also a history import and move, the protection might either come from another title or come from before 2006 [18:14:48] Hm.. looks like something recently broke the integration.mediawiki.org redirect [18:14:50] http://integration.mediawiki.org/ci/job/MediaWiki-phpunit/ [18:14:52] now redirects to the root of the wikimedia.org domain instead of preserving the path [18:17:14] (03CR) 10Ottomata: [C: 032 V: 032] Use CDH5 for oozie in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/143653 (owner: 10Ottomata) [18:18:04] (03PS3) 10Withoutaname: New namespace "Carte" for rowiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139766 (https://bugzilla.wikimedia.org/66530) [18:19:05] robh: is there a limit to file size in RT ? [18:19:24] dunno [18:19:35] ive put in 300dpoi hand scan docs before though [18:19:44] in the range of 5mb each [18:20:03] how large is the nda scan? [18:20:20] (03PS1) 10Yurik: Updated firefox App to master [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143657 [18:20:23] 16.5 MB [18:20:38] but i can reduce it if that is an issue [18:20:59] How high res is your signature? :P [18:21:07] (03CR) 10Yurik: [C: 032] Updated firefox App to master [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143657 (owner: 10Yurik) [18:21:28] (03CR) 10Yurik: [C: 032] Added a comment to remove dup config setting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143497 (owner: 10Yurik) [18:21:33] Reedy: 1600 dpi :P [18:22:35] (03PS1) 10Withoutaname: Add VisualEditor to Wikimania 2015 wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143659 [18:23:10] (03Merged) 10jenkins-bot: Updated firefox App to master [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143657 (owner: 10Yurik) [18:23:12] (03Merged) 10jenkins-bot: Added a comment to remove dup config setting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143497 (owner: 10Yurik) [18:23:26] I remember there being talk about adding tests (e.g. a nagios health check) to ensure our many different kinds of redirects are working properly. especially those from apache-config for wikis since those rewrites are quite repetitive and easy to break. While those are auto-generated now, they should still be checked. [18:23:29] I prefer 8dpi for my signatures, gives it that old-skool feel [18:23:58] _joe_: I'm volunteering to write such as check. It'd need to make http requests and assert the http status and location header in response. Can you give me a pointer? [18:24:03] Krinkle: Tim wrote a 'DSL' for the redirects recently (well, months ago) [18:24:24] greg-g: Yeah, 'auto generated'. But better code doesn't justify absence of tests. [18:24:35] greg-g: There's already a tool for checking it from the web side [18:24:39] jenkins tests them on submit [18:24:41] Uh, Krinkle that is [18:24:50] Krinkle: Jeff_Green wrote 'apache-fast-test' [18:24:51] Krinkle: just informing in case you hadn't seen :) [18:25:09] robh: can you look to see if it made it ? [18:25:14] apach-fast-test ~/urls.txt pybal [18:25:20] mutante: Was that to me? What does jenkins test on submit (and how?) [18:25:21] grep output [18:25:37] <_joe_> Krinkle: I'm dining now, maybe later? [18:25:39] Reedy: Sounds interesting, is that a icinga check? [18:25:46] Nope, it's a cli tool [18:25:47] "apache-fast-test" [18:25:49] it really needs a better name [18:25:50] Right [18:25:56] matanya: hrmm, nope... just email it directly to me =] rhalsell@wikimedia.org [18:25:59] https://github.com/wikimedia/operations-puppet/find/production [18:26:00] Jeff_Green: apache-pretty-fast-test [18:26:02] * Nemo_bis points out https://wikitech.wikimedia.org/wiki/Special:Search/apach-fast-test [18:26:03] https://github.com/wikimedia/operations-puppet/blob/production/modules/apachesync/files/apache-fast-test [18:26:16] Reedy: punch_webservers_in_the_nose [18:26:25] !log yurik Synchronized docroot/bits/WikipediaMobileFirefoxOS/: (no message) (duration: 01m 03s) [18:26:26] i stand by the 'fast' part though :-P [18:26:29] there's no need for violence here [18:26:30] Logged the message, Master [18:26:33] wtf-is-apache-doing [18:26:33] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: Fetching origin [18:26:44] ha [18:26:52] (03PS1) 10Ottomata: Use CDH5 for pig, sqoop and hue in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/143660 [18:26:56] Krinkle: "operations-apache-config-lint" ,something like https://integration.wikimedia.org/ci/job/operations-apache-config-lint/488/console [18:27:03] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: Fetching origin [18:27:34] mutante: right, that's a syntax linter [18:28:05] Jeff_Green: Hm.. interesting, it uses pyball to test every individual node, not just from the outside in general. [18:28:07] That's nice [18:28:12] !log yurik Synchronized wmf-config/CommonSettings.php: (no message) (duration: 01m 04s) [18:28:20] Logged the message, Master [18:28:25] Should misc-web-eqiad be in pyball? I don't see it on noc.wikimedia.org [18:28:33] !log yurik ^ was a noop - comment fix [18:28:38] Logged the message, Master [18:28:44] (03PS2) 10Ottomata: Use CDH5 for pig, sqoop and hue in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/143660 [18:29:06] robh: can please check your mail box ? [18:29:27] nothing yet [18:29:36] ahhh [18:29:40] it went into my spam.... wtf [18:30:40] matanya: So i was able to attach it directly to the ticket via reply in the web interface [18:30:47] (03CR) 10Greg Grossmeier: "This hasn't been synced out yet, has it?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139279 (https://bugzilla.wikimedia.org/66370) (owner: 10Withoutaname) [18:30:52] so its on there now for mark to see [18:31:10] if you wanna send me a higher res copy i can try the same, whatever works right? =] [18:31:11] Krinkle: yeah, i wrote it after the terror of my first config deploy [18:31:12] thanks a lot robh [18:31:18] very welcome [18:31:30] robh: i tried from the web, it failed [18:31:42] and tried via mail, and also failed [18:31:57] and without reducing your gmail also rejected it [18:32:07] Krinkle: it's been very useful for diagnosing code mirroring snafu's and apache reload fail [18:32:23] Yep [18:32:33] (03CR) 10Greg Grossmeier: "nvm, seems noc.wikimedia.org was out of date? ignore me for now." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139279 (https://bugzilla.wikimedia.org/66370) (owner: 10Withoutaname) [18:32:55] Jeff_Green: Can/Should it be extended to also check web services not served by apaches/api, but by misc? [18:33:12] Krinkle: indeed, what they say, also, if all servers give you the same result it summarizes them, if one is different you get multiple lines, very useful to find the "rogue" apaches who did not get config synced [18:33:14] for example, integration.wikimedia.org and the integration.mediawiki.org redirect, those are in puppet and served by misc-eqiad-web [18:33:20] !log reprepro include, trusty-wikimedia (main/universe): nutcracker, libicu 4.8, libzip 0.11, hhvm, {php,hhvm}-wikidiff2, {php,hhvm}-fss, {php,hhvm}-luasandbox, ffmpeg2theora [18:33:24] ori: ^ [18:33:25] Logged the message, Master [18:33:38] !log yurik Synchronized php-1.24wmf11/extensions/: Updating JsonConfig, ZeroBanner, ZeroPortal (duration: 01m 38s) [18:33:43] Logged the message, Master [18:34:22] Krinkle: I don't see why not. it would just be a matter of choosing from the different available pybal pools [18:35:08] Jeff_Green: Where does test_urls.txt come from? I don't see it in puppet and fetching 'http://mw1018.eqiad.wmnet/test_urls.txt' from tin.eqiad.wmnet yielded nothing useful [18:35:12] I think it can be used that way now by feeding it a list of IPs instead of getting a list from pybal, but that's awkward [18:35:29] Krinkle: there is currently talk about letting all apache configs be deployed differently, that may influence it quite a bit [18:35:47] Krinkle: https://gerrit.wikimedia.org/r/#/c/143329/ [18:36:14] test_urls.txt is just an example. you throw a list of URLs you want to test into a file, and feed the file as a command line argument [18:36:59] so when I was messing with apache's redirect config, I made a list of URLs that should/shouldn't redirect to test before/after the change [18:37:13] (03CR) 10Krinkle: mediawiki: manage the apache config via puppet (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143329 (owner: 10Giuseppe Lavagetto) [18:37:31] Jeff_Green: Right, so we don't have a list somewhere of the current state that should pass that test? [18:37:33] !log yurik Synchronized php-1.24wmf10/extensions/: Updating JsonConfig, ZeroBanner, ZeroPortal (duration: 01m 55s) [18:37:38] Logged the message, Master [18:37:41] there is a bug, rolling back [18:38:05] Krinkle: right, afaik there's no shared list that everyone uses [18:38:31] (03PS3) 10Ottomata: Use CDH5 for pig, sqoop and hue in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/143660 [18:38:34] Krinkle: but it would be trivial to collect some lists and throw them in a shared location on the deploy server [18:39:16] Jeff_Green: Could you or help me change this script so that it can also be used to test 'misc-eqiad', and subsequently set up an incigna check that periodically ensures a short list always passes (doesn't have to be an exhaustive list of all domains and all types of redirects, but pick one from each type, should add up to about 20 or so) [18:39:24] (03PS4) 10Ottomata: Use CDH5 for pig, sqoop and hue in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/143660 [18:39:34] Krinkle: sure [18:39:46] But before that, integration.mediawiki.org is currently broken. [18:39:53] it drops the path of the url [18:40:23] causing various active links onwiki, documentation and elsewhere to break, and also be corrected wrong (since it is 301, link juice is hurting) [18:40:57] chasemp: ping ? [18:41:00] https://raw.githubusercontent.com/wikimedia/operations-puppet/production/modules/contint/files/apache/integration.mediawiki.org [18:41:05] "Redirect permanent / https://integration.wikimedia.org/" [18:41:09] matanya: hello [18:41:14] afaik apache does not drop the path by default, it is interpreted as a prefix [18:41:17] This used to work fine [18:41:24] hi, wondering about https://gerrit.wikimedia.org/r/#/c/143523/1 [18:41:51] question? [18:41:53] you added @ before the var, which is correct, but only for var in scope. [18:42:18] !log yurik Synchronized php-1.24wmf10/extensions/: Reverting previous update to JsonConfig, ZeroBanner, ZeroPortal (duration: 01m 20s) [18:42:20] the erb in question is called out of scope, so how can this work? [18:42:23] Logged the message, Master [18:42:31] Krinkle: is this a recent breakage? do you have any idea what caused it? [18:42:34] I think the correct way would be scope.lookupvar [18:42:47] Jeff_Green: It worked a few weeks ago. I know that the CI team didn't make any changes to it lately. [18:42:50] or I'm missing something here [18:42:56] (03PS5) 10Ottomata: Use CDH5 for pig, sqoop and hue in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/143660 [18:43:09] I suspect something generic that applied to all apaches. Since none of the relevant files were changed. [18:43:17] misc-web-lb speciically maybe [18:43:36] apache refactoring? [18:44:00] matanya: I understand your thought I think, but not why you think those variables are out of scope [18:44:13] they are defined in the manifest the template is called from [18:44:22] but possible my verbage on this is poor [18:45:00] lookupvar can be used for both [18:45:00] chasemp: where is generic_my.cnf.erb called from ? [18:45:45] tha is for my.cnf.erb which called from mysql::config [18:45:46] no? [18:46:29] right [18:46:42] and where is basedir var called from ? [18:48:33] so I think the confusion is that it's weird :) granted this predates me [18:48:42] so vars in mysql::config [18:48:47] are in scope for this template yes? [18:48:50] yes [18:48:54] and at first look basedir is not that [18:48:55] but [18:49:00] but those from params aren't [18:49:10] it is defined in modules/mysql/manifests/params.pp [18:49:15] right [18:49:19] and that is referenced directly in config.pp [18:49:23] mysql::config [18:49:38] it's like python import, you can import a lib from a lib that has imported it [18:49:41] the reference translates [18:49:45] thereby in scope [18:49:47] tho weird [18:50:14] I see the inherits mysql::params [18:50:54] (03CR) 10Nikerabbit: "I have no clue why this apparently doesn't enable the extension on labs." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140723 (owner: 10Nikerabbit) [18:50:54] does that mean inheritance includes all var in scope too ? [18:51:03] it must :) [18:51:16] ok, makes sense now. thanks! [18:55:43] (03PS6) 10Ottomata: Use CDH5 for pig, sqoop and hue in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/143660 [18:59:56] (03PS7) 10Ottomata: Use CDH5 for pig, sqoop and hue in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/143660 [19:00:32] !log yurik Synchronized php-1.24wmf10/extensions/: update to JsonConfig, ZeroBanner, ZeroPortal - bug fix (duration: 01m 15s) [19:00:35] Logged the message, Master [19:01:49] greg-g, few min late with depl, one more patch [19:02:28] Krinkle: jfyi I'm still poking around trying to figure out the situation re. integration [19:04:12] Jeff_Green: thx [19:04:38] Jeff_Green: example url http://integration.mediawiki.org/ci/job/MediaWiki-phpunit/ [19:04:55] or rather, https://integration.mediawiki.org/mediawiki-core-qunit/ [19:05:14] arg, sorry; https://integration.mediawiki.org/ci/job/mediawiki-core-qunit/ ! [19:08:43] !log yurik Synchronized php-1.24wmf11/extensions/: update to JsonConfig, ZeroBanner, ZeroPortal - bug fix (duration: 01m 40s) [19:08:48] Logged the message, Master [19:09:45] greg-g, done [19:10:48] (03CR) 10Matanya: [C: 04-1] "erb is called on modules/wikistats/manifests/web.pp and var is called on modules/wikistats/manifests/init.pp so out of scope and needs sc" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143526 (owner: 10Dzahn) [19:12:21] (03PS1) 10Dzahn: bump Bugzilla's TTL back to regular 1H [operations/dns] - 10https://gerrit.wikimedia.org/r/143675 [19:12:50] (03CR) 10Matanya: [C: 031] deprecated syntax in icinga checkcommands.cfg.erb [operations/puppet] - 10https://gerrit.wikimedia.org/r/143527 (owner: 10Dzahn) [19:13:59] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [19:15:48] (03PS11) 10Withoutaname: Delete ve.wikimedia.org and leave redirect [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131907 (https://bugzilla.wikimedia.org/55737) [19:18:29] Krinkle: hmm, this gets into varnish which is an area in which I have no experience [19:18:36] is there an RT ticket already for this issue? [19:18:50] Nope, I just ran into it in the middle of something [19:19:05] I'd rather file it publicly though. [19:19:54] <^d> chasemp: Patch up for Elastic :) https://secure.phabricator.com/D9798 [19:20:00] <^d> Also twentyafterfour ^ [19:20:17] Jeff_Green: Hm.. I know I came to you with the issue, but I just realised, from your apache-fast script, one can request to one of the hosts directly [19:20:27] (03PS8) 10Ottomata: Use CDH5 for pig, sqoop and hue in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/143660 [19:20:29] Have you tried reproducing the issue with curl and -H Host: integration.mediawiki.org ? [19:20:44] ^d note on that "The author of this revision has not signed all the required legal documents. The revision can not be accepted until the documents are signed." [19:20:46] that would rule out varnish [19:20:52] I don't think it's varnish [19:20:57] <^d> chasemp: Bah, e-mail mismatch? [19:21:05] fuck yeah, oopen sauce [19:21:27] ^d idk, but you can add additional emails if that it's it to your profile I think [19:21:31] <^d> Hmmmm [19:21:37] your a rebel atm [19:21:38] <^d> Because I definitely got D9321 merged. [19:21:39] (03CR) 10Gage: [C: 032] Use CDH5 for pig, sqoop and hue in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/143660 (owner: 10Ottomata) [19:22:02] ^d maybe that message is a bug itself :) [19:22:08] Krinkle: i'm just working my way through all the layers of the system and there are parts I don't understand yet [19:23:58] (03CR) 10Ottomata: [C: 032 V: 032] Use CDH5 for pig, sqoop and hue in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/143660 (owner: 10Ottomata) [19:24:04] RECOVERY - Unmerged changes on repository puppet on palladium is OK: Fetching origin [19:24:34] RECOVERY - Unmerged changes on repository puppet on strontium is OK: Fetching origin [19:28:23] (03PS8) 10Withoutaname: Reduce string URLs to defined constant [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131914 (https://bugzilla.wikimedia.org/48618) [19:44:33] * hashar facepalms writing puppet code [19:45:03] Use DjVu :) [19:45:09] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 979.550281652 [19:45:25] Uh that was about RT and attached scans [19:46:00] ottomata: some kafka alarms looking like a SuperLongJavaMethodCall above ^^ :D [19:46:59] PROBLEM - puppet last run on search1016 is CRITICAL: CRITICAL: Puppet has 1 failures [19:50:49] springle: so is RB replication coming now with TS dead? [19:52:03] interesting! tsk tsk [19:56:01] (03PS4) 10Hashar: zuul: split conf file for server and merger [operations/puppet] - 10https://gerrit.wikimedia.org/r/141572 [19:56:23] (03CR) 10Hashar: [C: 031] "I can't find a better way. Would refactor later on using some kind of define." [operations/puppet] - 10https://gerrit.wikimedia.org/r/141572 (owner: 10Hashar) [19:56:41] (03PS4) 10Hashar: zuul: migrate statsd_host to zuul::server [operations/puppet] - 10https://gerrit.wikimedia.org/r/141657 [19:57:15] (03CR) 10Hashar: [C: 031] "This patch definitely works." [operations/puppet] - 10https://gerrit.wikimedia.org/r/141657 (owner: 10Hashar) [20:00:04] gwicke, subbu, cscott: The time is nigh to deploy Parsoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140702T2000) [20:00:47] (03PS1) 10Rush: outbound mail for legalpad.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/143741 [20:00:53] (03CR) 10BryanDavis: "Nikerabbit: I think the problem that 'wikipedia' is not a valid wiki name for beta; Sam thinks it should be 'wiki' instead." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140723 (owner: 10Nikerabbit) [20:01:07] (03CR) 10jenkins-bot: [V: 04-1] outbound mail for legalpad.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/143741 (owner: 10Rush) [20:02:02] (03PS2) 10Rush: outbound mail for legalpad.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/143741 [20:03:11] jouncebot: thanks! fire in the hole! [20:03:58] RECOVERY - puppet last run on search1016 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [20:07:14] (03CR) 10Rush: [C: 032 V: 032] outbound mail for legalpad.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/143741 (owner: 10Rush) [20:13:01] (03CR) 10Nikerabbit: "Curious. That probably means that the two instances of 'wikipedia' above are not doing anything either. They are the same as defaults so I" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140723 (owner: 10Nikerabbit) [20:16:51] (03PS1) 10Nikerabbit: Enable ContentTranslation extension on beta Wikipedias [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143746 [20:16:56] !log updated Parsoid to version 6afcb8df [20:17:05] Logged the message, Master [20:17:41] (03CR) 10Nikerabbit: "https://gerrit.wikimedia.org/r/143746" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140723 (owner: 10Nikerabbit) [20:20:19] cscott: congrats on your first Parsoid deploy! [20:24:02] (03CR) 10BryanDavis: [C: 031] Enable ContentTranslation extension on beta Wikipedias [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143746 (owner: 10Nikerabbit) [20:24:03] cscott, indeed ;) [20:24:17] jouncebot: next [20:24:17] In 0 hour(s) and 35 minute(s): Flow (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140702T2100) [20:25:47] (03CR) 10BryanDavis: "Nikerabbit: You should add this to today's SWAT window (https://wikitech.wikimedia.org/wiki/Deployments)." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143746 (owner: 10Nikerabbit) [20:32:01] (03PS1) 10MarkTraceur: Remove remaining surveys for Media Viewer [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143750 [20:32:19] cajoel: thanks!! [20:32:38] paravoid: at the beach! [20:32:44] cajoel: ooops [20:32:45] sorry about that [20:32:53] (03PS3) 10Faidon Liambotis: mail: add lead as second smarthost, remove mchenry [operations/puppet] - 10https://gerrit.wikimedia.org/r/143558 [20:33:00] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mail: add lead as second smarthost, remove mchenry [operations/puppet] - 10https://gerrit.wikimedia.org/r/143558 (owner: 10Faidon Liambotis) [20:36:40] sometimes I wish wikitech.wikimedia had the thanks extension installed [20:37:09] "thanks for updating that task in the post-mortem" over email or IRC is too disruptive [20:37:16] bd808: ^ :P [20:37:23] it's got WikiLove [20:37:36] talk pages are almost abandoned, you're not going to bother anyone if you use them [20:37:40] greg-g: :) yw [20:37:53] Reedy: Enable thanks and then you'll get some WikiLove :p [20:38:06] I don't have shell access to the right host :P [20:38:29] https://wikitech.wikimedia.org/wiki/User_talk:BryanDavis#A_beer_for_you.21 [20:38:43] Reedy: 1. Find some who does or get it; 2. do it or get them to do it 3. take credit (if necessary) 4. get WikiLove :D [20:38:51] Reedy: Fix wikitech to run on the cluster then [20:39:03] Wikitech is the easy part [20:39:08] OSM/SMW et al isn't [20:39:09] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 1642.75674892 [20:39:17] RECOVERY - Disk space on lead is OK: DISK OK [20:41:02] (03PS1) 10Faidon Liambotis: exim: set o+x to /var/spool/exim4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/143752 [20:42:16] (03PS2) 10Faidon Liambotis: exim: set o+x to /var/spool/exim4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/143752 [20:42:44] (03CR) 10Faidon Liambotis: [C: 032] exim: set o+x to /var/spool/exim4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/143752 (owner: 10Faidon Liambotis) [20:43:36] (03PS1) 10Andrew Bogott: Remove some obsolete gluster scripts: [operations/puppet] - 10https://gerrit.wikimedia.org/r/143753 [20:44:43] (03PS1) 10Chad: Prevent massively distructive actions against Elasticsearch [operations/puppet] - 10https://gerrit.wikimedia.org/r/143754 [20:44:51] (03CR) 10Faidon Liambotis: [V: 032] exim: set o+x to /var/spool/exim4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/143752 (owner: 10Faidon Liambotis) [20:44:59] ^d: s/dist/dest/ ? [20:45:13] (03PS2) 10Chad: Prevent massively destructive actions against Elasticsearch [operations/puppet] - 10https://gerrit.wikimedia.org/r/143754 [20:45:26] <^d> Whoops, thx. [20:45:35] :) [20:45:58] twentyafterfour: so the oauth call that fails is it s2s or c2s? [20:46:18] well .. it's kinda both [20:46:32] it's a post from the client to the server but that request kicks off a server to server connection [20:46:38] and it's timing out after 15 seconds [20:46:56] so I think that the server to server part is what's acttually failiing and that our logs are misleading [20:47:18] <_joe_> from what I see, one that fails is clearly the s2s call, and it results in a 503 as the timeout of varnish or nginx is smaller? [20:47:33] via tcpdump on radon (server) I see varnish try to hit the url http://legalpad.wikimedia.org/auth/login/mediawiki:wmf/ [20:47:40] for awhile and then serve the generic bad page error [20:47:40] <_joe_> assuming phabricator runs behind nginx [20:48:01] (03PS2) 10Faidon Liambotis: MX switch, part 3 [operations/dns] - 10https://gerrit.wikimedia.org/r/143559 [20:48:04] right [20:48:08] <_joe_> chasemp: so the time out is in the call to the remote server I bet [20:48:11] so the ssl proxy is nginx and varnish, on the server itself is apache [20:48:27] <_joe_> go on the phab host and strace the process that is running after you made a request [20:48:34] <_joe_> that will confirm this [20:48:39] apahce logs attached http://fab.wmflabs.org/T364 [20:48:49] (03CR) 10Faidon Liambotis: [C: 032] MX switch, part 3 [operations/dns] - 10https://gerrit.wikimedia.org/r/143559 (owner: 10Faidon Liambotis) [20:49:03] we need to see what http requests are being made on the server side to debug further. [20:49:13] !log switching non-wikimedia.org MX to polonium/lead (from polonium/mchenry) [20:49:17] see why it's either not getting a response from mediawiki oauth or it's sending a weird request [20:49:18] Logged the message, Master [20:49:32] <_joe_> it's the first I'd see [20:49:39] (03PS2) 10Faidon Liambotis: MX switch, part 4 [operations/dns] - 10https://gerrit.wikimedia.org/r/143560 [20:49:49] <_joe_> s/see/say/ [20:50:14] <_joe_> chasemp: on it as well [20:50:21] <_joe_> radon, right? [20:50:25] does the outgoing connection from phabricator to mediawiki.org have to go through a proxy as well? [20:50:37] _joe_: yes [20:50:46] twentyafterfour: no [20:50:55] it should be direct as it's all "in house" is my understanding [20:50:57] <_joe_> twentyafterfour: are you trying to reach an external IP? [20:51:10] <_joe_> chasemp: what is the oauth url? [20:51:16] it's just doing regular dns lookup [20:51:31] <_joe_> twentyafterfour: what is the host we're trying to reach? [20:51:36] this works fine from radon ping mediawiki.org [20:51:43] _joe_: mediawiki.org [20:51:49] <_joe_> it's just mediawiki.org? [20:51:51] <_joe_> ok [20:51:54] https://mediawiki.org [20:52:17] (03CR) 10Manybubbles: [C: 031] "Note: This requires a restart to take effect. We're not planning a restart super soon, so we'll just have to not be stupid." [operations/puppet] - 10https://gerrit.wikimedia.org/r/143754 (owner: 10Chad) [20:52:17] callback: https://legalpad.wikimedia.org/auth/login/mediawiki:wmf/ [20:52:35] chasemp: the callback is where the user gets redirected at the end [20:53:16] can you wget https://mediawiki.org/w/index.php?title=Special:OAuth/initiate from the phabricator box? [20:53:30] yes [20:53:39] phabricator is simply trying to hit that url through curl [20:55:03] http://fab.wmflabs.org/P10 [20:55:11] details [20:55:20] <_joe_> ok connection works fine it seems [20:55:28] <_joe_> chasemp: you should POST to it [20:55:41] <_joe_> right? [20:55:59] !log pfw1-eqiad: s/mchenry/lead/; all smtp_out rules have [ polonium lead ] as destination-address now [20:56:02] Jeff_Green: ^ [20:56:04] Logged the message, Master [20:56:21] paravoid: ok. fixing frack mail config too [20:56:27] _joe_: yes? [20:56:31] last call is ..O/.POST /auth/login/mediawiki:wmf/ HTTP/1.1 [20:56:34] then radon says [20:56:39] <_joe_> twentyafterfour: chasemp first of all, use www.mediawiki.org [20:56:41] As received by the server, this request had a nonzero content length but no POST data. [20:57:01] <_joe_> chasemp: mmmh how can that happen? [20:57:02] springle: is the schema change for https://gerrit.wikimedia.org/r/#/c/117373/ finished? [20:57:10] <_joe_> chasemp: where did you find this log? [20:57:23] captured it on radon during an attempt at oauth [20:57:32] tcpdump [20:57:35] <_joe_> chasemp: seems the phabricator plugin is buggy [20:57:39] (03PS1) 10Faidon Liambotis: Kill wiki-mail.wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/143762 [20:57:49] <_joe_> or the mediawiki oauth is [20:57:50] paravoid: :D [20:59:37] _joe_: buggy how? [20:59:44] the post making it back is errant, so either varnish or the responding mediawiki host I guess? or the request was weird to begin with? [21:00:04] spagewmf: The time is nigh to deploy Flow (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140702T2100) [21:00:29] I really don't think it could be the phabricator code or my code causing this, it's really odd. that 500 never makes it back to the client. [21:00:48] <_joe_> twentyafterfour: that is varnish in the middle probably [21:01:00] yeah varnish is screwing it up somewhere [21:01:04] <_joe_> just let me try something guys [21:01:07] varnish is screwing what? [21:01:14] I can clearly see the post being sent and it has a body, but when phabricator gets it there is no post body [21:01:24] <_joe_> chrome as a nice 'copy as cURL' feature I plan to exploit [21:01:29] truncating tthe http post [21:01:35] <_joe_> twentyafterfour: let me test this [21:01:39] _joe_: ok [21:02:21] <_joe_> twentyafterfour: I'm pretty sure that message comes from the remote server [21:02:34] which message? [21:02:48] "As received by the server, this request had a nonzero content length but no POST data. " [21:02:53] that message is from phabricator [21:02:56] paravoid: we are debugging oauth through misc-web-lb with mediawiki.org [21:03:16] or attempting [21:03:37] hey are there docs somewhere on packaging the mozilla way? [21:03:39] wow, jouncebot, cool. greg-g OK to do this? [21:04:02] "the mozilla way?" [21:04:05] <_joe_> twentyafterfour: that message is sent from the remote server [21:04:17] <_joe_> tried the curl my browser does locally from radon [21:04:20] spagewmf: yep [21:04:27] <_joe_> so that I would bypassh varnish [21:04:33] <_joe_> and got a beautiful 500 [21:04:37] <_joe_> pastebinning it [21:04:41] _joe_: no it's not from the remote server [21:04:45] yeah I see ..d.HTTP/1.1 500 Internal Server Error [21:04:46] it's a phabriicator error message [21:04:51] at the beginning of the process captured [21:05:00] <_joe_> ok, let me paste what I do have. [21:05:33] damn this keyboard keeps double-triggerinng [21:06:11] <_joe_> http://paste.debian.net/107856/ [21:06:40] <_joe_> so, the problem is after varnish [21:06:43] (03CR) 10EBernhardson: [C: 032] "lgtm" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142151 (https://bugzilla.wikimedia.org/66094) (owner: 10Spage) [21:06:49] <_joe_> the oauth call fails for some reason [21:07:04] (03Merged) 10jenkins-bot: add new Mantle extension, required by coming Flow [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142151 (https://bugzilla.wikimedia.org/66094) (owner: 10Spage) [21:08:25] _joe_: yes but that's a different situation that we get when requestting from outside. I could debug the oauth problem if we had the same behavior from outside the proxy [21:08:42] <_joe_> the only thing varnish does, is turning the internal error in a 503 [21:09:03] <_joe_> which may suck, and you may want to change for legalpad specifically for the moment [21:09:04] does it take 15 seconds to fail when you do it locally [21:09:11] <_joe_> no [21:09:19] <_joe_> because that's varnish re-trying [21:09:20] yeah I see varnish retrying for that period [21:09:28] <_joe_> oh how stupid am I [21:09:31] retrying? really that's bad behavior [21:09:32] <_joe_> chasemp: yes [21:09:44] <_joe_> twentyafterfour: it is not in the general case. [21:09:52] on a post it is [21:09:58] <_joe_> Guys, you just discovered you need to configure varnish! [21:10:14] <_joe_> twentyafterfour: if it failed? [21:10:25] yes [21:10:29] <_joe_> it really depends [21:10:31] _joe_: thanks for looking at this man, what do you think we are missing varnish side? [21:10:36] <_joe_> and you can configure it [21:10:42] I always prefer fail fast to a hung connection that eventually times out [21:11:02] (03PS1) 10Andrew Bogott: Add a default logfile to manage-nfs-volumes-daemon. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143763 [21:11:15] <_joe_> chasemp: I'd start bypassing most of the config honestly, but I'm no expert on our varnishes. you should really wait for brandon later [21:15:42] I'm still 95% convinced that varnish is truncating requests and it's not just turning 500 into 503 [21:16:00] because phabricator is seeing a request with no post body yet the content length is 48 [21:16:13] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: Fetching readonly [21:17:11] Reedy, greg-g: I'm pleasantly confused, Mantle is already in /a/common/php-1.24wmf11 even though it wasn't in extension-list [21:17:29] spagewmf: They're completely different [21:17:55] https://github.com/wikimedia/mediawiki-tools-release/commit/55dbba0f13101fb3251f169276ae9f84d8a679b5 [21:18:08] I did it purposely before last weeks branching as I knew it was incoming [21:18:36] Reedy: thanks! I was going to check that after the deploy. OK, so I just need to deploy the config change that requires it if wmgUseFlow [21:19:06] And scap for any messages (as it wasn't in extension-list already) [21:19:19] I'm guessing you've added it to wmf10? [21:21:44] ok who was it that said "first off use www." ... :D [21:21:59] Reedy: no, as only mediawiki.org will be using it in wmf11. But now you remind me and ebernhardson who's taking over that there's a reason to addnew extesnion to every branch in use [21:22:11] are there docs somewhere on debian packaging for wikimedia? [21:22:55] and when I say "using in wmf11", I mean "requiring" but not actually exercising the code until wmf12 [21:23:21] ori: graphite seeems kind of broken [21:23:28] <_joe_> twentyafterfour: how can you say you're 95% convinced of that? [21:23:28] * AaronSchulz was trying to find LogPager graphs [21:23:38] <_joe_> I whoued you a request from radon itself [21:23:41] <_joe_> that failed [21:23:51] <_joe_> and gave the _extact_ same error in the logs [21:23:54] https://gdash.wikimedia.org/dashboards/indexpager/ gah [21:23:56] RECOVERY - puppet last run on francium is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [21:24:07] <_joe_> but then, it's 11 PM and it's your debugging chore :) [21:24:15] (03PS1) 10Andrew Bogott: Store a list of orhpaned project volumes for later cleanup. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143765 [21:24:40] _joe_: because I didn't see the exact same error in the logs, I guess I missed it [21:24:51] (03PS2) 10Andrew Bogott: Store a list of orphaned project volumes for later cleanup. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143765 [21:26:08] <_joe_> btw, not it works [21:26:14] <_joe_> did you do something? [21:26:17] yes [21:26:26] <_joe_> :) [21:26:39] <_joe_> so if NOW it does not work externally [21:26:40] https://www.mediawiki.org [21:26:45] <_joe_> it's varnishe's fault [21:27:09] <_joe_> oh man, you were not following redirects, which makes a ton of sense btw [21:27:17] ha shit yep [21:27:21] works [21:27:49] <_joe_> so, that was not varnish after all :) [21:28:07] mukunda was it just base uri https://www.mediawiki.org [21:28:08] ? [21:28:33] (03CR) 10Calak: [C: 031] Remove some obsolete gluster scripts: [operations/puppet] - 10https://gerrit.wikimedia.org/r/143753 (owner: 10Andrew Bogott) [21:28:42] <_joe_> chasemp: the oauth client for security reasons tends not to follow redirects [21:28:52] Reedy: so, i'm a bit confused :) it sounds like we need to add it to wmf10 which is fine, but looking at the commit history in wmf11 i dont see where Mantle submodule was added there either. I was thinking i could just cherry pick whatever is in 11 into 10 via the previous patch [21:28:56] <_joe_> that's the reason why I told you before to use www.mediawiki.org [21:28:57] (Mantle) [21:30:19] i can just manually add a submodule via the wikitech directions, but i was thinking cherry-pick from before would be less error prone [21:31:40] spagewmf: Well, Flow is used on wmf10 and wmf11, right? [21:31:47] ebernhardson: It was included at branching time for wmf11 [21:32:18] Reedy: flow is in wmf10 and wmf11, but only the new code we deploy tomorrow (wmf12?) will use Mantle [21:32:28] Right [21:32:34] But the config change unconditionally includes it [21:32:44] ahh, i get you [21:32:53] And scap will barf with it being in extension-list but only in one of the branches [21:33:36] yup, ok i'll just prep a submodule add via the wikitech directions [21:34:04] ah someone gave me a writeup on build packaging yesterday I forgot who [21:38:29] !log rebooting analytics1021 to check bios cpufreq setting [21:38:35] (03CR) 10Andrew Bogott: [C: 032] Requote a boolean [operations/puppet] - 10https://gerrit.wikimedia.org/r/143616 (owner: 10Andrew Bogott) [21:38:35] Logged the message, Master [21:40:07] !log blog updated to newest release, no downtime [21:40:11] Logged the message, Master [21:41:35] ok just ran kafka controlled-shutdown 21 [21:42:17] !log ebernhardson Synchronized php-1.24wmf10/extensions/Mantle/: Sync new Mantle extension in 1.24wmf10 (duration: 00m 20s) [21:42:22] Logged the message, Master [21:42:42] Reedy: so afaict everything is now ready for me to merge mediawiki-config and scap (for the single i18n message used in Special:Version), i've never used scap is there anything i should know? [21:43:09] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: Fetching readonly [21:43:12] Not really [21:43:17] !log ebernhardson Started scap: (no message) [21:43:21] Logged the message, Master [21:45:16] thx ebernhardson [21:46:48] chasemp: so... it looks like even on machines with puppet 3 the last status file isn't readable by non-root [21:48:19] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1022 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 12.0 [21:48:29] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1018 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 12.0 [21:49:19] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1012 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 12.0 [21:59:16] those icinga alerts are me, ignore [21:59:32] i scheduled maintenance on analytics1021 but clearly i should have scheduled downtime for those checks as well [22:06:16] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1012 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [22:06:26] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1022 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [22:06:26] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1018 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [22:10:07] (03PS1) 10QChris: Feed logs from ssl terminators again into webstatscollector's filter [operations/puppet] - 10https://gerrit.wikimedia.org/r/143775 [22:16:51] !log rebooting analytics1022 to check bios cpufreq setting [22:16:56] Logged the message, Master [22:19:43] !log ebernhardson Finished scap: (no message) (duration: 36m 25s) [22:19:48] Logged the message, Master [22:20:29] * bd808 grumbles about how long l10n updates take to generate and scap [22:22:22] * hashar points at l10noid (SOA approved) [22:23:21] GET /tab-article-description/1.24wmf10/1/?lang=en [22:24:25] If memcached is too slow I doubt that a home grown ReST service will be faster [22:28:26] yup [22:28:38] will need a better strategy :-/ [22:29:26] sleeping time [22:37:16] !log rebooting analytics1021 to change bios "system profile" from PPW (OS) to PPW (DAPC) [22:37:19] Logged the message, Master [22:38:51] (03CR) 10Swalling: [C: 031] "Correct set of wikis" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143614 (owner: 10Phuedx) [22:40:47] PROBLEM - Host analytics1021 is DOWN: PING CRITICAL - Packet loss = 100% [22:42:06] dammit i guess my scheduled maintenence was too short [22:42:21] analytics1021 coming back up now, hopefully with its cpu running at full speed [22:43:26] RECOVERY - Host analytics1021 is UP: PING WARNING - Packet loss = 61%, RTA = 0.75 ms [22:43:42] woo success [22:44:45] (03PS1) 10Reedy: Remove old static-stable [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143780 [22:44:54] (03CR) 10Ori.livneh: [C: 032] Apache config for Wikivoyage using mod_proxy_fcgi [operations/apache-config] - 10https://gerrit.wikimedia.org/r/142983 (owner: 10Ori.livneh) [22:49:06] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 229.67073727 [22:49:58] kafka rebalancing.. [22:52:06] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 2324.09019434 [23:00:05] RoanKattouw, mwalker, ori, MaxSem: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140702T2300) [23:00:15] I'll take it [23:01:13] PROBLEM - DPKG on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:01:13] PROBLEM - MediaWiki profile collector on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:01:23] PROBLEM - SSH on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:02:23] RECOVERY - SSH on tungsten is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [23:02:23] PROBLEM - Graphite Carbon on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:02:23] PROBLEM - RAID on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:03:37] (03CR) 10MaxSem: [C: 032] Meta: automatic translation workflow state changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137804 (owner: 10Awight) [23:03:43] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:03:49] (03Merged) 10jenkins-bot: Meta: automatic translation workflow state changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137804 (owner: 10Awight) [23:04:09] (03CR) 10MaxSem: [C: 032] Disable mobile upload CTA on wikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142155 (https://bugzilla.wikimedia.org/66958) (owner: 10MaxSem) [23:04:13] RECOVERY - MediaWiki profile collector on tungsten is OK: OK: All defined mwprof jobs are runnning. [23:04:13] RECOVERY - DPKG on tungsten is OK: All packages OK [23:04:33] PROBLEM - uWSGI web apps on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:05:23] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [23:05:23] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [23:05:23] RECOVERY - uWSGI web apps on tungsten is OK: OK: All defined uWSGI apps are runnning. [23:05:30] (03Merged) 10jenkins-bot: Disable mobile upload CTA on wikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142155 (https://bugzilla.wikimedia.org/66958) (owner: 10MaxSem) [23:05:33] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.150 second response time [23:06:56] !log maxsem Synchronized wmf-config/: (no message) (duration: 00m 07s) [23:07:01] Logged the message, Master [23:07:14] awight, deployed your change - please test:) [23:08:29] MaxSem: It's not on the swat list, but it would be nice to see https://gerrit.wikimedia.org/r/#/c/143746/ merged (prod no-op change for beta) [23:08:49] (03CR) 10MaxSem: [C: 032] Enable ContentTranslation extension on beta Wikipedias [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143746 (owner: 10Nikerabbit) [23:08:56] (03Merged) 10jenkins-bot: Enable ContentTranslation extension on beta Wikipedias [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143746 (owner: 10Nikerabbit) [23:09:03] bd808, it doesn't even need a swat [23:09:24] True but it needs to be synced to tin [23:09:26] and thanks [23:10:26] (03PS1) 10BryanDavis: [WIP] Allow puppetmaster to send logs to logstash [operations/puppet] - 10https://gerrit.wikimedia.org/r/143788 (https://bugzilla.wikimedia.org/60690) [23:12:53] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Allow puppetmaster to send logs to logstash [operations/puppet] - 10https://gerrit.wikimedia.org/r/143788 (https://bugzilla.wikimedia.org/60690) (owner: 10BryanDavis) [23:13:21] MaxSem: ooh great, thank you! [23:14:41] (03PS2) 10BryanDavis: [WIP] Allow puppetmaster to send logs to logstash [operations/puppet] - 10https://gerrit.wikimedia.org/r/143788 (https://bugzilla.wikimedia.org/60690) [23:16:02] (03PS4) 10Dzahn: stats - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138004 (owner: 10Rush) [23:16:25] (03CR) 10jenkins-bot: [V: 04-1] stats - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138004 (owner: 10Rush) [23:16:46] thanks ojenkins [23:16:59] (03CR) 10Ottomata: [C: 031] stats - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138004 (owner: 10Rush) [23:17:12] thanks otto :) [23:19:19] Nikerabbit: ContentTranslation is showing up in Special:Version for http://en.wikipedia.beta.wmflabs.org/ now. \o/ [23:20:10] (03PS5) 10Dzahn: stats - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138004 (owner: 10Rush) [23:21:42] (03CR) 10Dzahn: [C: 032] "PS5: quoting fix" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138004 (owner: 10Rush) [23:24:57] (03CR) 10Dzahn: "checked stat1002, PASS:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138004 (owner: 10Rush) [23:27:33] (03CR) 10Dzahn: "@ottomata: cronjobs of stats user untouched and still existing, UID unchanged" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138004 (owner: 10Rush) [23:29:34] (03PS4) 10MaxSem: Add a handler for HHVM fatals [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/120180 [23:30:03] (03PS4) 10Dzahn: facilities, replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138009 (owner: 10Rush) [23:31:11] (03CR) 10Dzahn: [C: 032] "i don't see this class being used currently" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138009 (owner: 10Rush) [23:35:32] (03PS4) 10Dzahn: openstack-replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/138002 (owner: 10Rush) [23:44:42] (03PS1) 10Dzahn: openstack-unquoted file modes [operations/puppet] - 10https://gerrit.wikimedia.org/r/143791