[01:05:28] 3Project-Creators, ops-requests, operations, ops-core: Project Proposal: Label style projects for common operations tools - https://phabricator.wikimedia.org/T1147#954295 (10Aklapper) "Mail" is now https://phabricator.wikimedia.org/tag/mail_%5Bplaceholder%5D/ for the time being and we don't need to prefix every... [01:36:49] (03PS2) 10Andrew Bogott: Don't specify provider => upstart [puppet] - 10https://gerrit.wikimedia.org/r/181540 [01:39:04] 3Wikimedia-Mailing-lists, operations: Upgrade Mailman to 2.1.15 - https://phabricator.wikimedia.org/T52864#954340 (10MZMcBride) Can we just install Mailman 3 instead? [01:49:41] (03CR) 10Andrew Bogott: [C: 032] Don't specify provider => upstart [puppet] - 10https://gerrit.wikimedia.org/r/181540 (owner: 10Andrew Bogott) [01:54:46] (03PS2) 10Andrew Bogott: Don't include base::instance-upstarts on Debian. [puppet] - 10https://gerrit.wikimedia.org/r/181541 [01:56:03] (03CR) 10Andrew Bogott: [C: 032] Don't include base::instance-upstarts on Debian. [puppet] - 10https://gerrit.wikimedia.org/r/181541 (owner: 10Andrew Bogott) [02:05:04] (03PS1) 10Andrew Bogott: Remove generate-ganglia-conf and the projectgid fact. [puppet] - 10https://gerrit.wikimedia.org/r/182755 [02:06:05] (03Abandoned) 10Andrew Bogott: Fix the projectgid fact -- take two [puppet] - 10https://gerrit.wikimedia.org/r/181535 (owner: 10Andrew Bogott) [02:06:25] (03CR) 10Andrew Bogott: [C: 032] Remove generate-ganglia-conf and the projectgid fact. [puppet] - 10https://gerrit.wikimedia.org/r/182755 (owner: 10Andrew Bogott) [02:16:29] (03CR) 10Andrew Bogott: "bootstrap-vz has an 'unattended-upgrades' plugin which I'm using for the labs debian images. It installs the 'unattended-upgrades' packag" [puppet] - 10https://gerrit.wikimedia.org/r/181539 (owner: 10Andrew Bogott) [02:21:17] (03CR) 10Andrew Bogott: "Ubuntu docs say, about unattended upgrades: "If you want the script to automatically reboot when needed, you not only need to set Unattend" [puppet] - 10https://gerrit.wikimedia.org/r/181539 (owner: 10Andrew Bogott) [02:26:38] (03CR) 10Faidon Liambotis: "Is it really needed in Ubuntu after all?" [puppet] - 10https://gerrit.wikimedia.org/r/181541 (owner: 10Andrew Bogott) [02:28:53] (03CR) 10Faidon Liambotis: "Do we set Unattended-Upgrade::Automatic-Reboot? I doubt it. I also doubt it's a good idea to do so, imagine the whole Labs fleet rebooting" [puppet] - 10https://gerrit.wikimedia.org/r/181539 (owner: 10Andrew Bogott) [02:37:54] (03CR) 10Andrew Bogott: "I just tracked this patch back to its source, and that labs include has been included since labs puppet day one. Is it possible it's ther" [puppet] - 10https://gerrit.wikimedia.org/r/181541 (owner: 10Andrew Bogott) [02:38:47] andrewbogott: the question is whether bootstrap-vz (or the tool the preceded it) already provision this [02:39:06] production's debian-installer does and we rely on that [02:44:21] paravoid: I can dig through the vmbuilder source but… the status quo clearly works, and it seems to safe to presume that the file was added for a reason. So I don't quite understand why you're suspicious? [02:45:52] because I've found tons of unneeded crap in base in the past :) [02:47:01] either because of human error/bad judgement back then, or even more often, surrounding circumstances having changed since [02:47:18] but whatever, it doesn't matter much [02:49:42] well, I can confirm that 'getty' and 'tty' don't appear anywhere in the vmbuilder source. It's possible that it's being set up indirectly elswhere though. [02:56:38] (03PS3) 10Andrew Bogott: Remove update-notifier-common from labs instances. [puppet] - 10https://gerrit.wikimedia.org/r/181539 [03:01:05] hey andrewbogott [03:01:07] welcome back? :) [03:01:19] Semi-back, yes :) [03:01:22] How's it going? [03:01:34] not bad. mostly labsdb stuff, but... [03:01:58] andrewbogott: is there documentation on how to ‘restart’ labs dns servers? at least temporarily, until Coren figures out how to replace that? [03:02:16] it’s been *really* flaky last week [03:02:19] Would restarting help? I thought the issue was overload [03:03:06] (03PS15) 10Andrew Bogott: contint: Apply contint::qunit_localhost to labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/168631 (https://bugzilla.wikimedia.org/72063) (owner: 10Krinkle) [03:04:22] andrewbogott: not sure if it would, but I was afraid of trying that because I know nova-network and pdns are in some… sort of intricate set up? [03:04:28] so having it documented would be good anyway [03:04:33] (03CR) 10Andrew Bogott: [C: 032] contint: Apply contint::qunit_localhost to labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/168631 (https://bugzilla.wikimedia.org/72063) (owner: 10Krinkle) [03:05:07] I'm sure it is documented, I'm looking... [03:05:14] Although I guess you probably searched already [03:06:19] I just searched wikitech, yeah [03:06:22] hm [03:06:45] well, it's pretty simple -- basically, when you restart opendj, pdns dies. So you have to restart opendj (on both hosts) and then restart pdns (on both hosts) right after [03:06:50] I'm not sure where best to write that :) [03:07:27] (03PS3) 10Andrew Bogott: contint: Move tmpfs Require to caller to support labs' jenkins-deploy [puppet] - 10https://gerrit.wikimedia.org/r/173511 (owner: 10Krinkle) [03:08:08] (03PS4) 10Andrew Bogott: contint: Add tmpfs mount in jenkins-deploy homedir for labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/173512 (https://bugzilla.wikimedia.org/72063) (owner: 10Krinkle) [03:08:52] (03CR) 10Andrew Bogott: [C: 031] "This looks OK to me -- I'll wait for hashar to review the dependency and then I'll merge both." [puppet] - 10https://gerrit.wikimedia.org/r/173512 (https://bugzilla.wikimedia.org/72063) (owner: 10Krinkle) [03:14:29] (03CR) 10Krinkle: [C: 031] Move composer.json into repository root [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180589 (owner: 10Legoktm) [03:15:33] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [03:15:33] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [03:16:23] andrewbogott: just on wikitech, /Labs_DNS perhaps? [03:16:30] documenting the current setup would be nice. [03:16:44] yeah, I'll start a page after lunch [03:16:51] andrewbogott: hmm, so restart opendj + restart pdns basically? [03:16:52] cool [03:17:27] YuviPanda: yes -- you /might/ get away with doing that on each server in sequence, I'm not sure. I always restart both opendjs and then both pdns [03:17:41] hmmm [03:17:41] ok [03:17:44] But, again, I don't think this will help as I believe the outages are related to dnsmasq [03:19:01] yeah, that’s true [03:19:17] I just wanted to know :) [03:30:24] (03CR) 10Tim Landscheidt: "@paravoid: tools-dev is a Precise instance, but I spun up another instance (toolsbeta-pam-sshd-motd-test) to test the change properly and " [puppet] - 10https://gerrit.wikimedia.org/r/181789 (owner: 10Tim Landscheidt) [03:59:20] (03CR) 10KartikMistry: "Should this fixed elsewhere too? (eg: cxserver)" [puppet] - 10https://gerrit.wikimedia.org/r/181540 (owner: 10Andrew Bogott) [03:59:44] andrewbogott: ^^ [04:09:24] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: puppet fail [04:13:10] kart_: yeah, lines like that should probably be removed everywhere [04:18:34] (03PS1) 10KartikMistry: Remove provider => upstart [puppet] - 10https://gerrit.wikimedia.org/r/182763 [04:28:23] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:38:13] (03PS1) 10KartikMistry: Don't specify provider => upstart [puppet] - 10https://gerrit.wikimedia.org/r/182764 [04:40:04] (03CR) 10KartikMistry: "Warning: untested." [puppet] - 10https://gerrit.wikimedia.org/r/182764 (owner: 10KartikMistry) [04:55:59] YuviPanda: https://wikitech.wikimedia.org/wiki/Labs_DNS please amend as you see fit :) [06:29:33] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:43] PROBLEM - puppet last run on amslvs1 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:52] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:53] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:13] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:56] I can't reproduce the mw1144 failure, so those are most likely false alarms [06:37:43] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:45:23] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:45:43] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:46:03] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:46:43] RECOVERY - puppet last run on amslvs1 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:56:33] (03PS1) 10Gergő Tisza: Enable CORS support logging on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182767 [07:28:40] (03PS1) 10Andrew Bogott: Make nembus an ldap server [puppet] - 10https://gerrit.wikimedia.org/r/182768 [07:31:51] (03CR) 10Andrew Bogott: [C: 032] Make nembus an ldap server [puppet] - 10https://gerrit.wikimedia.org/r/182768 (owner: 10Andrew Bogott) [07:32:42] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [07:32:42] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [07:42:43] (03PS1) 10Andrew Bogott: Install the ldap-codfw cert on nembus. [puppet] - 10https://gerrit.wikimedia.org/r/182770 [07:43:23] PROBLEM - puppet last run on nembus is CRITICAL: CRITICAL: puppet fail [07:43:43] (03CR) 10Andrew Bogott: [C: 032] Install the ldap-codfw cert on nembus. [puppet] - 10https://gerrit.wikimedia.org/r/182770 (owner: 10Andrew Bogott) [07:44:33] PROBLEM - LDAP on nembus is CRITICAL: Connection refused [07:44:52] PROBLEM - LDAPS on nembus is CRITICAL: Connection refused [07:47:03] RECOVERY - LDAPS on nembus is OK: TCP OK - 0.043 second response time on port 636 [07:47:53] RECOVERY - puppet last run on nembus is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [07:48:02] RECOVERY - LDAP on nembus is OK: TCP OK - 0.043 second response time on port 389 [08:18:17] PROBLEM - Host nembus is DOWN: PING CRITICAL - Packet loss = 100% [08:19:27] RECOVERY - Host nembus is UP: PING OK - Packet loss = 0%, RTA = 43.30 ms [08:19:43] All this nembus stuff is me -- I'm setting up a new ldap server [08:19:44] or, trying [09:06:30] (03CR) 10Dzahn: [C: 031] Force HTTPS for graphite [puppet] - 10https://gerrit.wikimedia.org/r/181949 (owner: 10Hoo man) [09:07:39] (03CR) 10Dzahn: "if it doesn't work in labs, should we add a $realm check? or change the setup in labs so it is closer to prod in the first place?" [puppet] - 10https://gerrit.wikimedia.org/r/181949 (owner: 10Hoo man) [09:12:38] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "We have internal applications (like gdash or grafana) using graphite and they have no reason to be redirected to HTTPS on their specific e" [puppet] - 10https://gerrit.wikimedia.org/r/181949 (owner: 10Hoo man) [09:13:46] (03CR) 10Giuseppe Lavagetto: [C: 031] Revoke Brett Simmer's key [puppet] - 10https://gerrit.wikimedia.org/r/182347 (owner: 10Ori.livneh) [09:16:07] (03CR) 10Dzahn: "we also have "absent: meta group for absented users". afaik it should be added there but chase could confirm" [puppet] - 10https://gerrit.wikimedia.org/r/182347 (owner: 10Ori.livneh) [09:20:11] (03PS3) 10Giuseppe Lavagetto: hiera: make puppet fail if the mwyaml backend fails to lookup [puppet] - 10https://gerrit.wikimedia.org/r/181550 [09:22:32] (03CR) 10Yuvipanda: hiera: make puppet fail if the mwyaml backend fails to lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/181550 (owner: 10Giuseppe Lavagetto) [09:22:38] _joe_: minor naming nit. [09:23:02] greetings [09:23:21] <_joe_> ciao godog [09:23:36] hello, back here as well [09:23:42] I’ve always thought ‘ciao’ was same as ‘cya' [09:23:46] * YuviPanda waves at mutante [09:24:02] <_joe_> YuviPanda: it is both "hi" and "cya" [09:24:20] _joe_: I’m also guessing that the rescue => detail in mwcache.rb is what makes puppet fail when wikitech isn’t available? [09:24:28] while just ‘skipping’ if the page doesn’t exist [09:24:38] <_joe_> exactly [09:24:49] wiktionary says it's from "your slave" :p [09:24:58] <_joe_> also, it originally comes from the venetian "s'ciavo", which means that [09:25:16] _joe_: adding a comment to that effect at that rescue would also be nice [09:25:24] <_joe_> YuviPanda: nod [09:25:31] outside of that looks good :) [09:26:09] * YuviPanda should write some ruby at some point to see how it is these days [09:26:14] spent the weekend learning a bit of haskell [09:26:27] <_joe_> nah [09:26:34] <_joe_> learn lisp! [09:27:07] my problem is that I need to build something with a new language to properly learn it, and I can’t think of any non-wiki related thing to build [09:27:22] and if I’m building a wiki related thing I’d want to be as easily maintainable as possible, which precludes a lot of languages [09:27:39] maybe I should write an edit counter :P [09:27:45] make another IRC bot :p [09:28:44] happy new year \o/ [09:29:04] YuviPanda: irc bots should be nice in haskell [09:29:31] valhallasw`cloud: hmm, maybe I could rewrite ircyall in haskell? [09:29:33] or scheme, even [09:30:03] YuviPanda: mmm, I wouldn't write anything production-grade in it for now :-p [09:30:12] heh :D [09:30:15] I’ll just wait up [09:30:21] other people should be able to maintain it too, etc [09:30:24] indeed [09:30:31] that’s why I haven’t written anything with it [09:30:42] I think it has to be PHP/Python/JS for people to be able to easily maintain it [09:30:47] from inside the WM community at least [09:30:57] java even [09:31:28] do people have babel templates for programming languages on user pages? you could actually get the numbers then [09:32:17] you do, I think. http://en.wikipedia.org/wiki/User:Yuvipanda [09:32:23] I should update that page [09:32:45] * YuviPanda removes all the programming language userboxes [09:33:44] write a bot in haskell to do it :) [09:35:09] http://ftp.ccc.de/congress/31C3/h264-hd/31c3-6243-en-de-The_Perl_Jam_Exploiting_a_20_Year-old_Vulnerability_hd.mp4 [09:35:17] that was the Perl talk :p [09:35:22] (03PS4) 10Giuseppe Lavagetto: hiera: make puppet fail if the mwyaml backend fails to lookup [puppet] - 10https://gerrit.wikimedia.org/r/181550 [09:35:31] <_joe_> mutante: what did come out of that? [09:35:49] <_joe_> YuviPanda: ^^ [09:35:57] we are supposed to stop using it :p [09:36:15] <_joe_> what was the vuln? [09:36:49] _joe_: I left a comment earlier about not calling it MediawikiNotFoundError, since that sounds like Mediawiki itself isn’t found :D [09:36:56] MWNoSuchPage? MWPageNotFound? [09:37:03] it does weird things when you use lists, let me find a slide [09:37:10] everybody loves slides: http://events.ccc.de/congress/2014/Fahrplan/system/attachments/2542/original/the-perl-jam-netanel-rubin-31c3.pdf [09:37:19] <_joe_> YuviPanda: mh, that sounds good to be btw [09:37:36] what does? the current name? [09:37:51] it would just be clearer if we explicitly say what was not found [09:37:56] _joe_: https://www.youtube.com/watch?v=gweDBQ-9LuQ&t=4m55s [09:38:22] that's from 4m55s, where he shows the "expected vs. reality" column [09:38:26] <_joe_> YuviPanda: PageNotFound [09:38:34] sure [09:38:36] that sounds good [09:39:21] if you send the same parameter twice, like &bar=a&bar=b bar will become a list [09:39:43] <_joe_> mutante: which is exactly like in php [09:39:52] <_joe_> well, a bit better than php [09:40:28] <_joe_> I am uninpressed by this talk so far [09:40:39] _joe_: well, yes, except for the part where lists are not really lists [09:40:52] morning all [09:40:59] <_joe_> these are all things any perl programmer who was exposed to programming perl should know [09:41:14] _joe_: {a=>@list}, where @list=(b,c,d) becomes {a=>b, c=>d}? wat? [09:41:55] <_joe_> valhallasw`cloud: it is _not_ strange if you know how perl works [09:42:17] <_joe_> I'm not saying it is /sane/ or /good/, only it's not strange [09:42:46] the fact that one can come up with an explanation based on 'how perl works' doesn't mean it's not strange [09:43:13] but if we assume it's reasonable behavior [09:43:21] then returning a list from anything is not [09:43:37] <_joe_> my point is, once you understand how lists are treated in perl, it's not strange [09:43:46] <_joe_> it's internally consistent [09:43:58] <_joe_> unlike php casting magic, for example [09:44:04] http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2014-1572 [09:44:16] <_joe_> the vuln is funny btw [09:45:24] _joe_: well, apparently no-one gets it. [09:45:51] maybe people 'get' it when they see the vulnerability, but they sure don't see it when they write the code. [09:46:13] <_joe_> valhallasw`cloud: my point is, I know how %myhash = (a, list, of, things); would work, and I expected it the way he presented [09:47:44] _joe_: sure. But then I don't see how it's ever reasonable to have a function return a list. [09:47:53] and certainly not 'sometimes a value and sometimes a list' [09:48:02] but maybe arrays didn't exist when CGI:: was written? [09:48:03] <_joe_> it's the "context" :P [09:48:44] <_joe_> it's one of the trickiest and slickest things in perl [09:50:31] I'm sorry, but having $dbh->quote($cgi->param('user')); behave as it does is just absurd. [09:50:50] I'm passing something as the first parameter, how does it suddenly overwrite the second?? [09:50:56] valhallasw`cloud: I think what _joe_ is saying is that if you write perl enough you get used to the absurd [09:50:58] that's just against all common sense [09:51:00] I suppose. [09:51:02] <_joe_> I am not saying it's sane (I stated that at the beginning) [09:51:04] there is also the part about escaping with DBI->quote() where you can just set the type to integer and escaping is disabled, allowing SQL injection [09:51:16] around 18m [09:51:23] mutante: well, that's the correct behavior for an integer [09:51:26] unrelatedish, but I’ll be happy if we get rid of all perl in ops/puppet :D [09:51:47] mutante: the issue is that if you pass a list to a function, that list is actually used as different function arguments [09:51:53] <_joe_> mutante: seeing it now, it's funny [09:52:00] mutante: basically, quote(*param( [09:52:08] quote(*param('user')) in python [09:52:47] valhallasw`cloud: yea, allowin the attacker to set type to integer [09:52:54] yep [09:54:18] the talk about "SS7" is also worth watching [09:54:47] free calls? :-p [09:55:06] oh, location detection [09:55:08] interesting [09:55:30] you could register a "premium" number, then redirect somebody else to your number, then call them and they pay you [09:55:33] (03PS5) 10Giuseppe Lavagetto: hiera: make puppet fail if the mwyaml backend fails to lookup [puppet] - 10https://gerrit.wikimedia.org/r/181550 [09:57:16] <_joe_> however, TL;DR; - if you ever bought anything on booking.com, go block your CC already :P [09:57:52] arrg [09:58:29] <_joe_> mutante: didn't you know booking is a perl-only shop? [09:58:56] not really [09:59:31] <_joe_> I hope they have fully-encrypted cc with separate vaults [09:59:37] <_joe_> so it's hard to hack them [09:59:47] _joe_: if they don't, they're in big trouble with the CC agencies [10:00:10] <_joe_> valhallasw`cloud: well, I was saying I hope they have that done right [10:00:17] <_joe_> it's not always the case [10:00:27] <_joe_> but I won't say anything more than this :) [10:00:59] (03CR) 10Giuseppe Lavagetto: [C: 032] hiera: make puppet fail if the mwyaml backend fails to lookup [puppet] - 10https://gerrit.wikimedia.org/r/181550 (owner: 10Giuseppe Lavagetto) [10:03:46] (03PS2) 10Giuseppe Lavagetto: mediawiki: use role keyword in node defs, get rid of duplicate regexes [puppet] - 10https://gerrit.wikimedia.org/r/181596 [10:04:55] (03PS2) 10Dzahn: planets: add Varnish statement [puppet] - 10https://gerrit.wikimedia.org/r/181419 (owner: 10John F. Lewis) [10:08:39] (03CR) 10Dzahn: "wait, did we decide to install the cert on misc varnish and has it happened?" [puppet] - 10https://gerrit.wikimedia.org/r/181419 (owner: 10John F. Lewis) [10:09:26] godog: ^ so we are putting the cert on misc-web ? [10:10:12] ah, we did not but https://gerrit.wikimedia.org/r/#/c/181415/ [10:10:33] vs. comments from Faidon [10:11:32] mutante: no reply on https://phabricator.wikimedia.org/T60048 though the timing isn't ideal 23rd dec [10:11:34] (03CR) 10Dzahn: [C: 04-1] "we did not. it would still need https://gerrit.wikimedia.org/r/#/c/181415/" [puppet] - 10https://gerrit.wikimedia.org/r/181419 (owner: 10John F. Lewis) [10:12:02] I'd say "yes we want it" [10:12:06] godog: ah, yea. i was about to create subtasks for that one [10:12:08] one per service [10:12:25] nice [10:12:34] there is also the question what to do with contacts.. sigh [10:12:40] 3Beta-Cluster, operations: Beta servers can be badly misconfigured if mwyaml hiera backend fails - https://phabricator.wikimedia.org/T78408#954536 (10Joe) 5Open>3Resolved [10:12:45] first step will be finding out if it still has users [10:13:19] (03PS1) 10Springle: pool db1057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182773 [10:13:44] (03CR) 10Springle: [C: 032] pool db1057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182773 (owner: 10Springle) [10:13:48] (03Merged) 10jenkins-bot: pool db1057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182773 (owner: 10Springle) [10:14:33] !log springle Synchronized wmf-config/db-eqiad.php: pool db1057, warm up (duration: 00m 07s) [10:14:39] Logged the message, Master [10:15:09] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00334448160535 [10:20:18] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [10:21:55] (03PS1) 10Springle: depool db1061 db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182776 [10:22:33] (03CR) 10Springle: [C: 032] depool db1061 db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182776 (owner: 10Springle) [10:22:37] (03Merged) 10jenkins-bot: depool db1061 db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182776 (owner: 10Springle) [10:22:47] paravoid: are you against installing the planet SSL cert on misc-web? https://gerrit.wikimedia.org/r/#/c/181415/ you said once " Planet needs its own certificate anyway (second-level wildcard), so there is not much incentive here." but that would install it [10:23:13] re https://phabricator.wikimedia.org/T60048 [10:23:23] !log springle Synchronized wmf-config/db-eqiad.php: depool db1061 db1062 (duration: 00m 06s) [10:23:31] Logged the message, Master [10:24:24] 3operations: Kill manifests/realm.pp - https://phabricator.wikimedia.org/T85459#954538 (10Joe) a:3Joe [10:24:49] _joe_: ^ yay :) [10:25:19] <_joe_> YuviPanda: I am trying to use phab extensively as a task tracker, as mar.k asked us to do [10:25:23] <_joe_> it's not bad [10:25:27] yeah, it’s quite nice. [10:25:40] I’m using https://phabricator.wikimedia.org/project/board/939/query/all/ like how I used trello at the apps team [10:26:10] <_joe_> I should create my own board [10:26:11] <_joe_> :P [10:26:25] i'm missing a "refers to" link field to link one task to another without it being a real blocking task [10:26:32] _joe_: lame example https://phabricator.wikimedia.org/dashboard/view/6/ :] [10:26:37] also happy new year everyone [10:26:51] <_joe_> hey hashar, hi [10:27:46] (03PS1) 10Springle: move db1061 to s6, db1062 to s7 [puppet] - 10https://gerrit.wikimedia.org/r/182777 [10:29:47] (03CR) 10Springle: [C: 032] move db1061 to s6, db1062 to s7 [puppet] - 10https://gerrit.wikimedia.org/r/182777 (owner: 10Springle) [10:35:21] mutante: edit the description [10:35:41] Not indexed by search and not twosided, though :-( [10:36:37] valhallasw`cloud: ok, thanks [10:37:15] (03CR) 10Filippo Giunchedi: "whoops, forgot to submit comments, PS2 contains the changes tho" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/181080 (owner: 10Filippo Giunchedi) [10:40:02] !log xtrabackup clone: db1037 to db1061, db1039 to db1062 [10:40:09] Logged the message, Master [10:40:28] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [10:53:01] <_joe_> springle: your change is unmerged on strontium, maybe a failed puppet-merge? [10:53:14] <_joe_> I can merge it if needed. [10:56:17] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [11:17:03] 3Wikimedia-SSL-related, operations: Put all zirconium vhosts behind misc varnish cluster - https://phabricator.wikimedia.org/T60048#954590 (10Dzahn) [11:20:04] _joe_: thanks [11:22:07] (03CR) 10Hashar: [C: 031] "I am not sure how gdash works, I guess we can get this change merged, apply it on the gdash host and see what happens. At worth, there wi" [puppet] - 10https://gerrit.wikimedia.org/r/166511 (https://bugzilla.wikimedia.org/65478) (owner: 10Nemo bis) [11:22:57] 3Wikimedia-SSL-related, operations: Put all zirconium vhosts behind misc varnish cluster - https://phabricator.wikimedia.org/T60048#610144 (10Dzahn) [11:23:43] 3Wikimedia-SSL-related, operations: Put all zirconium vhosts behind misc varnish cluster - https://phabricator.wikimedia.org/T60048#610144 (10Dzahn) [11:24:25] (03PS3) 10Dzahn: planets: add Varnish statement [puppet] - 10https://gerrit.wikimedia.org/r/181419 (owner: 10John F. Lewis) [11:25:02] (03PS4) 10Dzahn: cache: install the planet SSL cert on misc-web [puppet] - 10https://gerrit.wikimedia.org/r/181415 [11:25:24] (03PS4) 10Dzahn: planets: remove SSL stanza [puppet] - 10https://gerrit.wikimedia.org/r/181984 (owner: 10John F. Lewis) [11:26:02] (03PS3) 10Dzahn: planet: change dns to misc-web [dns] - 10https://gerrit.wikimedia.org/r/181985 (owner: 10John F. Lewis) [11:26:19] (03PS4) 10Dzahn: etherpad: remove SSL stanza [puppet] - 10https://gerrit.wikimedia.org/r/181413 (owner: 10John F. Lewis) [11:27:43] (03PS2) 10Dzahn: etherpad->misc-web-lb.eqiad [dns] - 10https://gerrit.wikimedia.org/r/181269 (owner: 10John F. Lewis) [11:27:53] (03PS2) 10Dzahn: etherpad: add Varnish misc config [puppet] - 10https://gerrit.wikimedia.org/r/181412 (owner: 10John F. Lewis) [11:29:57] (03CR) 10Nemo bis: "Ping springle et al." [puppet] - 10https://gerrit.wikimedia.org/r/178170 (owner: 10Nemo bis) [11:32:12] 3operations: LocalisationUpdate broken since 2014-12-16 - https://phabricator.wikimedia.org/T85790#954619 (10Ciencia_Al_Poder) [11:36:59] (03PS1) 10Dzahn: bugzilla: remove bug-attachment 443 virtual host [puppet] - 10https://gerrit.wikimedia.org/r/182781 [11:37:58] (03PS2) 10Dzahn: bugzilla: remove bug-attachment 443 virtual host [puppet] - 10https://gerrit.wikimedia.org/r/182781 [11:38:29] 3operations: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#954624 (10mark) Let's update this ticket with sub tasks / blockers for the tasks mentioned below. Some of them are in old RT links of course, but let's change that into the newer Phab tickets... [11:42:08] (03PS1) 10Dzahn: move bug-attachments to misc-web [dns] - 10https://gerrit.wikimedia.org/r/182782 [11:47:53] (03PS4) 10Dzahn: planet: change dns to misc-web [dns] - 10https://gerrit.wikimedia.org/r/181985 (owner: 10John F. Lewis) [11:49:57] (03CR) 10Dzahn: "PS4: also move "planet" itself, not just the language subdomains. there are 2 apache templates in the planet module, planet.erb and planet" [dns] - 10https://gerrit.wikimedia.org/r/181985 (owner: 10John F. Lewis) [11:51:46] (03CR) 10Dzahn: [C: 04-1] "i updated the related DNS change to also switch the plain "planet" without a language over to misc-web. that vhost is in "planet.erb" vs. " [puppet] - 10https://gerrit.wikimedia.org/r/181984 (owner: 10John F. Lewis) [11:54:25] (03PS5) 10Dzahn: planets: remove SSL stanzas [puppet] - 10https://gerrit.wikimedia.org/r/181984 (owner: 10John F. Lewis) [11:56:07] 3operations: LocalisationUpdate broken since 2014-12-16 - https://phabricator.wikimedia.org/T85790#954640 (10Aklapper) Are you sure this is not intentional? Last MediaWiki software deployment was 2014-12-17, and that's on purpose: https://www.mediawiki.org/wiki/MediaWiki_1.25/Roadmap [11:57:16] (03CR) 10Jforrester: [C: 04-1] "-1: You need to update the commit message to say you're pointing to Special:BlankPage." [puppet] - 10https://gerrit.wikimedia.org/r/182558 (owner: 10OliverKeyes) [11:59:59] (03CR) 10Dzahn: [C: 032] "just a comment" [puppet] - 10https://gerrit.wikimedia.org/r/177376 (owner: 10Reedy) [12:02:22] 3operations: Document the new puppet and hiera role function - https://phabricator.wikimedia.org/T84976#954656 (10Joe) 5Open>3Resolved [12:07:21] (03CR) 10Dzahn: "i think we can handle the number of tickets that are actual list deletions and it seems good to have them documented as phabricator ticket" [puppet] - 10https://gerrit.wikimedia.org/r/170398 (owner: 10John F. Lewis) [12:13:12] (03PS3) 10Dzahn: base: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170477 (owner: 10John F. Lewis) [12:14:09] (03CR) 10Dzahn: [C: 031] base: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170477 (owner: 10John F. Lewis) [12:18:23] hmm [12:18:27] we should change the /topic [12:18:30] to point to phab [12:18:32] instead of rt :) [12:18:38] James_F: also, what does ‘on product duty’ mean? [12:21:32] At this point it means his name is James [12:21:41] I don't believe anyone else has ever been on product duty [12:21:49] heh [12:22:48] 3Engineering-Community, operations, WMF-Legal: Implement the Volunteer NDA process in Phabricator - https://phabricator.wikimedia.org/T655#954736 (10Dzahn) T84818 needs a NDA confirmation please [12:23:59] 3Engineering-Community, operations, WMF-Legal: Implement the Volunteer NDA process in Phabricator - https://phabricator.wikimedia.org/T655#954742 (10Dzahn) also T85170 please [12:25:23] 3WMF-NDA-Requests, operations: Grant WMF-NDA access to Stas in Phabricator - https://phabricator.wikimedia.org/T85170#940607 (10Dzahn) afaict this needs legal to confirm the NDA exists and has been counter-signed [12:30:12] 3Wikimedia-Mailing-lists, operations: Upgrade Mailman to 2.1.15 - https://phabricator.wikimedia.org/T52864#954782 (10Dzahn) >>! In T52864#954340, @MZMcBride wrote: > Can we just install Mailman 3 instead? https://lists.debian.org/debian-devel/2014/05/msg00502.html [12:31:11] YuviPanda: Yeah, it means if you're about to do something big (e.g. switch off an extension on a wiki), talk to someone in Product to sanity-check first, and for the last 6 (9?) months that's been me unceasingly. [12:31:23] 3Wikimedia-Mailing-lists, operations: Upgrade Mailman to 2.1.15 - https://phabricator.wikimedia.org/T52864#954784 (10Dzahn) "We (upstream) very definitely expect that people will want to run existing MM2.1 installations in parallel with MM3, at least for a while or while they're testing out the transition." htt... [12:32:20] James_F: ah, right. [12:32:21] ok [12:32:37] YuviPanda: At some point I hope to convince someone else to take on the duty. :-) [12:32:44] :D [12:34:38] what's "the fundraising ticket system" [12:38:30] anyone here knows where to access icinga? https://icinga.wikimedia.org/icinga won't work for us here [12:40:20] anjeve: you'd need to sign a volunteer NDA (T655) to get added to an LDAP group and then you can login with your labs user [12:41:53] mutante: I think anjeve is with WMDE [12:42:00] didn’t we add a WMDE ldap group at some point? [12:42:40] we do have an LDAP group called "wmde" [12:43:09] login on icinga is granted to: [12:43:14] ops,wmf,nda [12:43:33] so the change would be to add "wmde" to those groups [12:43:40] or get all members of wmde proper nda [12:43:56] I was with WMDE, am with HPI now [12:44:10] we have a project working for Wikidata each year [12:44:29] Wikidata-quality - https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidata-quality [12:44:48] don't we use shinken for labs, yet? [12:44:50] it does not install the php and mysql packages though we checked the boxes on the special page [12:45:05] so we have this https://wikitech.wikimedia.org/wiki/Volunteer_NDA [12:45:10] http://shinken.wmflabs.org/ [12:45:16] but it's a draft and the process for phabricator is still being written [12:45:18] we hoped to get more insights into the running puppet manifests and their success [12:45:24] anjeve: In that case you probably don't want icinga [12:45:33] just log into the boxes and see what's up there [12:45:44] hoo: oh yeah, labs is shinken. guest/guest [12:45:46] Icinga will (at most) tell you stuff failed [12:45:46] oh, you want labs icinga? [12:46:03] http://icinga.wmflabs.org/ but it's broken [12:46:48] anjeve: oh, do you want notifications whenever puppet fails on that project? [12:47:02] I think I need a status quo firstly ;) [12:47:12] php is not installed though in the table above [12:47:18] https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidata-quality [12:47:36] and I don't know what to trigger to get php and the whole lamp stack and mw from gerrit [12:47:49] http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1420462061.956&target=wikidata-quality.wikidata-quality-playground.puppetagent.failed_events.value [12:48:04] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [12:48:08] anjeve: if you want MW you want to use https://wikitech.wikimedia.org/wiki/Labs-vagrant [12:48:34] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [12:50:54] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [12:52:53] YuviPanda: Other thing: [12:52:56] dig wikidata.beta.wmflabs.org [12:53:01] ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 [12:53:03] wtf [12:53:50] hmm [12:53:59] dig m.wikidata.beta.wmflabs.org works [12:54:17] hoo: when did this stop working}? [12:54:26] Coren: ^ more labs DNS woes, unsure if these are related. [12:54:29] No idea, honestly [13:06:14] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [13:08:54] 3Wikimedia-Mailing-lists, operations: Upgrade Mailman to 2.1.15 - https://phabricator.wikimedia.org/T52864#954874 (10faidon) >>! In T52864#954340, @MZMcBride wrote: > Can we just install Mailman 3 instead? Mailman 3 has not been released yet and have been in beta for a while. I've been following it a bit, it's... [13:12:09] YuviPanda: how about merging "shinken: Add ssh checks" ? [13:13:16] mutante: oh, must -2 that. for some reason I can’t actually ssh from shinken host to other hosts... [13:13:40] (03CR) 10Dzahn: [C: 031] Add wikimania2016.wikimedia.org to ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/181892 (owner: 10Reedy) [13:13:50] (03CR) 10Yuvipanda: [C: 04-2] "For some reasons, shinken can't actually ssh to lots of labs hosts. Requires some network poking around before this can be merged." [puppet] - 10https://gerrit.wikimedia.org/r/181807 (owner: 10Yuvipanda) [13:14:03] YuviPanda: gotcha, 'k [13:14:57] (03CR) 10Dzahn: [C: 032] Add wikimania2016 [dns] - 10https://gerrit.wikimedia.org/r/181891 (owner: 10Reedy) [13:24:50] (03CR) 10Dzahn: "sounds like the "default" security group needs a hole for SSH from the shinken instance" [puppet] - 10https://gerrit.wikimedia.org/r/181807 (owner: 10Yuvipanda) [13:27:34] (03CR) 10JanZerebecki: "If labsproxy does not set X-Forwarded-Proto it should. Yes change it so its closer to production." [puppet] - 10https://gerrit.wikimedia.org/r/181949 (owner: 10Hoo man) [13:31:31] (03CR) 10Dzahn: "this by itself shouldn't do any harm but before the DNS change need to remove the SSL part from Apache config" [puppet] - 10https://gerrit.wikimedia.org/r/180248 (owner: 10Ottomata) [13:35:43] (03Abandoned) 10Hashar: mwdeploy private key is only for production [puppet] - 10https://gerrit.wikimedia.org/r/179875 (owner: 10Hashar) [13:44:04] (03PS1) 10Yuvipanda: Add base table name to indexed views [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182799 [13:55:23] (03CR) 10Hashar: "The jenkins-deploy user is not known upon boot which lock the instance on startup. We could have the tmpfs mounted just like /tmp : owned " [puppet] - 10https://gerrit.wikimedia.org/r/173512 (https://bugzilla.wikimedia.org/72063) (owner: 10Krinkle) [13:55:43] (03PS1) 10Dzahn: monitoring: fix a bunch of indentation warnings [puppet] - 10https://gerrit.wikimedia.org/r/182800 [13:56:33] (03CR) 10Hashar: [C: 031] monitoring: fix a bunch of indentation warnings [puppet] - 10https://gerrit.wikimedia.org/r/182800 (owner: 10Dzahn) [13:57:15] (03PS1) 10Yuvipanda: Make table columns dicts instead of lists [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182801 [13:59:15] (03PS2) 10Yuvipanda: Make table columns dicts instead of lists [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182801 [14:01:22] (03PS1) 10Dzahn: ganglia_new: fix indentation warnings [puppet] - 10https://gerrit.wikimedia.org/r/182802 [14:02:04] (03CR) 10Dzahn: [C: 032] add dev.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/181398 (owner: 10Dzahn) [14:03:25] (03CR) 10Gilles: [C: 031] Make upload.wikimedia.org set Timing-Allow-Origin [puppet] - 10https://gerrit.wikimedia.org/r/181405 (owner: 10Unicodesnowman) [14:08:22] (03PS4) 10Hashar: contint: tmpfs is now root:root and world writable [puppet] - 10https://gerrit.wikimedia.org/r/173511 (owner: 10Krinkle) [14:10:11] 3operations: Varnish: the lower the Age value, the slower the request - https://phabricator.wikimedia.org/T84980#954987 (10Gilles) >>! In T84980#942109, @BBlack wrote: > The left axis is ms and the bottom is cache age? Yep >>! In T84980#942109, @BBlack wrote: > It would basically be a result of the "silo" desi... [14:10:44] (03PS5) 10Hashar: contint: Add tmpfs mount in jenkins-deploy homedir for labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/173512 (https://bugzilla.wikimedia.org/72063) (owner: 10Krinkle) [14:13:48] (03CR) 10Krinkle: [C: 04-1] "It was reduced from 512M to 128M. I think we should keep it at 512M (matching what we already use on gallium and lanthanum and what I alre" [puppet] - 10https://gerrit.wikimedia.org/r/173512 (https://bugzilla.wikimedia.org/72063) (owner: 10Krinkle) [14:15:56] (03PS5) 10Krinkle: contint: tmpfs is now root:root and world writable [puppet] - 10https://gerrit.wikimedia.org/r/173511 [14:16:19] (03PS6) 10Krinkle: contint: Add tmpfs mount in jenkins-deploy homedir for labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/173512 (https://bugzilla.wikimedia.org/72063) [14:16:27] (03CR) 10Krinkle: [C: 031] contint: tmpfs is now root:root and world writable [puppet] - 10https://gerrit.wikimedia.org/r/173511 (owner: 10Krinkle) [14:16:31] 3ops-core, operations, Phabricator: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#954999 (10Dzahn) [14:16:33] 3operations: provide a database to test sanitizing Bugzilla db - https://phabricator.wikimedia.org/T85150#954997 (10Dzahn) 5Open>3Resolved thank you. that works. i can run the script now. resolving [14:19:23] 3ops-core, operations, Phabricator: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#955003 (10Dzahn) now that we can write: Deleting product 'Security'... Deleting 0 bugs in security groups... Done DBD::mysql::db selectcol_arrayref failed: Table 'bugzilla.longdescs_tags'... [14:19:55] (03CR) 10Hashar: "Yeah 512MB will probably do though it is certainly overkill." [puppet] - 10https://gerrit.wikimedia.org/r/173512 (https://bugzilla.wikimedia.org/72063) (owner: 10Krinkle) [14:22:42] (03PS7) 10Hashar: contint: Add tmpfs mount in jenkins-deploy homedir for labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/173512 (https://bugzilla.wikimedia.org/72063) (owner: 10Krinkle) [14:26:36] 3ops-eqiad: virt1004 - https://phabricator.wikimedia.org/T85798#955008 (10Cmjohnson) 3NEW [14:27:54] PROBLEM - puppet last run on amssq61 is CRITICAL: CRITICAL: Puppet has 1 failures [14:28:32] 3operations, ops-eqiad: virt1004 - https://phabricator.wikimedia.org/T85798#955030 (10Cmjohnson) [14:39:30] Reedy: Comments on, https://gerrit.wikimedia.org/r/#/c/181546/ when you're free, please! [14:39:37] YuviPanda: I get an exception when provisioning vagrant as described [14:39:40] Wrapped exception: [14:39:41] invalid byte sequence in US-ASCII [14:39:42] Error: invalid byte sequence in US-ASCII at /vagrant/puppet/modules/hhvm/manifests/init.pp:1 on node wikidata-quality-playground.eqiad.wmflabs [14:39:58] converted the manifest to unix already [14:40:06] any idea? [14:40:11] anjeve: which instance is this? [14:40:42] wikidata-quality-playground [14:41:10] (03CR) 10Hashar: "I have applied this on all labs instance (via child change https://gerrit.wikimedia.org/r/#/c/173512/ )." [puppet] - 10https://gerrit.wikimedia.org/r/173511 (owner: 10Krinkle) [14:41:19] 3ops-core, operations, Phabricator: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#955057 (10Dzahn) it's from: ``` sub delete_deleted_comments { # Delete all comments tagged as 'deleted' my $comment_ids = $dbh->selectcol_arrayref("SELECT comment_id FROM longdescs_tags WHER... [14:41:30] (03CR) 10Hashar: [C: 031 V: 031] contint: tmpfs is now root:root and world writable [puppet] - 10https://gerrit.wikimedia.org/r/173511 (owner: 10Krinkle) [14:42:15] anjeve: looking [14:42:21] (03CR) 10Hashar: [C: 031 V: 031] "Per Timo, made the tmpfs 512MBytes on the labs instance to match production." [puppet] - 10https://gerrit.wikimedia.org/r/173512 (https://bugzilla.wikimedia.org/72063) (owner: 10Krinkle) [14:43:31] anjeve: I just ran ‘labs-vagrant provision’ and it seems to be running fine [14:43:36] let’s wait for it to finish... [14:45:54] RECOVERY - puppet last run on amssq61 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:50:20] YuviPanda: and? [14:50:50] anjeve: still running, let’s wait for it to finish... [14:56:30] (03CR) 10Hashar: [V: 032] "I have rebooted integration-slave1001 and confirmed it comes back just fine." [puppet] - 10https://gerrit.wikimedia.org/r/173511 (owner: 10Krinkle) [14:57:33] 3ops-core, operations, Phabricator: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#955075 (10Dzahn) our version: 4.4.5 - script says "Last validated against Bugzilla version 4.0" ---- 06:52 -!- Irssi: Join to #bugzilla was synced in 0 secs 06:55 < mutante> hi. i'm trying to use "s... [15:02:14] (03CR) 10Dzahn: "i agree that "roots" makes you expect they are actual roots, and a limitation to a non-root user would make me think it's an "-admins" gro" [puppet] - 10https://gerrit.wikimedia.org/r/182585 (owner: 10GWicke) [15:04:59] (03CR) 10Dzahn: [C: 031] "+1 per "check other *-roots" groups in data.yaml, almost all are actual ALL ALL roots, except parsoid and mathoid (change it for mathoid a" [puppet] - 10https://gerrit.wikimedia.org/r/182585 (owner: 10GWicke) [15:11:56] 3Ops-Access-Requests, Analytics-Cluster: Access to Hadoop Cluster for Ananth Ramakrishnan (new contractor) - https://phabricator.wikimedia.org/T85229#955098 (10Ottomata) 5Open>3Resolved [15:22:52] (03PS6) 10Hashar: contint: tmpfs is now root:root and world writable [puppet] - 10https://gerrit.wikimedia.org/r/173511 (owner: 10Krinkle) [15:23:32] (03CR) 10Hashar: "The file {} directive would override the user/group/mode set by mount {} :-( Removed them from file {}" [puppet] - 10https://gerrit.wikimedia.org/r/173511 (owner: 10Krinkle) [15:26:42] 3WMF-NDA-Requests, operations: Grant WMF-NDA access to Stas in Phabricator - https://phabricator.wikimedia.org/T85170#955118 (10Qgil) a:3LuisV_WMF Passing the ball. [15:33:29] (03CR) 10Hashar: contint: tmpfs is now root:root and world writable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/173511 (owner: 10Krinkle) [15:45:14] (03PS1) 10Glaisher: Remove 'collectionsaveascommunitypage' from ruwiki users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182816 [15:49:22] (03CR) 10JanZerebecki: [C: 04-1] "Missing something like https://gerrit.wikimedia.org/r/#/c/175595/ ?" [dns] - 10https://gerrit.wikimedia.org/r/182782 (owner: 10Dzahn) [15:50:23] (03PS3) 10JanZerebecki: bugzilla: remove bug-attachment 443 virtual host [puppet] - 10https://gerrit.wikimedia.org/r/182781 (owner: 10Dzahn) [15:50:41] manybubbles, marktraceur, ^d: Who wants to SWAT this morning? [15:50:44] * anomie would prefer not to [15:50:59] anomie: I got it [15:51:03] manybubbles: Ok! [15:51:10] Thanks manybubbles :) [15:51:19] np! [15:51:53] (03PS1) 10Florianschmidtwelzow: mediawikiwiki: Add Api: and Skin: namespace to default searched namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182818 [15:52:35] 3ops-codfw: ms-be2003.codfw.wmnet: slot=3 dev=sdd failed - https://phabricator.wikimedia.org/T85591#955213 (10Papaul) Hello Papaul, Thank you for contacting Dell Enterprise Support! The following information includes the applicable case and dispatch numbers related to our conversation: Service Request No.: 905... [15:54:50] (03PS3) 10Giuseppe Lavagetto: mediawiki: use role keyword in node defs, get rid of duplicate regexes [puppet] - 10https://gerrit.wikimedia.org/r/181596 [15:54:52] (03CR) 10GWicke: "@Daniel: The restart using upstart requires root (service parsoid restart). Upstart changes the user to parsoid in the process." [puppet] - 10https://gerrit.wikimedia.org/r/182585 (owner: 10GWicke) [15:55:54] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: use role keyword in node defs, get rid of duplicate regexes [puppet] - 10https://gerrit.wikimedia.org/r/181596 (owner: 10Giuseppe Lavagetto) [15:56:04] 3ops-codfw: ms-be2011.codfw.wmnet: slot=0 dev=sda failed - https://phabricator.wikimedia.org/T85445#955221 (10Papaul) Hello Papaul, Thank you for contacting Dell Enterprise Support! The following information includes the applicable case and dispatch numbers related to our conversation: Service Request No.: 905... [15:58:03] RECOVERY - puppet last run on ms1004 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [15:58:40] puppet disabled all across the mw cluster? [15:58:42] that you _joe_? [15:59:03] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: puppet fail [15:59:15] 3Project-Creators, ops-core, operations, ops-requests: Project Proposal: Label style projects for common operations tools - https://phabricator.wikimedia.org/T1147#955222 (10chasemp) "mail" as a tag is meant to be generic. It would be given context in conjunction with other tags (preferably not of the 'label'... [15:59:45] hoo and edsanders|away: SWAT is coming, ready to support your patches? [16:00:01] paravoid: _joe_ just pinged mt to tell me he is sick [16:00:05] manybubbles, anomie, ^d, marktraceur, edsanders: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150105T1600). [16:00:15] manybubbles: Yus! [16:00:23] manybubbles: me too! :P [16:00:31] springle: ping [16:00:44] (03CR) 10Manybubbles: [C: 032] Display links to Wikidata in the other project sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 (owner: 10Tpt) [16:00:47] I always get missed because I'm late at updating [[Deployments]] :/ [16:00:55] (03PS1) 10Yuvipanda: maintain-replicas: Use consistent IF's to redact info [software] - 10https://gerrit.wikimedia.org/r/182819 [16:00:56] Coren: ^ [16:01:00] (03Merged) 10jenkins-bot: Display links to Wikidata in the other project sidebar [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181871 (owner: 10Tpt) [16:01:01] jesus [16:01:06] what has happened to me [16:01:09] * YuviPanda removes stray apostrophe [16:01:23] (03PS2) 10Yuvipanda: maintain-replicas: Use consistent IFs to redact info [software] - 10https://gerrit.wikimedia.org/r/182819 [16:02:17] Huh. I wonder why I wrote those the other way 'round. [16:02:48] (03CR) 10coren: [C: 031] "More consistent = easier to find bugs" [software] - 10https://gerrit.wikimedia.org/r/182819 (owner: 10Yuvipanda) [16:02:58] Coren: hmm, can we change the views in the db too? it’s messing with labsdb-auditor... [16:03:10] (03CR) 10Yuvipanda: [C: 032] maintain-replicas: Use consistent IFs to redact info [software] - 10https://gerrit.wikimedia.org/r/182819 (owner: 10Yuvipanda) [16:03:30] !log manybubbles Synchronized wmf-config/Wikibase.php: SWAT Display links to Wikidata in the other project sidebar (duration: 00m 06s) [16:03:31] hoo: ^^^^ [16:03:35] I can force a run for that table if you want. [16:03:36] Logged the message, Master [16:03:36] thanks! [16:03:44] Coren: yeah, that would be nice [16:04:52] Aw, eff. I had locally uncommited changes to maintain-replicas. [16:05:11] bad Coren is bad :P [16:05:33] grrr [16:05:35] Should not be an issue, I think I can just discard them (I'm checking now). IIRC it was to do a partial run by hand. [16:06:42] Yeah, nothing substantive in there. [16:07:11] hoo: pong [16:07:29] (03CR) 10Manybubbles: [C: 031] "Looks ok to me but I'm no expert in permissions. I'll swat in a few minutes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182367 (owner: 10Glaisher) [16:07:30] springle: Any idea why we only have sites tables on some wikis? [16:09:04] springle: They are defined in core's tables.sql, so should be there [16:09:30] YuviPanda: ipblocks views are being updated [16:09:43] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [16:09:53] (This may take up to 10-15 minutes) [16:10:11] edsanders|away: so your submodule update isn't passing in jenkins - I'm going to skip it for now and investigate. I don't know of a good reason why its failing so I might override jenkins and just push it through. unless you don't come online in the next few minutes. [16:10:13] Coren: \o/ cool [16:11:01] (03CR) 10Manybubbles: [C: 032] Remove 'collectionsaveascommunitypage' from ruwiki users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182816 (owner: 10Glaisher) [16:11:07] (03Merged) 10jenkins-bot: Remove 'collectionsaveascommunitypage' from ruwiki users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182816 (owner: 10Glaisher) [16:11:34] PROBLEM - puppet last run on mw1159 is CRITICAL: CRITICAL: puppet fail [16:12:14] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: SWAT disable creating books in the wikipedia namespace (duration: 00m 06s) [16:12:15] Glaisher: ^^^ [16:12:23] Logged the message, Master [16:12:29] :D thanks [16:12:42] manybubbles: some tidy tests are broken in wmf branches [16:13:04] for some hhvm-related reason [16:13:12] i think it's known and ignored [16:13:31] MatmaRex: ah - k. I see its tidy and alder32 tests - I can ignore if someone is working on making it stop [16:13:42] (03CR) 10Manybubbles: [C: 032] Create 'uploader' user group on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182367 (owner: 10Glaisher) [16:13:44] manybubbles: as for what it does, it apparently fixes switching between wikitext and VE in mobile [16:13:48] (03Merged) 10jenkins-bot: Create 'uploader' user group on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182367 (owner: 10Glaisher) [16:14:06] manybubbles: i don't think anyone's working, but i do think that no one currently cares [16:14:33] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: SWAT disable creating books in the wikipedia namespace AND shuffle some upload permissions on kowiki (duration: 00m 05s) [16:14:39] Logged the message, Master [16:14:40] Glaisher: that last one didn't take. this one got both. sorry ^^^ [16:15:28] oh.. I was wondering why it was visible at Listgrouprights [16:15:33] thanks :D [16:15:41] Glaisher: so you are all done! [16:15:52] right! :) [16:15:58] MatmaRex: ok cool. you want to be the "verifier"? [16:16:06] looks like edsanders|away isn't going to come online [16:16:33] manybubbles: i suppose i could [16:16:40] manybubbles: btw, one question.. are SWATs done for wiki deletions? [16:16:54] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: puppet fail [16:16:54] Glaisher: idunno [16:17:18] 3Beta-Cluster, Labs-Team, operations: Core dumps fill up /var on labs instances - https://phabricator.wikimedia.org/T1259#955291 (10greg) >>! In T1259#943783, @yuvipanda wrote: > Also, all of these are hhvm - are the hhvm core dumps from beta useful at all, or should we disable them? @joe or @ori or @bd808 ? [16:17:21] I'm not sure why not but I haven't deleted a wiki before so I can't be sure 100% what is involved. [16:17:23] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [16:17:37] I saw that it was done for closures [16:17:43] PROBLEM - puppet last run on mw1154 is CRITICAL: CRITICAL: puppet fail [16:18:48] hoo: don't know, before my time. there is no schema change ticket for it, though new tables don't always get one. https://gerrit.wikimedia.org/r/#/c/23528/ mentions a transition from interwiki. perhaps relevant? [16:20:31] springle: Yeah... it's on the wiki where we're making use of it [16:20:34] but not on all [16:20:42] that could at some point strike us [16:21:04] 3Beta-Cluster, Labs-Team, operations: Core dumps fill up /var on labs instances - https://phabricator.wikimedia.org/T1259#955312 (10bd808) >>! In T1259#955291, @greg wrote: >>>! In T1259#943783, @yuvipanda wrote: >> Also, all of these are hhvm - are the hhvm core dumps from beta useful at all, or should we disab... [16:21:21] !log manybubbles Synchronized php-1.25wmf13/extensions/VisualEditor/: SWAT fix switching between wikitext and VE on mobile (duration: 00m 14s) [16:21:28] Logged the message, Master [16:21:52] MatmaRex: ^^ [16:23:37] Hi all! [16:24:40] renoirb: hi! [16:24:46] MatmaRex: can you check that that worked? [16:24:46] We get a lot of spam accounts on our wiki which is running on wmf/1.24wmf16 any advice on how to lower that spamming? [16:24:54] thanks manybubbles [16:25:09] MatmaRex: thanks for checking it even though it wasn't yours [16:25:22] Happy New year! My servers missed me, they told me they’ve been naughty and were spamming everywhere :( [16:25:54] PROBLEM - puppet last run on mw1153 is CRITICAL: CRITICAL: puppet fail [16:26:12] manybubbles: yeah, in a minute, sorry [16:27:19] renoirb: I'm not even sure which direction to point you in, sorry! I'm too specific.... [16:28:04] 3ops-core, operations: Deploy hhvm 3.3.1+dfsg1-1+wm1 to the production cluster - https://phabricator.wikimedia.org/T85812#955320 (10Joe) 3NEW [16:28:17] manybubbles: looks like there is at least one more bug related to switching editors in mobile, and i currently can't get it to work, eh [16:28:45] MatmaRex: well, it isn't worse than before so I won't have to rollback [16:28:50] yep [16:28:51] that is a comfort [16:28:55] manybubbles: it works on betalabs and on enwiki (wmf12) [16:29:01] so there's something else broken in wmf13 [16:29:05] sneaky [16:29:10] probably fix to https://phabricator.wikimedia.org/T85480 needs backporting [16:30:17] * manybubbles done with SWAT [16:30:38] thanks for editing the deploy page, I just updated it with the full week's schedule [16:30:54] RECOVERY - puppet last run on mw1159 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:31:25] I’m looking at extensions that might help me, maybe i´m missing an extension that I should use. [16:32:36] renoirb: there's a page at mw.org specifically for that issue [16:32:40] lemmefindit [16:32:53] Glaisher, thanks! [16:32:59] https://www.mediawiki.org/wiki/Manual:Combating_spam [16:33:23] There’s a lot of pages about many issues. Thing is which one is most recent+relevant [16:34:09] Glaisher are you aware of one that gives hints on particular wmf branch? [16:34:33] I should review all submodules in current wmf branch and update my own fork. https://github.com/webplatform/mediawiki-core [16:34:47] AbuseFilter, SpamBlacklist and ConfirmEdit works [16:34:47] all deployed to Wikimedia wikis [16:34:55] latest, that is [16:35:00] Thanks! [16:36:19] (03PS1) 10Ottomata: Send ganglia-logtailer cron output to logfile [puppet] - 10https://gerrit.wikimedia.org/r/182826 [16:38:00] springle: Can you take care of creating that table on all wikis? [16:38:13] hoo: T85813 [16:38:19] awesome [16:38:20] when i'm more awake [16:38:26] https://gerrit.wikimedia.org/r/#/c/182421/3/sql/bounce_records.sql [16:38:33] Hi springle , remember https://gerrit.wikimedia.org/r/#/c/178170/ :) [16:38:48] springle: ^ do you see any reason to have a character limit on that index? [16:39:09] (my patch doesn't require being particularly awake :P) [16:39:20] not that it's going to matter, probably [16:40:23] PROBLEM - puppet last run on mw1157 is CRITICAL: CRITICAL: puppet fail [16:42:41] hoo: maybe someone expected the table to get very large with many bounces from people with insanely long email addresses [16:42:56] so, no [16:43:16] :P [16:43:30] (03PS4) 10Springle: Update cached article count monthly to avoid social unrest [puppet] - 10https://gerrit.wikimedia.org/r/178170 (owner: 10Nemo bis) [16:43:44] (03CR) 10Ottomata: [C: 032] Send ganglia-logtailer cron output to logfile [puppet] - 10https://gerrit.wikimedia.org/r/182826 (owner: 10Ottomata) [16:44:07] (03PS1) 10Ottomata: Remove analytics1003 rsync job [puppet] - 10https://gerrit.wikimedia.org/r/182829 [16:44:51] RECOVERY - puppet last run on mw1153 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [16:45:14] (03CR) 10Ottomata: [C: 032] Remove analytics1003 rsync job [puppet] - 10https://gerrit.wikimedia.org/r/182829 (owner: 10Ottomata) [16:45:37] (03PS5) 10Springle: Update cached article count monthly to avoid social unrest [puppet] - 10https://gerrit.wikimedia.org/r/178170 (owner: 10Nemo bis) [16:46:12] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [16:46:41] (03CR) 10Springle: [C: 032] Update cached article count monthly to avoid social unrest [puppet] - 10https://gerrit.wikimedia.org/r/178170 (owner: 10Nemo bis) [16:46:41] PROBLEM - puppet last run on mw1160 is CRITICAL: CRITICAL: puppet fail [16:47:26] Nemo_bis: thanks for vslow [16:50:41] PROBLEM - Varnishkafka Delivery Errors on cp3017 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 444.266663 [16:51:12] PROBLEM - puppet last run on mw1159 is CRITICAL: CRITICAL: puppet fail [16:51:32] PROBLEM - puppet last run on mw1156 is CRITICAL: CRITICAL: puppet fail [16:53:23] Thanks :) [16:53:51] RECOVERY - Varnishkafka Delivery Errors on cp3017 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [16:56:55] Hi all and a happy new year! :) [16:57:02] PROBLEM - puppet last run on mw1155 is CRITICAL: CRITICAL: puppet fail [16:57:42] hi silke [16:58:05] Is there anyone around who can give me a hint for the config of the LDAPAuthentication extension? [16:58:47] !log syslog events not being recorded in logstash as expected (apache2, hhvm) [16:58:52] Logged the message, Master [16:59:22] Mapping the MW security gorups works but I don't know what to name the group all usual authenticated users go into. [16:59:42] hi mark :) [16:59:43] (03PS1) 10Giuseppe Lavagetto: use role for imagescalers as well [puppet] - 10https://gerrit.wikimedia.org/r/182831 [17:00:43] (03PS3) 10Ottomata: Lower request topic_request_timeout_ms on all varnishkafkas to 2000 (2 seconds) [puppet] - 10https://gerrit.wikimedia.org/r/182469 [17:00:44] !log restarted logstash on logstash1001 to see if that will make syslog events come back [17:00:50] Logged the message, Master [17:00:53] and it di [17:00:55] *did [17:01:00] (03CR) 10Ottomata: [C: 032 V: 032] Lower request topic_request_timeout_ms on all varnishkafkas to 2000 (2 seconds) [puppet] - 10https://gerrit.wikimedia.org/r/182469 (owner: 10Ottomata) [17:02:05] These internal logstash lockups are getting to be annoying. I can't wait to get more traffic logging via redis queues [17:04:59] _joe_: here? [17:06:22] PROBLEM - puppet last run on mw1153 is CRITICAL: CRITICAL: puppet fail [17:07:04] cscott: hi [17:07:40] (03PS4) 10BryanDavis: beta: honor log sampling and levels for logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181349 [17:07:54] (03CR) 10BryanDavis: [C: 032] beta: honor log sampling and levels for logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181349 (owner: 10BryanDavis) [17:07:58] (03Merged) 10jenkins-bot: beta: honor log sampling and levels for logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181349 (owner: 10BryanDavis) [17:09:21] !log bd808 Synchronized wmf-config/logging-labs.php: Beta logging config change (b47ee787) (duration: 00m 06s) [17:09:25] Logged the message, Master [17:10:14] <_joe_> paravoid: yes [17:10:27] _joe_: lots of mw* puppet fail [17:10:42] at least mw1153-1160 [17:10:50] <_joe_> paravoid: imagescalers, https://gerrit.wikimedia.org/r/#/c/182831/ will fix that [17:10:58] ah, k [17:10:59] (03PS2) 10Giuseppe Lavagetto: use role for imagescalers as well [puppet] - 10https://gerrit.wikimedia.org/r/182831 [17:11:14] jfyi :) [17:11:25] (03CR) 10Giuseppe Lavagetto: [C: 032] use role for imagescalers as well [puppet] - 10https://gerrit.wikimedia.org/r/182831 (owner: 10Giuseppe Lavagetto) [17:12:45] 3Labs-Team, Beta-Cluster, operations: Core dumps fill up /var on labs instances - https://phabricator.wikimedia.org/T1259#955410 (10greg) @Ori, ideas on how to manage these? Should you (or someone else close to HHVM) take a weekly gander at the dumps on beta or something else? Ideally we'd have auto bug reportin... [17:13:02] ACKNOWLEDGEMENT - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 1.42453020252e-91 ottomata Ok. we are experimenting with different settings. We also might replace this node in the next few weeks. [17:13:47] 3Labs-Team, operations: stray files created in /etc/ssh/userkeys - https://phabricator.wikimedia.org/T85814#955415 (10fgiunchedi) 3NEW [17:14:03] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.046 second response time [17:14:36] mh it is in trouble alright, taking a look [17:14:57] (03CR) 10Ottomata: "Aye, the SSL config isn't actually in Apache, it is in Hue itself (which I think it Tomcat?). But, yes, this should be pretty easy. I wa" [puppet] - 10https://gerrit.wikimedia.org/r/180248 (owner: 10Ottomata) [17:15:01] RECOVERY - puppet last run on mw1159 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:15:03] (03PS2) 10Ottomata: Set up hue.wikimedia.org backend on misc-web-lb [puppet] - 10https://gerrit.wikimedia.org/r/180248 [17:15:12] PROBLEM - RAID on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:15:32] PROBLEM - Graphite Carbon on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:15:42] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [17:16:21] RECOVERY - puppet last run on mw1155 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [17:16:32] PROBLEM - SSH on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:32] PROBLEM - dhclient process on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:16:39] (03CR) 10Ottomata: [C: 032] Set up hue.wikimedia.org backend on misc-web-lb [puppet] - 10https://gerrit.wikimedia.org/r/180248 (owner: 10Ottomata) [17:16:52] PROBLEM - puppet last run on tungsten is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:17:12] RECOVERY - puppet last run on mw1154 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:17:22] RECOVERY - RAID on tungsten is OK: OK: optimal, 1 logical, 2 physical [17:17:32] RECOVERY - SSH on tungsten is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [17:17:32] RECOVERY - dhclient process on tungsten is OK: PROCS OK: 0 processes with command name dhclient [17:17:42] RECOVERY - Graphite Carbon on tungsten is OK: OK: All defined Carbon jobs are runnning. [17:18:01] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 9 minutes ago with 0 failures [17:18:42] RECOVERY - puppet last run on mw1157 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [17:19:22] wtf: http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page [17:19:50] er, wrong channel, but, still: [17:19:51] [4e1ca5cd] /wiki/Main_Page InvalidArgumentException from line 88 of /srv/mediawiki/php-master/includes/libs/ObjectFactory.php: Provided specification lacks both factory and class parameters. [17:20:03] (03PS1) 10BryanDavis: beta: capture $logstashHandler for closure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182839 [17:20:06] bd808: ^ smells like librarization? :) [17:20:23] logging probably. working on it now [17:20:24] manybubbles, hi [17:20:29] bd808: ty [17:20:44] yeah: [17:20:45] #0 /srv/mediawiki/php-master/includes/debug/logger/monolog/Spi.php(220): ObjectFactory::getObjectFromSpec(NULL) [17:20:48] #1 /srv/mediawiki/wmf-config/logging-labs.php(149): MWLoggerMonologSpi->getHandler(NULL) [17:21:28] (03CR) 10BryanDavis: [C: 032] beta: capture $logstashHandler for closure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182839 (owner: 10BryanDavis) [17:21:33] (03Merged) 10jenkins-bot: beta: capture $logstashHandler for closure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182839 (owner: 10BryanDavis) [17:22:06] (possibly a big query locked the graphite webapp up, recovered by itself) [17:22:08] greg-g: Can I deploy two Wikidata backports now-ish? [17:22:12] !log bd808 Synchronized wmf-config/logging-labs.php: Beta logging config change (5b628827) (duration: 00m 06s) [17:22:15] Logged the message, Master [17:22:25] hoo: what are they? :) [17:22:27] is there still a problem with VE? [17:23:15] greg-g: https://gerrit.wikimedia.org/r/182750 and https://gerrit.wikimedia.org/r/182661 [17:23:16] edsanders: on mobile, yes [17:23:44] (a second bug, it seems) [17:23:58] hoo: doit [17:24:11] RECOVERY - puppet last run on mw1160 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [17:24:24] greg-g: Thanks, I will :) [17:25:42] RECOVERY - puppet last run on mw1153 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [17:28:26] (03PS1) 10Yuvipanda: labslamp: Add php5-cli [puppet] - 10https://gerrit.wikimedia.org/r/182840 [17:28:43] (03PS2) 10Yuvipanda: labslamp: Add php5-cli [puppet] - 10https://gerrit.wikimedia.org/r/182840 [17:29:52] (03PS1) 10Ottomata: Serve hue requests to backend analytics1027 on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/182842 [17:30:12] RECOVERY - puppet last run on mw1156 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [17:30:36] mutante: that look ok? ^^ [17:30:37] (03CR) 10jenkins-bot: [V: 04-1] Serve hue requests to backend analytics1027 on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/182842 (owner: 10Ottomata) [17:30:41] oop, guess not :p [17:31:11] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.003 second response time [17:31:18] (03PS2) 10Ottomata: Serve hue requests to backend analytics1027 on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/182842 [17:31:29] there we go ^ [17:31:31] (03CR) 10Yuvipanda: [C: 032] labslamp: Add php5-cli [puppet] - 10https://gerrit.wikimedia.org/r/182840 (owner: 10Yuvipanda) [17:34:31] (03CR) 10Dzahn: [C: 031] "i haven't used backend_options before because they were all just on port 80 but yea, this looks reasonable, just like the other examples a" [puppet] - 10https://gerrit.wikimedia.org/r/182842 (owner: 10Ottomata) [17:34:42] mutante, legal OKayed my NDA? [17:35:20] MWTidy still awry on hhvm ... :S [17:35:28] thanks mutante, yeah, that is actually a little strange, because there might be more than one service on a single node. i guess that means i might need multiple backends, but it look slike it infers the backend name from the hostname, so mehhh? [17:35:34] i'll deal with that when/if I hafta [17:35:38] (03PS3) 10Ottomata: Serve hue requests to backend analytics1027 on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/182842 [17:37:06] ottomata: yes it means we can't just have multiple backends on one host, we ran into that before [17:37:26] (03CR) 10Ottomata: [C: 032] Serve hue requests to backend analytics1027 on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/182842 (owner: 10Ottomata) [17:37:36] Krenair: not specifically, but this is based on you being a "WMF" user in the first place [17:38:10] Krenair: the LDAP group nda isn't the same as NDA for security bugs .. [17:38:11] mutante, I have the wmf group for that [17:38:16] I know [17:38:24] Krenair: yes, but that wouldnt give you login on icinga/graphite/ [17:38:36] yes it does [17:39:41] ah you are right. it was redundant to add you [17:40:27] I am told that my NDA has been signed by WMF, but haven't seen the signature myself. [17:40:41] Last time I asked Chris, it was waiting for legal review or something like that IIRC... [17:41:21] Krenair: yea, for that we have to wait for legal [17:41:35] maybe in the future it can be legalpad.wm [17:41:49] Okay. Please actually ask legal for the current status. [17:42:44] i'm not sure how to when not at office [17:42:53] let me try adding CCs [17:42:58] !log hoo Synchronized php-1.25wmf12/extensions/Wikidata/: Update Wikibase: Fix SpecialEntityData and enhance populateSitesTable (duration: 00m 14s) [17:43:03] Logged the message, Master [17:44:20] !log hoo Synchronized php-1.25wmf13/extensions/Wikidata/: Update Wikibase: Fix SpecialEntityData and enhance populateSitesTable (duration: 00m 24s) [17:44:24] Logged the message, Master [17:44:42] PROBLEM - configured eth on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:44:52] PROBLEM - DPKG on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:44:53] PROBLEM - puppet last run on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:45:01] PROBLEM - Disk space on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:45:01] PROBLEM - RAID on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:45:02] PROBLEM - dhclient process on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:45:02] PROBLEM - salt-minion processes on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:45:32] PROBLEM - parsoid disk space on wtp1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:46:31] (03PS5) 10BryanDavis: monolog: honor log sampling and levels for logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181350 [17:46:38] 3ops-requests: server admin log should include year in date (again) - https://phabricator.wikimedia.org/T85803#955489 (10Aklapper) [17:47:41] (03CR) 10BryanDavis: monolog: honor log sampling and levels for logstash (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181350 (owner: 10BryanDavis) [17:48:07] mutante: do you know if there is a way to force https at the varnish level? [17:48:13] with misc-lb? [17:50:02] (03CR) 10BryanDavis: [C: 04-2] "Blocking with -2 so I don't do anything dumb here. Should be safe to deploy after 1.25wmf14 hits group1." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181130 (owner: 10BryanDavis) [17:50:31] PROBLEM - Parsoid on wtp1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:50:49] (03CR) 10jenkins-bot: [V: 04-1] monolog: honor log sampling and levels for logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181350 (owner: 10BryanDavis) [17:51:01] ottomata: with non-misc stuff, we've generally had the app-layer apache/service do that with a redirect [17:51:22] yeah, makes sense, i don't seem to have much control over that unfortunetly, am trying. [17:52:07] it's possible to redirect in varnish, but not recommended: https://www.varnish-cache.org/trac/wiki/VCLExampleRedirectInVCL [17:52:21] (because it's kind of an ill fit and a hack there) [17:53:04] basically you detect non-https and set a custom bogus error code, then in vcl_error you detect your custom bogus error code and issue the 3xx [17:53:38] (03PS6) 10BryanDavis: monolog: honor log sampling and levels for logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181350 [17:54:00] is ops-requests@ still usable or is there a new world with Phab? [17:54:03] ottomata: you would typically use this in the Apache layer: RewriteCond %{HTTP:X-Forwarded-Proto} !https [17:54:22] ottomata: as opposed to being able to just check protocol because now it just speaks http [17:54:46] cajoel, https://www.mediawiki.org/wiki/Phabricator/versus_RT#Submitting_requests_to_the_Operations_team [17:55:01] andre__: thx [17:55:02] ...though someone(TM) should update https://wikitech.wikimedia.org/wiki/RT [17:55:02] ottomata: ah right, not Apache.. yea [17:55:23] ha, mutante, i mean, I could set up another apache proxy on that box too :/ [17:55:31] that would solve any future multi backend problems too [17:55:33] but meh :( [17:55:54] it does say this at the top [17:56:52] ottomata: you could.. you would still save a public IP [17:57:39] this box is private anyway [18:09:50] (03PS1) 10Yuvipanda: Fix whitelist value being ignored when reading greylisted.yaml [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182847 [18:09:52] (03PS1) 10Yuvipanda: Add report that verifies view definitions [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182848 [18:09:53] legoktm: ^ parser is here :) [18:16:21] PROBLEM - SSH on wtp1020 is CRITICAL: Server answer: [18:17:20] sigh, need to build package... [18:17:53] where do people usually build packages? [18:17:54] carbon? [18:18:41] RECOVERY - SSH on wtp1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [18:19:35] I did it on a fresh labs host previously, but that may not be best [18:22:31] (03CR) 10OliverKeyes: "Sure, if anyone can point me to the relevant lines." [puppet] - 10https://gerrit.wikimedia.org/r/182558 (owner: 10OliverKeyes) [18:23:36] 3operations: Identify deficiencies in Nimsoft Cloud User Experience Monitor (formerly WatchMouse) - https://phabricator.wikimedia.org/T85829#955636 (10chasemp) 3NEW [18:23:38] greg-g: I have one more thing... and that's painful [18:23:42] well, a little [18:23:53] 3operations: Identify deficiencies in Nimsoft Cloud User Experience Monitor (formerly WatchMouse) - https://phabricator.wikimedia.org/T85829#955636 (10chasemp) [18:24:20] Apparently we need this: https://gerrit.wikimedia.org/r/182850 due to a configuration change that was SWATted earlier [18:24:39] but reverting the configuration is not really an option as it breaks various things [18:25:24] Tpt said that not many pages are affected, so we could possible defer that into tonight's SWAT [18:25:34] but I didn't plan to be around for that [18:27:51] chasemp: hmm, how do you transfer it from fresh labs host to carbon? [18:28:10] I scp'ed back to my host first [18:28:14] laptop [18:28:24] hmm [18:28:34] ideally I imagine we build in prod and use an autodeploy to carbon [18:28:42] but afaik there isn't any logic like taht aroud [18:29:02] oooh, I just need to backport from trusty [18:29:09] just a python package... [18:30:01] PROBLEM - SSH on wtp1020 is CRITICAL: Server answer: [18:31:02] RECOVERY - SSH on wtp1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [18:34:22] bblack, i'm searching through puppet, don't know much about this. [18:34:33] where are the (nginx?) ssl configs for misc-web-lb stuff? [18:35:39] specifically, i'm trying to find out if it sets X-Forwarded-Proto [18:36:19] it does I believe as we check it for SSL redirection with phab [18:36:30] that is exactly why I need it :) [18:37:32] ah ok. hmm, i can do this but I can't until I upgrade the cluster...another blocker :) thanks chasemp [18:38:01] https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/phabricator/templates/phabricator-default.conf.erb;54932673a784d86c47ff820233ba70a5cc5e9413$15-17 [18:38:18] ottomata: ^ and np [18:39:29] nice I wondered if this worked [18:39:30] https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/phabricator/templates/phabricator-default.conf.erb;HEAD$15-17 [18:39:31] yep [18:41:14] !log imported trusty pyparsing package into precise-wikimedia [18:41:19] Logged the message, Master [18:42:22] PROBLEM - SSH on wtp1020 is CRITICAL: Server answer: [18:42:36] (03PS2) 10Yuvipanda: Add report that verifies view definitions [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182848 [18:43:32] RECOVERY - SSH on wtp1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [18:45:32] ottomata: yeah in general you can assume X-F-P is correct and trustable (we set it in our SSL proxies, and we strip it if it didn't come from us) [18:46:46] (03CR) 10Yuvipanda: "I've verified that labsproxy *does* set XFP" [puppet] - 10https://gerrit.wikimedia.org/r/181949 (owner: 10Hoo man) [18:46:52] PROBLEM - SSH on wtp1020 is CRITICAL: Server answer: [18:48:01] RECOVERY - SSH on wtp1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [18:50:04] interesting! [18:51:01] ottomata: some caches seems to have stopped reporting to ganglia temporarily today, related to the varnishkafka change? [18:51:09] (03CR) 10Faidon Liambotis: [C: 04-2] "Does Special:BlankPage hit the database? Does it use the parsercache? Does it use memcache? What kind of Cache-Control headers does it sen" [puppet] - 10https://gerrit.wikimedia.org/r/182558 (owner: 10OliverKeyes) [18:51:28] godog: shouldn't be. i haven't turned anything off [18:51:58] ottomata: yeah I was surprised too, not even all caches [18:52:09] (03CR) 10OliverKeyes: "Actually, we have, last time I checked, two or three different lists for trusted XFFs ;p." [puppet] - 10https://gerrit.wikimedia.org/r/182558 (owner: 10OliverKeyes) [18:52:22] PROBLEM - SSH on wtp1020 is CRITICAL: Server answer: [18:53:29] (03PS1) 10Ottomata: Use monitoring::graphite_threshold for varnishkafka delivery error check [puppet] - 10https://gerrit.wikimedia.org/r/182860 [18:53:32] RECOVERY - SSH on wtp1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [18:57:21] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.016 second response time [18:57:51] uh, did I just do that with a really heavy graphite request? [18:58:31] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.007 second response time [18:58:34] ja [18:58:36] ha* [18:59:03] andrewbogott: what's the status with jessie + labs btw? [18:59:12] PROBLEM - SSH on wtp1020 is CRITICAL: Server answer: [18:59:45] <_joe_> wtp1020 is in a bad state [18:59:52] paravoid: I have an instance that pretty much works, but lvm/jessie/systemd is a mess. I'll forward you the latest. [19:00:03] gwicke, cscott: https://phabricator.wikimedia.org/T76115 ? [19:00:36] andrewbogott: awesome! let me know, I'd like to tie up all of the loose ends soon and "announce" it for prod, so maybe it's better we do it for prod & labs at the same time [19:00:42] (03PS2) 10Ottomata: Use monitoring::graphite_threshold for varnishkafka delivery error check [puppet] - 10https://gerrit.wikimedia.org/r/182860 [19:01:02] PROBLEM - NTP on wtp1020 is CRITICAL: NTP CRITICAL: No response from NTP server [19:01:32] RECOVERY - SSH on wtp1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:03:04] paravoid: I'm hoping that cscott reacts soon [19:03:11] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.006 second response time [19:04:18] ah, meeting! [19:12:42] PROBLEM - SSH on wtp1020 is CRITICAL: Server answer: [19:13:51] RECOVERY - SSH on wtp1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:14:34] !log Made the ruwikinews sites table entry on wikidatawiki use https URLs rather than protocol relative ones [19:14:38] Logged the message, Master [19:14:57] csteipp: ^ Worked around the problem now [19:15:02] but that's nasty, obviously [19:16:39] ah [19:18:14] 3OCG-General-or-Unknown, operations: OCG Queue Length Checks are unclear - https://phabricator.wikimedia.org/T76115#955854 (10GWicke) @cscott: Could you come up with a new reasonable threshold? [19:21:00] subbu: gwicke can you get an incident report out for the parsoid issues on saturday? :) [19:22:36] YuviPanda: I can try to hot-potato-route this to subbu ;) [19:24:00] gwicke, YuviPanda sure .. later this afternoon. [19:24:46] ha! seems to have worked ;P [19:27:31] PROBLEM - SSH on wtp1020 is CRITICAL: Server answer: [19:27:40] 3Ops-Access-Requests: Access to stat1003 (statistics-users) for Ananth Ramakrishnan - https://phabricator.wikimedia.org/T85828#955891 (10Ottomata) [19:30:21] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 6.012 second response time [19:30:51] RECOVERY - SSH on wtp1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:34:21] PROBLEM - SSH on wtp1020 is CRITICAL: Server answer: [19:35:00] (03CR) 10Ori.livneh: "ping @chasemp (re: dzahn's comment)" [puppet] - 10https://gerrit.wikimedia.org/r/182347 (owner: 10Ori.livneh) [19:35:26] (03PS2) 10Rush: Revoke Brett Simmer's key [puppet] - 10https://gerrit.wikimedia.org/r/182347 (owner: 10Ori.livneh) [19:36:19] (03CR) 10Rush: "heyo! ping worked. So are we absenting his user or jsut removing the key? Description seems like absent is appropriate (forever) vs key" [puppet] - 10https://gerrit.wikimedia.org/r/182347 (owner: 10Ori.livneh) [19:36:30] 3Scrum-of-Scrums, RESTBase, operations: RESTBase production hardware - https://phabricator.wikimedia.org/T76986#955918 (10GWicke) @robh: What is the latest ETA for this hardware? [19:36:32] RECOVERY - SSH on wtp1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:37:26] (03CR) 10Ori.livneh: "If it's all the same, I'd prefer just revoking the key, because that way, if we run into some super-thorny HHVM problem, we could choose t" [puppet] - 10https://gerrit.wikimedia.org/r/182347 (owner: 10Ori.livneh) [19:43:32] PROBLEM - SSH on wtp1020 is CRITICAL: Server answer: [19:44:29] (03PS1) 10QChris: Follow pagecounts-all-sites move in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/182867 [19:44:57] (03CR) 10Ori.livneh: [C: 031] Use motd::script instead of File across the tree [puppet] - 10https://gerrit.wikimedia.org/r/182376 (owner: 10Faidon Liambotis) [19:47:11] (03CR) 10Ori.livneh: "ignore doesn't require the files to be present, so ignoring '00-header' / '99-footer' on Debian wouldn't be an issue. If it is important t" [puppet] - 10https://gerrit.wikimedia.org/r/182374 (owner: 10Faidon Liambotis) [19:48:23] (03CR) 10Ori.livneh: [C: 031] "Makes sense." [puppet] - 10https://gerrit.wikimedia.org/r/182373 (owner: 10Faidon Liambotis) [19:48:56] (03PS2) 10Ottomata: Follow pagecounts-all-sites move in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/182867 (owner: 10QChris) [19:49:43] (03CR) 10Ottomata: [C: 032] Follow pagecounts-all-sites move in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/182867 (owner: 10QChris) [19:50:21] RECOVERY - SSH on wtp1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [19:54:36] (03CR) 10Faidon Liambotis: "Ignoring 99-footer when it doesn't exist means that we wouldn't display motd.tail on those systems (Ubuntu trusty, Debian). We want to, th" [puppet] - 10https://gerrit.wikimedia.org/r/182374 (owner: 10Faidon Liambotis) [19:56:09] (03CR) 10Ori.livneh: [C: 031] "Nah (re: moving it to base). If uniformity is your goal, then I think this change makes sense." [puppet] - 10https://gerrit.wikimedia.org/r/182374 (owner: 10Faidon Liambotis) [19:57:02] PROBLEM - SSH on wtp1020 is CRITICAL: Server answer: [19:57:34] what's up with the wtp* outages? nothing in the SAL. [19:58:11] RECOVERY - SSH on wtp1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [20:03:04] What exactly is the difference between marking a user as absent vs. removing them entirely anyway? [20:05:26] (03CR) 10Rush: "I think I'm ok w/ this (as in I'll merge :)" [puppet] - 10https://gerrit.wikimedia.org/r/182347 (owner: 10Ori.livneh) [20:05:50] absent does remove them in the puppet sense [20:06:09] but in teh global wmf environment sense the absenting has to actually be done so we have a "absent" dummy group [20:06:16] that can be applied everywhere and always [20:08:22] PROBLEM - SSH on wtp1020 is CRITICAL: Server answer: [20:12:23] (03PS2) 10Awight: DO NOT DEPLOY BEFORE https://gerrit.wikimedia.org/r/#/c/182074/ Ugly URLs to override mobile redirect for CentralNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182078 (owner: 10AndyRussG) [20:14:03] (03CR) 10Awight: [C: 031] "I still don't fully understand how the URL rewrite magic is happening, but this looks good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182078 (owner: 10AndyRussG) [20:15:07] 3Scrum-of-Scrums, operations, RESTBase: RESTBase production hardware - https://phabricator.wikimedia.org/T76986#955983 (10RobH) As Gabriel updates, we just pushed the order for mgmt approval today. We've only recently begun ordering HP systems, but our limited history shows about 3 weeks from order to delivery.... [20:15:24] (03PS3) 10Yuvipanda: Add report that verifies view definitions [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182848 [20:15:30] man i hate that [20:15:30] 3Scrum-of-Scrums, operations, RESTBase: RESTBase production hardware - https://phabricator.wikimedia.org/T76986#955984 (10RobH) a:3RobH [20:15:34] i reallyyyyy hate that [20:15:49] im going to end up having to turn off ping for my name. [20:16:03] same [20:16:09] chasemp: cant we disable it? [20:16:14] i mean, this is our channel man [20:16:17] lets just disable it. [20:16:42] I have no knowledge of wikibugs bot, valhallasw`cloud i think is the owner [20:16:42] it's not a phab thing but a labs thing [20:16:49] ahh.. [20:16:50] wikibugs [20:17:08] https://www.mediawiki.org/wiki/wikibugs?redirect=no [20:17:19] robh: you can /ignore it, or we can move all ops- bugs somewhere else [20:17:19] YuviPanda: heeeeeelllpppp [20:17:29] moving them back to mediawiki-dev is unreasonable, I think [20:17:29] can we just not output ops bugs? [20:17:40] yes, we can. [20:17:43] valhallasw`cloud: let’s do #wikimedia-operations-spam? [20:17:45] robh: call beatles... [20:17:54] I say just disable it for TASLS [20:17:55] or something like that [20:17:56] tasks [20:18:03] if/when our git commits are phab, then it can [20:18:05] YuviPanda: why? it's already in -feed anyway [20:18:06] but it does that for gerrit [20:18:13] valhallasw`cloud: but feed is unusablybig [20:18:22] YuviPanda: are you saying you like having the output? [20:18:26] robh: I find the task notifications qutie usefull [20:18:30] you stand alone, much like the cheese! [20:18:42] robh: you can of course also tell your client to not show highlights from wikibugs ;- [20:18:43] ;-) [20:18:45] (im pretty sure the cheese stands alone may be an odd US midwest thing) [20:19:00] valhallasw`cloud: well, chase also hates it [20:19:04] so yuvi is outnumbered now! [20:19:25] damn it we should have had this discussion in ops meeting! [20:19:49] robh: I wanted to bring it here but I didn’t do it myself but was gently surprised one morning by valhallasw`cloud I think :) [20:19:59] it’s also much less spammy than icinga-wm [20:20:09] well, i'd like it gone and i think me simply blocking it isnt ideal [20:20:21] as it may have valid info for use across channels [20:20:28] I suggest you fight on it, then YuviPanda submits a patch and we deploy that :-) [20:20:38] well, im about to give up and block it [20:20:58] and done. trying to get folks to change things back to the way it was before is too much work [20:21:04] chasemp: we lose. [20:21:06] there's also a /dev/null option which sends ops bugs to -feed (everything goes to feed anyways) but no where else. [20:21:31] ...bah, where the hell does one block someone in limechat. [20:21:42] valhallasw`cloud: who asked for this change? [20:21:57] robh: /ignore [20:22:17] robh: I did. You ops people were spamming mediawiki-dev with irrelevant changes ;-) [20:22:29] so why cant it just be turned off? [20:22:36] we didnt ask for it, simply dont feed its spam into channel [20:23:09] robh: as I've mentioned a gazillion times, IT CAN, but I see one person saying 'keep it' and one person 'kill it' [20:23:34] the status quo was it didnt exist [20:23:39] you guys changed it so it does [20:23:46] id say it defaults back to not existing in such a case [20:23:55] robh: well, we didn’t really have ops things on bugzilla before, and RT didn’t make this possible... [20:24:13] RECOVERY - SSH on wtp1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [20:24:18] if its simply 'do whatever the hell you want' i could just kickban the wikibugs =p [20:24:21] I *do* find it useful - I’ve often jumped on unrelated tickets because I spend a lot of time on IRC and see something pass by [20:24:30] robh: you can just ignore it :) [20:24:31] adding more spam to the channel isnt ideal. [20:24:39] you could just use email instead of irc [20:24:40] robh: you don’t see it, I do. win-win :) [20:25:18] thats the same arguement for why no one tackles cron spam [20:25:26] 'if you auto delete it its not a problem' [20:25:43] its adding to the signal noise ratio, its just a matter of view on which side [20:25:50] on which we wont agree [20:25:54] and val agrees with you [20:25:57] so i lose ;p [20:26:02] YAY! [20:26:04] :D [20:26:24] I would remove it too if we are counting votes :) [20:26:26] but meh [20:27:06] valhallasw`cloud: so now 2 to 1 [20:27:31] robh: is there a good way to ignore it in limechat? [20:27:39] chasemp: /ignore wikibugs [20:27:41] PROBLEM - SSH on wtp1020 is CRITICAL: Server answer: [20:27:56] YuviPanda: that also ignores it in other channels, which is less-than-ideal [20:28:09] that’s truel [20:28:13] my irssi solution is to 'semi-ignore' bots: they don't highlight me, and they appear in gray [20:29:17] I think limechat sucks in this regard [20:30:06] textual can ignore per-channel [20:31:04] so i dont want to ignore it across all channels [20:31:15] i just dont want it spamming this channel. [20:31:21] its 2 ops to 1 to disable [20:31:25] so... [20:31:42] https://gerrit.wikimedia.org/r/182880 moves it to a different channel [20:32:31] although, I’d still like it here, and honestly think being passively aware of things going on in the #operations project is useful. [20:32:43] thats what the firehose of email is for! [20:32:56] I don’t want a firehose of emails. IRC is realtime, email is not. [20:33:11] Isn't email watching all operations tasks a highly restricted thing? [20:33:21] but, this has taken up enough time :) [20:33:22] * YuviPanda lets it go [20:33:53] robh: by that we should also move icinga-wm out of here. [20:34:12] it was discussed and decided that since its directly merging that it would stay [20:34:23] where every single task update on every single ops ticket would be a LOT more than every single patchset. [20:34:36] and if most of ops agrees with you, cool, i lose [20:34:44] but if you wanna change status quo, you start the ops list email ;P [20:34:45] well, let me email ops@ [20:34:51] (see my email about local uploads!) [20:35:02] YuviPanda: i apologize, i should have started it there [20:35:04] robh: can you poke the ops list about that new IRC channel ? :-] [20:35:07] rather than complaining bitterly [20:35:26] i realize i was a bit too confrontrational possibly [20:35:29] =P [20:35:37] possibly. [20:35:54] well, like as i sit here at my table, i wasnt raising my voice or typing hard or anythign [20:36:01] but it dosnt convey to irc like that. [20:36:13] "I hate you and your face and your pants" - robh 2015 [20:36:23] AND YOUR FUCKING SHOES [20:36:24] joke’s on robh, I’m not wearing pants. [20:36:26] =] [20:36:36] YuviPanda: thats not a safe thing man [20:36:42] thats the downhill slope of remote work [20:36:51] keep on pants and regularly bathe or all is lost. [20:36:53] robh: it’s too hot to wear pants. 26C [20:37:09] shorts = pants [20:37:19] noooooooooooo [20:37:24] "lower body out garments" [20:37:30] outer even [20:37:30] rephase [20:37:42] RECOVERY - SSH on wtp1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [20:37:51] when you dont change out of what you slept in for the past week, your remote work schedule has gone too far [20:38:08] robh: My biggest problem is that I go to sleep usually at 8AM... [20:38:31] PROBLEM - Varnishkafka Delivery Errors on cp3016 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 702.108337 [20:38:41] "For the purposes of this discussion Lower Body Outer Garments (herefore referred to as Pants) will be assumed to be garments rendered from perishable materials worn on the lower portion of the body assuming gravitational and orientation constants" [20:38:46] i'd need blackout curtains and earplugs to sleep at 8am. [20:39:02] hrmm [20:39:13] chasemp: do polyster pants count? [20:39:15] are they perishable? [20:39:19] hmm, probably [20:39:24] let me check with my counsel [20:39:24] i wanna argue against nonperishable pants but i dont think i can [20:39:31] i mean for them [20:39:37] but on a cosmic timeline, its all perishable... [20:40:21] heh [20:40:40] that is the legalese way [20:40:59] their secret is, it's always ambiguous [20:41:10] ‘are these pants perishable?’ ‘maybe' [20:41:12] PROBLEM - SSH on wtp1020 is CRITICAL: Server answer: [20:41:22] ‘is SSH on wtp1020 ok?’ ‘maybe' [20:41:32] RECOVERY - Varnishkafka Delivery Errors on cp3016 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [20:41:46] 3Parsoid, operations: Parsoid should use SO_REUSEADDR when it binds to its port - https://phabricator.wikimedia.org/T75395#956056 (10GWicke) >>! In T75395#834249, @GWicke wrote: > @yuvipanda: We don't have root on those boxes, so that would need to be somebody in ops. See also: https://gerrit.wikimedia.org/r/#/... [20:41:52] YuviPanda: server answer: leave me alone! [20:43:26] jgage: regarding ipsec ... [20:43:45] isn't having site-to-site vpn enough ? [20:44:07] (03CR) 10coren: [C: 031] "The sticky makes it reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/173511 (owner: 10Krinkle) [20:44:32] RECOVERY - SSH on wtp1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [20:47:52] PROBLEM - SSH on wtp1020 is CRITICAL: Server answer: [20:48:47] 3ops-codfw: where to put the netapp (nas1) in codfw - https://phabricator.wikimedia.org/T84796#956088 (10Papaul) All Nas servers moved to storage [20:50:36] * valhallasw`cloud prods wikibugs [20:50:53] oh, I pulled before the change was actually merged [20:50:58] should be fixed now [20:52:21] RECOVERY - SSH on wtp1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [20:53:31] PROBLEM - Varnishkafka Delivery Errors on cp3004 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 892.233337 [20:56:41] RECOVERY - Varnishkafka Delivery Errors on cp3004 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [20:57:34] (03CR) 10coren: [C: 031] "She works" [puppet] - 10https://gerrit.wikimedia.org/r/173512 (https://bugzilla.wikimedia.org/72063) (owner: 10Krinkle) [20:57:49] (03PS8) 10Krinkle: contint: Add tmpfs mount in jenkins-deploy homedir for labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/173512 (https://bugzilla.wikimedia.org/72063) [20:59:02] PROBLEM - SSH on wtp1020 is CRITICAL: Server answer: [21:00:04] gwicke, cscott, arlolra, subbu: Dear anthropoid, the time has come. Please deploy Parsoid/OCG (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150105T2100). [21:00:22] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00666666666667 [21:05:31] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [21:10:15] YuviPanda: https://gerrit.wikimedia.org/r/182173 ? :) [21:12:41] RECOVERY - SSH on wtp1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [21:14:04] (03CR) 10Yuvipanda: [C: 031] Don't use logrotate for the wikidata dump logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/182173 (owner: 10Hoo man) [21:14:25] hoo: one minor nit, but seems ok. I can merge now, but I guess it’ll be better for both of us if I merge tomorrow so we have some time to debug?p [21:15:45] YuviPanda: How would that work with find? [21:16:01] PROBLEM - SSH on wtp1020 is CRITICAL: Server answer: [21:17:18] We could work on the actual last modified time of the files [21:17:29] hoo: you could do something like ‘delete everything that hasn’t been modified in about 1 month' [21:17:30] yeah [21:17:39] I like that better than a loop + ls [21:17:59] mh... I really like my "just keep $n" thing [21:18:20] hoo: well, that’s fine too :) I did give it a +1. [21:20:33] does anyone know why wtp1020 is down? /cc gwicke [21:20:42] https://ganglia.wikimedia.org/latest/?c=Parsoid%20eqiad&h=wtp1020.eqiad.wmnet&m=cpu_report&r=day&s=by%20name&hc=4&mc=2 [21:21:03] i was about to sync new code for deploy and saw that [21:23:04] YuviPanda, _joe_ ^ [21:26:06] (03CR) 10Legoktm: [C: 04-1] Make table columns dicts instead of lists (031 comment) [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182801 (owner: 10Yuvipanda) [21:26:46] (03CR) 10Legoktm: [C: 031] Add base table name to indexed views [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182799 (owner: 10Yuvipanda) [21:26:58] YuviPanda: this code needs tests :| [21:27:24] legoktm: yeah, and flake/tox [21:27:31] RECOVERY - SSH on wtp1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [21:27:32] PROBLEM - Varnishkafka Delivery Errors on cp3015 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 158.375 [21:27:32] subbu: hmm, not sure. am about to go off now, though :( [21:27:51] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 51.083332 [21:27:51] PROBLEM - Varnishkafka Delivery Errors on cp3004 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 663.041687 [21:27:58] YuviPanda, ok. [21:28:17] (03CR) 10Awight: [C: 031] Add comment about not redirecting ugly URLs [puppet] - 10https://gerrit.wikimedia.org/r/182141 (owner: 10AndyRussG) [21:28:21] legoktm: not sure how testable that code is, though. it mostly runs queries against things... [21:28:31] any other ops around? [21:28:58] (03CR) 10Awight: [C: 031] Remove unused mobile CentralNotice URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182132 (owner: 10AndyRussG) [21:29:03] (03CR) 10Legoktm: [C: 031] "I only looked at the greylisted.yaml changes." [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182848 (owner: 10Yuvipanda) [21:29:57] subbu: I wasn't able to log in earlier [21:30:05] possibly a full disk [21:30:15] I think roots should still be able to log in [21:30:41] RECOVERY - Varnishkafka Delivery Errors on cp3015 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:31:02] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:31:02] RECOVERY - Varnishkafka Delivery Errors on cp3004 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:32:36] greg-g, ^ need some help. [21:33:28] bblack: since you're chatting over in that other channel, I'm pinging you re subbu's request for help [21:33:43] (03PS2) 10Yuvipanda: Fix whitelist value being ignored when reading greylisted.yaml [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182847 [21:33:45] (03PS3) 10Yuvipanda: Make table columns dicts instead of lists [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182801 [21:33:47] (03PS4) 10Yuvipanda: Add report that verifies view definitions [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182848 [21:34:00] gwicke: subbu I can’t log in as root either. [21:34:04] ssh doesn’t respond [21:34:45] network has been completely dead for a few hours now [21:35:02] PROBLEM - Varnishkafka Delivery Errors on cp3018 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 373.016663 [21:35:12] PROBLEM - SSH on wtp1020 is CRITICAL: Server answer: [21:35:27] pong, looking at wtp1020 [21:35:42] bblack, thanks [21:36:24] sweet :) [21:37:18] looks to me like a runaway process has basically hosed the machine, imho [21:37:26] I can't yet get in to see, but the graphs read like that [21:37:35] stuck on disk i/o at that [21:37:43] (03CR) 10Hashar: [C: 031] Move composer.json into repository root [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180589 (owner: 10Legoktm) [21:38:04] ~20 mins ago? [21:38:11] RECOVERY - Varnishkafka Delivery Errors on cp3018 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:38:18] is there some obvious correlated action there? did someone deploy/run something there 20 mins ago? [21:38:32] https://ganglia.wikimedia.org/latest/?c=Parsoid%20eqiad&h=wtp1020.eqiad.wmnet&m=cpu_report&r=day&s=by%20name&hc=4&mc=2 .. seems longer than that. [21:38:51] bblack, and no deploys .. right now is our deploy window. [21:39:15] oh yeah I'm looking at the wrong graph. all the rest of the above applies, but it was more like 4 hours ago [21:39:22] (well right graph, but wrong timescale) [21:40:10] could be a disk / filesystem failure as well [21:40:26] !log hard rebooted wtp1020, unresponsive in every way [21:40:34] Logged the message, Master [21:42:52] PROBLEM - Host wtp1020 is DOWN: PING CRITICAL - Packet loss = 100% [21:43:22] RECOVERY - SSH on wtp1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [21:43:34] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1015.68335 [21:43:34] RECOVERY - Host wtp1020 is UP: PING OK - Packet loss = 0%, RTA = 1.76 ms [21:43:52] RECOVERY - parsoid disk space on wtp1020 is OK: DISK OK [21:43:52] RECOVERY - configured eth on wtp1020 is OK: NRPE: Unable to read output [21:43:52] RECOVERY - DPKG on wtp1020 is OK: All packages OK [21:44:02] RECOVERY - Disk space on wtp1020 is OK: DISK OK [21:44:03] RECOVERY - RAID on wtp1020 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [21:44:22] RECOVERY - dhclient process on wtp1020 is OK: PROCS OK: 0 processes with command name dhclient [21:44:22] RECOVERY - salt-minion processes on wtp1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:44:31] some strange messages during startup. it's back online for now, but I'm still looking at the node and there's a good chance there's a hardware issue. [21:44:38] (03CR) 10Legoktm: [C: 031] Make table columns dicts instead of lists [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/182801 (owner: 10Yuvipanda) [21:44:58] bblack, should i go ahead with our scheduled deploy? [21:46:01] RECOVERY - Parsoid on wtp1020 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.030 second response time [21:46:08] if you have time, give me like 10 minutes to look around [21:46:12] sure. [21:46:41] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:49:46] seems wtp1020 and others were reinstalled to trusty on Nov 4 [21:50:06] there might be something odd/wrong about the new install, but it's hard to put a finger on exactly what yet [21:50:24] subbu: in any case, go ahead for now, I'll just keep looking at it on the side [21:50:42] yes, all nodes were upgraded then. [21:50:47] k. thx. [21:50:54] there are many strange kernel errors related to storage, pci hardware, IRQs. like bad bios settings and/or kernel bugs and/or who knows what :) [21:51:12] but it does seem to be functional right now, and was up until 4 hours ago [21:53:22] ^d: about? [21:55:44] <^d> chasemp: For about the next 5 minutes, sup? [21:55:57] see -devtools twentyafterfour asked the questino already :) [21:56:02] need force push on a repo to migrate history [21:56:07] if you are in a position to do it [21:56:18] <^d> ack'd and done/. [21:56:23] thanks! [21:56:27] <^d> np. [21:57:01] (03PS1) 10Ori.livneh: Update my (=ori) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/182898 [21:57:14] other wtp1020 hints: [21:57:15] Jan 3 16:33:54 wtp1020 kernel: [5190073.282954] salt-minion invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 [21:57:22] PROBLEM - Parsoid on wtp1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:57:53] ^ something was salted out about 5 hours before now, yesterday, which ate up all memory? [21:58:30] not long after: [21:58:30] Jan 3 17:02:33 wtp1020 kernel: [5191791.308041] nodejs invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 [21:59:09] nodejs oomkills seem pretty common in its logs, actually. probably app-level memleak [21:59:35] also, nodejs is buggy in general, there are several: [21:59:36] Dec 28 12:45:51 wtp1020 kernel: [4657578.661161] nodejs[26851]: segfault at 2e7929b00058 ip 00007fb978b765f8 sp 00007fff90200980 error 4 in libv8.so.3.14.5[7fb9789e1000+3de000] [22:00:12] and then for good measure, there's also these questionable hardware-related messages on this box as well, e.g.: [22:00:15] Dec 28 20:59:00 wtp1020 kernel: [4687190.192151] do_IRQ: 0.158 No irq handler for vector (irq -1) [22:00:35] [ 19.712962] mei_me 0000:00:16.0: initialization failed. [22:00:51] RECOVERY - Parsoid on wtp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.072 second response time [22:01:04] and way back during the grub bootup, there was this as well: [22:01:05] error: diskfilter writes are not supported. [22:01:25] ^ which seems to be related to problems with grub and how the boot partition is set up wrt to raid / lvm / md [22:01:42] PROBLEM - Parsoid on wtp1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:02:21] PROBLEM - Parsoid on wtp1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:02:39] Is there a way to configure wgDebugLogGroups to log only user creation and know what is the ip address of the originating visitor? [22:03:11] bblack, we had a bad page that sent parsoid into high cpu load .. on saturday. [22:03:21] so, some of those from jan3 might be from that incident. [22:03:26] ok [22:04:52] PROBLEM - Parsoid on wtp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:04:53] bblack, if wtp1017, 1019, 1023 don't recover on their own, they might need a nodejs kill and service restart [22:05:06] looks 1008 did come back up. [22:05:44] we have a 5 min. timeout .. but, it is probably close to that now. [22:05:51] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.025 second response time [22:06:53] !log deployed parsoid version 0e2997d2 [22:06:59] Logged the message, Master [22:07:00] subbu: looking at 1017, the machine seems overall healthy, but there are several nodejs procs running 100% cpu [22:07:30] I take that back, just one of them is consistently locking up a CPU [22:07:47] mutante: did you ever hear anything back about the .wiki domain names? [22:07:53] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND [22:07:56] 23513 parsoid 20 0 2585284 1.562g 6220 R 99.9 5.0 64:28.06 nodejs [22:08:01] yes .. it should timeout and get restarted on its own .. we still have page titles we don't handle well [22:08:02] sucking up insane memory too... [22:08:10] 1.5GB resident [22:08:54] this looks exactly like the pattern that probably killed the other node [22:08:59] it's almost out of free/cache memory now [22:09:24] should I just kill the runaway nodejs? [22:09:39] yes please. [22:10:21] (side note: server.js needs SIGTERM/INT handlers?) [22:10:29] bblack, maybe even sudo killall -9 nodejs; sudo service parsoid start on wtp1017 and wtp1019 [22:10:32] RECOVERY - Parsoid on wtp1019 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.025 second response time [22:10:39] ah, 1019 recovered [22:10:40] I think someone mentioned that in the ops meeting, that you guys were already looking at that [22:10:57] we'll take a look. [22:11:11] RECOVERY - Parsoid on wtp1017 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.008 second response time [22:11:25] * subbu will test the deploy now [22:12:10] kill on 1023 as well [22:12:15] *killed [22:13:05] in both cases, there were still parsoid-related nodejs processing running after the kills [22:13:12] do I actually need to explicitly start? [22:13:26] looks like they are all up. [22:13:33] i guess upstart takes care of it. [22:13:38] it kinda looks like they already had new services yeah [22:14:46] all looking good .. edit tests are looking clean [22:14:47] thanks bblack [22:14:51] what happend with esams? [22:15:44] did something happen with esams? [22:15:55] yeah [22:17:01] (03PS1) 10Rush: phab adding security extension [puppet] - 10https://gerrit.wikimedia.org/r/182934 [22:17:24] (03PS1) 10Catrope: Unbreak Parsoid URL config in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182935 [22:20:32] (03Abandoned) 10Hashar: Fix dependencies for tox 'cover' env [debs/pybal] - 10https://gerrit.wikimedia.org/r/173914 (owner: 10Hashar) [22:20:43] (03CR) 10Cmcmahon: [C: 031] "It would be nice to prevent this from happening also" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182935 (owner: 10Catrope) [22:22:37] (03CR) 1020after4: [C: 031] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/182934 (owner: 10Rush) [22:23:35] (03CR) 10Rush: [C: 032] phab adding security extension [puppet] - 10https://gerrit.wikimedia.org/r/182934 (owner: 10Rush) [22:23:59] gwicke, subbu: is an nodejs version update on the horizon, by any chance? [22:24:23] i'd be interested in having 0.11.13 or newer, for [22:25:17] we just upgrade to 0.10 .. we should play around and run tests with the newer version and see .. i think we were holding out for 0.12 at one point. [22:25:59] i am heading to a coffee shop and escaping my cold house ... back in 15. [22:27:08] ori: as I said a while ago, those flames aren't actually as useful as V8 profiler info for locating JS code hotspots [22:27:22] PROBLEM - puppet last run on iridium is CRITICAL: CRITICAL: puppet fail [22:27:57] gwicke: do you have the output of your tests anywhere? [22:28:16] ori: just try perf with v8 [22:28:20] you'll see what I mean [22:28:51] (you also see it in the netflix post if you look closely) [22:29:05] (03PS1) 10Rush: phab libext lock files per extension repo [puppet] - 10https://gerrit.wikimedia.org/r/182938 [22:30:04] (03CR) 1020after4: [C: 031] phab libext lock files per extension repo [puppet] - 10https://gerrit.wikimedia.org/r/182938 (owner: 10Rush) [22:30:51] (03CR) 10Rush: [C: 032] phab libext lock files per extension repo [puppet] - 10https://gerrit.wikimedia.org/r/182938 (owner: 10Rush) [22:34:46] (03PS1) 10Rush: phab change lockfile name on git::install [puppet] - 10https://gerrit.wikimedia.org/r/182941 [22:36:02] (03CR) 10Rush: [C: 032] "hopefully this is the last one" [puppet] - 10https://gerrit.wikimedia.org/r/182941 (owner: 10Rush) [22:37:20] uh yeah phab just went down [22:37:23] but I'm on it [22:38:40] i already miss my pretty dashboard and feel lost without it. [22:38:51] yay back \o/ [22:38:54] for now [22:38:57] temp fixed it [22:39:01] a lib reference has wrong path [22:49:14] (03PS1) 10Rush: phab load-libraries is meant to be an array of strings [puppet] - 10https://gerrit.wikimedia.org/r/182945 [22:52:21] (03PS2) 10Rush: phab load-libraries is meant to be an array of strings [puppet] - 10https://gerrit.wikimedia.org/r/182945 [22:53:22] RECOVERY - puppet last run on iridium is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [22:55:19] (03CR) 1020after4: [C: 031] phab load-libraries is meant to be an array of strings [puppet] - 10https://gerrit.wikimedia.org/r/182945 (owner: 10Rush) [22:55:28] (03CR) 10Rush: [C: 032] phab load-libraries is meant to be an array of strings [puppet] - 10https://gerrit.wikimedia.org/r/182945 (owner: 10Rush) [22:57:32] phab is 504'ing for me [22:57:53] chasemp: ^ [22:58:10] at this moment or a bit ago? [22:58:15] seems ok now [22:58:23] chasemp: now [22:58:31] cached bad state? [22:58:33] wfm now [22:58:40] does anyone else see a phab error? [22:58:56] I did, when you noticed it chasemp, but not since you hot fixed it [22:58:59] i was getting phab exceptions earlier, then everything was OK, now i'm getting 504 errors from nginx [22:59:14] nginx is the public proxy [22:59:21] aye [22:59:22] so it's possible it cached an error page? [22:59:30] though I thought it didn't do that in this case [22:59:37] kind of mixed up in fixing thigns atm [22:59:41] hmm [22:59:44] ah looks OK now [22:59:47] will check this when I put things in a reasonable state [22:59:54] chasemp: wfm now [23:01:51] (03PS3) 10Rush: Add the SecurityPolicyEventListener to phabricator config [puppet] - 10https://gerrit.wikimedia.org/r/182380 (owner: 1020after4) [23:01:59] (03CR) 10Rush: [C: 032 V: 032] Add the SecurityPolicyEventListener to phabricator config [puppet] - 10https://gerrit.wikimedia.org/r/182380 (owner: 1020after4) [23:05:15] argh [23:05:40] chasemp i'm still getting the 504 error when trying to hit https://phabricator.wikimedia.org/tag/team-practices/board/ [23:05:57] and ita ctually seems to time out (eg it seems to take forever to load before displaying the error) [23:06:07] k [23:06:42] chasemp: but accessing the board from this URL works fine: https://phabricator.wikimedia.org/project/board/56/ [23:07:13] that smells like that sprint extension bug [23:07:35] https://phabricator.wikimedia.org/T78208 [23:07:38] awjr: ^ [23:08:50] huh that does sound similar (although i didn't see a fancy stack trace, just a 504 error from nginx) [23:10:32] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 1 failures [23:18:27] (03PS1) 10Ori.livneh: Re-enable xhprof for single-request profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182951 [23:19:42] AaronSchulz: ^ [23:20:36] greg-g: ping [23:20:36] gwicke: You sent me a contentless ping. This is a contentless pong. Please provide a bit of information about what you want and I will respond when I am around. [23:20:43] gwicke: yo [23:20:46] hey [23:21:02] I was just wondering whether incident reports still go to wikitech [23:21:08] I saw that you started some on phab [23:21:30] the report on wikitech, the bugs in phab [23:21:31] but there doesn't seem to be one tag for them [23:21:48] no, and that decision might have been over kill (one project per report) [23:22:37] what's the rationale for continuing the reports on the wiki? [23:22:59] Tradition! :p [23:23:05] prose [23:23:07] hehe [23:23:36] though, I guess if there's a project per incident, then the project page could include the prose in the description, but that seems... awkward [23:23:47] it would be easier to follow all incidents if there was a single project for them [23:23:58] you could also do a master/tracking bug, but an incident report isn't ever "resolved" really [23:24:42] the follow-up can be resolved though [23:24:52] I'm fine with reverting my decision of OPPI (one project per incident), but not sure about not using wikitech for the prose/graphs/etc [23:25:01] yeah, the bug reports/action items yeah [23:25:15] but the "this is what happened, here are the graphs/logs" isn't a bug/task [23:25:41] we use phab for designs etc as well [23:25:46] where? [23:26:01] example: https://phabricator.wikimedia.org/T75955 [23:26:23] will that ever be closed? [23:26:38] yes, it will be once implemented [23:26:44] (actually pretty close to that) [23:26:49] incident reports (the prose) won't be :) [23:27:03] but, anywho, gotta run, bbiab, massage time :) [23:27:13] kk [23:27:32] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [23:30:05] (03PS2) 10Ori.livneh: Re-enable xhprof for single-request profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182951 [23:31:32] yes again please hold [23:31:38] (03CR) 10Aaron Schulz: [C: 031] Re-enable xhprof for single-request profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182951 (owner: 10Ori.livneh) [23:32:34] phab looks broken [23:33:25] /cc chasemp [23:33:32] PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string Wikimedia and MediaWiki not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 473 bytes in 0.313 second response time [23:33:42] * bd808 came here to see if that was a known thing [23:33:56] [Core Exception/PhutilBootloaderException] Include of '/srv/phab/libext/security/__phutil_library_init__.php' failed! [23:33:59] yes looking [23:34:03] it's all kinds of weird [23:35:51] RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 16512 bytes in 0.325 second response time [23:40:30] (03PS1) 10Rush: phab src suffix for security library [puppet] - 10https://gerrit.wikimedia.org/r/182954 [23:42:25] (03CR) 10Rush: [C: 032] phab src suffix for security library [puppet] - 10https://gerrit.wikimedia.org/r/182954 (owner: 10Rush) [23:43:22] PROBLEM - puppet last run on iridium is CRITICAL: CRITICAL: Puppet has 1 failures [23:44:32] RECOVERY - puppet last run on iridium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:55:44] (03CR) 10Ori.livneh: [C: 032] Re-enable xhprof for single-request profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182951 (owner: 10Ori.livneh) [23:55:57] (03Merged) 10jenkins-bot: Re-enable xhprof for single-request profiling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182951 (owner: 10Ori.livneh)