[00:01:58] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [00:02:30] uh oh [00:04:31] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.005 second response time [00:09:47] is anyone looking at tungsten? [00:13:49] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.007 second response time [00:15:06] (03CR) 10Dzahn: "superseded by Ie790fd2e3b607e92 , already does the same thing but in phab module instead of here" [puppet] - 10https://gerrit.wikimedia.org/r/169303 (owner: 10Dzahn) [00:15:13] (03Abandoned) 10Dzahn: add a phabricator check to LVS monitoring [puppet] - 10https://gerrit.wikimedia.org/r/169303 (owner: 10Dzahn) [00:17:48] (03PS1) 10Ori.livneh: hhvm: make HHVM's working directory be /var/tmp/hhvm [puppet] - 10https://gerrit.wikimedia.org/r/169630 [00:18:37] (03Abandoned) 10Ori.livneh: hhvm: make HHVM's working directory be /var/tmp/hhvm [puppet] - 10https://gerrit.wikimedia.org/r/169627 (owner: 10Ori.livneh) [00:18:59] (03CR) 10Ori.livneh: [C: 032] hhvm: make HHVM's working directory be /var/tmp/hhvm [puppet] - 10https://gerrit.wikimedia.org/r/169630 (owner: 10Ori.livneh) [00:32:28] PROBLEM - puppet last run on mw1029 is CRITICAL: CRITICAL: puppet fail [00:38:18] PROBLEM - puppet last run on mw1018 is CRITICAL: CRITICAL: puppet fail [00:43:18] PROBLEM - puppet last run on mw1028 is CRITICAL: CRITICAL: puppet fail [00:50:03] puppet failure on hhvms is me, fixing [00:51:28] RECOVERY - puppet last run on mw1029 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [00:51:59] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 351 seconds [00:52:08] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 356 seconds [00:52:18] (03PS1) 10Ori.livneh: Fix-up for I777ae49ca [puppet] - 10https://gerrit.wikimedia.org/r/169635 [00:52:49] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:53:00] (03PS2) 10Ori.livneh: Fix-up for I777ae49ca [puppet] - 10https://gerrit.wikimedia.org/r/169635 [00:53:01] dpkg-dev? [00:53:02] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -0 seconds [00:53:02] for hhvm? [00:53:06] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix-up for I777ae49ca [puppet] - 10https://gerrit.wikimedia.org/r/169635 (owner: 10Ori.livneh) [00:53:24] also, there's a ori dotfile change there [00:53:41] paravoid: the latter explained in the commit message [00:53:56] no please don't do that [00:54:04] bundle completely unrelated changes like that [00:54:08] the former: to unpack the source package. i'd prefer it if there was an hhvm-src package instead (i requested it in the RT ticket) but haven't seen action on that [00:54:22] nah, -src packages are a bad idea [00:54:24] uncommon too [00:54:31] because they are a bad idea :) [00:54:50] why are they a bad idea? because there's already such a thing as a source package? [00:54:58] that for starters [00:55:03] they're hard to build as well [00:55:20] .deb is not the right format for hhvm's source either [00:55:48] linux-source-* for example does exist [00:56:01] for people that want to build their own kernel [00:56:07] but ships a .tar.gz [00:56:40] i want the source so i can see context in gdb [00:56:50] I know [00:57:18] RECOVERY - puppet last run on mw1018 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [00:58:33] apt-get source is fine I think [00:58:43] (with dpkg-dev, granted) [00:59:30] the only downside to that is that we ensure => present on the HHVM package, so we can manage when package upgrades happen [00:59:37] which means that you can't make the exec subscribe to the package [00:59:47] why do you need to puppetize it? [01:00:06] puppetize what? the source package? [01:00:18] yes [01:00:28] also -dbg packages, but let's say I can see that [01:00:30] why not? it should be available on each server [01:00:35] the new hhvm-dbg is 300M btw [01:00:40] why? [01:01:12] (03PS1) 10Dzahn: phab - remove duplicate check command [puppet] - 10https://gerrit.wikimedia.org/r/169637 [01:01:24] because we're not always immediately able to perceive a problem generically, and reproduce it on separate environments. initial investigation is often tied to the specific servers on which the problem initially manifests. the current leak is a good example. [01:01:55] for most issues you can just take a corefile [01:01:58] (03CR) 10Dzahn: [C: 032] phab - remove duplicate check command [puppet] - 10https://gerrit.wikimedia.org/r/169637 (owner: 10Dzahn) [01:02:13] RECOVERY - puppet last run on mw1028 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [01:02:17] or isolate it to one server you're debugging in [01:02:33] but why start with a default attitude of not having debugging tools handy? [01:02:55] well I didn't say to debugging tools [01:03:01] no* [01:03:28] but 300M hhvm-dbg, hhvm source extracted..., we're kinda overdoing it [01:03:58] what is your criterion for overdoing, here? [01:04:36] Filesystem Size Used Avail Use% Mounted on [01:04:36] /dev/sda1 211G 26G 175G 13% / [01:04:40] I don't think we have /any/ other package in production weighting more than a few dozen MB; maybe hadoop [01:04:51] so? [01:05:00] similarly, no source for any other software of ours, not even PHP [01:05:09] that's unfortunate, in the case of php [01:05:10] and possibly not even mediawiki at some point in the future, aiui :) [01:05:57] tim equipped the app servers with some helpful tools for debugging PHP [01:06:17] I'm aware [01:06:20] and we lost them mostly because the people migrating the setup weren't aware of them or weren't sure how to port them [01:07:05] how do you envision updating HHVM to new versions? [01:07:31] major or minor releases? [01:07:40] minor sounds fine [01:08:06] the workflow giuseppe and i have adopted is basically this: [01:08:30] first, the package is just scp'd to osmium for some initial testing / sanity checks [01:08:41] if it looks good, giuseppe uploads it to apt, and we upgrade labs [01:08:51] (03PS3) 10Dzahn: RT - puppetize /etc/aliases for phab redirects [puppet] - 10https://gerrit.wikimedia.org/r/168733 [01:09:05] we watch it for a while, depending on how big the delta is [01:09:26] then apply it to prod gradually with salt [01:09:37] ok [01:10:05] are you worried that having all servers retrieve a 300mb package all at once would overwhelm the apt server? [01:10:10] we wouldn't be doing it all at once, anyway [01:11:10] PROBLEM - Disk space on ocg1002 is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=73%): [01:11:12] among other things [01:11:18] dunno, it's not exactly great [01:11:24] I don't mind it all that much [01:11:56] feel like glancing at my latest change up there? [01:12:03] * ori looks [01:12:10] but I also don't feel very motivated to jump through extra hoops (such as keeping the source installed and in sync with binaries) [01:12:33] btw, debugging RelWithDeb was broken upstream before my patch [01:12:37] I wonder how facebook deals with this [01:13:53] (03CR) 10Ori.livneh: "This file would be extremely easy to template, allowing the 'includer' of the class to pass in aliases as a hash parameter." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/168733 (owner: 10Dzahn) [01:14:53] * ori doesn't know [01:15:05] hopes that the RT->phab migration is a singular event though [01:15:28] mutante: oh, is the RT module going to be nuked afterwards? [01:15:48] in that case, I think it's OK not to bother templating it, but you should still specify u/g/o for the file resource [01:15:52] i'm not sure when, first i will become read-only [01:16:05] yes, ok [01:17:43] (03PS4) 10Dzahn: RT - puppetize /etc/aliases for phab redirects [puppet] - 10https://gerrit.wikimedia.org/r/168733 [01:18:38] mutante: 0444, by convention :P [01:19:45] well, i'm doing what i see on the server, in order to not introduce any change :p [01:20:20] (03PS5) 10Dzahn: RT - puppetize /etc/aliases for phab redirects [puppet] - 10https://gerrit.wikimedia.org/r/168733 [01:20:38] arr, say it.. it's missing a "this file managed by puppet" line :) [01:21:21] (03PS6) 10Dzahn: RT - puppetize /etc/aliases for phab redirects [puppet] - 10https://gerrit.wikimedia.org/r/168733 [01:24:56] (03CR) 10Ori.livneh: [C: 031] RT - puppetize /etc/aliases for phab redirects [puppet] - 10https://gerrit.wikimedia.org/r/168733 (owner: 10Dzahn) [01:26:06] mutante: actually, paravoid weaned me off of those ('managed by puppet') [01:26:11] because it can actually become untrue [01:26:16] in which case it is actively misleading [01:26:39] i think you actually debugged one such case a while ago, an apache ports.conf that said it was managed by puppet but wasn't [01:26:59] i think "This file was provisioned by Puppet" may be a good compromise [01:27:16] there's also "mailalias" btw [01:28:34] mailalias { 'ops-request': ensure => present, recipient => 'ops-requests', } [01:28:41] oh! [01:28:50] :) [01:29:02] it won't automatically purge unmanaged aliases [01:29:05] well then .. thank you :) i'll check it out [01:29:08] but it might be suitable for your use [01:29:36] paravoid: were you still undecided about https://gerrit.wikimedia.org/r/#/c/167020/ , btw? [01:30:02] Hi... Is there some way for one wiki to talk to another server-side? I want to write PHP code that talks to Meta 8p [01:30:06] actually i think that is exactly what i want it to do .. NOT touch unmanaged things, but append a few extra ones to the file [01:30:20] * AndyRussG rolls eyes and looks innocent [01:30:28] ori: I thought we agreed to do it in a stage? [01:30:46] or am I confusing it with something else, I don't remember... [01:30:52] what the hell, i was sure i did that [01:30:57] * ori is confused [01:31:04] https://gerrit.wikimedia.org/r/#/c/167835/ [01:31:05] heh you did [01:31:23] oh, i should just abandon the other one then [01:31:30] AndyRussG: yea, for example file_get_contents() [01:31:43] and I should review this one :) [01:31:57] (03Abandoned) 10Ori.livneh: Apt::Conf['no-recommends'] -> Package <| provider == 'apt' |> [puppet] - 10https://gerrit.wikimedia.org/r/167020 (owner: 10Ori.livneh) [01:32:49] right, I started reviewing it [01:32:55] and then realized it's a bit more complex [01:33:46] mutante: ahh thanks, fantastic, where does that live? (grepping just came up empty-handed..) [01:34:04] and that I'd need to babysit it [01:34:08] when deploying it :) [01:34:28] oic it's a php thing... [01:34:36] mailalias is even a native puppet type.. duh :) thx paravoid [01:35:04] paravoid: there's always the keyholder patch :P [01:35:09] :P [01:35:15] it's not actually time-sensitive [01:35:19] i'm just impatient [01:35:24] don't worry about it [01:35:25] AndyRussG: yea, all it was is saying "you can fetch remote files in PHP" [01:35:33] (03PS2) 10Faidon Liambotis: Add ::apt to stage => first [puppet] - 10https://gerrit.wikimedia.org/r/167835 (owner: 10Ori.livneh) [01:35:39] (03CR) 10Faidon Liambotis: [C: 032] Add ::apt to stage => first [puppet] - 10https://gerrit.wikimedia.org/r/167835 (owner: 10Ori.livneh) [01:35:43] let's see [01:35:48] uhoh [01:35:49] (famous last words) [01:35:57] * ori fastens seat-belts [01:36:02] mutante: ah hmm thanks, what about calling another wiki's actual code, kinda like an API request? [01:36:36] AndyRussG: well, mediawiki does have an API [01:36:58] AndyRussG: http://www.mediawiki.org/wiki/API:Main_page#A_simple_example [01:38:05] Yes I know... But can I make PHP of one WMF wiki call the API of another WMF wiki server-side? [01:38:38] yes [01:38:41] there's a class for that [01:38:47] \o/ woo :) [01:38:58] * ori digs it up [01:39:03] AndyRussG: just make an API request using MWHttpRequest... [01:39:33] (Exec[/usr/bin/apt-get update] => Class[Apt::Update] => Stage[first] => Stage[main] => Class[Passwords::Root] => User[root] => File[/usr/local/bin/apt2xml] => Class[Apt] => Stage[first]) [01:39:37] yeah [01:39:46] I suspected something like that [01:39:48] AndyRussG: https://www.mediawiki.org/wiki/API:Calling_internally [01:40:19] but that only works for the same wiki [01:40:26] (03PS1) 10Faidon Liambotis: Revert "Add ::apt to stage => first" [puppet] - 10https://gerrit.wikimedia.org/r/169643 [01:40:34] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Revert "Add ::apt to stage => first" [puppet] - 10https://gerrit.wikimedia.org/r/169643 (owner: 10Faidon Liambotis) [01:40:45] (before we get flooed from warnings) [01:41:07] AndyRussG: if you need to call another wiki, you can just make an http request. see for an example [01:42:10] weee thanks so much ori legoktm mutante :D [01:42:15] paravoid: what was it about Apt::Conf['no-recommends'] -> Package <| provider == 'apt' |> that rubbed you the wrong way again? [01:42:17] * AndyRussG hugs everyone [01:42:35] it adds a depedency on every single package basically [01:42:44] enlarges the dependency tree and the catalog [01:42:56] it's basically exactly why stages exist :) [01:43:30] oh, right [01:44:21] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: puppet fail [01:51:44] (03PS7) 10Dzahn: RT - add mail aliases [puppet] - 10https://gerrit.wikimedia.org/r/168733 [01:53:23] (03CR) 10Dzahn: [C: 032] RT - add mail aliases [puppet] - 10https://gerrit.wikimedia.org/r/168733 (owner: 10Dzahn) [01:55:11] (03CR) 10Dzahn: "yep, it added them just fine and didn't overwrite the entire file" [puppet] - 10https://gerrit.wikimedia.org/r/168733 (owner: 10Dzahn) [02:02:31] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [03:25:00] (03PS1) 10Tim Starling: Increase HHVM server thread count [puppet] - 10https://gerrit.wikimedia.org/r/169649 [03:26:43] (03CR) 10Tim Starling: [C: 032] Increase HHVM server thread count [puppet] - 10https://gerrit.wikimedia.org/r/169649 (owner: 10Tim Starling) [03:36:32] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [03:38:59] !log upgraded mw1114 to custom package with patch from https://phabricator.wikimedia.org/T820#16428 applied [03:39:12] Logged the message, Master [03:49:32] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [05:22:13] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [05:28:32] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [05:58:43] PROBLEM - puppet last run on mw1154 is CRITICAL: CRITICAL: Puppet has 1 failures [06:16:52] RECOVERY - puppet last run on mw1154 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:26:23] RECOVERY - Disk space on ocg1003 is OK: DISK OK [06:27:04] RECOVERY - Disk space on ocg1002 is OK: DISK OK [06:27:52] PROBLEM - puppetmaster https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:28:03] PROBLEM - puppetmaster backend https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:28:03] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:33] PROBLEM - puppet last run on amssq60 is CRITICAL: CRITICAL: puppet fail [06:28:52] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.343 second response time [06:28:52] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: puppet fail [06:28:52] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: puppet fail [06:29:04] PROBLEM - puppet last run on amssq47 is CRITICAL: CRITICAL: puppet fail [06:29:12] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.258 second response time [06:29:13] PROBLEM - puppet last run on mw1011 is CRITICAL: CRITICAL: puppet fail [06:29:22] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: puppet fail [06:29:22] PROBLEM - puppet last run on amssq48 is CRITICAL: CRITICAL: puppet fail [06:29:22] PROBLEM - puppet last run on analytics1010 is CRITICAL: CRITICAL: puppet fail [06:29:53] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:53] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:12] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:13] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:22] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:26] it's like the changing of the guards in front of buckingham palace [06:31:33] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:34] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:43] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:53] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:02] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:02] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:02] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:03] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:09] at this point, if you guys fix this, i think i might miss it [06:32:12] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 3 failures [06:32:12] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:12] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:12] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 3 failures [06:32:13] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:22] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:22] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:34] <_joe_> hi, I'm mod passenger [06:32:53] <_joe_> I am a clueless piece of ruby glue you run your infrastructure on [06:33:18] welcome [06:33:26] <_joe_> ciao Nemo_bis [06:33:31] 'ao [06:42:37] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:42:51] there's always a straggler in ever litter [06:42:55] *every [06:45:07] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:45:07] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:45:16] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:45:26] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:45:27] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:45:41] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:45:41] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:45:41] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:45:46] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:45:56] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:46:06] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:46:07] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:46:07] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:07] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:46:07] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:46:16] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:46:17] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:46:26] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:46:36] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:46:56] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:47:07] (03PS1) 10Gage: logstash: hadoop: disable output temporarily [puppet] - 10https://gerrit.wikimedia.org/r/169655 [06:47:17] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 61 seconds ago with 0 failures [06:47:27] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:47:38] RECOVERY - puppet last run on amssq47 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:47:46] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:47:57] RECOVERY - puppet last run on amssq48 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:48:18] RECOVERY - puppet last run on amssq60 is OK: OK: Puppet is currently enabled, last run 61 seconds ago with 0 failures [06:49:57] RECOVERY - puppet last run on analytics1010 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:52:02] (03CR) 10Gage: [C: 032] "Current output is too high for existing storage. I will minimize & reenable; we will scale logstash service." [puppet] - 10https://gerrit.wikimedia.org/r/169655 (owner: 10Gage) [06:53:26] (03CR) 10Springle: "Adding mysql::password_file sounds logical. We could probably use it places that currently touch, or expect, /root/.my.cnf." [puppet] - 10https://gerrit.wikimedia.org/r/168993 (owner: 10Ottomata) [06:56:28] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [07:16:21] !log repooled mw1189 w/patched hhvm () [07:16:29] Logged the message, Master [07:18:18] (03CR) 10Giuseppe Lavagetto: "+1 to making this a general-purpose class... but then, what about a more general mysql::config::file module that works similarly to what t" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/168993 (owner: 10Ottomata) [07:27:51] (03CR) 10Springle: "If mysql::config::file means auto generating /etc/my.cnf, then I'll be complaining and stonewalling on production boxes :-) Plain text con" [puppet] - 10https://gerrit.wikimedia.org/r/168993 (owner: 10Ottomata) [07:34:19] (03CR) 10Giuseppe Lavagetto: "yep I wasn't so foolish as to suggest a full my.cnf for a prod server to be done this way :)" [puppet] - 10https://gerrit.wikimedia.org/r/168993 (owner: 10Ottomata) [07:34:33] <_joe_> springle: did you mistake me for a dev? :) [07:34:37] _joe_: hehe ;) just checking [07:37:16] _joe_: so bascially an INI file generator? [07:37:24] that doesn't even need to be mysql:: [07:37:35] <_joe_> springle: we already have _that_ [07:37:44] <_joe_> springle: lemme find that for you [07:37:50] so.. why do we need another? [07:37:53] * springle confused [07:38:12] <_joe_> not the generator, a define specialized for mysql files maybe [07:38:25] <_joe_> springle: I'll bake something up to show it [07:38:32] <_joe_> it's easier done than explained [07:38:58] :) [07:39:04] <_joe_> modules/wmflib/lib/puppet/parser/functions/php_ini.rb basically does what we need [07:39:06] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 72, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqiad:xe-4/2/1 (Giglinx/Zayo, ETYX/084858//ZYO) {#1062} [10Gbps MPLS]BR [07:39:11] <_joe_> modulo sections [07:39:18] <_joe_> which are not used in php [07:40:12] <_joe_> springle: our mysql module is puppetlabs one? [07:40:59] <_joe_> *the [07:41:04] was once, i think. doubt it's current. it isn't used for production, save for a hook or two in old coredb [07:41:14] <_joe_> oh ok [07:41:15] i really dont know who else uses it [07:41:24] <_joe_> so, where should I put such a resource? [07:41:36] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [07:42:47] <_joe_> springle: it's used all over analytics puppet resources [07:43:02] best to see what otto wants, then [07:43:21] <_joe_> manifests/misc/statistics.pp is responsible for most of the used-only-once code we have [07:43:36] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [07:43:56] <_joe_> jgage: is this you? ^^ [07:44:26] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 74, down: 0, dormant: 0, excluded: 0, unused: 0 [07:44:26] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 31 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 2, uunassigned_shards: 31, utimed_out: False, uactive_primary_shards: 46, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 92, uinitializing_shards: 0, unumber_of_data_nodes: 2} [07:44:36] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 31 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 2, uunassigned_shards: 31, utimed_out: False, uactive_primary_shards: 46, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 92, uinitializing_shards: 0, unumber_of_data_nodes: 2} [07:46:38] not sure if i caused that alert but i am tryign to solve that problem [07:47:47] <_joe_> ok thanks [07:48:00] <_joe_> I'm going to have the power out for ~ 2 hours in a few [07:48:06] i ok [07:48:12] s/i // [07:48:46] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 9, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 112, initializing_shards: 2, number_of_data_nodes: 3 [07:48:53] phew [07:48:55] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 1, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 121, initializing_shards: 1, number_of_data_nodes: 3 [07:49:00] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 1, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 121, initializing_shards: 1, number_of_data_nodes: 3 [07:49:44] i did a delete by query in attempt to recover some disk space. doesn't seem to have worked though. [07:51:45] <_joe_> jgage: you probably need to redo/compact indices [07:53:15] hm ok. *looks that up* [08:23:06] (03PS2) 10KartikMistry: Update Debian package to upstream r57689 [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/168760 [09:05:41] Reedy: I'm assuming https://gerrit.wikimedia.org/r/#/c/169294/ is already deployed? [09:12:43] (03CR) 10Filippo Giunchedi: [C: 031] "dependent change seems to have been deployed, will merge this later today" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/169295 (owner: 10Chad) [09:14:11] (03Abandoned) 10Filippo Giunchedi: eqiad-prod: reduce weight on ms-be1013/1014/1015 to help shed some load [software/swift-ring] - 10https://gerrit.wikimedia.org/r/166544 (owner: 10Filippo Giunchedi) [09:20:51] (03PS1) 10Filippo Giunchedi: new script: swift-add-machine [software/swift-ring] - 10https://gerrit.wikimedia.org/r/169662 [09:21:12] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] new script: swift-add-machine [software/swift-ring] - 10https://gerrit.wikimedia.org/r/169662 (owner: 10Filippo Giunchedi) [09:42:19] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: Puppet last ran 14433 seconds ago, expected 14400 [09:52:32] (03CR) 10Alexandros Kosiaris: [C: 032] "On a side note, this class is only used in two places. dns::recursor::statistics (which is only used on nescio) and mailman (sodium). It a" [puppet] - 10https://gerrit.wikimedia.org/r/169561 (owner: 10Dzahn) [09:55:53] akosiaris: i was looking into putting nginx there myself with mutante last night [09:56:47] and now that this is merged, i'd love if you can please handle https://gerrit.wikimedia.org/r/#/c/169571/ as well. Thanks akosiaris ! [09:59:21] <_joe_> eh I thought about moving to nginx when I moved everything to a module [10:00:22] yeah, we should kill the 2-3 instances of lighttpd that we got with fire ... [10:00:25] <_joe_> but that seemed too many things at once [10:00:40] <_joe_> akosiaris: AFAIK we use lighty in labs as well [10:01:25] really? ... sigh [10:03:09] we do [10:03:26] there was a long mail thread on labs-l [10:06:48] akosiaris: https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Web_services [10:19:20] akosiaris: yeah, lighty is kind of the backbone of toollabs stuff [10:20:44] ok so lighttpd is in heavy use in labs. That's fine... production however is a different story [10:20:59] anyway, everything on its own time [10:40:41] PROBLEM - puppet last run on ssl1004 is CRITICAL: CRITICAL: Puppet has 1 failures [10:58:40] RECOVERY - puppet last run on ssl1004 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [11:11:58] poor lighty [11:12:02] what do y'all have against it [11:48:13] (03PS1) 10Alexandros Kosiaris: Fix bacula rspec tests [puppet] - 10https://gerrit.wikimedia.org/r/169679 [11:48:15] (03PS1) 10Alexandros Kosiaris: Modularize backups.pp [puppet] - 10https://gerrit.wikimedia.org/r/169680 [12:01:36] !log xtrabackup clone db1007 to db2029 [12:01:37] !log elastic1001, elastic1008 and elastic1013 powering down to replace ssds RT7779 [12:01:43] Logged the message, Master [12:01:49] Logged the message, Master [12:05:51] PROBLEM - Host elastic1001 is DOWN: PING CRITICAL - Packet loss = 100% [12:06:11] PROBLEM - Host elastic1008 is DOWN: CRITICAL - Plugin timed out after 15 seconds [12:06:31] PROBLEM - Host elastic1013 is DOWN: CRITICAL - Plugin timed out after 15 seconds [12:09:58] ACKNOWLEDGEMENT - DPKG on elastic1001 is CRITICAL: Timeout while attempting connection Chris Johnson upgrading ssds [12:09:58] ACKNOWLEDGEMENT - Disk space on elastic1001 is CRITICAL: Timeout while attempting connection Chris Johnson upgrading ssds [12:09:58] ACKNOWLEDGEMENT - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.108 Chris Johnson upgrading ssds [12:09:58] ACKNOWLEDGEMENT - ElasticSearch health check for shards on elastic1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.108:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health Chris Johnson upgrading ssds [12:09:58] ACKNOWLEDGEMENT - NTP on elastic1001 is CRITICAL: NTP CRITICAL: No response from NTP server Chris Johnson upgrading ssds [12:09:58] ACKNOWLEDGEMENT - RAID on elastic1001 is CRITICAL: Timeout while attempting connection Chris Johnson upgrading ssds [12:09:59] ACKNOWLEDGEMENT - SSH on elastic1001 is CRITICAL: Connection timed out Chris Johnson upgrading ssds [12:09:59] ACKNOWLEDGEMENT - check configured eth on elastic1001 is CRITICAL: Timeout while attempting connection Chris Johnson upgrading ssds [12:10:00] ACKNOWLEDGEMENT - check if dhclient is running on elastic1001 is CRITICAL: Timeout while attempting connection Chris Johnson upgrading ssds [12:10:00] ACKNOWLEDGEMENT - check if salt-minion is running on elastic1001 is CRITICAL: Timeout while attempting connection Chris Johnson upgrading ssds [12:10:01] ACKNOWLEDGEMENT - puppet last run on elastic1001 is CRITICAL: Timeout while attempting connection Chris Johnson upgrading ssds [12:10:37] ACKNOWLEDGEMENT - DPKG on elastic1008 is CRITICAL: Timeout while attempting connection Chris Johnson Upgrading ssds [12:10:39] ACKNOWLEDGEMENT - Disk space on elastic1008 is CRITICAL: Timeout while attempting connection Chris Johnson Upgrading ssds [12:10:39] ACKNOWLEDGEMENT - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.140 Chris Johnson Upgrading ssds [12:10:39] ACKNOWLEDGEMENT - ElasticSearch health check for shards on elastic1008 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.140:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health Chris Johnson Upgrading ssds [12:10:39] ACKNOWLEDGEMENT - NTP on elastic1008 is CRITICAL: NTP CRITICAL: No response from NTP server Chris Johnson Upgrading ssds [12:10:39] ACKNOWLEDGEMENT - RAID on elastic1008 is CRITICAL: Timeout while attempting connection Chris Johnson Upgrading ssds [12:10:39] ACKNOWLEDGEMENT - SSH on elastic1008 is CRITICAL: Connection timed out Chris Johnson Upgrading ssds [12:10:39] ACKNOWLEDGEMENT - check configured eth on elastic1008 is CRITICAL: Timeout while attempting connection Chris Johnson Upgrading ssds [12:10:40] ACKNOWLEDGEMENT - check if dhclient is running on elastic1008 is CRITICAL: Timeout while attempting connection Chris Johnson Upgrading ssds [12:10:40] ACKNOWLEDGEMENT - check if salt-minion is running on elastic1008 is CRITICAL: Timeout while attempting connection Chris Johnson Upgrading ssds [12:10:41] ACKNOWLEDGEMENT - puppet last run on elastic1008 is CRITICAL: Timeout while attempting connection Chris Johnson Upgrading ssds [12:11:04] ACKNOWLEDGEMENT - DPKG on elastic1013 is CRITICAL: Timeout while attempting connection Chris Johnson Upgrading ssds [12:11:04] ACKNOWLEDGEMENT - Disk space on elastic1013 is CRITICAL: Timeout while attempting connection Chris Johnson Upgrading ssds [12:11:04] ACKNOWLEDGEMENT - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.10 Chris Johnson Upgrading ssds [12:11:04] ACKNOWLEDGEMENT - ElasticSearch health check for shards on elastic1013 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.10:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health Chris Johnson Upgrading ssds [12:11:04] ACKNOWLEDGEMENT - NTP on elastic1013 is CRITICAL: NTP CRITICAL: No response from NTP server Chris Johnson Upgrading ssds [12:11:04] ACKNOWLEDGEMENT - RAID on elastic1013 is CRITICAL: Timeout while attempting connection Chris Johnson Upgrading ssds [12:11:05] ACKNOWLEDGEMENT - SSH on elastic1013 is CRITICAL: Connection timed out Chris Johnson Upgrading ssds [12:11:05] ACKNOWLEDGEMENT - check configured eth on elastic1013 is CRITICAL: Timeout while attempting connection Chris Johnson Upgrading ssds [12:11:06] ACKNOWLEDGEMENT - check if dhclient is running on elastic1013 is CRITICAL: Timeout while attempting connection Chris Johnson Upgrading ssds [12:11:06] ACKNOWLEDGEMENT - check if salt-minion is running on elastic1013 is CRITICAL: Timeout while attempting connection Chris Johnson Upgrading ssds [12:11:07] ACKNOWLEDGEMENT - puppet last run on elastic1013 is CRITICAL: Timeout while attempting connection Chris Johnson Upgrading ssds [12:18:40] akosiaris: can you please pm me list of hosts lower than 12.04 ? [12:21:47] matanya: no need for pm really. sodium, nickel (pretty much ready to be replaced), nescio and ms1004 [12:21:53] so down to 3 soon :-) [12:22:03] Yay! thanks :) [12:24:14] akosiaris: on top of this, tampa is 100% out? no netapp there anymore ? [12:26:23] (03PS2) 10Giuseppe Lavagetto: hiera: mediawiki-based backend for labs [puppet] - 10https://gerrit.wikimedia.org/r/168984 [12:26:38] yes tampa is 100% out, and the netapp is being moved to codfw [12:26:45] (03PS1) 10Alexandros Kosiaris: Remove PMTPA from icinga::nsca::firewall [puppet] - 10https://gerrit.wikimedia.org/r/169685 [12:29:37] thanks akosiaris so what would happen with /srv/home_pmtpa on bast1001 ? [12:29:55] should it be changed to codfw ? [12:31:19] Ι 've already fixed that [12:32:48] it is the old historic home from pmtpa and should be mounted on bast1001 as is for people to access it if they want [12:33:04] at some point we will delete it I suppose, but not yet [12:34:53] (03CR) 10Alexandros Kosiaris: [C: 032] Remove PMTPA from icinga::nsca::firewall [puppet] - 10https://gerrit.wikimedia.org/r/169685 (owner: 10Alexandros Kosiaris) [12:35:00] so : class { 'nfs::netapp::home': [12:35:00] mountpoint => '/srv/home_pmtpa', [12:35:00] mount_site => 'pmtpa', [12:35:06] is correct. thanks [12:35:59] yes it is [12:43:41] RECOVERY - Host elastic1001 is UP: PING OK - Packet loss = 0%, RTA = 1.58 ms [12:49:15] <^demon|away> cmjohnson: Ah, just saw the ack's on elastic* for the ssd swaps, coolio. [12:49:22] <^demon|away> I'm around far too early if you need a hand from our side. [12:50:43] PROBLEM - Host elastic1001 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:01] RECOVERY - Host elastic1001 is UP: PING OK - Packet loss = 0%, RTA = 5.28 ms [12:51:08] thanks...just getting ready to boot up and install now [12:51:47] (03CR) 10Alexandros Kosiaris: [C: 032] "This is fine, but the repo misses the upstream/0.1+svn_57689 tag" [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/168760 (owner: 10KartikMistry) [12:57:03] akosiaris: Fixed^ Thanks. [12:58:48] thnaks [12:58:51] thanks* [12:59:01] me again, sorry for nagging today akosiaris , can i add a ferm rule without a port? i.e any ? [12:59:07] !log disabling puppet on elastic1017 and 1018 [12:59:13] Logged the message, Master [12:59:31] matanya: yeah sure [12:59:43] you probably need to use the ferm::rule define and not the ferm::service [12:59:52] RECOVERY - Host elastic1008 is UP: PING OK - Packet loss = 0%, RTA = 1.96 ms [12:59:57] who, the ferm modules doesn't hint at that :) [13:00:02] and know a little bit about ferm in general [13:00:41] you mean it is missing docs ? [13:00:42] something like: saddr (0.0.0.0/0) proto tcp dport (ssh) ACCEPT [13:00:42] RECOVERY - Host elastic1013 is UP: PING OK - Packet loss = 0%, RTA = 2.09 ms [13:00:56] yes, saying that in a nice way [13:01:02] it does indeed [13:01:22] but what you just pointed out, does have a port (ssh) in the rule [13:01:32] !log powering down/replacing elastic1017 and elastic1018 [13:01:38] Logged the message, Master [13:01:48] so in order to avoid an XY problem, what are you trying to do matanya ? [13:02:03] PROBLEM - puppet last run on elastic1017 is CRITICAL: Timeout while attempting connection [13:02:22] replace iptable classes in misc/udp2log [13:02:31] i worte the code, but have no way to test it [13:03:05] ACKNOWLEDGEMENT - DPKG on elastic1017 is CRITICAL: Timeout while attempting connection Chris Johnson Replacing the server [13:03:05] ACKNOWLEDGEMENT - ElasticSearch health check on elastic1017 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.39 Chris Johnson Replacing the server [13:03:05] ACKNOWLEDGEMENT - ElasticSearch health check for shards on elastic1017 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.39:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health Chris Johnson Replacing the server [13:03:05] ACKNOWLEDGEMENT - NTP on elastic1017 is CRITICAL: NTP CRITICAL: No response from NTP server Chris Johnson Replacing the server [13:03:05] ACKNOWLEDGEMENT - RAID on elastic1017 is CRITICAL: Timeout while attempting connection Chris Johnson Replacing the server [13:03:06] ACKNOWLEDGEMENT - SSH on elastic1017 is CRITICAL: Connection refused Chris Johnson Replacing the server [13:03:06] ACKNOWLEDGEMENT - check configured eth on elastic1017 is CRITICAL: Timeout while attempting connection Chris Johnson Replacing the server [13:03:07] ACKNOWLEDGEMENT - check if dhclient is running on elastic1017 is CRITICAL: Timeout while attempting connection Chris Johnson Replacing the server [13:03:07] ACKNOWLEDGEMENT - check if salt-minion is running on elastic1017 is CRITICAL: Timeout while attempting connection Chris Johnson Replacing the server [13:03:08] ACKNOWLEDGEMENT - puppet last run on elastic1017 is CRITICAL: Timeout while attempting connection Chris Johnson Replacing the server [13:03:30] ACKNOWLEDGEMENT - ElasticSearch health check for shards on elastic1018 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.40:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health Chris Johnson Replacing the server [13:03:30] ACKNOWLEDGEMENT - check if salt-minion is running on elastic1018 is CRITICAL: Timeout while attempting connection Chris Johnson Replacing the server [13:03:51] PROBLEM - Host elastic1017 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:04:12] PROBLEM - Host elastic1018 is DOWN: PING CRITICAL - Packet loss = 100% [13:06:42] PROBLEM - Host elastic1008 is DOWN: PING CRITICAL - Packet loss = 100% [13:07:22] PROBLEM - Host elastic1013 is DOWN: PING CRITICAL - Packet loss = 100% [13:08:01] RECOVERY - Host elastic1008 is UP: PING OK - Packet loss = 0%, RTA = 1.17 ms [13:08:59] (03PS1) 10Matanya: udp2log: replace iptables with ferm [puppet] - 10https://gerrit.wikimedia.org/r/169691 [13:09:02] RECOVERY - Host elastic1013 is UP: PING OK - Packet loss = 0%, RTA = 1.57 ms [13:09:05] akosiaris: this ^ [13:09:40] (03CR) 10jenkins-bot: [V: 04-1] udp2log: replace iptables with ferm [puppet] - 10https://gerrit.wikimedia.org/r/169691 (owner: 10Matanya) [13:10:28] (03PS2) 10Matanya: udp2log: replace iptables with ferm [puppet] - 10https://gerrit.wikimedia.org/r/169691 [13:11:26] !log lowered redundancy on logstash from 3 way to 2 way [13:11:32] RECOVERY - ElasticSearch health check on logstash1003 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 46: active_shards: 92: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [13:11:32] RECOVERY - ElasticSearch health check on logstash1001 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 46: active_shards: 92: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [13:11:32] RECOVERY - ElasticSearch health check on logstash1002 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 46: active_shards: 92: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [13:11:36] Logged the message, Master [13:13:16] matanya: Seems the same if you take into account the DEFAULT DROP of base::firewall [13:14:03] akosiaris: so this change is useless ? [13:14:32] no, obviously not [13:14:51] I meant the functionality is the same [13:15:17] which is what you wanted [13:16:09] (03CR) 10Chad: Decom lsearchd pool 5 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/169295 (owner: 10Chad) [13:16:58] <^d> godog: That thing is a mess ^ [13:17:04] <^d> How pool 5 even works I don't even. [13:17:31] oh, thanks. i'm slow today :) [13:19:00] ^d: haha just saw the full duplication below [13:19:09] (03PS1) 10Chad: Revert "Stop using lsearchd pool 5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169693 [13:19:24] <^d> But actually the dependent change broke lsearchd on those wikis. [13:19:33] <^d> pool5 is unpuppetized, mostly. [13:19:34] <^d> Figures. [13:19:45] (03CR) 10Chad: [C: 032 V: 032] Revert "Stop using lsearchd pool 5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169693 (owner: 10Chad) [13:20:36] !log demon Synchronized wmf-config/lucene-production.php: unbreak lsearchd for commons, enwikitionary, etc (duration: 00m 04s) [13:20:42] (03PS3) 10Matanya: udp2log: replace iptables with ferm [puppet] - 10https://gerrit.wikimedia.org/r/169691 [13:20:42] Logged the message, Master [13:22:19] sigh, so not a full subset? [13:22:51] <^d> Yeah. [13:22:53] !log lowered replication on logstash's template for new indexes from 3 way to 2 way [13:22:53] <^d> Seems like [13:22:57] Logged the message, Master [13:23:11] godog and ^d: part of the problem with lsearchd is that its config is just all fucked up [13:23:37] hahaha [13:23:40] <^d> Yes. [13:23:59] <^d> This is why we try to avoid touching it. Everything breaks. [13:24:10] <^d> I was trying to chip off a tiny edge of the crap from the edge of shit mountain. [13:24:14] <^d> But no, couldn't do that either. [13:24:15] looks like we're up to dismantle it all together [13:27:47] (03PS1) 10Cmjohnson: Adding new elastic search mgmt dns and correcting WMF to wmf discrepencies [dns] - 10https://gerrit.wikimedia.org/r/169694 [13:28:32] (03CR) 10Alexandros Kosiaris: udp2log: replace iptables with ferm (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/169691 (owner: 10Matanya) [13:28:59] (03CR) 10Alexandros Kosiaris: [C: 04-1] udp2log: replace iptables with ferm [puppet] - 10https://gerrit.wikimedia.org/r/169691 (owner: 10Matanya) [13:29:57] akosiaris: it can be merged yet anyway because of hosts missing include base::firewall [13:30:01] *can't [13:30:29] matanya: yup [13:31:18] (03PS4) 10Matanya: udp2log: replace iptables with ferm [puppet] - 10https://gerrit.wikimedia.org/r/169691 [13:32:16] PROBLEM - puppet last run on search1017 is CRITICAL: CRITICAL: Puppet has 1 failures [13:33:44] (03Abandoned) 10Chad: Remove search-pool5 LVS entries, exists no more [dns] - 10https://gerrit.wikimedia.org/r/169300 (owner: 10Chad) [13:33:52] (03Abandoned) 10Chad: Decom lsearchd pool 5 [puppet] - 10https://gerrit.wikimedia.org/r/169295 (owner: 10Chad) [13:39:50] (03CR) 10Cmjohnson: [C: 032] Adding new elastic search mgmt dns and correcting WMF to wmf discrepencies [dns] - 10https://gerrit.wikimedia.org/r/169694 (owner: 10Cmjohnson) [13:40:20] ottomata: i think i'll really need your help with this one [13:41:06] ottomata: fyi elastic1001,1008,1013 have the basic install. I removed all the old salt-keys and puppet certs. [13:41:30] (03CR) 10Alexandros Kosiaris: [C: 04-1] udp2log: replace iptables with ferm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/169691 (owner: 10Matanya) [13:42:35] (03PS5) 10Matanya: udp2log: replace iptables with ferm [puppet] - 10https://gerrit.wikimedia.org/r/169691 [13:42:51] the only question now is what hosts need base::firewall [13:43:19] i guess flourine, oxygen, erbium and some anaylitics [13:43:23] probably more [13:43:37] eeef [13:43:43] i am not excited about this! [13:43:53] share with me :) [13:43:54] we are trying to get rid of udp2log! (i know i have been saying that for over a year now) [13:44:25] it is one of those systems that i prefer not to touch too much, because it is fragile, and people get really upset when it breaks [13:44:27] in the mean time, i'm trying to get rid of iptables.pp and this is blocking me :) [13:44:30] and, we are getting rid of it [13:45:20] join #gsoc [13:45:25] (oops, sorry) [13:50:07] RECOVERY - puppet last run on search1017 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [13:56:08] akosiaris: hey! halfak wants to use the postgres db from labs, wondering how much capacity it has... [13:56:13] in terms of storage capacity? [13:58:47] YuviPanda: hola [13:58:55] kart_: hey [13:59:05] YuviPanda: Is it possible to get CPU/RAM usage in Beta? [13:59:12] for specific instances? [13:59:13] kart_: graphite.wmflabs.org :) [13:59:46] YuviPanda: around 4T as soon as we clear it up from the osm mirror (which is no longer used) [14:00:03] YuviPanda: ah. What was the credential? :P [14:00:59] kart_: no credentials/ [14:01:02] performance wise it is not going to be great and he might have neighbors in the future but for now it would be pretty much only him [14:01:12] halfak: ^ [14:01:25] akosiaris: can you make him an account? [14:01:54] akosiaris: Please :) [14:01:58] kart_: you can also use http://tools.wmflabs.org/nagf [14:02:29] kart_: https://tools.wmflabs.org/nagf/?project=deployment-prep for betalabs [14:02:53] akosiaris, thanks! [14:03:49] YuviPanda: I 've been meaning to ask you that. You had some code around to make it as easy as populating the mysql dbs for labs, right ? [14:03:51] YuviPanda: thanks! [14:03:59] wanna push it forward ? [14:04:18] akosiaris, How do I find out the connection details for postgres? [14:04:35] halfak: you ain't gonna be able to connect yet [14:05:08] akosiaris: sure, I can. Need to figure out where to run it, but that's about it... [14:05:13] Oh totally. I was just hoping to leave myself some notes for when I got back to this later. :) [14:05:21] akosiaris: not this week though, I'm 'off', and only around totally as a volunteer :) [14:05:31] not that that's different from other time, but probably won't do too much work.. [14:05:50] akosiaris: but yeah, if we think we're ready to open it up to everyone I'm totally up for it. [14:06:12] YuviPanda: we do think so (me thinks so). What you will need of me ? [14:06:25] akosiaris: access credentials for postgres... [14:06:36] akosiaris: if having access / root on the machine is good enough, I'll have it next week... [14:07:03] ok, take your vacation, I 'll create an account for the tool to create users and mail you [14:07:12] halfak: mind waiting till next week ? [14:07:20] akosiaris, I can. [14:07:24] thanks :) [14:07:35] no, thank you :D [14:08:10] YuviPanda: thanks for the graphite links as well, they are gonna prove really helpful :-) [14:08:13] akosiaris: cool :) do mail me whenever, I might get a head start. [14:08:25] akosiaris: yeah :D Krinkle|detached wrote nagf which is quite nice too [14:11:52] ^d: later this evening (when I'm actually taking my day off :P) can you ban the current masters so they are empty? [14:12:17] when we rebuild the "new" masters I'd like to be able to bring the old masters down right away [14:12:22] and we can do that if they are empty [14:12:29] <^d> We could go ahead and start now. [14:12:35] we've got plenty of extra capacity [14:12:37] sure [14:12:47] <^d> I'll do them 1 by 1 instead of all 3 at once. [14:13:06] we'll have banned 9 nodes out of 31 at that point - still plenty of capacity [14:13:18] ^d: I don't think it matter if you do them one by one or all 3 at once [14:13:31] <^d> Yeah probably, sure. [14:13:36] oh - do you want to make your setting for concurrent moves persistent? it seems pretty stable. [14:15:04] <^d> All 3 banned. [14:15:12] <^d> Yeah, I'll whip up a puppet change for it. [14:15:13] (03PS3) 10Ottomata: Require 2 ACKs from kafka brokers per default [puppet] - 10https://gerrit.wikimedia.org/r/167553 (https://bugzilla.wikimedia.org/69667) (owner: 10QChris) [14:16:15] thanks! [14:17:26] <^d> Oh duh, we just need it in persistent. [14:18:06] <^d> https://phabricator.wikimedia.org/P47 [14:21:17] (03CR) 10Ottomata: [C: 032] Require 2 ACKs from kafka brokers per default [puppet] - 10https://gerrit.wikimedia.org/r/167553 (https://bugzilla.wikimedia.org/69667) (owner: 10QChris) [14:21:44] !log set request.required.acks = 2 for all varnishkafkas [14:21:51] Logged the message, Master [14:22:56] (03PS1) 10Chad: Only show comment when section exists [puppet] - 10https://gerrit.wikimedia.org/r/169703 [14:23:41] (03PS1) 10Cmjohnson: Changing dhcpd entries for elastic1017-19 [puppet] - 10https://gerrit.wikimedia.org/r/169704 [14:24:22] manybubbles:, ^d, ok if i bring up 1017-1019? [14:24:30] ottomata: fine by me [14:24:43] they are currently still banned so they won't grab shards anyway [14:24:45] oh, cmjohnson, did we change netboot stuff for those 3? [14:24:46] (03PS2) 10Cmjohnson: Changing dhcpd entries for elastic1017-19 [puppet] - 10https://gerrit.wikimedia.org/r/169704 [14:24:48] ah [14:24:49] haha [14:24:54] hahah [14:24:56] we can validate them and then unban then [14:24:58] haha? [14:25:16] as I was asking the question cmjohnson made the commit i was looking for [14:25:59] heh [14:26:22] ottomata: did you see the ping about the other 3 (1001,1008,1013) [14:26:29] yes [14:26:31] (03CR) 10Cmjohnson: [C: 032] Changing dhcpd entries for elastic1017-19 [puppet] - 10https://gerrit.wikimedia.org/r/169704 (owner: 10Cmjohnson) [14:26:38] going to do these ones first, if you don't mind [14:26:55] wait, i think so, cmjohnson, you said they had been removed and hds upgraded? [14:27:17] yes...i did the base install and removed all the old salt keys and puppet certs [14:27:36] you will need to add new [14:28:06] k cool [14:29:17] (03PS1) 10Glaisher: Raise account creation throttle at cawiki temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169708 (https://bugzilla.wikimedia.org/72611) [14:29:19] oh, cmjohnson, hypethreading? [14:29:29] on, or should I do that? [14:29:30] yep..i remembered [14:29:31] cool! [14:29:32] danke [14:31:24] i think those are not your fault [14:31:27] oops [14:31:32] wrong chat [14:38:22] manybubbles: i'm running puppet on elastic1001,1008,1013 now [14:38:32] shoudl I mark them as master elligible now? [14:38:33] ottomata: did you merge that one? [14:38:43] ? [14:38:44] ottomata: yeah - if your running puppet you should merge it [14:38:53] https://gerrit.wikimedia.org/r/#/c/169550/ [14:38:58] k will do that first [14:39:23] manybubbles: should I wait then? [14:39:27] before getting elasticsearch up on those? [14:39:47] ottomata: nah - its fine [14:39:52] you can set them up [14:40:00] ok [14:40:05] (03PS2) 10Ottomata: Add three more master node to elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/169550 (owner: 10Manybubbles) [14:40:06] the other nodes should be empty pretty soon and we can bring them down [14:41:01] (03CR) 10Ottomata: [C: 032] Add three more master node to elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/169550 (owner: 10Manybubbles) [14:43:31] PROBLEM - DPKG on elastic1013 is CRITICAL: Connection refused by host [14:43:32] PROBLEM - Disk space on elastic1008 is CRITICAL: Connection refused by host [14:43:32] PROBLEM - ElasticSearch health check for shards on elastic1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.108:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [14:43:42] PROBLEM - Disk space on elastic1013 is CRITICAL: Connection refused by host [14:43:42] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.140 [14:43:52] PROBLEM - RAID on elastic1001 is CRITICAL: Connection refused by host [14:43:52] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.10 [14:43:52] PROBLEM - ElasticSearch health check for shards on elastic1008 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.140:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [14:44:02] PROBLEM - ElasticSearch health check for shards on elastic1013 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.10:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [14:44:11] PROBLEM - check configured eth on elastic1001 is CRITICAL: Connection refused by host [14:44:14] PROBLEM - RAID on elastic1008 is CRITICAL: Connection refused by host [14:44:31] RECOVERY - Disk space on elastic1008 is OK: DISK OK [14:44:31] RECOVERY - DPKG on elastic1013 is OK: All packages OK [14:44:41] RECOVERY - Disk space on elastic1013 is OK: DISK OK [14:45:02] PROBLEM - puppet last run on elastic1001 is CRITICAL: CRITICAL: Puppet has 2 failures [14:45:06] RECOVERY - RAID on elastic1001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [14:45:11] PROBLEM - puppet last run on elastic1008 is CRITICAL: CRITICAL: puppet fail [14:45:11] RECOVERY - check configured eth on elastic1001 is OK: NRPE: Unable to read output [14:45:11] RECOVERY - RAID on elastic1008 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [14:45:22] PROBLEM - puppet last run on elastic1013 is CRITICAL: CRITICAL: puppet fail [14:45:43] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.108 [14:46:02] ^ acknowledged, puppet is bringing these up [14:46:51] RECOVERY - ElasticSearch health check for shards on elastic1001 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 25, unassigned_shards: 0, timed_out: False, active_primary_shards: 2033, cluster_name: production-search-eqiad, relocating_shards: 66, active_shards: 6094, initializing_shards: 0, number_of_data_nodes: 25 [14:46:51] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 25: number_of_data_nodes: 25: active_primary_shards: 2033: active_shards: 6094: relocating_shards: 66: initializing_shards: 0: unassigned_shards: 0 [14:47:03] RECOVERY - puppet last run on elastic1001 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [14:47:15] RECOVERY - ElasticSearch health check for shards on elastic1008 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 28, unassigned_shards: 0, timed_out: False, active_primary_shards: 2033, cluster_name: production-search-eqiad, relocating_shards: 66, active_shards: 6094, initializing_shards: 0, number_of_data_nodes: 28 [14:47:19] there they go! [14:47:21] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 28: number_of_data_nodes: 28: active_primary_shards: 2033: active_shards: 6094: relocating_shards: 66: initializing_shards: 0: unassigned_shards: 0 [14:47:22] manybubbles: ^d [14:47:32] RECOVERY - puppet last run on elastic1008 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [14:47:33] RECOVERY - puppet last run on elastic1013 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [14:47:41] RECOVERY - ElasticSearch health check for shards on elastic1013 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 28, unassigned_shards: 0, timed_out: False, active_primary_shards: 2033, cluster_name: production-search-eqiad, relocating_shards: 66, active_shards: 6094, initializing_shards: 0, number_of_data_nodes: 28 [14:47:53] <^d> wheee [14:48:05] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 28: number_of_data_nodes: 28: active_primary_shards: 2033: active_shards: 6094: relocating_shards: 66: initializing_shards: 0: unassigned_shards: 0 [14:48:15] elastic1007 10.64.32.139 57 35 1.50 d m elastic1007 [14:48:15] elastic1002 10.64.0.109 73 35 1.78 d * elastic1002 [14:48:16] elastic1014 10.64.48.11 20 35 0.23 d m elastic1014 [14:48:16] elastic1008 10.64.32.140 2 33 0.26 d m elastic1008 [14:48:16] elastic1013 10.64.48.10 0 33 0.23 d m elastic1013 [14:48:16] elastic1001 10.64.0.108 1 33 0.23 d m elastic1001 [14:49:06] yeah [14:49:33] I'll unban the new nodes [14:49:42] RECOVERY - Host elastic1019 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [14:49:46] <^d> es-tool has made this so much easier. [14:49:47] <^d> :p [14:50:40] (03PS2) 10Alexandros Kosiaris: osm export the expired tile list [puppet] - 10https://gerrit.wikimedia.org/r/169242 [14:50:53] is this the addition of the new boxes? [14:50:56] manybubbles, marktraceur, ^d: Who wants to SWAT today? [14:51:48] <^d> mark: New boxes mostly went in yesterday, this is adding remaining 3 new ones to replace 17-19 that we're retiring. [14:51:59] ah cool [14:52:01] <^d> And started swapping out ssds on 3 of the old boxes. [14:52:01] (03CR) 10Alexandros Kosiaris: [C: 032] osm export the expired tile list [puppet] - 10https://gerrit.wikimedia.org/r/169242 (owner: 10Alexandros Kosiaris) [14:52:02] PROBLEM - RAID on elastic1019 is CRITICAL: Connection refused by host [14:52:02] PROBLEM - puppet last run on elastic1019 is CRITICAL: Connection refused by host [14:52:02] PROBLEM - puppet disabled on elastic1019 is CRITICAL: Connection refused by host [14:52:02] PROBLEM - check configured eth on elastic1019 is CRITICAL: Connection refused by host [14:52:02] PROBLEM - ElasticSearch health check on elastic1019 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.41 [14:52:29] (03PS3) 10Alexandros Kosiaris: Backup user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/168981 [14:52:38] (03CR) 10Alexandros Kosiaris: [C: 032] Backup user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/168981 (owner: 10Alexandros Kosiaris) [14:52:42] PROBLEM - check if dhclient is running on elastic1019 is CRITICAL: Connection refused by host [14:52:42] PROBLEM - DPKG on elastic1019 is CRITICAL: Connection refused by host [14:52:43] PROBLEM - Disk space on elastic1019 is CRITICAL: Connection refused by host [14:53:19] anomie: not it [14:53:57] <^d> anomie: It's only 1 patch so I can if you guys can't, but preferably not it. [14:54:12] PROBLEM - Puppet freshness on elastic1019 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 19:08:08 UTC [14:54:29] * anomie will take it unless marktraceur wants it [14:55:25] !log started rolling shards back to elastic1001, elastic1008, and elastic1013 after hard drive upgrade [14:55:31] Logged the message, Master [14:56:01] aude: Ping for SWAT in 4 minutes [14:56:02] anomie: All yours [14:56:03] ottomata: it looks like 1001 isn't hyper threading? [14:56:08] I have phone internet this morning [14:56:18] Because the cable guy is supposed to show up sometime in the next two hours [14:56:18] anomie: ping me too [14:56:19] here [14:56:28] i.e. he'll show up four hours from now [14:56:51] PROBLEM - NTP on elastic1013 is CRITICAL: NTP CRITICAL: Offset unknown [14:56:52] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: puppet fail [14:56:52] PROBLEM - NTP on elastic1008 is CRITICAL: NTP CRITICAL: Offset unknown [14:57:26] hm, I didn't check them, cmjohnson? [14:57:34] hyperthreading on 1001,1008,1013? [14:57:40] ottomata: should I push the shards back off of 1001, 1008 and 1013 for ht bounce? [14:58:03] oh cmjohnson ran out for a bit [14:58:06] manybubbles: i'd say yes [14:58:26] i asked cmjohnson if he turned ht on, and he said yes, but he might have been answering only about 1017-1019 [14:58:41] if you get the shards off, i'll reboot one (or all) and check (and turn it on if it iisn't) [15:00:05] manybubbles, anomie, ^d, marktraceur, aude: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141029T1500). [15:00:09] * anomie begins SWAT [15:00:14] aude: I'll do yours first [15:00:16] !log started moving shard off of elastic1001, elastic1008, and elastic1013 so we can bounce them to enable hyper threading [15:00:22] ok [15:00:23] Logged the message, Master [15:01:33] ottomata: none of them are empty yet [15:02:11] RECOVERY - NTP on elastic1013 is OK: NTP OK: Offset -0.01706254482 secs [15:02:12] RECOVERY - NTP on elastic1008 is OK: NTP OK: Offset -0.02334403992 secs [15:02:31] k, lemme know when [15:04:31] PROBLEM - NTP on elastic1019 is CRITICAL: NTP CRITICAL: No response from NTP server [15:06:28] (03PS1) 10Aude: Re-enable hhvm beta feature on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169714 [15:07:04] !log Killed old (pre 1.25) l10nupdate cache dirs from tin:/var/lib/l10nupdate [15:07:10] Logged the message, Master [15:07:37] !log anomie Synchronized php-1.25wmf5/extensions/Wikidata: SWAT: Fix WikiData "add links" widget JS error [[gerrit:169700]] (duration: 00m 15s) [15:07:38] aude: ^ test please [15:07:43] Glaisher: You're next [15:07:44] Logged the message, Master [15:07:45] doing [15:07:45] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: puppet fail [15:07:51] k [15:08:01] (03PS2) 10Anomie: Raise account creation throttle at cawiki temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169708 (https://bugzilla.wikimedia.org/72611) (owner: 10Glaisher) [15:08:51] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: puppet fail [15:08:54] ottomata: all of them are sitting there holding 3 shards [15:09:08] rataher - they are trying to move them away but its taking a while [15:09:18] hmk [15:10:03] anomie: looks good [15:10:09] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169708 (https://bugzilla.wikimedia.org/72611) (owner: 10Glaisher) [15:10:16] (03Merged) 10jenkins-bot: Raise account creation throttle at cawiki temporarily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169708 (https://bugzilla.wikimedia.org/72611) (owner: 10Glaisher) [15:10:27] (03PS1) 10Reedy: Clone/update skins directory during l10nupdate [puppet] - 10https://gerrit.wikimedia.org/r/169715 (https://bugzilla.wikimedia.org/67154) [15:10:35] !log anomie Synchronized wmf-config/throttle.php: SWAT: Raise account creation throttle at cawiki temporarily [[gerrit:169708]] (duration: 00m 09s) [15:10:35] Glaisher: ^ I suppose there's no way to test that [15:10:40] Logged the message, Master [15:10:46] ottomata: meh - its ok - you can bounce 1013 [15:10:47] mhm [15:10:48] thanks [15:10:54] * anomie is done with SWAT [15:11:11] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: puppet fail [15:11:13] thanks [15:11:32] Is that a record for non-empty SWATs? [15:11:32] :) [15:11:40] ottomata: in fact you can do them all if you are ready [15:11:49] marktraceur: Probably not. One with all config changes would easily be faster [15:11:54] Oh, true [15:11:58] (03PS1) 10Reedy: Add skins to wgLocalisationUpdateRepositories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169716 (https://bugzilla.wikimedia.org/67154) [15:12:05] so is ES happier with the new hardware? [15:12:09] Apply 'em all, get 'em on tin, sync-dir wmf-config/ [15:12:21] If they were all in one file it'd be even better [15:12:36] manybubbles: ok... [15:12:37] InitialiseCommonSettings.php [15:12:43] * bd808 looks at logstash hosts to find why only parsoid messages are making it to the index [15:12:49] I wonder if jouncebot would like to be able to track stuff like !startdeploy SWAT; !enddeploy SWAT [15:12:50] Even one at a time, config merges take seconds versus 10 minutes for MediaWiki changes [15:12:58] True [15:13:00] <^d> mark: Yeah, I'd say so. Much more breathing room. [15:13:14] <^d> We were able to take 9 nodes down for maintenance without suffering. [15:13:40] manybubbles: do I have to be nice to elasticsearch in some way? [15:13:43] or can I just reboot? [15:13:52] ottomata: just reboot when you want [15:14:02] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: puppet fail [15:14:21] <^d> Reedy: CommonInitialisedSettings.php? [15:14:36] k, rebooting 1013 [15:15:19] !log Restarted logstash on logstash1001. No MW events were being added to the index. [15:15:24] Logged the message, Master [15:15:32] bd808: weird [15:15:52] PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: puppet fail [15:15:59] manybubbles: It gets stuck sometimes. And sadly really doesn't log anything helpful about why [15:16:09] lame [15:16:32] I hope this goes away when I kill off log2udp packet relay as input [15:16:47] PROBLEM - Host elastic1013 is DOWN: CRITICAL - Plugin timed out after 15 seconds [15:18:52] (03CR) 10Alexandros Kosiaris: [C: 032] Clone/update skins directory during l10nupdate [puppet] - 10https://gerrit.wikimedia.org/r/169715 (https://bugzilla.wikimedia.org/67154) (owner: 10Reedy) [15:18:58] akosiaris: thanks! :) [15:18:59] !log Restarted logstash on logstash1002 to fix OCG and hadoop log events not being recorded [15:18:59] bd808: not sure if the parsoid folks mentioned this, but there is a field that contains a copy of the entire info as json in the gelf output that we could dropto save space; it's added by the gelf-stream library [15:19:06] Logged the message, Master [15:19:36] bd808: the full_message field [15:19:53] gwicke: I just noticed that. Seems like an easy fix [15:20:37] we were considering pushing a patch upstream to avoid adding it in the first place, but for now stripping it might be quicker [15:20:48] !log reedy Purged l10n cache for 1.25wmf2 [15:20:53] Logged the message, Master [15:21:12] !log reedy Purged l10n cache for 1.25wmf3 [15:21:18] Logged the message, Master [15:21:57] Can someone rm -rf /srv/mediawiki-staging/operations on tin please? [15:22:02] (it's empty bar the .git dir) [15:22:04] ottomata: that machine is having trouble coming back up? [15:22:19] * Reedy looks which WMF dirs can die [15:22:27] PROBLEM - puppet last run on ssl3003 is CRITICAL: CRITICAL: puppet fail [15:23:01] ottomata: ah - much better [15:23:08] RECOVERY - Host elastic1013 is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [15:23:37] !log unbanned elastic1013 now that it is back with hyper threading on [15:23:43] Logged the message, Master [15:23:59] ok cool [15:24:00] yeah, [15:24:11] i will do the other two now [15:24:13] can I do them at the same time? [15:24:15] manybubbles? [15:25:03] ottomata: sure - yeah [15:25:14] (03PS1) 10Alexandros Kosiaris: osm: Fix typo introduced in 2d96ee8 [puppet] - 10https://gerrit.wikimedia.org/r/169720 [15:26:13] (03CR) 10Alexandros Kosiaris: [C: 032] osm: Fix typo introduced in 2d96ee8 [puppet] - 10https://gerrit.wikimedia.org/r/169720 (owner: 10Alexandros Kosiaris) [15:27:36] ok, rebooting 1001 and 1008 [15:28:37] (03PS1) 10Ottomata: Add defines for working with mysql config files, and mysql client settings [puppet] - 10https://gerrit.wikimedia.org/r/169722 [15:28:50] PROBLEM - Host elastic1001 is DOWN: CRITICAL - Plugin timed out after 15 seconds [15:28:51] PROBLEM - Host elastic1008 is DOWN: CRITICAL - Plugin timed out after 15 seconds [15:30:27] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:30:32] (03PS2) 10Ottomata: Add defines for working with mysql config files, and mysql client settings [puppet] - 10https://gerrit.wikimedia.org/r/169722 [15:31:17] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.002 second response time [15:31:18] (03CR) 10Ottomata: "Ok, here is an attempt:" [puppet] - 10https://gerrit.wikimedia.org/r/168993 (owner: 10Ottomata) [15:31:57] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:32:17] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.013 second response time [15:32:36] reedy@tin:/srv/mediawiki-staging$ rm -rf php-1.24wmf21/.git/modules/extensions/Wikidata/rr-cache/1c5f30e1e932c16581d9da4f7a9e510910cba134/preimage [15:32:36] rm: cannot remove `php-1.24wmf21/.git/modules/extensions/Wikidata/rr-cache/1c5f30e1e932c16581d9da4f7a9e510910cba134/preimage': Permission denied [15:32:36] reedy@tin:/srv/mediawiki-staging$ ls -al php-1.24wmf21/.git/modules/extensions/Wikidata/rr-cache/1c5f30e1e932c16581d9da4f7a9e510910cba134/preimage [15:32:36] -rw-rw-r-- 1 ori wikidev 48576 Sep 12 00:35 php-1.24wmf21/.git/modules/extensions/Wikidata/rr-cache/1c5f30e1e932c16581d9da4f7a9e510910cba134/preimage [15:32:36] wut [15:33:04] (03CR) 10Ottomata: "Strangely enough, this template already existed in the mysql module. It looks like it was from puppet labs. I did a halfhearted googling" [puppet] - 10https://gerrit.wikimedia.org/r/169722 (owner: 10Ottomata) [15:33:14] ? [15:33:23] why can't I delete those? [15:33:28] There's a stack of them [15:33:52] (03PS1) 10Reedy: Remove php-1.24wmf(19|2[01]) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169723 [15:34:06] (03CR) 10Reedy: [C: 032] Remove php-1.24wmf(19|2[01]) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169723 (owner: 10Reedy) [15:34:14] (03Merged) 10jenkins-bot: Remove php-1.24wmf(19|2[01]) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169723 (owner: 10Reedy) [15:34:19] (03PS1) 10Alexandros Kosiaris: osm: Fix matching the rsync process via nrpe [puppet] - 10https://gerrit.wikimedia.org/r/169724 [15:35:07] !log uploaded apertium-apy_0.1+svn~57689-1 on apt.wikimedia.org [15:35:14] Logged the message, Master [15:35:30] !log deleted php-1.24wmf19 from mediawiki-installation [15:35:36] Logged the message, Master [15:35:55] (03CR) 10Alexandros Kosiaris: [C: 032] osm: Fix matching the rsync process via nrpe [puppet] - 10https://gerrit.wikimedia.org/r/169724 (owner: 10Alexandros Kosiaris) [15:36:27] !log deleted php-1.24wmf20 from mediawiki-installation [15:36:32] Logged the message, Master [15:37:17] !log deleted php-1.24wmf21 from mediawiki-installation [15:37:23] Logged the message, Master [15:37:47] manybubbles: 1001 and 1008 shoudl be back up [15:37:48] RECOVERY - Host elastic1001 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [15:37:50] RECOVERY - Host elastic1008 is UP: PING OK - Packet loss = 0%, RTA = 1.39 ms [15:38:01] <^d> I see them [15:38:14] heh [15:38:26] yeah [15:39:12] !log start moving shards back to elastic1001 and elastic1008 now that they are up with hyperthreading on [15:39:17] Logged the message, Master [15:40:46] <^d> 1, 8 and 13 won't take any until 2, 7 and 14 are done dumping theirs. [15:41:02] (03PS1) 10Manybubbles: Remove old elasticsearch masters [puppet] - 10https://gerrit.wikimedia.org/r/169725 [15:41:13] <^d> at least that's what i'm observing. [15:41:24] RECOVERY - puppet last run on ssl3003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [15:41:30] (03CR) 10Manybubbles: "Merge me anytime." [puppet] - 10https://gerrit.wikimedia.org/r/169725 (owner: 10Manybubbles) [15:42:29] ^d: its because elasticsearch won't perform any rebalancing until all the banned nodes are empty [15:42:33] <^d> yeah [15:42:46] and I believe all the shards on the banned nodes are already moving [15:43:27] <^d> they are. [15:49:41] (03CR) 10Chad: [C: 031] Remove old elasticsearch masters [puppet] - 10https://gerrit.wikimedia.org/r/169725 (owner: 10Manybubbles) [15:53:28] (03PS1) 10BryanDavis: logstash: Drop full_message field from GELF messages [puppet] - 10https://gerrit.wikimedia.org/r/169727 [15:53:30] (03PS1) 10BryanDavis: logstash: reformat gelf filter config [puppet] - 10https://gerrit.wikimedia.org/r/169728 [15:53:41] hey paravoid, did you(and magnus?) package a new librdkafka? [15:53:58] is it going to ubuntu/debian? [15:59:52] jgage: A couple smallish logstash tweaks for your consideration: https://gerrit.wikimedia.org/r/169727 & https://gerrit.wikimedia.org/r/169728 [16:00:15] I haven't tested either one in beta yet because I'm lazy and have to prep for a phone interview. :/ [16:01:03] bd808: thanks for that (the phone interview) [16:02:14] greg-g: :) np. I kind of like doing interviews. [16:03:12] ^d, manybubbles, i'm running puppet on 1017-1019 now [16:03:20] ottomata: cool [16:03:58] !log shutting down elasticsearch on elastic1017 - its empty and ready to have its disk upgraded/hyper threading enabled [16:04:05] Logged the message, Master [16:04:42] !log shutting down elasticsearch on elastic1014 - its empty and ready to have its disk upgraded/hyper threading enabled [16:04:46] Logged the message, Master [16:05:23] manybubbles: did you mean 1017? [16:05:36] <^d> yeah was about to say. [16:05:42] !log shutting down elasticsearch on elastic1007 - its empty and ready to have its disk upgraded/hyper threading enabled [16:05:47] Logged the message, Master [16:05:51] !log ignore my last log message about 1017 - typod [16:05:56] Logged the message, Master [16:06:04] RECOVERY - DPKG on elastic1019 is OK: All packages OK [16:06:05] RECOVERY - check if dhclient is running on elastic1019 is OK: PROCS OK: 0 processes with command name dhclient [16:06:24] RECOVERY - Disk space on elastic1019 is OK: DISK OK [16:06:44] RECOVERY - RAID on elastic1019 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [16:06:45] RECOVERY - check configured eth on elastic1019 is OK: NRPE: Unable to read output [16:06:45] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.11 [16:07:05] PROBLEM - ElasticSearch health check for shards on elastic1014 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.11:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [16:07:35] RECOVERY - puppet last run on elastic1019 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:07:54] PROBLEM - ElasticSearch health check for shards on elastic1007 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.139:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [16:07:58] RECOVERY - ElasticSearch health check on elastic1019 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 29: number_of_data_nodes: 29: active_primary_shards: 2033: active_shards: 6094: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [16:07:58] PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.139 [16:11:15] RECOVERY - ElasticSearch health check for shards on elastic1014 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 30, unassigned_shards: 0, timed_out: False, active_primary_shards: 2033, cluster_name: production-search-eqiad, relocating_shards: 12, active_shards: 6093, initializing_shards: 1, number_of_data_nodes: 30 [16:12:06] <^d> manybubbles: That enwiki_general shard is taking its sweet time moving off 1002 :p [16:12:08] wtf puppet - I disabled you. [16:12:10] yeah [16:12:54] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.11 [16:13:11] how I disable puppet? [16:13:25] there we go [16:13:29] --disable instead of disable [16:15:04] that also takes an argument to specify why FWIW [16:15:36] PROBLEM - ElasticSearch health check for shards on elastic1014 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.11:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [16:15:53] ottomata: noatime on elastic1018 [16:16:02] bwaaaHHH [16:16:03] thanks. sorry [16:16:15] you are a good double checker [16:16:44] ottomata: I try [16:16:51] elastic1017 and 1019 need it too [16:16:54] but 1019 isn't emtpy [16:17:16] PROBLEM - ElasticSearch health check on elastic1019 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.41 [16:17:20] manybubbles: it isn't empty!? [16:17:23] i just did i! [16:17:25] it! [16:17:27] uh oh [16:17:32] it is back on now [16:17:35] what did I do? [16:17:36] !log shutting down elasticsearch on elastic1002 - its empty and ready to have its disk upgraded/hyper threading enabled [16:17:45] Logged the message, Master [16:17:46] ottomata: we just hadn't banned it [16:17:50] OHH [16:17:54] right because it wasn't online [16:18:05] welp, elasticsearch was down there for about 10 seconds then :/ [16:18:06] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.109 [16:18:15] ottomata: its empty now [16:18:19] hah, ok [16:18:38] ah - I see. we're in yellow [16:18:40] its all good [16:19:50] (03CR) 10Gage: [C: 04-1] "full_message is where Hadoop stores Java stack traces, which we want. Can we move this into a nodejs/parsoid-specific section?" [puppet] - 10https://gerrit.wikimedia.org/r/169727 (owner: 10BryanDavis) [16:20:15] * ^d puts his yellow hat on [16:20:26] PROBLEM - ElasticSearch health check for shards on elastic1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.109:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [16:20:30] !log elastic101[7-9] look good to me - adding them to the cluster [16:20:35] Logged the message, Master [16:20:37] icinga-wm: I know that. I shut it down intentionally [16:21:02] k [16:22:15] ^d: can you make es-tool take a hostname instead of an ip address? [16:22:21] in addition to, rather? [16:22:23] (03CR) 10BryanDavis: "Parsoid and OCG both just have junk there. Are there other useful things in the Hadoop version? Maybe we could pick all the good things ou" [puppet] - 10https://gerrit.wikimedia.org/r/169727 (owner: 10BryanDavis) [16:22:24] <^d> Yeah, I was just thinking that. [16:22:41] I do a lot of `ifconfig | grep inet`, copy, paste [16:24:44] RECOVERY - NTP on elastic1019 is OK: NTP OK: Offset -4.708766937e-05 secs [16:28:57] ottomata: this went great! [16:30:43] RECOVERY - ElasticSearch health check on elastic1019 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 28: number_of_data_nodes: 28: active_primary_shards: 2033: active_shards: 6094: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [16:34:58] COOL [16:35:08] so, manybubbles, status? [16:35:26] is one of the reinstalled hosts now the master? [16:35:52] ah, i see your email [16:40:33] (03CR) 10Dzahn: "dependency has been merged (thanks Alex). should not be used anymore now." [puppet] - 10https://gerrit.wikimedia.org/r/169571 (owner: 10Matanya) [16:42:19] (03CR) 10Dzahn: "Alex, thanks! and yea, agree, i think we all want to get rid of the old webserver class as well" [puppet] - 10https://gerrit.wikimedia.org/r/169561 (owner: 10Dzahn) [16:43:12] (03CR) 10JanZerebecki: [C: 031] "Yes, please." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169714 (owner: 10Aude) [16:43:53] (03CR) 10Gage: [C: 031] "Actually it looks like stack traces are now in their own field called StackTrace. full_message seems to be just a copy of short_message, s" [puppet] - 10https://gerrit.wikimedia.org/r/169727 (owner: 10BryanDavis) [16:46:39] (03CR) 10Dzahn: "when glancing at the bug it's reopened as "not yet fixed"? update?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169714 (owner: 10Aude) [16:47:26] (03PS1) 10Reedy: Add mai to langs.tmpl [dns] - 10https://gerrit.wikimedia.org/r/169736 [16:47:58] (03Abandoned) 10Reedy: Add mai to langs.tmpl [dns] - 10https://gerrit.wikimedia.org/r/169736 (owner: 10Reedy) [16:48:33] (03CR) 10Reedy: [C: 031] Add 'mai' to langs.tmpl [dns] - 10https://gerrit.wikimedia.org/r/169011 (https://bugzilla.wikimedia.org/72346) (owner: 10Glaisher) [16:48:39] (03CR) 10Dzahn: "duplicate of.. lol. you're quicker" [dns] - 10https://gerrit.wikimedia.org/r/169736 (owner: 10Reedy) [16:49:48] (03PS2) 10Glaisher: Add 'mai' to langs.tmpl [dns] - 10https://gerrit.wikimedia.org/r/169011 (https://bugzilla.wikimedia.org/72346) [16:50:39] (03CR) 10Dzahn: [C: 032] "the language is "Maithili" spoken in India and Nepal http://en.wikipedia.org/wiki/Maithili_language" [dns] - 10https://gerrit.wikimedia.org/r/169011 (https://bugzilla.wikimedia.org/72346) (owner: 10Glaisher) [16:58:08] oo, manybubbles, i need to enable 1019 in pybal [16:58:10] s'ok to do so? [16:59:47] ottomata: regarding HT...i was only talking about 1017-19...the others I thought you did the other day [17:01:05] sorry, nope, will have to do those as we reinstall them [17:01:25] oh..okay [17:02:17] (03PS1) 10Reedy: Fix private typo [puppet] - 10https://gerrit.wikimedia.org/r/169744 [17:03:01] enabling hyperthreading? yay. [17:06:01] cmjohnson, ottomata: are you aware of enabled HT on any other nodes? [17:06:11] (03PS2) 10Ori.livneh: Re-enable hhvm beta feature on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169714 (owner: 10Aude) [17:06:16] (03CR) 10Ori.livneh: [C: 032] Re-enable hhvm beta feature on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169714 (owner: 10Aude) [17:06:32] (03Merged) 10jenkins-bot: Re-enable hhvm beta feature on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169714 (owner: 10Aude) [17:07:24] !log ori Synchronized wmf-config/CommonSettings.php: I8dd62e2cc: Re-enable hhvm beta feature on Wikidata (duration: 00m 06s) [17:07:35] Logged the message, Master [17:08:03] (03PS2) 10Dzahn: analytics-privatedata-users - Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/169744 (owner: 10Reedy) [17:08:12] (03CR) 10Dzahn: [C: 032] analytics-privatedata-users - Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/169744 (owner: 10Reedy) [17:08:43] ori: oh, nice (re wikidata+hhvm) [17:11:24] (03CR) 10GWicke: [C: 031] logstash: Drop full_message field from GELF messages [puppet] - 10https://gerrit.wikimedia.org/r/169727 (owner: 10BryanDavis) [17:13:27] (03CR) 10Subramanya Sastry: [C: 031] logstash: Drop full_message field from GELF messages [puppet] - 10https://gerrit.wikimedia.org/r/169727 (owner: 10BryanDavis) [17:14:26] ^d, should I enable 1019 in pybal? [17:16:50] csteipp: is that OAuth fix going to be lightning deployed, or will it await the train or somesuch? I'm hoping to get users testing my app today, if possible. [17:19:08] ragesoss: The new branch is being cut at 11 (40 mins?). It should have the fix, and mediawikiwiki is phase 0. [17:19:17] So should be resolved in about an hour :) [17:19:29] sweet [17:19:31] :) [17:21:46] PROBLEM - check if salt-minion is running on ocg1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:22:24] ragesoss: you might not know, but we moved the Thur deploy to Wed (starting this week) [17:22:57] greg-g: I was just looking at that. Serendipity! [17:23:06] :) [17:24:28] !log shutting down to replace ssds in elastic1002,1007,1014 [17:24:33] Logged the message, Master [17:29:04] PROBLEM - Host elastic1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:30:47] PROBLEM - Host elastic1007 is DOWN: PING CRITICAL - Packet loss = 100% [17:31:23] PROBLEM - Host elastic1014 is DOWN: CRITICAL - Plugin timed out after 15 seconds [17:31:27] ACKNOWLEDGEMENT - DPKG on elastic1002 is CRITICAL: Timeout while attempting connection Chris Johnson replacing ssds [17:31:28] ACKNOWLEDGEMENT - Disk space on elastic1002 is CRITICAL: Timeout while attempting connection Chris Johnson replacing ssds [17:31:28] ACKNOWLEDGEMENT - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.109 Chris Johnson replacing ssds [17:31:28] ACKNOWLEDGEMENT - ElasticSearch health check for shards on elastic1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.109:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health Chris Johnson replacing ssds [17:31:28] ACKNOWLEDGEMENT - NTP on elastic1002 is CRITICAL: NTP CRITICAL: No response from NTP server Chris Johnson replacing ssds [17:31:28] ACKNOWLEDGEMENT - RAID on elastic1002 is CRITICAL: Timeout while attempting connection Chris Johnson replacing ssds [17:31:28] ACKNOWLEDGEMENT - SSH on elastic1002 is CRITICAL: Connection timed out Chris Johnson replacing ssds [17:31:29] ACKNOWLEDGEMENT - check configured eth on elastic1002 is CRITICAL: Timeout while attempting connection Chris Johnson replacing ssds [17:31:29] ACKNOWLEDGEMENT - check if dhclient is running on elastic1002 is CRITICAL: Timeout while attempting connection Chris Johnson replacing ssds [17:31:30] ACKNOWLEDGEMENT - check if salt-minion is running on elastic1002 is CRITICAL: Timeout while attempting connection Chris Johnson replacing ssds [17:31:30] ACKNOWLEDGEMENT - puppet last run on elastic1002 is CRITICAL: Timeout while attempting connection Chris Johnson replacing ssds [17:32:04] ACKNOWLEDGEMENT - DPKG on elastic1007 is CRITICAL: Timeout while attempting connection Chris Johnson replacing ssds [17:32:04] ACKNOWLEDGEMENT - Disk space on elastic1007 is CRITICAL: Timeout while attempting connection Chris Johnson replacing ssds [17:32:04] ACKNOWLEDGEMENT - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.139 Chris Johnson replacing ssds [17:32:04] ACKNOWLEDGEMENT - ElasticSearch health check for shards on elastic1007 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.139:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health Chris Johnson replacing ssds [17:32:04] ACKNOWLEDGEMENT - NTP on elastic1007 is CRITICAL: NTP CRITICAL: No response from NTP server Chris Johnson replacing ssds [17:32:05] ACKNOWLEDGEMENT - RAID on elastic1007 is CRITICAL: Timeout while attempting connection Chris Johnson replacing ssds [17:32:05] ACKNOWLEDGEMENT - SSH on elastic1007 is CRITICAL: Connection timed out Chris Johnson replacing ssds [17:32:06] ACKNOWLEDGEMENT - check configured eth on elastic1007 is CRITICAL: Timeout while attempting connection Chris Johnson replacing ssds [17:32:06] ACKNOWLEDGEMENT - check if dhclient is running on elastic1007 is CRITICAL: Timeout while attempting connection Chris Johnson replacing ssds [17:32:07] ACKNOWLEDGEMENT - check if salt-minion is running on elastic1007 is CRITICAL: Timeout while attempting connection Chris Johnson replacing ssds [17:32:07] ACKNOWLEDGEMENT - puppet last run on elastic1007 is CRITICAL: Timeout while attempting connection Chris Johnson replacing ssds [17:35:14] PROBLEM - Disk space on ms-be3004 is CRITICAL: Timeout while attempting connection [17:39:43] PROBLEM - puppet last run on cp3011 is CRITICAL: CRITICAL: Puppet has 2 failures [17:39:43] PROBLEM - puppet last run on amssq33 is CRITICAL: CRITICAL: Puppet has 3 failures [17:39:45] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Puppet has 2 failures [17:39:45] PROBLEM - puppet last run on ssl3001 is CRITICAL: CRITICAL: Puppet has 3 failures [17:44:41] (03PS1) 10Glaisher: Initial configuration for Maithili Wikipedia (maiwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169758 (https://bugzilla.wikimedia.org/72346) [17:45:23] Glaisher: Are any more new wikis in the pipeline? [17:45:43] none that I'm aware of [17:50:22] (03CR) 10JanZerebecki: "@Dzahn: Yes, updated: https://bugzilla.wikimedia.org/show_bug.cgi?id=64415#c4" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169714 (owner: 10Aude) [17:54:23] RECOVERY - puppet last run on amssq33 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [17:55:33] RECOVERY - puppet last run on ssl3001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [17:56:24] RECOVERY - puppet last run on cp3011 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [17:56:34] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [18:00:05] Reedy, greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141029T1800). [18:00:22] * aude back [18:00:32] aude: Saw the cake? [18:01:24] RECOVERY - Host elastic1002 is UP: PING OK - Packet loss = 0%, RTA = 1.76 ms [18:03:54] RECOVERY - Host elastic1014 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [18:04:35] mmmmm, cupcake! :D [18:05:26] Right :) [18:05:36] and the present is hhvm beta feature again :) [18:06:23] Already see hhvm edits happening :) [18:08:14] PROBLEM - Host elastic1002 is DOWN: CRITICAL - Plugin timed out after 15 seconds [18:08:34] RECOVERY - Host elastic1007 is UP: PING OK - Packet loss = 0%, RTA = 3.74 ms [18:09:29] the error logs look quiet [18:10:04] RECOVERY - Host elastic1002 is UP: PING OK - Packet loss = 0%, RTA = 1.99 ms [18:14:54] PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: Puppet has 1 failures [18:15:33] PROBLEM - puppet last run on amssq36 is CRITICAL: CRITICAL: Puppet has 1 failures [18:16:04] PROBLEM - Host elastic1014 is DOWN: CRITICAL - Plugin timed out after 15 seconds [18:17:15] (03PS1) 10Reedy: add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169765 [18:17:17] (03PS1) 10Reedy: testwiki to 1.25wmf6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169766 [18:17:19] (03PS1) 10Reedy: wikipedias to 1.25wmf5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169767 [18:17:21] (03PS1) 10Reedy: group0 to 1.25wmf6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169768 [18:17:50] (03CR) 10Reedy: [C: 032] add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169765 (owner: 10Reedy) [18:17:58] (03CR) 10Reedy: [C: 032] testwiki to 1.25wmf6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169766 (owner: 10Reedy) [18:18:14] (03Merged) 10jenkins-bot: add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169765 (owner: 10Reedy) [18:18:22] (03Merged) 10jenkins-bot: testwiki to 1.25wmf6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169766 (owner: 10Reedy) [18:18:33] !log reedy Started scap: testwiki to 1.25wmf6 and build l10n cache [18:18:40] Logged the message, Master [18:21:24] RECOVERY - Host elastic1014 is UP: PING OK - Packet loss = 0%, RTA = 2.11 ms [18:28:05] PROBLEM - Host elastic1014 is DOWN: PING CRITICAL - Packet loss = 100% [18:28:54] RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [18:29:34] RECOVERY - puppet last run on amssq36 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [18:29:48] sync-common: 10% (ok: 24; fail: 0; left: 205) [18:31:54] ottomata: elastic1002/7/14 has base install, HT turned on. old puppet certs and salt-keys removed. [18:33:14] RECOVERY - Host elastic1014 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [18:33:31] cool, ok, ^d, shall I keep going? [18:35:40] Reedy: taking a while, eh? [18:35:52] 17 mins so far [18:35:57] sync-common: 65% (ok: 149; fail: 0; left: 80) [18:42:04] <^d> ottomata: 2, 7 and 14? I think so sure. [18:42:48] ja, 2, 7 14 [18:42:51] on it [18:47:04] !log reedy Finished scap: testwiki to 1.25wmf6 and build l10n cache (duration: 28m 30s) [18:47:10] Logged the message, Master [18:54:13] (03CR) 10Reedy: [C: 032] wikipedias to 1.25wmf5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169767 (owner: 10Reedy) [18:54:21] (03Merged) 10jenkins-bot: wikipedias to 1.25wmf5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169767 (owner: 10Reedy) [18:57:16] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.25wmf5 [18:57:23] Logged the message, Master [18:57:54] (03CR) 10Reedy: [C: 032] group0 to 1.25wmf6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169768 (owner: 10Reedy) [18:58:10] (03Merged) 10jenkins-bot: group0 to 1.25wmf6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169768 (owner: 10Reedy) [18:58:35] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.25wmf6 [18:58:40] Logged the message, Master [19:03:04] PROBLEM - puppet last run on elastic1007 is CRITICAL: CRITICAL: Puppet has 1 failures [19:03:49] PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.139 [19:03:55] PROBLEM - puppet last run on elastic1002 is CRITICAL: CRITICAL: Puppet has 2 failures [19:03:55] PROBLEM - ElasticSearch health check for shards on elastic1007 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.139:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [19:04:32] <^d> icinga-wm: go away, we know. [19:04:46] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.109 [19:04:47] * YuviPanda should implement a replacement for ircecho one of these days [19:04:55] PROBLEM - ElasticSearch health check for shards on elastic1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.109:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [19:04:56] one that understands 'go away' [19:08:44] PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: puppet fail [19:09:06] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [19:10:19] ack on both [19:10:53] ori: "hackish fix" vs. "fixed this" ? :) [19:11:14] "hackish fix" points to some kind of better fix, doesn't it? [19:11:14] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [19:11:38] paravoid: yes, brett self-assigned and made it high-prio. but it's a good enough fix for me for now. [19:13:29] why is it hackish? [19:14:38] <^d> ottomata: Where we at? :) [19:15:15] oop, getting distracted [19:15:33] paravoid: Hackish because Tim said so. :) -- https://phabricator.wikimedia.org/T820#16428 [19:15:41] oh, ^d, i should enable 1019 in pybal, yes? [19:15:41] I know, that's what I'm asking [19:15:42] paravoid: because the way the extension was written, it's supposed to be create_node_object()'s responsibility to add the node to m_orphans [19:15:56] <^d> ottomata: Yeah, at some point. No rush on that part tho :) [19:16:02] not the caller's [19:16:16] but in this case the caller knows that create_node_object() won't, so it does anyway [19:16:24] ah [19:16:28] hm, ^d, i can do now, though, so it is the same as the others [19:16:36] <^d> okie dokie [19:16:54] RECOVERY - puppet last run on elastic1007 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [19:17:22] <^d> Ok, I see 7 and 2 back [19:17:23] done [19:17:25] yeah, just now [19:17:28] !log upgraded HHVM to 3.3.0+dfsg1-1+wm1 [19:17:29] 1014 i couldnt log into [19:17:30] lemme try again [19:17:34] Logged the message, Master [19:17:35] RECOVERY - ElasticSearch health check on elastic1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2033: active_shards: 6094: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [19:17:35] RECOVERY - ElasticSearch health check on elastic1007 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2033: active_shards: 6094: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [19:17:45] RECOVERY - ElasticSearch health check for shards on elastic1002 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 30, unassigned_shards: 0, timed_out: False, active_primary_shards: 2033, cluster_name: production-search-eqiad, relocating_shards: 16, active_shards: 6094, initializing_shards: 0, number_of_data_nodes: 30 [19:17:45] RECOVERY - puppet last run on elastic1002 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [19:17:54] RECOVERY - ElasticSearch health check for shards on elastic1007 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 30, unassigned_shards: 0, timed_out: False, active_primary_shards: 2033, cluster_name: production-search-eqiad, relocating_shards: 16, active_shards: 6094, initializing_shards: 0, number_of_data_nodes: 30 [19:18:02] ok then, I guess we can live with that [19:18:15] cmjohnson: is elastic1014 ok? [19:18:30] why...did it not come back up? [19:18:32] <^d> ottomata: Lets also merge https://gerrit.wikimedia.org/r/#/c/169725/ so we don't bring them back up as masters again. [19:18:32] I just wanted to be sure it didn't mean "hackish because it will leak under other circumstances" or something :) [19:18:39] paravoid: _joe_ and I were talking about re-basing our packages on 3.3.1 early next week, by which point i hope a proper fix for this will land [19:18:51] nod [19:18:51] <^d> (since we already did the master dance to 1/8/13) [19:19:04] I was hoping 3.3.1 would have a large portion of our patches as well, but meh :( [19:19:10] it does [19:19:31] did paul include them after all? [19:20:25] all of the ones that already landed in master, which was most of them [19:21:39] the pcre cache rewrite is in a bit of a limbo state: see https://reviews.facebook.net/D25515 for fascinating reading (actually fascinating, no sarcasm) [19:22:41] ^d, OOP, didn't realize that hadn't been done [19:22:50] (03PS2) 10Ottomata: Remove old elasticsearch masters [puppet] - 10https://gerrit.wikimedia.org/r/169725 (owner: 10Manybubbles) [19:23:04] <^d> Yeah if we do it now I can bounce those nodes again before we reload them with shards. [19:23:05] (03PS1) 10Reedy: Set a default for wgProofreadPageNamespaceIds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169779 (https://bugzilla.wikimedia.org/72525) [19:23:08] <^d> So they won't take back master. [19:23:12] Dereckson: ^^ [19:23:25] (03CR) 10Ottomata: [C: 032 V: 032] Remove old elasticsearch masters [puppet] - 10https://gerrit.wikimedia.org/r/169725 (owner: 10Manybubbles) [19:23:53] Thank you. [19:24:15] Dereckson: I thought we might aswell make it more obvious to view the ones that don't match the rest [19:24:15] ^d, running puppet there, then will bounce them [19:24:21] <^d> okie dokie [19:24:39] yeah, dunno what's up with 1014 though [19:24:45] cmjohnson: yeah, its not back [19:24:57] last i checked the console was busy too (but that was an hourish ago) [19:25:02] (03CR) 10Reedy: [C: 032] Set a default for wgProofreadPageNamespaceIds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169779 (https://bugzilla.wikimedia.org/72525) (owner: 10Reedy) [19:25:10] (03Merged) 10jenkins-bot: Set a default for wgProofreadPageNamespaceIds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169779 (https://bugzilla.wikimedia.org/72525) (owner: 10Reedy) [19:26:06] <^d> ottomata: Ok, I see them back and not master eligible. Fantastic! [19:26:36] !log reedy Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 15s) [19:26:43] Logged the message, Master [19:28:33] <^d> ottomata: Unbanned 02 and 07 so they can start taking shards again [19:28:52] great [19:30:20] ottomata: stuck in installer [19:30:31] will ping you once its fixed [19:30:39] (03PS1) 10Tpt: Revert "Set a default for wgProofreadPageNamespaceIds" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169780 [19:31:12] <^d> cmjohnson: Ok, so we've got 3-6, 9-12 and 15-16 left. Each of those ranges are same rack right? We could do them in 3 batches then. [19:31:50] ^d yep....but we'll have to do tomorrow [19:32:12] <^d> Yeah, I'll start draining traffic from a group of those tonight though so they'll be ready by the morning. [19:32:19] <^d> Got any preference? [19:32:30] Reedy: we need to revert 169779, 250/252 are used for Page and Index for the new wikisources created for more than one year [19:32:42] (Tpt dixit) [19:33:31] (as the default value set by the extension) [19:33:48] no preference...just let me know [19:33:58] PROBLEM - puppet last run on amssq38 is CRITICAL: CRITICAL: puppet fail [19:34:43] <^d> will do [19:36:22] <^d> cmjohnson, ottomata: I'm stepping away to grab lunch, back in ~15ish. All's quiet from our side right now. [19:36:43] okay 1014 will be up in a few mins [19:38:24] cool [19:38:32] i'll get 1014 up and then wait to do more til tomorrow [19:38:39] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [19:38:44] (03CR) 10Reedy: [C: 032] Revert "Set a default for wgProofreadPageNamespaceIds" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169780 (owner: 10Tpt) [19:38:52] (03Merged) 10jenkins-bot: Revert "Set a default for wgProofreadPageNamespaceIds" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169780 (owner: 10Tpt) [19:39:28] !log reedy Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 16s) [19:39:33] Logged the message, Master [19:42:57] ottomata: all yours [19:43:02] danke [19:43:25] (03PS1) 10Tpt: Adds an explicit default for $wgProofreadPageNamespaceIds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169781 [19:43:29] (03CR) 10jenkins-bot: [V: 04-1] Adds an explicit default for $wgProofreadPageNamespaceIds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169781 (owner: 10Tpt) [19:44:19] (03PS2) 10Tpt: Adds an explicit default for $wgProofreadPageNamespaceIds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169781 [19:45:13] (03CR) 10Reedy: [C: 032] Adds an explicit default for $wgProofreadPageNamespaceIds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169781 (owner: 10Tpt) [19:45:20] (03Merged) 10jenkins-bot: Adds an explicit default for $wgProofreadPageNamespaceIds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169781 (owner: 10Tpt) [19:45:54] !log reedy Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 16s) [19:47:53] (03CR) 10Dereckson: "Follow-up:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169779 (https://bugzilla.wikimedia.org/72525) (owner: 10Reedy) [19:48:27] (03CR) 10Dereckson: "Follow-up: Reverted commit fixed and merged in I27632aa04215c05c40c9a44f921e6e0a1aff319e" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169780 (owner: 10Tpt) [19:48:32] <^d> nom nom nom [19:49:37] (03CR) 10Dereckson: "This commit fixes I27632aa04215c05c40c9a44f921e6e0a1aff319e after an emergency revert in I6b92c02a91b803a744e4b41519e3beb732aa4be0." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169781 (owner: 10Tpt) [19:53:14] RECOVERY - puppet last run on amssq38 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [19:56:15] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [19:59:18] (03PS1) 10BryanDavis: Fix ip address for beta redis master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169789 [20:00:04] gwicke, cscott, arlolra, subbu: Dear anthropoid, the time has come. Please deploy Parsoid/OCG (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141029T2000). [20:01:05] (03CR) 1020after4: "Can we even test this in labs? does labs use the same front-end proxy setup??" [puppet] - 10https://gerrit.wikimedia.org/r/168509 (owner: 1020after4) [20:01:10] on it jouncebot [20:01:37] (03CR) 10BryanDavis: "$ dig -x 10.68.16.146" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169789 (owner: 10BryanDavis) [20:04:54] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.11 [20:05:04] PROBLEM - ElasticSearch health check for shards on elastic1014 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.11:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [20:05:07] <^d> oh shove it icinga-wm [20:05:22] (03CR) 10John F. Lewis: [C: 031] Fix ip address for beta redis master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169789 (owner: 10BryanDavis) [20:06:04] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2033: active_shards: 6094: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [20:06:04] RECOVERY - ElasticSearch health check for shards on elastic1014 is OK: OK - elasticsearch status production-search-eqiad: status: green, number_of_nodes: 31, unassigned_shards: 0, timed_out: False, active_primary_shards: 2033, cluster_name: production-search-eqiad, relocating_shards: 16, active_shards: 6094, initializing_shards: 0, number_of_data_nodes: 31 [20:07:09] !log updated Parsoid to version 4e21bdb6fccc377468fd3d1cbc656fb64464ea78 [20:07:15] Logged the message, Master [20:08:02] <^d> ottomata: 1014 looks good [20:08:12] arlolra, that version should be the parsoid repo version not the deploy repo version. [20:08:18] cool [20:08:20] you can edit the server admin log directly. [20:08:25] :( [20:08:46] https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:46] ok fixed [20:10:19] <^d> ottomata: Ok, I think we're all done for today :) I'll go ahead and ban 3-6 now for tomorrow morning. [20:10:23] <^d> Thanks for all your help!! [20:10:46] cool, sounds good [20:14:52] Anybody on tin with time to do a prod no-op merge for me to fix a beta bug? https://gerrit.wikimedia.org/r/#/c/169789/ [20:15:26] (03CR) 10Reedy: [C: 032] Fix ip address for beta redis master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169789 (owner: 10BryanDavis) [20:15:35] (03Merged) 10jenkins-bot: Fix ip address for beta redis master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169789 (owner: 10BryanDavis) [20:15:48] thx Reedy [20:16:07] !log reedy Synchronized wmf-config/mc-labs.php: noop for prod (duration: 00m 17s) [20:16:13] Logged the message, Master [20:17:44] PROBLEM - Parsoid on wtp1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:25] PROBLEM - Parsoid on wtp1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:50] ^ subbu [20:19:03] oh.. hmm .. [20:19:24] i'd check ganglia to see if they're overloaded [20:19:34] they are not. [20:19:54] i have that page open .. looks like they maybe stuck (transient)? .. will watch. [20:20:08] something funky happened, though. from the graphs it looks like parsoid may have been restarted a little bit ago? [20:20:18] we deployed new code .. [20:20:25] PROBLEM - Parsoid on wtp1023 is CRITICAL: Connection refused [20:20:38] that is what we are monitoring now .. added timeouts to kill stuck processes .. but, we may have to revert that. [20:21:01] that's a third parsoid host; i'd revert now and ask questions later :P [20:21:12] arlolra, let us revert. [20:21:18] k [20:23:26] gwicke, ori, git deploy revert doesn't work? [20:23:41] deploy: error: argument : invalid choice: 'revert' (choose from 'abort', 'finish', 'help', 'report', 'service', 'start', 'sync') [20:24:13] bd808, ^ [20:24:34] PROBLEM - Parsoid on wtp1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:37] i guess i will revert the manual way by checking out the old reversion and syncing. [20:24:57] subbu: yeah that's what you do. There will be an old tag you can checkout [20:25:06] subbu: I'd assume that is how it was supposed to be reverted [20:25:12] poor parsoids ... dying a slow death [20:25:13] git deploy start; git checkout ; git deploy sync [20:26:46] yes. syncing right now. [20:27:44] RECOVERY - Parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.006 second response time [20:28:14] RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.017 second response time [20:28:39] RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.022 second response time [20:28:46] !log reverted parsoid to version 617e9e61b625f25d79dfaab08830c396537be632 (due to stuck processes) [20:28:48] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.029 second response time [20:28:54] Logged the message, Master [20:33:01] springle: lots of ' Unknown database ' errors spamming the logs for various servers and dbs [20:43:09] bd808, our jenkins job for php parser tests is failing with PHP Fatal error: Interface 'Psr\Log\LoggerInterface' not found in /srv/ssd/jenkins-slave/workspace/parsoidsvc-php-parsertests/src/mediawiki/core/includes/debug/logger/Logger.php on line 46 .. [20:43:43] i see a wikitech-l thread about it now. [20:43:55] subbu: It needs to clone the mediawiki/vendor repo too. Wikidata is having similar issues [20:44:32] * bd808 is trying to fix too many things at once [20:44:55] subbu: What's the job name? I'll look at the the JJB config [20:45:04] one sec .. let me find it. [20:45:19] parsoidsvc-php-parsertests [20:46:27] ack. custom to the max [20:48:52] subbu: You need to add cloning of mediawiki/vendor after /srv/deployment/integration/slave-scripts/bin/mw-core-get.sh is run. That will fix is for now. YOu should open a bug for hashar too because that will need more fixes in the future I fear. [20:49:29] * bd808 will actually open a meta bug about these errors [20:53:14] ok .. let me find the repo .. [20:54:18] subbu: https://bugzilla.wikimedia.org/show_bug.cgi?id=72700 to track the problem [20:56:51] (03CR) 10Dzahn: [C: 04-1] "/srv/mediawiki/common/ does not exist on terbium" [software] - 10https://gerrit.wikimedia.org/r/163769 (owner: 10Reedy) [20:58:21] thanks. [20:59:10] bd808, cd $MW_INSTALL_PATH; git clone https://git.wikimedia.org/git/mediawiki/vendor.git will do it? or is there a different dir within the install? [20:59:38] (03CR) 10GWicke: "Will these paths work with the Jenkins update stuff? I don't know much about how that works; Antoine should know." [puppet] - 10https://gerrit.wikimedia.org/r/169622 (owner: 10Catrope) [21:00:04] yurik: Dear anthropoid, the time has come. Please deploy Wikipedia Zero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141029T2100). [21:00:59] (03PS2) 10Reedy: Swap to /srv/mediawiki [software] - 10https://gerrit.wikimedia.org/r/163769 [21:01:27] (03PS3) 10Reedy: Swap to /srv/mediawiki [software] - 10https://gerrit.wikimedia.org/r/163769 [21:02:44] (03PS1) 10Reedy: Write ganglia temp file to /tmp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169866 [21:03:06] (03CR) 10Dzahn: [C: 032] "yep, merging because the files exist in that place. doesn't mean i understand why we have dblist in 2 places and none of them is puppet. i" [software] - 10https://gerrit.wikimedia.org/r/163769 (owner: 10Reedy) [21:06:28] (03CR) 10Dzahn: [C: 031] "if that solves the caching on "dbtree" which is slow now.. yes, please" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169866 (owner: 10Reedy) [21:11:09] subbu: The clone should do it. You want to end up with mediawiki/vendor.git cloned to $IP/vendor [21:11:56] bd808, i also see a lot of chatter in #qa .. so, if the scripts get fixed, that wll do it .. should i wait for that to happen? [21:14:43] ^d, around? [21:14:56] <^d> What's up? [21:15:18] ^d, was looking at your email re updating ext, not sure what i was doing wrong wrt git patching [21:16:53] <^d> What was missing was a change to mediawiki/core on the corresponding wmf/* branch for the submodule update. [21:18:10] ^d, i do these steps: git co wmf/1.25wmf6 && git add extension/... && git commit && git review. On tin i do git pull && git submodule update extension/... [21:18:29] subbu: If you can wait a bit hopefully we will magically fix things [21:18:32] <^d> Weird, that sounds right. [21:18:49] ^d, the extension is on master branch though [21:19:10] <^d> That shouldn't matter too much. [21:19:16] <^d> Maybe it's just wmf3 that was messed up? [21:19:18] <^d> https://phabricator.wikimedia.org/P48 [21:19:29] bd808, yes, that works. i like magic. [21:19:50] ^d, sec, about to commit a new patch, you can check [21:29:23] yurikR: heya, so, given the timing (and zerowiki being in phase0), is there a reason we shouldn't just have you ride the train for your code updates? [21:30:00] pushing patches out to an hours old branch seems... weird in this case [21:31:01] * greg-g goes into the last 1:1 of the day.... [21:33:17] greg-g, let me figure out this release first. I'm not really against the train ride, might be worth a try [21:39:15] (03CR) 10Dzahn: [C: 032] firewall: remove, unused [puppet] - 10https://gerrit.wikimedia.org/r/169571 (owner: 10Matanya) [21:41:40] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=72%): [21:50:13] (03CR) 10Dzahn: "i would just do it at this point" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134962 (owner: 10Reedy) [21:51:27] (03CR) 10Dzahn: "status unclear. +1 or not?" [puppet] - 10https://gerrit.wikimedia.org/r/166406 (owner: 10Christopher Johnson (WMDE)) [21:51:31] (03CR) 10Reedy: "Just my comment to fix up, and to a static array as per Timo I guess" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134962 (owner: 10Reedy) [21:52:16] (03CR) 10Dzahn: "do it" [puppet] - 10https://gerrit.wikimedia.org/r/160953 (owner: 10Alexandros Kosiaris) [21:53:14] (03CR) 10Dzahn: "removing self" [puppet] - 10https://gerrit.wikimedia.org/r/117698 (owner: 10Matanya) [21:53:45] (03CR) 10Dzahn: "should probably have comment from Coren now" [puppet] - 10https://gerrit.wikimedia.org/r/111387 (owner: 10Jeremyb) [21:54:25] (03CR) 10Mark Bergsma: [C: 04-1] "I don't like the mw* glob used on this, we need something more sophisticated than that." [puppet] - 10https://gerrit.wikimedia.org/r/160953 (owner: 10Alexandros Kosiaris) [21:54:27] (03CR) 10Dzahn: "nothing will ever happen without a corresponding ticket" [puppet] - 10https://gerrit.wikimedia.org/r/122621 (owner: 10Reedy) [21:59:40] (03PS1) 10Ori.livneh: Add tmpreaper module w/ tmpreaper::reap resource [puppet] - 10https://gerrit.wikimedia.org/r/169935 [21:59:46] ^ paravoid, fyi [22:00:50] (03PS3) 10Reedy: Allow faux-renaming/database remapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134962 [22:01:00] (03CR) 10jenkins-bot: [V: 04-1] Allow faux-renaming/database remapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134962 (owner: 10Reedy) [22:01:02] (03CR) 10Reedy: Allow faux-renaming/database remapping (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134962 (owner: 10Reedy) [22:01:35] (03PS4) 10Reedy: Allow faux-renaming/database remapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134962 [22:01:43] (03CR) 10jenkins-bot: [V: 04-1] Allow faux-renaming/database remapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134962 (owner: 10Reedy) [22:02:31] (03PS5) 10Reedy: Allow faux-renaming/database remapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134962 [22:04:43] !log git-deploy: Deploying integration/slave-scripts a6a23ac1ec [22:04:50] Logged the message, Master [22:05:00] bd808: ^ [22:05:10] yurikR: everything ok? I just saw https://gerrit.wikimedia.org/r/#/c/169928/ [22:05:42] greg-g, i just realized that you were absolutely right - there is no point to relese because it just got released :) [22:05:50] :) :) [22:05:57] Krinkle: Running https://integration.wikimedia.org/ci/job/mwext-Wikibase-client-tests/7216/console [22:06:02] I don't get told that very often. I'm going to quote you on that. [22:06:05] :P [22:06:07] and it failed :( [22:06:41] greg-g, basq in the glory, it won't last ;) Unless i was +2ing some additional stuff into master at the last moment, it should work otk [22:06:41] ok [22:06:48] oh. new script not there yet. [22:07:09] yurikR: /me nods [22:07:18] bd808: That one runs in wikidata-jenkins2 in labs, won't be updated until puppet runs [22:07:25] aude: How do you update scripts on wikidata-jenkins2? [22:07:30] yurikR: you can of course do SWAT deploys as needed, but, let me know what you think about next week (if you want this window still) [22:07:33] uhhh [22:07:35] aude: bd808: don't [22:07:45] Well, maybe with root. [22:08:00] let me verify in prod first since those are already deployed [22:08:10] https://integration.wikimedia.org/ci/job/parsoidsvc-php-parsertests/2773/console [22:08:19] https://integration.wikimedia.org/ci/job/parsoidsvc-php-parsertests/2776/console [22:08:21] passes :) [22:08:39] bd808: labs slaves don't get synced from prod git-deploy. [22:08:41] i would have thought puppet runs regularly on the instances [22:08:45] 30 min [22:08:51] it was merged 5min ago [22:09:03] never had to do anything manually, although i didn't set them up [22:09:12] I forgot that they use a puppet hack instead of trebuchet [22:09:14] (03CR) 10Faidon Liambotis: [C: 04-1] "Why do this and have the directories be cleaned during puppet runs, instead of relying on the package's support for that (cron.daily, /etc" [puppet] - 10https://gerrit.wikimedia.org/r/169935 (owner: 10Ori.livneh) [22:09:48] Hm.. wikidata-jenkins2 are not part of the integration project [22:09:56] so in that case I don't know. could be anything. [22:10:28] wikidata custom stuff.. [22:10:34] But \o/ for copy-n-paste + educated guesses :) [22:10:44] Krinkle: i think we use teh same puppet classes etc. [22:10:47] OK [22:10:50] I can't access them though [22:10:57] we can add you [22:11:11] #331376 {main} [22:11:12] wow [22:11:17] Can you grant me ssh into those instances? They are linked to jenkins prod instnace for slave launch. I can access it via that but would rather do it the right way. [22:11:19] largest stacktrace, ever [22:11:27] !log Re-running setZoneAccess.php for swift [22:11:33] Logged the message, Master [22:11:34] hoo: ??? [22:11:38] rerunning puppet manually on other slaves now [22:11:43] aude: renumber thing [22:11:43] (03PS6) 10Reedy: Allow faux-renaming/database remapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134962 [22:11:45] (03PS1) 10Reedy: Rename chapcomwiki to affcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169939 [22:11:46] it recursed that often [22:11:49] aaaah [22:11:56] (03CR) 10jenkins-bot: [V: 04-1] Rename chapcomwiki to affcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169939 (owner: 10Reedy) [22:12:03] subbu: Your test is green again. :) [22:12:49] thanks! :) [22:12:51] Krinkle: granted [22:13:25] they are marked as puppet stale [22:13:33] probably means we have to update manually :( [22:13:55] puppet stale usually means a local puppetmaster [22:14:23] aude: bd808: puppet natural run just started a few seconds ago [22:14:25] I'll let it finish :) [22:14:28] Should be good after that [22:14:28] (03PS2) 10Reedy: Rename chapcomwiki to affcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169939 [22:14:31] ok [22:14:44] aude: looks good so far regarding Q30 [22:14:50] hoo: yay [22:15:07] Will look again tomorrow and close the bug if no new things pop up [22:15:30] But still... a 300k+ line stack trace is massive :D [22:15:57] still a lot of GC cache entry warnings [22:15:59] https://tools.wmflabs.org/nagf/?project=wikidata-build [22:16:00] not good [22:16:19] aude: What exactly do you mean? [22:16:49] hitting gc [22:16:49] aude: bd808: https://integration.wikimedia.org/ci/job/mwext-Wikibase-client-tests/7217/console [22:17:16] nothing bad should happen with gc, but still shouldn't be hitting it so often [22:17:26] Krinkle: sweet. [22:17:40] Krinkle: looking good [22:23:42] (03CR) 10Reedy: "So this is now just the implementation. Taken the renaming of chapcomwiki to affcomwiki into a dependant patch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134962 (owner: 10Reedy) [22:25:12] (03PS1) 10Reedy: chapcomwiki -> affcomwiki [puppet] - 10https://gerrit.wikimedia.org/r/169944 (https://bugzilla.wikimedia.org/39482) [22:25:19] (03PS3) 10Reedy: Rename chapcomwiki to affcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169939 (https://bugzilla.wikimedia.org/39482) [22:48:37] (03PS1) 10Dzahn: dynamicproxy - disabled SSLv3 [puppet] - 10https://gerrit.wikimedia.org/r/169949 [23:00:05] RoanKattouw, ^d, marktraceur, MaxSem, RoanKattouw: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141029T2300). Please do the needful. [23:00:26] I'll do it [23:00:41] !log restarting nginx on cp1044 [23:00:47] Logged the message, Master [23:05:29] (03CR) 10John F. Lewis: [C: 031] "Looks good but, topic change perhaps?" [puppet] - 10https://gerrit.wikimedia.org/r/169949 (owner: 10Dzahn) [23:06:05] (03PS2) 10Dzahn: dynamicproxy - disable SSLv3 [puppet] - 10https://gerrit.wikimedia.org/r/169949 [23:07:16] (03CR) 10JanZerebecki: "To avoid any doubt: My +1 still stands." [puppet] - 10https://gerrit.wikimedia.org/r/166406 (owner: 10Christopher Johnson (WMDE)) [23:13:05] !log catrope Synchronized php-1.25wmf6/extensions/VisualEditor: SWAT (duration: 00m 04s) [23:13:11] Logged the message, Master [23:13:13] MaxSem: All yours [23:13:49] (03CR) 10Ori.livneh: "@paravoid: /etc/cron.daily/tmpreaper isn't great: it forces --ctime, --mtime-dir, and --symlinks on you, and it doesn't let you specify di" [puppet] - 10https://gerrit.wikimedia.org/r/169935 (owner: 10Ori.livneh) [23:35:24] !log maxsem Synchronized php-1.25wmf5/extensions/MobileFrontend/: (no message) (duration: 00m 04s) [23:35:30] Logged the message, Master [23:38:54] !log maxsem Synchronized php-1.25wmf6/extensions/MobileFrontend/: (no message) (duration: 00m 07s) [23:38:59] Logged the message, Master [23:39:41] * greg-g glares at "(no message)"