[00:00:02] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:00:04] RoanKattouw, ^d, marktraceur, MaxSem: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141113T0000). Please do the needful. [00:00:04] no, don't merge it right before you go to sleep [00:00:10] check the notifications commands for now? [00:00:14] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [00:00:17] I'm not going to sleep at least for another 30m [00:00:20] yeah, doing that now [00:00:21] let's talk more about the IRC bot before we keep reinventing it [00:00:29] that's a pattern [00:00:36] puppet's running on neon [00:00:41] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [00:00:42] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:00:43] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [00:00:50] i'll look at those ^ [00:01:08] I don't think we re-invented IRC, tbh. They're all different use cases for different tools... [00:01:14] (03PS4) 10Rush: phab email pipe cleanup and allow maint [puppet] - 10https://gerrit.wikimedia.org/r/172915 [00:01:43] icinga-wm uses ircecho, grrrit-wm uses nodejs because it needed async code, wikibugs uses python because valshallaw is better than me in writing async python code :) [00:01:54] SWAT is empty :) [00:02:09] so maybe that means we shouldn't keep writing new ones from scratch but use a stable one [00:02:19] what exactly are you proposing? [00:03:54] (03PS5) 10Rush: phab email pipe cleanup and allow maint [puppet] - 10https://gerrit.wikimedia.org/r/172915 [00:04:11] one irc bot to rule them all? [00:04:31] chasemp: I vaguely remember you talking about it :) [00:04:48] grrrit-wm will be subsumed by current wikibugs2 when we migrate to phab. [00:04:55] to rule most I think is the best I can forsee at this time [00:05:04] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:05:09] there is jouncebot and who knows what else [00:05:14] RECOVERY - check_puppetrun on thulium is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [00:05:19] I wouldn't be surprised by a dozen irc bots [00:05:29] (03PS1) 10Ori.livneh: Provision mwdeploy's private key on tin [puppet] - 10https://gerrit.wikimedia.org/r/172919 [00:05:31] well, how exactly would you integrate jouncebot into wikibugs2? [00:05:33] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [00:05:34] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [00:05:34] (03PS6) 10Rush: phab email pipe cleanup and allow maint [puppet] - 10https://gerrit.wikimedia.org/r/172915 [00:05:35] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [00:05:36] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:05:36] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [00:05:39] having one tool do 500 things isn't good either [00:06:10] maybe I don't understand it well enough, what does jouncebot do [00:06:14] that would be hard to integrate [00:06:25] reads a wikitech page, announces things [00:06:56] in my mind that's a pretty simple use case for inclusion [00:07:07] into where? [00:07:10] but the wall for me came when people declared they like the lots-of-bots [00:07:25] hmm, actually [00:07:34] logmsgbot + friends should probably be integrated together [00:07:57] (03CR) 10Rush: [C: 032] phab email pipe cleanup and allow maint [puppet] - 10https://gerrit.wikimedia.org/r/172915 (owner: 10Rush) [00:08:41] the idea that every service should have it's own irc bot I find weird [00:08:44] but that may be just me [00:09:04] it's mostly a historical artifact, and I agree that putting things into one is a good idea. [00:09:21] it's just that... I don't have the time to do it. [00:09:40] yep I think it's never a priority, which meh, is what it is [00:09:43] yeah [00:10:01] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:10:11] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [00:10:14] I think all the current bots were also written on volunteer time, and there it's always simpler to write a bot than a overarching bot framework that can take in things from everywhere. [00:10:20] YuviPanda: there's a bug somewhere for combining logmsgbot + morebots because they would get separated due to netsplits. [00:10:31] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [00:10:31] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:10:32] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [00:10:33] RECOVERY - check_puppetrun on silicon is OK: OK: Puppet is currently enabled, last run 242 seconds ago with 0 failures [00:10:39] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [00:10:41] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [00:10:53] legoktm: yup, what we should actually do is to have one 'Events pipeline' for everything in prod and plug all IRC bots from prod into that one [00:10:58] and have it present itself as one bot [00:11:01] yes [00:11:03] that also logs to SAL, etc. [00:11:05] or at least my thought as well [00:11:12] with very nice rulesets as to what goes where [00:11:15] and things just report json to it [00:11:20] and it figures out how to put them where [00:11:29] <^demon|lunch> manybubbles: I've got it. [00:12:00] YuviPanda: I think logmsgbot and morebots are the only prod IRC bots. (excluding irc.wm.o) [00:12:09] legoktm: true. [00:12:25] chasemp: mutante I think we also prefer to keep bots that don't need to be running in prod not running in prod... [00:12:26] Is morebots even in prod still? [00:12:28] so more people can fix them [00:12:41] legoktm: well, if it's noting down changes from the commandline in prod... [00:12:50] morebots is actually in labs [00:12:50] I am a logbot running on tools-exec-14. [00:12:50] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [00:12:50] To log a message, type !log . [00:12:53] :) [00:12:55] :) [00:13:00] logmsgbot is the only one in prod then [00:13:03] ah, right [00:13:07] https://wikitech.wikimedia.org/wiki/Morebots [00:13:48] Aaand logmsgbot is ircecho, which you want to use right? [00:13:59] :P [00:15:01] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:15:12] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [00:15:12] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:15:13] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [00:15:18] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [00:15:18] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [00:15:19] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [00:15:20] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [00:16:11] legoktm: :P [00:16:23] legoktm: if we build this pie-in-the-sky 'event-notification service' then we'll just send json to it [00:16:37] oh, icinga-wm is in prod. [00:18:33] legoktm: yeah, it is. shinken-wm will be in labs tho [00:20:02] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:20:15] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [00:20:33] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [00:20:34] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:20:35] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [00:20:36] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [00:20:36] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: puppet fail [00:20:37] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [00:20:38] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [00:22:25] (03PS1) 10Rush: phab remove bad save value [puppet] - 10https://gerrit.wikimedia.org/r/172921 [00:22:34] (03CR) 10jenkins-bot: [V: 04-1] phab remove bad save value [puppet] - 10https://gerrit.wikimedia.org/r/172921 (owner: 10Rush) [00:22:40] (03PS2) 10Rush: phab remove bad save value [puppet] - 10https://gerrit.wikimedia.org/r/172921 [00:23:53] (03CR) 10Rush: [C: 032] phab remove bad save value [puppet] - 10https://gerrit.wikimedia.org/r/172921 (owner: 10Rush) [00:24:59] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:25:09] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [00:25:38] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [00:25:39] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:25:40] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [00:25:41] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 290 seconds ago with 0 failures [00:25:43] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [00:25:43] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: puppet fail [00:25:44] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [00:25:45] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [00:27:06] mutante: chasemp fwiw, me and legoktm just started brainstorming writing this 'event notification pipeline' for labs :) [00:28:17] awesome :) [00:28:37] realized it's going to be super simple, and we already have salvageable parts [00:29:26] redis pubsub was my previous approach [00:29:35] yeah [00:29:40] we'll have http -> redis -> irc [00:29:47] with simple auth tokens [00:30:00] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:30:03] so services hit the http service with a simple json (from, token, channels, message) [00:30:07] then it puts them in redis [00:30:11] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [00:30:11] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [00:30:12] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [00:30:13] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:30:14] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [00:30:15] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: puppet fail [00:30:15] and then another script just puts them up on irc [00:30:16] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [00:30:16] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [00:30:22] why not just have services put them in redis directly? [00:30:24] chasemp: I usually prefer lpush + rbpop [00:30:25] why the http frontend? [00:30:27] chasemp: authentication [00:30:32] this is labs [00:30:36] oh labs [00:30:36] ah [00:31:19] chasemp: wikibugs and wikibugs2 is already organized similarly, there's redis subscriber + irc pusher, so we'll reuse it [00:31:24] chasemp: and then a simple flask app should take care of the rest. [00:31:37] chasemp: biggest problem now is getting me and legoktm to agree on a name :) [00:31:52] ircnotifier [00:32:17] if you want full pedantic compliance, yaib [00:32:30] phpwikibot2 [00:32:40] is it written in php? [00:32:42] nope [00:32:43] :D [00:32:49] troll yuvi [00:32:58] (03PS2) 10Springle: m2-master switch to dbproxy1002 [dns] - 10https://gerrit.wikimedia.org/r/172498 [00:33:04] I mean, even if the code quality ends up being bad, people can still be relieved it's not php [00:33:23] <^d> botoid? [00:33:36] we're going to call the project ircnotifier, since it's not wiki specific [00:34:12] and then we can bikeshed the bot name [00:34:14] https://github.com/yuvipanda/ircnotifier [00:34:16] legoktm: ^ [00:34:25] okay [00:34:36] (03CR) 10Springle: [C: 032] m2-master switch to dbproxy1002 [dns] - 10https://gerrit.wikimedia.org/r/172498 (owner: 10Springle) [00:34:45] legoktm: want me to steal the code from wikibugs or wikibugs2? [00:34:50] 2 [00:34:51] legoktm: wikibugs2, I think [00:34:52] yeah [00:35:00] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:35:10] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [00:35:34] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [00:35:35] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [00:35:36] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:35:37] RECOVERY - check_puppetrun on db1025 is OK: OK: Puppet is currently enabled, last run 293 seconds ago with 0 failures [00:35:37] RECOVERY - check_puppetrun on payments1001 is OK: OK: Puppet is currently enabled, last run 300 seconds ago with 0 failures [00:35:38] legoktm: python3? :) [00:35:42] RECOVERY - check_puppetrun on payments1004 is OK: OK: Puppet is currently enabled, last run 278 seconds ago with 0 failures [00:35:43] of course [00:35:43] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [00:37:08] legoktm: :D [00:37:09] (03PS2) 10Ori.livneh: Provision mwdeploy's private key on tin [puppet] - 10https://gerrit.wikimedia.org/r/172919 [00:37:48] (03CR) 10Ori.livneh: "The key first needs to be generated and added to the private repository. See RT 8857." [puppet] - 10https://gerrit.wikimedia.org/r/172919 (owner: 10Ori.livneh) [00:38:54] legoktm: can't co-routines be instance methods? [00:38:59] legoktm: why is redisrunner a function? [00:39:05] they probably can be. [00:39:11] I was trying to keep it similar to wikibugs [00:39:43] tch tch [00:40:04] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:40:21] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [00:40:21] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [00:40:22] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:40:23] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [00:40:24] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [00:45:00] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:45:12] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [00:45:26] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [00:45:27] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [00:45:27] legoktm: initial code :) https://github.com/yuvipanda/ircnotifier [00:45:27] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:45:28] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [00:45:34] legoktm: has one hugeass FIXME tho ;) [00:46:07] lol [00:46:09] !log thulium - Could not intern from pson: expected value in object at '"[PHP]\n\n; puppet:t'! [00:46:13] Logged the message, Master [00:47:19] legoktm: ;) [00:48:14] irc3 probably has a method to join a channel if you're not in it already [00:49:27] legoktm: yeah [00:49:59] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:50:09] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [00:50:10] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [00:50:18] something is up with fundraising [00:50:23] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [00:50:24] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:50:24] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [00:50:27] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [00:50:37] ori: yeah, mutante is looking into it [00:51:28] earlier the puppet certs on fr master have been deleted accidentally [00:51:34] they have all been resigned already [00:51:43] the error on thulium i saw last is now gone [00:51:49] notice: Finished catalog run in 3.65 seconds [00:51:57] BUT.. it's suspiciously fast? [00:52:59] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 346 seconds [00:53:00] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 348 seconds [00:53:11] let's call Jeff? [00:53:17] it's not too late on the east coast [00:54:36] eek [00:54:49] has someone called already? [00:54:51] other jeff [00:54:52] I haven't [00:54:54] * jgage looks up his number [00:55:01] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:55:03] RECOVERY - check_puppetrun on thulium is OK: OK: Puppet is currently enabled, last run 227 seconds ago with 0 failures [00:55:14] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -0 seconds [00:55:15] RECOVERY - check_puppetrun on pay-lvs1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [00:55:16] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [00:55:16] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:55:17] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [00:55:24] got it, calling.. [00:55:26] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [00:55:38] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:55:57] it's 3am here, I can't be of much help [00:56:03] err: Could not retrieve catalog from remote server: Could not intern from pson: expected value in object at '"# puppet/templates/'! [00:56:09] not thinking very clearly [00:56:15] this is a bit random now [00:56:18] paravoid: i never let that stop me! [00:56:20] (j/k) [00:56:21] one run finishes, the next does that [00:56:38] ori: well you know, I do that too, but I woke up early today :) [00:57:11] left him a voicemail [01:00:01] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [01:00:11] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [01:00:12] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [01:00:13] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [01:00:17] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [01:00:19] notice: Finished catalog run in 5.22 seconds [01:02:21] * Starting puppet master [fail] [01:02:34] * master is running [01:04:57] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [01:05:16] RECOVERY - check_puppetrun on tellurium is OK: OK: Puppet is currently enabled, last run 296 seconds ago with 0 failures [01:05:17] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [01:05:17] RECOVERY - check_puppetrun on silicon is OK: OK: Puppet is currently enabled, last run 250 seconds ago with 0 failures [01:05:20] RECOVERY - check_puppetrun on db1008 is OK: OK: Puppet is currently enabled, last run 92 seconds ago with 0 failures [01:06:07] PROBLEM - ElasticSearch health check on elastic1024 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.48 [01:08:31] hm [01:08:32] java 46746 elasticsearch 448u IPv6 68528286 0t0 TCP *:9300 (LISTEN) [01:08:35] java 46746 elasticsearch 1376u IPv6 68417825 0t0 TCP *:9200 (LISTEN) [01:09:54] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [01:10:24] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [01:10:25] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [01:10:26] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [01:10:37] it compiles catalogs for nodes. payments1001-1003 seem ok, payments1004 is not, wth [01:11:46] elasticsearch is yellow but appears to be recovering, i'll keep an eye on it [01:14:52] paravoid: yt? [01:15:03] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [01:15:15] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [01:15:18] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [01:15:19] RECOVERY - check_puppetrun on payments1004 is OK: OK: Puppet is currently enabled, last run 67 seconds ago with 0 failures [01:16:44] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: Puppet has 1 failures [01:17:01] stat1001, that's not even related.. come on [01:18:24] (03PS1) 10Springle: Revert "m2-master switch to dbproxy1002". All services switched cleanly except eventlogging, which needs firewall changes. So let them switch back and try another day. [dns] - 10https://gerrit.wikimedia.org/r/172926 [01:18:28] * YuviPanda checks on stat1001 [01:18:34] it runs fine :p [01:18:55] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [01:19:06] (03CR) 10Springle: [C: 032] Revert "m2-master switch to dbproxy1002". All services switched cleanly except eventlogging, which needs firewall changes. So let them switc [dns] - 10https://gerrit.wikimedia.org/r/172926 (owner: 10Springle) [01:19:58] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [01:20:11] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [01:20:12] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [01:21:24] hi [01:21:29] :) [01:22:04] chasemp: ^ redis relay is done. working on the http / auth part now [01:25:01] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [01:25:19] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [01:25:20] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [01:25:21] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [01:26:46] Jeff_Green: hey, so it's inconsistent a bit, sometimes they run and sometimes they dont [01:26:52] maybe stored configs? [01:27:12] stored configs? [01:27:21] on the puppet master [01:27:36] see failures like this odd one: [01:27:43] there's nothing fancy going on, just straight puppet lameness [01:27:57] Could not intern from pson: expected value in object at '"# puppet/templates/'! [01:28:21] or [01:28:22] ugh [01:28:25] puppet. [01:28:25] Could not intern from pson: expected value in object at '"[PHP]\n\n; puppet:t'! [01:28:39] but then, run it multiple times, and see them only sometimes [01:28:43] and the other times it finishes a run [01:28:54] you've restarted the master? [01:29:11] that too, see errors in log [01:29:30] how about we start totally fresh [01:29:51] and move aside boron:/var/lib/puppet and the same on each client [01:29:59] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [01:30:09] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [01:30:10] it was * Starting puppet master [fail] [01:30:15] but at the same time [01:30:35] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [01:30:37] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [01:30:44] the only thing I can think of is that the stored data on the master is corrupt [01:30:50] * master is running [01:31:22] that's what i meant when i said stored configs, yea [01:31:31] suggests truncation [01:31:36] so the puppetstoredconfigclean.rb thing we have in prod [01:32:01] can't just use it here though, because "Invalid db adapter " [01:32:02] i'd like to help, is there a host i can run puppet with verbose/debug on? [01:32:14] one that you guys aren't currently working on, i mean [01:32:21] ori: it's all frack hosts [01:32:37] payments1001 for example [01:32:45] mutante: i vote we just blow away /var/lib/puppet everywhere and start over [01:32:49] it's not that many hosts [01:33:08] Jeff_Green: ironically, sounds like the same thing was the root cause [01:33:27] I thought he just trashed the stored certs [01:33:36] not the entire dir [01:34:01] it was that find command [01:34:03] another possibility--purge everything *but* the stored certs [01:34:22] that, afaict find /var/lib/puppet/ssl -type f -exec rm {} \; [01:34:28] yeah [01:34:36] but yea, just ./ssl/ [01:34:39] that's a recipe for stabbing yourself in the eye [01:34:43] stopping puppetmaster [01:35:02] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [01:35:06] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [01:35:07] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [01:35:08] RECOVERY - check_puppetrun on payments1004 is OK: OK: Puppet is currently enabled, last run 233 seconds ago with 0 failures [01:35:24] Jeff_Green: are the nodes usually that fast in fr when running puppet? [01:35:33] just taking 3 or 5 seconds or something [01:35:34] yes [01:35:38] wow, ok [01:35:53] that's what happens when you rip out all the broken package version tracking [01:35:54] :-) [01:35:57] so why does it work sometimes [01:36:06] i can see the master compiling catalogs for nodes [01:36:08] because puppet is manufactured broken [01:36:18] heh [01:36:19] i've never seen this behavior before though [01:36:36] testing a theory here... one sec [01:38:00] ok here's what I just did for lutetium, let's see if it gets happier: [01:38:07] stopped puppetmaster and lutetium's puppet [01:38:32] on lutetium rm -fr /var/lib/puppet/everything_but_ssl_dir [01:38:59] on boron rm -fr /var/lib/puppet/{everything related to lutetium but the pem file} [01:39:04] start puppet and puppetmaster [01:39:18] runs clean on the first try, hopefully it will stay that way [01:39:57] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [01:40:07] doing samarium next [01:40:22] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [01:40:22] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [01:40:23] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [01:41:27] Jeff_Green: that sounds like a good plan, deleting ./yaml/node/lutetium* ./yaml/facts/lutetium etc.. yea [01:41:44] one can hope [01:41:54] is that what the prod *rb thing does? [01:42:46] that deletes from the database [01:43:00] oh, right, we have the db layer there [01:43:06] i purposely left that out here [01:43:19] it first tries to figure out from master config [01:43:28] it it uses sqlite or mysql or postgres [01:43:35] yeah [01:43:40] so when you run that in fr, you get "Invalid adapter" [01:43:43] right [01:44:00] it's even installed? [01:44:01] PROBLEM - ElasticSearch health check on elastic1025 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.49 [01:44:08] Jeff_Green: i copied it to /tmp manually [01:44:12] oic [01:44:33] i decided not to try to use all the bells and whistles for puppet in frack [01:44:34] trying to achieve kind of the same thing, make master forget all things about one node, then run it again [01:44:39] right [01:45:01] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [01:45:02] samarium done [01:45:08] backup4001 next [01:45:11] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [01:45:20] ha boron [01:45:22] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [01:45:23] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [01:45:24] RECOVERY - check_puppetrun on samarium is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [01:45:24] guess I'd better do boron [01:45:25] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [01:45:26] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [01:45:44] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [01:45:56] arr, yea, it came back [01:47:21] boron done [01:48:23] doing indium [01:48:30] mutante: ok [01:48:49] I think we should to stop puppetmaster while we do them, so let's do a couple at a time [01:49:25] lemme know when you're ready to start indium up, and I'll restart puppetmaster [01:49:51] hm i tried to check out backup4001 but ssh times out and the root password on record doesn't work via serial console :\ [01:50:01] jgage: ha [01:50:13] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [01:50:13] it's not technically in frack, but it's a fundraising host [01:50:15] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 154 seconds ago with 0 failures [01:50:15] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [01:50:16] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [01:50:23] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [01:50:36] oh! i didn't know fundraising had a box at ulsfo. [01:50:52] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [01:50:53] is the master up on boron? [01:50:55] backup4001 is an interim offsite-backup box we deployed as a stopgap between killing pmtpa and having a second dc [01:51:01] err: Could not send report: Connection refused - connect(2) [01:51:02] cool [01:51:04] mutante: down. ready for it to come up? [01:51:22] i just rebooted all the hadoop workers and nothing exploded and nobody noticed, woo [01:51:30] jgage: nice [01:51:36] i think so yea, i just deleted the yaml files for indium [01:51:52] mutante: master is up [01:53:27] Jeff_Green: i think we just have to delete those 2 files on the master [01:53:40] ok [01:53:44] yaml/facts/ and yaml/nodes/ [01:53:54] backup4001 didn't have its key exchange done yet [01:55:13] RECOVERY - check_puppetrun on pay-lvs1001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [01:55:20] indium finished a run. payments1002 did as well [01:55:26] great [01:55:32] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [01:55:33] only deleted files on master, did not even bother on node [01:55:33] RECOVERY - check_puppetrun on indium is OK: OK: Puppet is currently enabled, last run 107 seconds ago with 0 failures [01:55:34] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [01:55:36] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [01:55:46] mutante: interesting, ok [01:56:20] backup4001 is taking forever, maybe the cross-colo latency [01:56:49] doing db1008 [01:58:29] tellurium.. finished run.. but i didn't do anything [01:58:55] i dunno, the case where they flap between working and failing is really odd [01:58:55] icinga-wm: come one, just recover? [01:59:08] it should have received 3 OKs already.. [01:59:26] it doesn't work like prod does fwiw [01:59:42] what makes neon receives it really? [01:59:59] RECOVERY - check_puppetrun on backup4001 is OK: OK: Puppet is currently enabled, last run 225 seconds ago with 0 failures [02:00:04] there's a puppet plugin that's run as part of a suite on cron [02:00:13] RECOVERY - check_puppetrun on tellurium is OK: OK: Puppet is currently enabled, last run 71 seconds ago with 0 failures [02:00:14] RECOVERY - check_puppetrun on payments1002 is OK: OK: Puppet is currently enabled, last run 237 seconds ago with 0 failures [02:00:18] RECOVERY - check_puppetrun on db1008 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [02:00:19] looking better [02:00:39] no nrpe, no puppet db, no snmp [02:00:44] Jeff_Green: it sends a UDP? [02:00:51] it's tcp [02:00:56] ah, ok [02:00:57] nsca [02:01:05] gotcha, yea, i saw the iptables rule [02:01:17] the config is /etc/nagios_nsca.conf [02:01:51] config is just straight shell with args for calling the plugins [02:03:19] looking considerably better, now lets hope the fix sticks [02:03:46] i see, it looks at the puppet state file [02:03:54] ya [02:04:14] yes, it does look pretty good so far [02:05:01] also, nothing in Icinga is CRIT that isn't older than 11h [02:05:41] arg, wait, pay-lvs1002 just popped up [02:06:14] have we done that one already? [02:06:22] i did not, i think no [02:06:32] ok. doing [02:07:28] grr. I did pay-lvs1001 by accident [02:07:30] fail. [02:08:54] grr. the pson error [02:09:03] PROBLEM - ElasticSearch health check on elastic1026 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.50 [02:10:00] <^d> ^ that's me [02:11:21] pay-lvs1002 running clean again [02:11:26] cool [02:12:59] pay-lvs1001 broke [02:13:46] i wonder if this could have to do with a run happening with the master down? [02:13:48] but you just fixed it accidentally [02:14:10] i removed the local state files, yeah [02:14:16] it recovered the second time [02:14:19] yea, well, trying to run but no master, you get refused [02:14:21] which is also fail [02:14:28] i suppose [02:14:46] maybe if you drop the master mid-run, it doesn't recover well [02:15:48] this is really wierd [02:18:22] RECOVERY - swift-object-replicator on ms-be2010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [02:18:44] !log LocalisationUpdate completed (1.25wmf7) at 2014-11-13 02:18:44+00:00 [02:18:52] Logged the message, Master [02:20:05] Jeff_Green: arr, silicon and samarium.. sigh [02:20:13] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [02:20:26] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [02:20:27] hmmmmmmm. [02:20:31] what is going on. [02:20:51] stopping master [02:24:58] PROBLEM - swift-object-replicator on ms-be2010 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [02:25:14] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [02:25:35] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: puppet fail [02:26:37] puppetmaster starting [02:28:13] (03PS6) 10Springle: Remove sync-l10nupdate(-1)? [puppet] - 10https://gerrit.wikimedia.org/r/158624 (owner: 10Reedy) [02:28:33] (03CR) 10Springle: [C: 032] Remove sync-l10nupdate(-1)? [puppet] - 10https://gerrit.wikimedia.org/r/158624 (owner: 10Reedy) [02:30:00] !log LocalisationUpdate completed (1.25wmf8) at 2014-11-13 02:30:00+00:00 [02:30:04] Logged the message, Master [02:30:05] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [02:30:06] RECOVERY - check_puppetrun on samarium is OK: OK: Puppet is currently enabled, last run 166 seconds ago with 0 failures [02:30:07] RECOVERY - check_puppetrun on silicon is OK: OK: Puppet is currently enabled, last run 156 seconds ago with 0 failures [02:30:15] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: puppet fail [02:30:30] 3$*#(& [02:31:07] did we just not do them? [02:31:17] (03PS3) 10Springle: mysql_wmf - autoload layout and lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170479 (owner: 10Dzahn) [02:31:22] since we figured out the working method i mean [02:31:33] I blew away everything on the master [02:31:45] hrmmmm [02:31:50] but I didn't do anything on the client side for these yet [02:32:07] puppet runs clean on payments1003 byhand [02:32:18] i'm starting to wonder if something else happened [02:32:45] (03CR) 10Springle: [C: 032] mysql_wmf - autoload layout and lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/170479 (owner: 10Dzahn) [02:33:26] springle, thanks 😻 - http://codepoints.net/U+1F63B [02:35:05] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [02:35:20] RECOVERY - check_puppetrun on payments1003 is OK: OK: Puppet is currently enabled, last run 101 seconds ago with 0 failures [02:35:23] something is seriously broken here... [02:36:28] what happened to gerrit-wm [02:36:29] grrrit-wm: ? [02:36:29] speak up! [02:36:50] sometimes it skips a message, doesnt it [02:36:57] I'm going to reboot boron, not because I think it will fix this problem, but it's due for kernel updates [02:37:08] noticed it before, rare but happens [02:37:26] mutante: are you ok to lose your boron session? [02:37:45] Jeff_Green: yea, logged off [02:38:00] here goes [02:38:04] i was tryin to see things in bash_history, but not reallly [02:38:23] i ran a package checksum check, nothing showed up as changed [02:38:37] it almost feels network-y [02:38:54] (03CR) 10Dzahn: "grrit-wm, did you forget something?" [puppet] - 10https://gerrit.wikimedia.org/r/170479 (owner: 10Dzahn) [02:38:58] springle: ^ see [02:39:10] except for the fact that everything else networky appears to be perfect [02:39:11] it missed a message [02:39:52] several [02:40:07] RECOVERY - check_puppetrun on thulium is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [02:43:28] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/172799/" [puppet] - 10https://gerrit.wikimedia.org/r/172313 (https://bugzilla.wikimedia.org/35611) (owner: 10Dereckson) [02:45:03] (03CR) 10Dzahn: [C: 04-2] "sorry, i added it here and forgot you had made this before. https://gerrit.wikimedia.org/r/#/c/172899/2" [dns] - 10https://gerrit.wikimedia.org/r/172442 (owner: 10John F. Lewis) [02:45:14] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [02:45:42] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [02:46:20] Jeff_Green: i wonder where the PHP came from in errors like expected value in object at '"[PHP]\n\n; puppet:t'! [02:47:09] the only thing I can think of is that it's in the content of files that are supposed to be transferred [02:48:10] ok, yea, that would fit the changing message [02:50:05] RECOVERY - check_puppetrun on tellurium is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [02:50:16] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [02:52:31] !log beta puppet freshness - UNKNOWN: No valid datapoints found .. since 13d [02:52:36] Logged the message, Master [02:53:37] barium had never really recovered [02:54:01] https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=barium&service=check_puppetrun [02:54:39] as opposed to tellurium, different pattern [02:54:42] https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=tellurium&service=check_puppetrun [02:54:54] tellurium is flapping [02:55:06] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [02:55:33] ok this is interesting [02:56:21] i have a theory [02:56:24] time [03:00:15] RECOVERY - check_puppetrun on barium is OK: OK: Puppet is currently enabled, last run 295 seconds ago with 0 failures [03:02:49] ls [03:06:30] i'm going to kill the puppet monitoring [03:06:39] and stop puppet everywhere and completely start over [03:10:17] Jeff_Green: ok, you dont actually have to kill it, you can just disable notifications and ack it [03:10:31] and i gotta get some food, too [03:10:33] that's a lot of hosts to deal with [03:10:46] meh, i can do it quickly, multi checkbox, just a sec [03:10:53] it's already killed [03:10:56] ok [03:12:12] i'll run for now. good luck. it must be related to some other deleted files on the master [03:20:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: puppet fail [03:40:17] ok puppet cert exchange is totally redone and everything is working [03:41:03] I disabled all the manifests, I'm starting to suspect bad config somewhere-I've seen stuff like this happen funky variable expansion in templates [03:41:07] but i'm done for the night... [03:42:15] whew! sorry to phone you. [04:00:20] RECOVERY - ElasticSearch health check on elastic1025 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2117: active_shards: 6367: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [04:00:40] RECOVERY - ElasticSearch health check on elastic1024 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2117: active_shards: 6367: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [04:00:44] RECOVERY - ElasticSearch health check on elastic1026 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2117: active_shards: 6367: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [04:17:35] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Nov 13 04:17:35 UTC 2014 (duration 17m 34s) [04:19:25] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: puppet fail [04:38:01] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [04:58:20] Who is taking care of Varnish in Beta/Production? [04:58:49] We need to setup Varnish for ContentTraslation, something like, https://www.mediawiki.org/wiki/Content_translation/Setup#Backend_Services [04:59:01] Any starting pointers are welcome :) [04:59:06] YuviPanda: ^ ? [05:00:46] <^d> !log elasticsearch: set phabricatormain's index.auto_expand_replicas to 0-2 like production wikis (was hardcoded @ 1 replica) [05:08:30] PROBLEM - Disk space on vanadium is CRITICAL: DISK CRITICAL - free space: / 4273 MB (3% inode=94%): [05:47:59] PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: puppet fail [06:07:29] RECOVERY - puppet last run on ms-fe2003 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:28:19] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:48] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:19] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:29] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:11] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:53] PROBLEM - puppet last run on amslvs1 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:29] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:09] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - elasticsearch (production-logstash-eqiad) is running. status: red: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 45: active_shards: 62: relocating_shards: 0: initializing_shards: 1: unassigned_shards: 29 [06:34:19] PROBLEM - ElasticSearch health check on logstash1002 is CRITICAL: CRITICAL - elasticsearch (production-logstash-eqiad) is running. status: red: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 45: active_shards: 62: relocating_shards: 0: initializing_shards: 1: unassigned_shards: 29 [06:34:19] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - elasticsearch (production-logstash-eqiad) is running. status: red: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 45: active_shards: 62: relocating_shards: 0: initializing_shards: 1: unassigned_shards: 29 [06:36:14] meh [06:36:29] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 29 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 29, utimed_out: False, uactive_primary_shards: 46, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 63, uinitializing_shards: 0, unumber_of_data_nodes: 3} [06:36:38] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 29 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 29, utimed_out: False, uactive_primary_shards: 46, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 63, uinitializing_shards: 0, unumber_of_data_nodes: 3} [06:36:49] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 29 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 29, utimed_out: False, uactive_primary_shards: 46, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 63, uinitializing_shards: 0, unumber_of_data_nodes: 3} [06:40:30] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [06:45:29] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [06:46:08] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:09] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:46:09] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:18] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on amslvs1 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:46:38] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:53:59] PROBLEM - puppet last run on hooft is CRITICAL: CRITICAL: Puppet has 1 failures [07:11:48] RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [07:15:13] kart_: what sort of content does the cxserver serve? [07:32:38] ori: content is articles, also see, https://www.mediawiki.org/wiki/Content_translation#mediaviewer/File:CX_ArchitectureV1.svg [07:32:50] http requests. [07:41:26] you'll need help from ops. the standard pattern is to first set up a virtual IP address ('service ip') that will be shared by all cxserver hosts by having a machine running LVS load balance connections [07:43:07] <_joe_> ori: I think that's already sorted out, I think kart_ has been working with alex [07:44:02] well, he's asking about it, and it's confusing enough that i needed to have it explained to me a few times before it stuck [07:44:12] <_joe_> it is :) [07:44:30] so i'll risk redundancy! :P [07:45:18] kart_: then you define a varnish backend, which you can think of as giving that IP address a name that you can use in the URL routing configuration for varnish [07:46:20] then you write some VCL (=Varnish Configuration Language) code to map certain URL patterns to that backend, usually by testing for string prefix matching or more complicated regex matching [07:47:25] it ends up looking something like: if if (req.url ~ "^/cx") { set req.backend = cxserver; } [07:48:48] and then you live happily ever after [07:57:05] ori: Thank you! [07:58:15] ori: true that alex also mentioned that it is confusing. [07:59:05] kart_: one question though.. whyyyyyyyyyyy nodejs [08:01:20] good question! [08:01:30] https://www.mediawiki.org/wiki/Content_translation/Technical_Architecture#Scalability gives simple answer [08:02:08] It is easy to write node.js client for things we're working (apertium, dictionaries, third party services) too. [08:05:13] kart_: heh, probably not me :) [08:06:22] <_joe_> kart_: I don't really see how node can ensure scalability that any other platform can guarantee [08:07:03] <_joe_> but I mean, if you enjoy using it, fine :) [08:07:10] :D [08:07:35] _joe_: re: https://bugzilla.wikimedia.org/show_bug.cgi?id=73263, poked :) also mmodel is supposed to help as well... [08:07:44] * YuviPanda is going to continue with shinken in the meantime [08:07:48] <_joe_> I'd use python+gevent all the time if I really need something noblocking (meaning, I am basically writing a proxy in front of what does the real work) [08:08:11] no asyncio? [08:08:55] <_joe_> YuviPanda: I am used to gevent, I like the greenlet-based async programming, and I hate twisted [08:09:01] <_joe_> (hi, pybal) [08:09:03] heh :) [08:09:11] <_joe_> never tried asyncio tbh [08:09:13] I've never really done any proper async code ou7tside of js [08:09:38] although wikibugs and 2 (and the new 'one irc relay to rule them all (in labs)' me and legoktm are writing) is also asyncio [08:09:44] <_joe_> nodejs async model (callback-based async) is _horrible_ IMO [08:10:17] kart_: have a look at and tell me how impressive node.js's cluster module looks then ;) [08:10:22] <_joe_> but I mean, it's a matter of taste [08:10:32] <_joe_> ori: ahah :) [08:10:54] <_joe_> erlang, as long as you don't touch mnesia with a stick, is great [08:11:24] <_joe_> mnesia is horribly fragile and hard to recover, in my experience [08:12:51] i've never used mnesia, tbh [08:13:25] also, anyone to +1 https://gerrit.wikimedia.org/r/#/c/172916/? [08:15:08] <_joe_> ori: well my erlang now is so rusty I could probably manage to write a multiprocess factorial calculator, and not much more :P [08:15:33] ori: nice link. [08:17:43] hehe, 'rust'y [08:18:03] <_joe_> kart_: erlang/OTP is used for serious business like telecommunications, so it works better than most of the hipster tools we play with today [08:24:40] _joe_: Aren't we serious too? :) I agree on part about hipster tools. [08:25:17] <_joe_> kart_: well, I sometimes feel like the web is the "good enough" industry [08:25:45] <_joe_> we need things to be fast and work on large scale, but they are most of the times "almost good" [08:26:08] <_joe_> and also, we move at a pace that almost prevents "perfect" or "soundly proven" [08:27:04] <_joe_> think of all the NoSQL datastores; that's the world of "almost good" for instance [08:27:53] <_joe_> can it serve a gazzillion of read/writes per second? then it's good even if it's not really consistent [08:32:01] (03PS2) 10Ori.livneh: Tests for `pybal.monitor` [debs/pybal] - 10https://gerrit.wikimedia.org/r/172805 [08:32:02] (03PS1) 10Ori.livneh: Fix bug in MonitoringProtocol._getConfigStringList [debs/pybal] - 10https://gerrit.wikimedia.org/r/172949 [08:36:03] <_joe_> ori: good catch [08:37:02] probably not one that we've run into / will run into, but ¯\_(ツ)_/¯ [08:41:46] <_joe_> ori: that's because "and" is tricky in python [08:42:42] <_joe_> and reduce as well :) [08:43:32] yeah guido hates reduce and wanted to remove it for py3 [08:43:49] but the masses revolted [08:43:58] http://www.artima.com/weblogs/viewpost.jsp?thread=98196 [08:44:32] "So now reduce(). This is actually the one I've always hated most, because, apart from a few examples involving + or *, almost every time I see a reduce() call with a non-trivial function argument, I need to grab pen and paper to diagram what's actually being fed into that function before I understand what the reduce() is supposed to do." [08:45:30] (03PS3) 10Ori.livneh: Tests for `pybal.monitor` [debs/pybal] - 10https://gerrit.wikimedia.org/r/172805 [08:45:56] <_joe_> ori: actually, you could use reduce there by swapping x and y I guess [08:46:20] <_joe_> yes [08:46:38] i think the all / isinstance is clearer [08:46:44] <_joe_> it is [08:47:21] <_joe_> but let me check one thing' [08:47:22] _joe_: and swapping X and Y would still not work for the case of an empty string somewhere in the list [08:47:30] <_joe_> ori: true [08:47:49] * YuviPanda shops around https://gerrit.wikimedia.org/r/#/c/172916/ again, simple code move into a module [08:48:26] <_joe_> YuviPanda: in 10 mins, when I am done playing with python :) [08:48:31] ah, cool [08:48:46] YuviPanda: why move it to a module? [08:48:56] it's used only there, and is icinga specific code [08:49:08] as a 'role', it can not live anywhere where icinga isn't installed [08:49:19] <_joe_> ori: what about the empty list? [08:49:34] <_joe_> with reduce() it raises a TypeError [08:49:42] <_joe_> with your version it returns True [08:50:13] <_joe_> not sure if that's ok [08:51:15] well, an empty list has no none-string items [08:51:18] but also no string items [08:51:19] so dunno [08:51:41] as a 'role', it can not live anywhere where icinga isn't installed <-- why not? [08:51:46] <_joe_> ori: btw, val = ["", 'abc'] -> reduce(lambda x, y: type(y) == str and y, val) and all(isinstance(x, str) for x in val) both return true [08:52:04] ori: because it requires files that icinga itself writes to do notifications [08:52:17] that role isn't the general ircecho one, it's the specific icinga-wm one [08:52:23] _joe_: yes, but not for val = ['abc', ''] :) [08:52:39] <_joe_> yeah so, do we really want that? [08:53:02] <_joe_> yes we do [08:54:03] YuviPanda: i don't know why but what you're saying doesn't compute, probably the fault of the hour being late and my brain being mush -- i'll let _joe_ review, sorry [08:54:09] ori: :) ok [08:54:35] (03CR) 10Giuseppe Lavagetto: "The code is way clearer this way; it just introduces one small change:" [debs/pybal] - 10https://gerrit.wikimedia.org/r/172949 (owner: 10Ori.livneh) [08:54:53] 1. icinga writes things that should be on IRC to a file on the local filesystem, 2. ircecho tails that particular file, puts them on IRC, 3. hence, they need to share a local filesystem, 4. hence they need to be on same machine, 5. hence this can't be used anywhere other than neon atm [08:56:42] <_joe_> YuviPanda: I agree. What you want is to add 'require role::icinga' inside role::ircecho [08:56:58] <_joe_> not making it impossible to have an icinga instance without ircecho [08:57:05] ah, hmm.. [08:57:24] it's not actually role::ircecho, we use ircecho elsewhere [08:57:28] and will shortly use it for shinken too [08:57:56] so how about I keep the class as is, make a role::icinga::ircecho, and then include that in neon? [08:58:13] <_joe_> YuviPanda: ok then let me dig a little deeper [08:58:42] I'm personally ok with what we have right now, since if and when we want an icinga without ircecho we can easily move that one line include out [08:58:43] YAGNI etc [08:59:16] <_joe_> YuviPanda: we do use role::echoirc anywhere else? [08:59:18] nope [08:59:27] <_joe_> ok so my proposal stands [08:59:40] <_joe_> (I just got the class name wrong) [08:59:56] hmm [09:00:23] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "From IRC:" [puppet] - 10https://gerrit.wikimedia.org/r/172916 (owner: 10Yuvipanda) [09:00:48] I'm trying to understand why. [09:01:14] * YuviPanda tries to find other examples [09:02:07] I mean, if we wanted an icinga without NRPE [09:02:19] we'd have to remove that line from the role [09:02:23] should we instead have role::nrpe? [09:02:29] <_joe_> no [09:02:42] so how's this different? [09:02:44] <_joe_> I mean, another way of doing this [09:03:00] <_joe_> is to add a has_ircecho class variable to role::icinga [09:03:10] aha! yes, that'd be good. [09:03:13] <_joe_> well, nrpe is a fundamental part of icinga [09:03:26] well, you can just do all checks via graphite :) [09:03:33] <_joe_> YuviPanda: then use hiera to override the default (true) wherever you don't want ircecho [09:03:36] <_joe_> :) [09:03:50] <_joe_> YuviPanda: don't make me say what I think about that :P [09:04:03] <_joe_> (I understand why you did it btw [09:04:06] _joe_: what do you think about that? I was hoping you would (and ori would) when I wrote that email. [09:04:17] it does introduce two SPOFs (shinken and graphite) [09:04:21] <_joe_> YuviPanda: "it's a necessary evil) [09:04:26] ah, heh :) [09:04:31] <_joe_> oh god I can't type this morning [09:04:34] hehe :) [09:04:51] <_joe_> the fat cat sitting on my arm may have something to do with it [09:04:51] alternatively we can open up NRPE ports on all labs hosts machines to the wide internet, what can possibly go wrong? :) [09:04:57] mmm, cats [09:05:17] <_joe_> YuviPanda: haven't you seen my cat looking in the camera during hangouts? [09:05:22] no, I haven't! [09:05:27] <_joe_> it happens quite regularly [09:05:30] I've been far too nomadic to have a pet [09:05:36] _joe_: I'll look closely the next time! [09:05:40] <_joe_> eheh [09:05:57] * YuviPanda is planning on going to CCC CAMP next year [09:06:11] <_joe_> when is that? middle of august again? [09:06:22] yeah [09:06:29] <_joe_> or "the time I'm supposed to be on vacation with my family" [09:06:31] after Wikimania [09:06:33] hahah :) [09:07:03] * YuviPanda is planning on may - French WM Hackathon, June/Jul - UK, Aug - Wikimania/CAMP next year [09:07:09] <_joe_> and I don't think my step daughter would forgive me for taking her to yet-another-geek-camp [09:07:30] heh [09:07:32] <_joe_> isn't wikimania in mexico this year? [09:07:36] yes [09:07:37] it is [09:07:49] I foresee I'll be paying for a lot of my flights next year. [09:09:04] <_joe_> YuviPanda: what in the varnish role depends on ganglia implicitly? [09:09:44] _joe_: line 89, varnish/manifests/instance.pp [09:09:55] <_joe_> in the module, right? [09:09:58] yeah [09:10:01] <_joe_> not the scary role::cache [09:10:10] _joe_: in role::cache there's a direct include [09:10:18] <_joe_> man [09:10:20] that includes varnish/manifests/monitoring/ganglia.pp [09:10:25] which defines an exec [09:10:35] and the vanrnish/manifests/instance.pp (in the module!) depends on that exec [09:10:41] <_joe_> so varnish::instance depends from varnish::monitoring::ganglia [09:10:47] <_joe_> but it's not required there? [09:10:49] yup, but through role::cache :) [09:10:50] indeed [09:10:56] <_joe_> LOLWTF [09:11:00] my reaction, yes [09:11:39] I've a sinking feeling also that it's tip of the iceberg, and there's a lot more we haven't seen yet because puppet doesn't complain about more than one AST / syntax thing at a time [09:11:40] <_joe_> YuviPanda: I tried to do a radical refactor of role::cache using hiera, and role::cache won [09:11:46] heh [09:12:10] <_joe_> so I decided we need to do that in small steps [09:12:15] there's of course, the shitty way of fixing this, which is to do an if to check if that exec is defined and make the before => only if it is [09:12:40] <_joe_> YuviPanda: let me take a look [09:12:46] ok [09:13:20] <_joe_> the solution is to invert the logic [09:13:28] ah, right [09:13:34] <_joe_> as you can have a varnish::instance without monitoring [09:13:37] <_joe_> not the opposite [09:13:39] yup [09:13:49] ok, that was simple [09:14:00] * YuviPanda also likes after type ordering than before type ordering [09:14:13] <_joe_> not so much :) we have multiple services [09:14:28] <_joe_> so we need to use tags here, or anchors [09:16:47] <_joe_> lemme experiment a little [09:17:02] * YuviPanda admits to not knowing much about varnish nor ganglia [09:17:18] <_joe_> (we also have a gmond restart that is at the bottom of that file) [09:17:32] sigh [09:17:45] <_joe_> YuviPanda: I'm starting to think we need a config switch, but lemme try to get fancy first [09:18:02] :) ok [09:22:39] (03PS2) 10Yuvipanda: icinga: Move ircecho code into module [puppet] - 10https://gerrit.wikimedia.org/r/172916 [09:23:18] (03PS3) 10Yuvipanda: icinga: Move ircecho code into module [puppet] - 10https://gerrit.wikimedia.org/r/172916 [09:27:54] <_joe_> YuviPanda: so, reducing our problem, we actually have this: https://phabricator.wikimedia.org/P72 [09:28:29] <_joe_> we want to change this so that it's the class puppetsucks that ensures it's working after those two dude invocations, right? [09:28:48] yup [09:33:29] <_joe_> YuviPanda: now refresh the paste, I've pasted a second version using tags and the spaceship operator [09:33:50] <_joe_> dear lord, puppets DSL _is_ a joke [09:34:03] hmm, did i just make a user to disapper from the wikis ? :/ [09:34:23] hmm, that should work, I think [09:34:42] legoktm: I fear i need your urgent help [09:36:46] matanya: he's probably sleeping, tho [09:36:52] (03PS1) 10Gilles: Revert "Enable JPG thumbnail chaining on all wikis except commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172960 [09:37:17] YuviPanda: very bad :( looks like a huge bug in renameuser [09:37:23] aw, dman [09:37:25] *damn [09:38:27] a user with 75k + edits disappeared from the wikis [09:38:43] what happened to his edits? [09:38:53] I wonder [09:39:18] I hope i didn't break the sql with this [09:40:26] matanya: PM me which username? [09:50:20] matanya: I think this can wait for legoktm or csteipp to wake up. No data loss :) [09:54:57] matanya: YuviPanda: if anyone decides something *has* been lost, get onto dbstore1001 and STOP SLAVE 's1'. That's the 24h delayed replication box [09:55:08] or rather, STOP ALL SLAVES [09:55:16] thanks springle [09:55:23] to get he.wiki [09:55:24] look like the data is there [09:55:36] springle: I checked analytics-store, I see the data [09:55:44] but the actual display on wiki is a delayed job [09:55:58] ok :) [10:05:38] springle: am now [10:08:58] (03PS1) 10Giuseppe Lavagetto: varnish: make varnish::instance not depend on ganglia [puppet] - 10https://gerrit.wikimedia.org/r/172967 [10:09:11] <_joe_> YuviPanda: ^^ [10:09:16] _joe_: want me to cherry pick on labs and test? [10:09:29] <_joe_> YuviPanda: well this won't really resolve your problem [10:09:34] true [10:09:47] <_joe_> so maybe I'll create a dependent patch now [10:09:51] well, this in addition to the other patch might make things slightly better, or at least uncover new issues [10:10:00] other patch -> my earlier one making the includion of ganglia conditional [10:10:10] <_joe_> yeah [10:10:28] <_joe_> working on it [10:10:41] <_joe_> (I'd try to do that as cleanly as possible [10:10:47] ok :) [10:12:22] <_joe_> oh, one thing I never thought about - hiera and class inheritance in puppet [10:12:55] _joe_: btw, this plus the role::cache patch actually made puppet run on deployment-cache-bits01 [10:13:03] it's catching up on a month's worth of stuff now [10:13:14] <_joe_> YuviPanda: yea I expected that to be the case [10:13:29] <_joe_> but I'm more interested in not breaking prod tbh [10:13:42] <_joe_> so I will ask and wait for a few people to review this [10:13:43] yeah [10:13:46] definitely [10:13:56] but cherry-picking will unbreak deployment-prep [10:14:02] and I'll closely follow to make sure it remains unbroken [10:14:03] <_joe_> good! [10:14:26] <_joe_> bbiab, I need a break [10:17:38] _joe_: fwiw, it only fixed one host, uncovered lots more :) [10:18:01] Error: Failed to apply catalog: Could not find dependency File[/usr/lib/ganglia/python_modules] for File[/usr/lib/ganglia/python_modules/varnish.py] at /etc/puppet/modules/varnish/manifests/monitoring/ganglia.pp:10 [10:18:11] is somewhat weird, since ganglia.pp should't be included at all [10:18:37] * YuviPanda goes to eat food [10:23:01] !log upload mediawiki-math to trusty too (RT #5270) [10:23:09] Reedy_: ^ [10:29:36] <_joe_> godog: woot [10:30:05] <_joe_> YuviPanda: that's because you didn't remove it everywhere [10:30:34] there were other places? [10:30:40] * YuviPanda will look after food [10:41:25] _joe_: well trusty-mediawiki I should say, but you get the idea :) [11:03:31] !log Killing Jenkins due to a deadlock [11:06:08] bah [11:08:21] !log Killing Jenkins due to a deadlock [11:08:24] pff [11:08:26] morebots: ping [11:08:28] Logged the message, Master [11:08:28] I am a logbot running on tools-exec-06. [11:08:28] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [11:08:28] To log a message, type !log . [11:08:42] !log Killed Jenkins due to a deadlock [11:08:45] Logged the message, Master [11:09:07] !log resurrected morebots in #wikimedia-operations (see [[Morebots]]). [11:09:10] Logged the message, Master [11:12:51] (03PS1) 10Giuseppe Lavagetto: role::cache: make ganglia inclusion optional [puppet] - 10https://gerrit.wikimedia.org/r/172974 [11:13:01] <_joe_> YuviPanda: ^^ [11:14:20] ah, hmm [11:14:27] is there a way to set hiera info for all of labs at once? [11:15:27] hmm, one way would be to default to false and set it to true in prod hiera [11:15:45] we could also just set it to false in deployment-prep [11:15:52] which is the only place this rule is used anyway [11:18:39] PROBLEM - CI: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: integration.integration-puppetmaster.diskspace._var.byte_avail.value (11.11%) [11:21:51] <_joe_> YuviPanda: for now just set it in deployment-prep [11:21:58] yup, that's what I did [11:22:17] https://wikitech.wikimedia.org/w/index.php?title=Hiera%3ADeployment-prep&diff=134263&oldid=133195 [11:22:20] hopefully I got that one right [11:22:36] cherry picking now [11:24:58] _joe_: yup, that's better :) [11:25:39] (03Abandoned) 10Yuvipanda: cache: Don't setup ganglia monitoring on labs [puppet] - 10https://gerrit.wikimedia.org/r/172776 (https://bugzilla.wikimedia.org/73263) (owner: 10Yuvipanda) [11:25:49] hmm, [11:25:50] or not [11:25:51] Error: Failed to apply catalog: Could not find dependency File[/usr/lib/ganglia/python_modules] for File[/usr/lib/ganglia/python_modules/varnish.py] at /etc/puppet/modules/varnish/manifests/monitoring/ganglia.pp:10 [11:25:54] still, at parsoidcache [11:26:13] why is that still being included... [11:26:23] I wonder if it's on the right puppetmaster. [11:26:57] how do I find the puppetmaster of a machine? [11:27:00] * YuviPanda digs [11:27:18] yup [11:27:21] it's on the wrong puppetmaster [11:27:22] lol [11:27:27] * YuviPanda files bug [11:28:34] filed https://bugzilla.wikimedia.org/show_bug.cgi?id=73357 [11:40:46] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Not against it per se, but what is the use case? It's not like we got a single machine not having SSH on port 22. Not sure it is worth it." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/172799 (owner: 10Dzahn) [11:42:24] (03CR) 10Alexandros Kosiaris: [C: 04-1] ssh server: make ListenAddress configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/172803 (https://bugzilla.wikimedia.org/35611) (owner: 10Dzahn) [11:43:37] (03CR) 10Alexandros Kosiaris: [C: 032] "LGTM. This will be useful for when we drop root logins :-)" [puppet] - 10https://gerrit.wikimedia.org/r/172804 (owner: 10Dzahn) [11:51:32] (03CR) 10Alexandros Kosiaris: [C: 031] remove nickel's public IP [dns] - 10https://gerrit.wikimedia.org/r/172819 (owner: 10Dzahn) [12:08:04] (03PS2) 10Giuseppe Lavagetto: varnish: make varnish::instance not depend on ganglia [puppet] - 10https://gerrit.wikimedia.org/r/172967 [12:08:06] (03PS2) 10Giuseppe Lavagetto: role::cache: make ganglia inclusion optional [puppet] - 10https://gerrit.wikimedia.org/r/172974 [12:12:27] (03PS1) 10Yuvipanda: nagios_common: Parameterize path used for ircecho [puppet] - 10https://gerrit.wikimedia.org/r/172980 [12:12:48] _joe_: think you'll have time to CR ^ and the previous ircecho patch? [12:13:53] <_joe_> YuviPanda: sure [12:14:00] cool, thanks :D [12:14:22] _joe_: all should be no-ops [12:14:30] https://gerrit.wikimedia.org/r/#/c/172916/ is the other one [12:15:16] (03CR) 10Giuseppe Lavagetto: [C: 031] icinga: Move ircecho code into module [puppet] - 10https://gerrit.wikimedia.org/r/172916 (owner: 10Yuvipanda) [12:16:51] (03PS4) 10Yuvipanda: icinga: Move ircecho code into module [puppet] - 10https://gerrit.wikimedia.org/r/172916 [12:16:56] (03CR) 10Giuseppe Lavagetto: [C: 031] nagios_common: Parameterize path used for ircecho [puppet] - 10https://gerrit.wikimedia.org/r/172980 (owner: 10Yuvipanda) [12:17:37] (03CR) 10Yuvipanda: [C: 032] nagios_common: Parameterize path used for ircecho [puppet] - 10https://gerrit.wikimedia.org/r/172980 (owner: 10Yuvipanda) [12:19:00] (03CR) 10Yuvipanda: [C: 032] icinga: Move ircecho code into module [puppet] - 10https://gerrit.wikimedia.org/r/172916 (owner: 10Yuvipanda) [12:27:11] PROBLEM - CI: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: integration.integration-puppetmaster.diskspace._var.byte_avail.value (11.11%) [12:27:28] (03CR) 10Mark Bergsma: varnish: make varnish::instance not depend on ganglia (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/172967 (owner: 10Giuseppe Lavagetto) [12:28:57] (03CR) 10Mark Bergsma: varnish: make varnish::instance not depend on ganglia (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/172967 (owner: 10Giuseppe Lavagetto) [12:30:29] (03CR) 10Mark Bergsma: [C: 031] role::cache: make ganglia inclusion optional [puppet] - 10https://gerrit.wikimedia.org/r/172974 (owner: 10Giuseppe Lavagetto) [12:32:26] <_joe_> mark: it won't work I guess. meh [12:34:30] ok, run on neon went well [12:34:35] I'm off now, brb later [12:41:11] PROBLEM - swift-object-auditor on ms-be1015 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [12:42:45] ^ expected, also ms-be1005 is coming up [12:43:18] yeah I should probably !log that :) [12:44:59] indeedly [12:45:21] PROBLEM - swift-object-auditor on ms-be1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [12:45:29] !log investigating high iops on swift eqiad with paravoid, stopped object-auditor on ms-be1005 and ms-be1015 [12:45:33] Logged the message, Master [12:49:24] (03CR) 10Giuseppe Lavagetto: varnish: make varnish::instance not depend on ganglia (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/172967 (owner: 10Giuseppe Lavagetto) [12:50:46] I'm not sure why a cronjob is better than puppet here :) [12:52:06] (03CR) 10Alexandros Kosiaris: [C: 04-1] "The concepts look fine to me, various comments inline. We also need a role class however." (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/167213 (owner: 10GWicke) [12:57:08] (03PS3) 10Giuseppe Lavagetto: varnish: make varnish::instance not depend on ganglia [puppet] - 10https://gerrit.wikimedia.org/r/172967 [13:10:41] (03CR) 10Mark Bergsma: [C: 031] varnish: make varnish::instance not depend on ganglia [puppet] - 10https://gerrit.wikimedia.org/r/172967 (owner: 10Giuseppe Lavagetto) [13:14:35] (03CR) 10Mark Bergsma: "Just add a len(list) > 0 check I'd say." [debs/pybal] - 10https://gerrit.wikimedia.org/r/172949 (owner: 10Ori.livneh) [13:16:00] (03CR) 10Mark Bergsma: [C: 031] Tests for `pybal.monitor` [debs/pybal] - 10https://gerrit.wikimedia.org/r/172805 (owner: 10Ori.livneh) [13:34:02] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:43:12] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 57914 bytes in 0.536 second response time [14:12:42] RECOVERY - swift-object-auditor on ms-be1005 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [14:15:14] (03PS1) 10Giuseppe Lavagetto: start rationalizing the hieradata directories by using regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/172988 [14:15:16] (03PS1) 10Giuseppe Lavagetto: removing all the mw-relate overrides [puppet] - 10https://gerrit.wikimedia.org/r/172989 [14:15:59] <_joe_> paravoid: as promised :P [14:16:22] (03CR) 10Hoo man: [C: 04-1] "We agreed to change this so that it applies to group 0 wikis only (for now)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172322 (https://bugzilla.wikimedia.org/69019) (owner: 10Legoktm) [14:21:01] (03PS2) 10Giuseppe Lavagetto: start rationalizing the hieradata directories by using regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/172988 [14:21:22] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] start rationalizing the hieradata directories by using regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/172988 (owner: 10Giuseppe Lavagetto) [14:21:45] (03PS1) 10Filippo Giunchedi: swift: throttle object-auditor [puppet] - 10https://gerrit.wikimedia.org/r/172990 [14:22:14] paravoid: ^ [14:23:22] (03CR) 10Faidon Liambotis: [C: 032] swift: throttle object-auditor [puppet] - 10https://gerrit.wikimedia.org/r/172990 (owner: 10Filippo Giunchedi) [14:26:33] (03PS1) 10Manybubbles: Reenable regexes now that we've fixed the plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172991 [14:32:25] (03PS1) 10Jforrester: Enable VisualEditor by default on Tagalog Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172993 (https://bugzilla.wikimedia.org/73365) [14:33:25] (03PS2) 10Giuseppe Lavagetto: removing all the mw-relate overrides [puppet] - 10https://gerrit.wikimedia.org/r/172989 [14:37:04] (03CR) 10Giuseppe Lavagetto: [C: 032] removing all the mw-relate overrides [puppet] - 10https://gerrit.wikimedia.org/r/172989 (owner: 10Giuseppe Lavagetto) [14:37:34] (03PS2) 10Filippo Giunchedi: swift: throttle object-auditor [puppet] - 10https://gerrit.wikimedia.org/r/172990 [14:37:42] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: throttle object-auditor [puppet] - 10https://gerrit.wikimedia.org/r/172990 (owner: 10Filippo Giunchedi) [14:37:57] (03PS1) 10Rush: phab update tag [puppet] - 10https://gerrit.wikimedia.org/r/172994 [14:38:33] _joe_: there are the hieradata merges pending, good to go? [14:38:42] <_joe_> yep [14:38:48] done [14:38:49] <_joe_> I had puppet-merge open now [14:38:58] hehe me too [14:41:52] PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: puppet fail [14:42:12] !log hashar: restarting Jenkins and Zuul [14:42:12] PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: puppet fail [14:42:15] Logged the message, Master [14:43:19] (03PS2) 10Giuseppe Lavagetto: monitoring: move monitor_host to monitoring::host [puppet] - 10https://gerrit.wikimedia.org/r/172530 [14:43:24] !log hashar: restarted zuul-merger on gallium [14:43:26] (03CR) 10jenkins-bot: [V: 04-1] monitoring: move monitor_host to monitoring::host [puppet] - 10https://gerrit.wikimedia.org/r/172530 (owner: 10Giuseppe Lavagetto) [14:43:27] Logged the message, Master [14:44:42] (03PS1) 10Jforrester: Follow-up I50cb3ed: Enable VisualEditor as a Beta Feature on maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172996 [14:44:46] (03CR) 10jenkins-bot: [V: 04-1] Follow-up I50cb3ed: Enable VisualEditor as a Beta Feature on maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172996 (owner: 10Jforrester) [14:45:42] PROBLEM - puppet last run on mw1160 is CRITICAL: CRITICAL: puppet fail [14:45:49] (03PS2) 10Rush: phab update tag [puppet] - 10https://gerrit.wikimedia.org/r/172994 [14:45:52] PROBLEM - puppet last run on mw1153 is CRITICAL: CRITICAL: puppet fail [14:46:04] (03CR) 10jenkins-bot: [V: 04-1] phab update tag [puppet] - 10https://gerrit.wikimedia.org/r/172994 (owner: 10Rush) [14:46:32] PROBLEM - puppet last run on mw1217 is CRITICAL: CRITICAL: puppet fail [14:46:43] jenkins is broken [14:46:44] <_joe_> mmmh [14:46:46] Jenkins is biting us https://gerrit.wikimedia.org/r/172987 [14:46:47] <_joe_> yes [14:46:59] everything [14:47:18] jenkins wigging out [14:47:19] https://gerrit.wikimedia.org/r/#/c/172994/ [14:47:22] not registered [14:47:24] I see I'm not the only one :) [14:47:37] <_joe_> and... I forgot the imagescalers [14:47:51] PROBLEM - puppet last run on mw1213 is CRITICAL: CRITICAL: puppet fail [14:48:42] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: puppet fail [14:49:13] (03CR) 10Rush: [C: 032 V: 032] phab update tag [puppet] - 10https://gerrit.wikimedia.org/r/172994 (owner: 10Rush) [14:50:04] <_joe_> puppet failures are my fault, fixing them [14:50:09] <_joe_> they are harmless btw [14:50:29] any objects to a minor deployment to zero portal extension? [14:50:41] (sync to master) [14:51:31] PROBLEM - puppet last run on mw1156 is CRITICAL: CRITICAL: puppet fail [14:51:34] PROBLEM - puppet last run on mw1159 is CRITICAL: CRITICAL: puppet fail [14:52:22] PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: puppet fail [14:53:04] (03PS3) 10Rush: phab update tag [puppet] - 10https://gerrit.wikimedia.org/r/172994 [14:53:11] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: puppet fail [14:53:53] (03PS1) 10Giuseppe Lavagetto: hiera: re-add info lost in the cleaning earlier [puppet] - 10https://gerrit.wikimedia.org/r/173000 [14:54:10] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] hiera: re-add info lost in the cleaning earlier [puppet] - 10https://gerrit.wikimedia.org/r/173000 (owner: 10Giuseppe Lavagetto) [14:54:12] (03PS4) 10Rush: phab update tag [puppet] - 10https://gerrit.wikimedia.org/r/172994 [14:54:51] (03PS5) 10Rush: phab update tag [puppet] - 10https://gerrit.wikimedia.org/r/172994 [14:55:33] PROBLEM - puppet last run on mw1209 is CRITICAL: CRITICAL: puppet fail [14:55:42] PROBLEM - puppet last run on mw1219 is CRITICAL: CRITICAL: puppet fail [14:55:54] (03CR) 10Rush: [C: 032 V: 032] phab update tag [puppet] - 10https://gerrit.wikimedia.org/r/172994 (owner: 10Rush) [14:56:02] PROBLEM - puppet last run on mw1220 is CRITICAL: CRITICAL: puppet fail [14:56:12] PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: puppet fail [14:56:42] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: puppet fail [14:57:03] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: puppet fail [14:57:21] PROBLEM - puppet last run on mw1207 is CRITICAL: CRITICAL: puppet fail [14:57:32] PROBLEM - puppet last run on mw1131 is CRITICAL: CRITICAL: puppet fail [14:57:44] PROBLEM - puppet last run on mw1155 is CRITICAL: CRITICAL: puppet fail [14:57:44] PROBLEM - puppet last run on mw1194 is CRITICAL: CRITICAL: puppet fail [14:57:51] PROBLEM - puppet last run on mw1113 is CRITICAL: CRITICAL: puppet fail [14:58:11] PROBLEM - puppet last run on mw1073 is CRITICAL: CRITICAL: puppet fail [14:58:21] PROBLEM - puppet last run on mw1103 is CRITICAL: CRITICAL: puppet fail [14:58:30] PROBLEM - puppet last run on mw1154 is CRITICAL: CRITICAL: puppet fail [14:58:31] PROBLEM - puppet last run on mw1137 is CRITICAL: CRITICAL: puppet fail [14:58:31] PROBLEM - puppet last run on mw1128 is CRITICAL: CRITICAL: puppet fail [14:58:44] PROBLEM - puppet last run on mw1047 is CRITICAL: CRITICAL: puppet fail [14:58:46] hmmm [14:58:52] PROBLEM - puppet last run on mw1199 is CRITICAL: CRITICAL: puppet fail [14:59:11] PROBLEM - puppet last run on mw1095 is CRITICAL: CRITICAL: puppet fail [14:59:18] <_joe_> yeah I did some error, and jenkins not working... [14:59:22] PROBLEM - puppet last run on mw1157 is CRITICAL: CRITICAL: puppet fail [14:59:22] PROBLEM - puppet last run on mw1075 is CRITICAL: CRITICAL: puppet fail [14:59:24] <_joe_> akosiaris: I am fixing it [14:59:30] <_joe_> but it's no harm to prod [14:59:32] PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: puppet fail [14:59:35] <_joe_> puppet plainly fails [14:59:37] yeah, no worries [14:59:41] PROBLEM - puppet last run on mw1102 is CRITICAL: CRITICAL: puppet fail [14:59:42] PROBLEM - puppet last run on mw1085 is CRITICAL: CRITICAL: puppet fail [14:59:46] just making sure :-) [14:59:48] PROBLEM - puppet last run on mw1058 is CRITICAL: CRITICAL: puppet fail [15:00:01] PROBLEM - puppet last run on mw1070 is CRITICAL: CRITICAL: puppet fail [15:00:08] PROBLEM - puppet last run on mw1078 is CRITICAL: CRITICAL: puppet fail [15:00:12] PROBLEM - puppet last run on mw1191 is CRITICAL: CRITICAL: puppet fail [15:00:22] PROBLEM - puppet last run on mw1101 is CRITICAL: CRITICAL: puppet fail [15:00:31] PROBLEM - puppet last run on mw1083 is CRITICAL: CRITICAL: puppet fail [15:00:33] PROBLEM - puppet last run on mw1094 is CRITICAL: CRITICAL: puppet fail [15:00:43] PROBLEM - puppet last run on mw1127 is CRITICAL: CRITICAL: puppet fail [15:00:52] PROBLEM - puppet last run on mw1214 is CRITICAL: CRITICAL: puppet fail [15:01:11] PROBLEM - puppet last run on mw1136 is CRITICAL: CRITICAL: puppet fail [15:01:11] PROBLEM - puppet last run on mw1035 is CRITICAL: CRITICAL: puppet fail [15:01:21] PROBLEM - puppet last run on mw1196 is CRITICAL: CRITICAL: puppet fail [15:01:23] PROBLEM - puppet last run on mw1036 is CRITICAL: CRITICAL: puppet fail [15:01:34] PROBLEM - puppet last run on mw1138 is CRITICAL: CRITICAL: puppet fail [15:01:41] PROBLEM - puppet last run on mw1096 is CRITICAL: CRITICAL: puppet fail [15:01:42] PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string Wikimedia and MediaWiki not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 2734 bytes in 0.165 second response time [15:02:09] <_joe_> akosiaris: not understanging _why_ it failed btw [15:02:13] PROBLEM - puppet last run on mw1132 is CRITICAL: CRITICAL: puppet fail [15:02:13] PROBLEM - check if phabricator taskmaster is running on iridium is CRITICAL: PROCS CRITICAL: 0 processes with regex args PhabricatorTaskmasterDaemon [15:02:21] PROBLEM - puppet last run on mw1109 is CRITICAL: CRITICAL: puppet fail [15:02:22] PROBLEM - puppet last run on mw1040 is CRITICAL: CRITICAL: puppet fail [15:02:22] PROBLEM - puppet last run on mw1124 is CRITICAL: CRITICAL: puppet fail [15:02:32] PROBLEM - puppet last run on mw1130 is CRITICAL: CRITICAL: puppet fail [15:02:33] is it safe to sync-dir an extension? [15:02:42] there are all these puppet fails [15:02:51] PROBLEM - puppet last run on mw1192 is CRITICAL: CRITICAL: puppet fail [15:02:52] PROBLEM - puppet last run on mw1080 is CRITICAL: CRITICAL: puppet fail [15:02:52] PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: puppet fail [15:03:02] RECOVERY - puppet last run on mw1213 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:03:06] yurikR: yes [15:03:07] PROBLEM - puppet last run on mw1062 is CRITICAL: CRITICAL: puppet fail [15:03:07] PROBLEM - puppet last run on mw1038 is CRITICAL: CRITICAL: puppet fail [15:03:07] PROBLEM - puppet last run on mw1048 is CRITICAL: CRITICAL: puppet fail [15:03:11] PROBLEM - puppet last run on mw1089 is CRITICAL: CRITICAL: puppet fail [15:03:13] thx [15:03:21] PROBLEM - puppet last run on mw1147 is CRITICAL: CRITICAL: puppet fail [15:03:22] RECOVERY - check if phabricator taskmaster is running on iridium is OK: PROCS OK: 20 processes with regex args PhabricatorTaskmasterDaemon [15:03:22] PROBLEM - puppet last run on mw1031 is CRITICAL: CRITICAL: puppet fail [15:03:26] <_joe_> akosiaris: got it! [15:03:33] :-) [15:03:34] (03PS1) 10Giuseppe Lavagetto: hiera: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/173003 [15:03:42] PROBLEM - puppet last run on mw1072 is CRITICAL: CRITICAL: puppet fail [15:03:43] PROBLEM - puppet last run on mw1134 is CRITICAL: CRITICAL: puppet fail [15:03:48] dman [15:03:49] damn [15:03:51] PROBLEM - puppet last run on mw1041 is CRITICAL: CRITICAL: puppet fail [15:03:52] PROBLEM - puppet last run on mw1115 is CRITICAL: CRITICAL: puppet fail [15:03:52] RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 16422 bytes in 0.395 second response time [15:03:53] a single p ? [15:04:07] phab maint is me guys [15:04:13] sorry I forgot to silence [15:04:13] PROBLEM - puppet last run on mw1141 is CRITICAL: CRITICAL: puppet fail [15:04:13] PROBLEM - puppet last run on mw1059 is CRITICAL: CRITICAL: puppet fail [15:04:17] shame donuts at the next outing [15:04:21] PROBLEM - puppet last run on mw1145 is CRITICAL: CRITICAL: puppet fail [15:04:25] PROBLEM - puppet last run on mw1063 is CRITICAL: CRITICAL: puppet fail [15:04:25] PROBLEM - puppet last run on mw1045 is CRITICAL: CRITICAL: puppet fail [15:04:25] PROBLEM - puppet last run on mw1067 is CRITICAL: CRITICAL: puppet fail [15:04:28] PROBLEM - puppet last run on mw1140 is CRITICAL: CRITICAL: puppet fail [15:04:28] PROBLEM - puppet last run on mw1106 is CRITICAL: CRITICAL: puppet fail [15:04:32] !log phabricator upgrades T1203 [15:04:35] Logged the message, Master [15:04:35] PROBLEM - puppet last run on mw1197 is CRITICAL: CRITICAL: puppet fail [15:04:41] PROBLEM - puppet last run on mw1082 is CRITICAL: CRITICAL: puppet fail [15:04:48] ahah... I wasn't aware of shame donuts [15:05:02] PROBLEM - puppet last run on mw1200 is CRITICAL: CRITICAL: puppet fail [15:05:08] (03PS2) 10Giuseppe Lavagetto: hiera: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/173003 [15:05:09] shame donuts is a tradition from works past [15:05:13] I always felt it was appropriate [15:05:22] PROBLEM - puppet last run on mw1187 is CRITICAL: CRITICAL: puppet fail [15:05:26] <_joe_> yeah it was shame cornetti here [15:05:27] /ignore -time 600 -pattern 'puppet last run' icinga-wm [15:05:49] for y'all irssi users [15:05:53] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] hiera: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/173003 (owner: 10Giuseppe Lavagetto) [15:05:54] RECOVERY - puppet last run on mw1217 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:06:00] !log yurik Synchronized php-1.25wmf8/extensions/ZeroPortal: updatidng ZeroPortal to master (duration: 01m 13s) [15:06:02] PROBLEM - puppet last run on mw1069 is CRITICAL: CRITICAL: puppet fail [15:06:02] PROBLEM - puppet last run on mw1117 is CRITICAL: CRITICAL: puppet fail [15:06:02] PROBLEM - puppet last run on mw1046 is CRITICAL: CRITICAL: puppet fail [15:06:03] Logged the message, Master [15:06:12] PROBLEM - puppet last run on mw1100 is CRITICAL: CRITICAL: puppet fail [15:06:13] PROBLEM - puppet last run on mw1068 is CRITICAL: CRITICAL: puppet fail [15:06:31] PROBLEM - puppet last run on mw1060 is CRITICAL: CRITICAL: puppet fail [15:06:32] PROBLEM - puppet last run on mw1088 is CRITICAL: CRITICAL: puppet fail [15:06:32] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: puppet fail [15:06:32] PROBLEM - puppet last run on mw1150 is CRITICAL: CRITICAL: puppet fail [15:06:42] PROBLEM - puppet last run on mw1099 is CRITICAL: CRITICAL: puppet fail [15:06:51] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: puppet fail [15:06:52] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: puppet fail [15:06:56] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: puppet fail [15:07:01] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [15:07:02] PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: puppet fail [15:07:12] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: puppet fail [15:07:21] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: puppet fail [15:07:31] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: puppet fail [15:07:31] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: puppet fail [15:07:31] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: puppet fail [15:07:32] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: puppet fail [15:07:52] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: puppet fail [15:08:11] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: puppet fail [15:08:22] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: puppet fail [15:08:23] RECOVERY - puppet last run on mw1153 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [15:08:31] <_joe_> recoveries they are a comin [15:08:31] PROBLEM - puppet last run on mw1114 is CRITICAL: CRITICAL: puppet fail [15:08:47] <_joe_> chasemp: I have some tons of shame donuts to take to you guys [15:08:51] PROBLEM - puppet last run on mw1129 is CRITICAL: CRITICAL: puppet fail [15:08:56] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: puppet fail [15:09:01] PROBLEM - puppet last run on mw1126 is CRITICAL: CRITICAL: puppet fail [15:09:01] PROBLEM - puppet last run on mw1054 is CRITICAL: CRITICAL: puppet fail [15:09:01] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: puppet fail [15:09:16] !log rolling restart of object-auditor in swift codfw/eqiad to pick up changes [15:09:20] Logged the message, Master [15:09:52] RECOVERY - puppet last run on mw1159 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [15:10:41] RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [15:10:51] RECOVERY - puppet last run on mw1156 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:11:31] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [15:11:44] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [15:11:53] RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:14:02] RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:14:21] RECOVERY - swift-object-auditor on ms-be1015 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [15:14:31] RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [15:14:52] RECOVERY - puppet last run on mw1209 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [15:15:33] RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:37] PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL: HTTP CRITICAL: HTTP/1.1 302 Found - string Wikimedia and MediaWiki not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 578 bytes in 0.048 second response time [15:15:42] RECOVERY - puppet last run on mw1207 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [15:16:13] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:16:31] RECOVERY - puppet last run on mw1113 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [15:16:32] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:16:33] RECOVERY - puppet last run on mw1073 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [15:16:41] RECOVERY - puppet last run on mw1154 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [15:16:54] RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [15:17:01] RECOVERY - puppet last run on mw1131 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:17:02] RECOVERY - puppet last run on mw1155 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:17:11] RECOVERY - puppet last run on mw1194 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [15:17:51] RECOVERY - puppet last run on mw1103 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [15:17:53] RECOVERY - puppet last run on mw1128 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:17:53] RECOVERY - puppet last run on mw1085 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [15:17:53] RECOVERY - puppet last run on mw1137 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [15:17:54] RECOVERY - puppet last run on mw1047 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [15:18:01] RECOVERY - puppet last run on mw1078 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:18:21] RECOVERY - puppet last run on mw1095 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:18:22] RECOVERY - puppet last run on mw1101 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:18:31] RECOVERY - puppet last run on mw1157 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [15:18:41] RECOVERY - puppet last run on mw1075 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:18:42] RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:18:51] RECOVERY - puppet last run on mw1058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:18:52] RECOVERY - puppet last run on mw1102 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [15:19:01] RECOVERY - puppet last run on mw1214 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [15:19:02] RECOVERY - puppet last run on mw1070 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:19:21] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:19:23] RECOVERY - puppet last run on mw1191 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [15:19:31] RECOVERY - puppet last run on mw1083 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:19:42] RECOVERY - puppet last run on mw1094 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [15:20:01] RECOVERY - puppet last run on mw1127 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:20:12] RECOVERY - puppet last run on mw1035 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [15:20:22] RECOVERY - puppet last run on mw1196 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:20:22] RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [15:20:31] RECOVERY - puppet last run on mw1036 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [15:20:41] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:20:52] RECOVERY - puppet last run on mw1096 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [15:20:52] RECOVERY - puppet last run on mw1192 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:21:02] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [15:21:11] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:21:22] RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [15:21:22] RECOVERY - puppet last run on mw1109 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:21:31] RECOVERY - puppet last run on mw1040 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:21:37] RECOVERY - puppet last run on mw1124 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [15:21:43] RECOVERY - puppet last run on mw1130 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [15:21:53] RECOVERY - puppet last run on mw1072 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [15:22:04] RECOVERY - puppet last run on mw1134 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [15:22:05] RECOVERY - puppet last run on mw1062 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:22:06] RECOVERY - puppet last run on mw1038 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [15:22:14] RECOVERY - puppet last run on mw1048 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:22:14] RECOVERY - puppet last run on mw1080 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:22:34] RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:22:35] RECOVERY - puppet last run on mw1067 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:22:37] RECOVERY - puppet last run on mw1031 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [15:23:13] RECOVERY - puppet last run on mw1115 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [15:23:24] RECOVERY - puppet last run on mw1141 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [15:23:24] RECOVERY - puppet last run on mw1059 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [15:23:33] RECOVERY - puppet last run on mw1089 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:23:34] RECOVERY - puppet last run on mw1145 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [15:23:34] RECOVERY - puppet last run on mw1045 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:23:53] RECOVERY - puppet last run on mw1197 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:24:13] RECOVERY - puppet last run on mw1041 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:24:13] RECOVERY - puppet last run on mw1200 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:24:24] RECOVERY - puppet last run on mw1160 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:24:33] RECOVERY - puppet last run on mw1187 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:24:43] RECOVERY - puppet last run on mw1063 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:24:46] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [15:24:47] RECOVERY - puppet last run on mw1106 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:24:47] RECOVERY - puppet last run on mw1150 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [15:24:54] RECOVERY - puppet last run on mw1082 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [15:24:59] (03CR) 10Giuseppe Lavagetto: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/172530 (owner: 10Giuseppe Lavagetto) [15:25:16] RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:25:24] RECOVERY - puppet last run on mw1100 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [15:25:25] RECOVERY - puppet last run on mw1068 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [15:25:35] RECOVERY - puppet last run on mw1046 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [15:25:43] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:25:43] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:25:43] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:25:43] RECOVERY - puppet last run on mw1060 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [15:25:43] RECOVERY - puppet last run on mw1088 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:25:44] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [15:26:03] RECOVERY - puppet last run on mw1099 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:26:14] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:26:15] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [15:26:15] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [15:26:16] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:26:16] RECOVERY - puppet last run on mw1069 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [15:26:23] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:26:33] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:26:34] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [15:26:44] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [15:27:05] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:27:06] RECOVERY - puppet last run on mw1054 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [15:27:06] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [15:27:24] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [15:27:26] !log hashar: deleted all content from https://doc.wikimedia.org/ :-( Will regenerate. [15:27:32] Logged the message, Master [15:27:54] RECOVERY - puppet last run on mw1129 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:28:03] RECOVERY - puppet last run on mw1126 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:28:29] (03PS1) 10Filippo Giunchedi: gdash: add swift iops dashboards [puppet] - 10https://gerrit.wikimedia.org/r/173010 [15:29:08] (03PS2) 10Filippo Giunchedi: gdash: add swift iops dashboards [puppet] - 10https://gerrit.wikimedia.org/r/173010 [15:29:27] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] gdash: add swift iops dashboards [puppet] - 10https://gerrit.wikimedia.org/r/173010 (owner: 10Filippo Giunchedi) [15:32:44] PROBLEM - puppet last run on ms-be2005 is CRITICAL: CRITICAL: Puppet has 1 failures [15:35:09] (03PS1) 10Glaisher: Set wgCheckUserForceSummary to true by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173015 (https://bugzilla.wikimedia.org/71457) [15:35:53] PROBLEM - HHVM busy threads on mw1114 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [90.0] [15:36:59] <_joe_> I'm already looking [15:37:53] RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 16440 bytes in 0.387 second response time [15:45:37] matanya: hey [15:45:47] hello legoktm [15:46:15] I'm reading the logs right now, is anything still broken? [15:47:14] yes legoktm : Tomer T is still missing here: https://meta.wikimedia.org/w/index.php?title=Special%3ACentralAuth&target=Tomer+T on he.wiki [15:47:17] hashar: on gallium manually removed /var/lib/puppet/state/agent_catalog_run.lock , prevented manual run of puppet on the host [15:48:09] matanya: was this a local renameuser or a global one? [15:48:15] local [15:48:22] i wanted to merge [15:48:22] ahh [15:48:24] ok [15:49:05] manybubbles: Should we assume you'll SWAT today? [15:49:49] legoktm: is that the answer? :D [15:50:30] matanya: there's a script to fix incomplete/broken renames, I'm figuring out how to use it right now :P [15:52:08] anomie: I'll do it! [15:54:44] RECOVERY - HHVM busy threads on mw1114 is OK: OK: Less than 1.00% above the threshold [60.0] [15:58:53] gi11es: around for swat deplor? [15:58:56] deploy [15:59:27] manybubbles: ready to deploy ice harpoons [15:59:46] gi11es: wonderful. waiting for the signal from jouncebot [15:59:58] matanya: did someone create the old username as a new account? [16:00:04] manybubbles, anomie, ^d, marktraceur, gi11es: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141113T1600). [16:00:06] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "agreed it'd be nice to have deduplication, one way to reduce noise is to let the check run on masters and master-candidates only perhaps? " [puppet] - 10https://gerrit.wikimedia.org/r/172527 (owner: 10Filippo Giunchedi) [16:00:08] (03CR) 10Manybubbles: [C: 032] Revert "Enable JPG thumbnail chaining on all wikis except commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172960 (owner: 10Gilles) [16:00:14] legoktm: not that i know of [16:00:26] I see one in the database [16:00:34] (03Merged) 10jenkins-bot: Revert "Enable JPG thumbnail chaining on all wikis except commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172960 (owner: 10Gilles) [16:00:38] user_id = ט [16:00:40] why i look at it, it says "username not registered" [16:00:44] er [16:00:45] *when [16:00:51] user_id = 249682 [16:01:15] gi11es: Harpoons don't work on comets or patches [16:01:15] matanya: https://he.wikipedia.org/wiki/%D7%9E%D7%99%D7%95%D7%97%D7%93:%D7%AA%D7%A8%D7%95%D7%9E%D7%95%D7%AA/%D7%AA%D7%95%D7%9E%D7%A8_%D7%98 is the old user? [16:01:27] yes [16:01:27] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: SWAT revert JPG thumbnail chaining on all wikis except commons (duration: 00m 05s) [16:01:30] Logged the message, Master [16:01:31] ok [16:01:34] gi11es: ^^ [16:01:36] so someone created the username again [16:01:50] manybubbles: testing... [16:01:51] maybe he logged in ? [16:02:06] idk [16:02:16] matanya: I'll run the script after I find some food and once the deploy is over [16:02:25] thank you [16:03:28] (03PS6) 1001tonythomas: Deploy BounceHandler extension to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172322 (https://bugzilla.wikimedia.org/69019) (owner: 10Legoktm) [16:04:24] manybubbles: deploy is great success [16:04:34] gi11es: wonderful! consider yourself SWATed [16:04:38] thanks! [16:09:46] (03CR) 10Manybubbles: [C: 032] Set wgCheckUserForceSummary to true by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173015 (https://bugzilla.wikimedia.org/71457) (owner: 10Glaisher) [16:09:53] (03Merged) 10jenkins-bot: Set wgCheckUserForceSummary to true by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173015 (https://bugzilla.wikimedia.org/71457) (owner: 10Glaisher) [16:10:36] Glaisher: around to verify? [16:10:36] manybubbles: Thanks! [16:10:44] not really testable :P [16:11:17] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: SWAT force summary when running checkuser query on all wikis (duration: 00m 04s) [16:11:22] Logged the message, Master [16:11:26] Glaisher: ah. well, ok [16:11:27] done then [16:11:39] :) [16:12:08] (03CR) 10Manybubbles: [C: 032] Reenable regexes now that we've fixed the plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172991 (owner: 10Manybubbles) [16:12:16] (03Merged) 10jenkins-bot: Reenable regexes now that we've fixed the plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172991 (owner: 10Manybubbles) [16:13:35] !log manybubbles Synchronized wmf-config/CirrusSearch-common.php: SWAT reenable accelerated regex search (regex search still disabled) (duration: 00m 03s) [16:13:38] Logged the message, Master [16:14:22] !log manybubbles Synchronized wmf-config/CirrusSearch-production.php: SWAT reenable regex search now that it will not crash elasticsearch (duration: 00m 04s) [16:14:25] Logged the message, Master [16:15:52] (03PS14) 10GWicke: Initial RESTBase puppet module [puppet] - 10https://gerrit.wikimedia.org/r/167213 [16:16:27] !log manybubbles Synchronized php-1.25wmf8/extensions/CirrusSearch/: SWAT update cirrussearch to fix slow prefix queries (duration: 00m 05s) [16:16:31] Logged the message, Master [16:16:32] (03CR) 10GWicke: "Alex, thanks for your review." (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/167213 (owner: 10GWicke) [16:16:35] !log logstash elasticsearch cluster is pretty messed up. logstash1002 has lost shards for all indices except for today, and it's master for that one. [16:16:38] Logged the message, Master [16:18:06] (03PS3) 10Giuseppe Lavagetto: monitoring: move monitor_host to monitoring::host [puppet] - 10https://gerrit.wikimedia.org/r/172530 [16:18:50] (03CR) 10Giuseppe Lavagetto: [C: 032] monitoring: move monitor_host to monitoring::host [puppet] - 10https://gerrit.wikimedia.org/r/172530 (owner: 10Giuseppe Lavagetto) [16:20:23] <_joe_> let's see what I broke this time :/ [16:20:39] breaker alert :P [16:21:10] !log disk utilization is 94% on logstash1002, 92% on logstash1001 and 91% on logstash1003. Too much data in indices even with replica count bumped down to 1 for the small disks we have today. [16:21:12] Logged the message, Master [16:21:30] <^d> bd808: Ouch. Can we do anything to help? [16:22:19] ^d: I think I'm just going to start killing older indices. Drop us down to 20 days of retention instead of 31 [16:22:30] <^d> Hmm [16:23:10] selective deleting would take more disk unfortunately since they have to be compacted to gc [16:23:24] <^d> I was thinking of pulling a production elastic* box over to logstash* duty for the time being. [16:24:19] oh. how about I just drop replica count to 0 for closed indexes? [16:24:45] That would free a bunch of disk [16:24:49] <^d> That'd work too :) [16:25:02] without getting rid of the data completely [16:25:20] and short of a full disk crash shouldn't be too risky [16:29:53] * manybubbles is done with SWAT [16:32:48] (03PS7) 1001tonythomas: Deploy BounceHandler extension to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172322 (https://bugzilla.wikimedia.org/69019) (owner: 10Legoktm) [16:36:57] RECOVERY - CI: Low disk space on /var on labmon1001 is OK: OK: All targets OK [16:43:10] (03PS8) 1001tonythomas: Make BounceHandler extension work on test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/168622 [16:44:01] Jeff_Green: https://gerrit.wikimedia.org/r/#/c/168622/ :) [16:44:14] we are installing on test2wiki ! Looks ready to go ? [16:45:06] !log preparing to upgrade analytics1026 to trusty [16:45:10] Logged the message, Master [16:45:46] tonythomas: ok by me, you've already got the http endpoint configured? [16:46:01] jgage: yt? [16:46:11] !log dropped replica count to 0 for logstash indices from 2014-10-14 through 2014-10-29. See https://phabricator.wikimedia.org/P73 for the commands. [16:46:13] Logged the message, Master [16:46:33] waiting for legoktm / Reedy_ to merge https://gerrit.wikimedia.org/r/#/c/172322/7/ ! [16:46:44] ok [16:47:31] legoktm: we can get that one done today ? [16:48:22] tonythomas: I don't know, depends on whether Reedy_ can get a deployment window for it since this is a new extension [16:48:57] yeah :( [16:49:27] (03PS1) 10Alexandros Kosiaris: rrdcached tuning [puppet] - 10https://gerrit.wikimedia.org/r/173032 [16:49:45] !log restarted elasticsearch on logstash1002 [16:49:48] Logged the message, Master [16:50:20] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: red, number_of_nodes: 3, unassigned_shards: 2, timed_out: False, active_primary_shards: 45, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 70, initializing_shards: 4, number_of_data_nodes: 3 [16:50:50] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: red, number_of_nodes: 3, unassigned_shards: 2, timed_out: False, active_primary_shards: 45, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 70, initializing_shards: 4, number_of_data_nodes: 3 [16:50:51] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: red, number_of_nodes: 3, unassigned_shards: 2, timed_out: False, active_primary_shards: 45, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 70, initializing_shards: 4, number_of_data_nodes: 3 [16:51:09] matanya: does everything look fine now? [16:54:48] (03PS1) 10Jgreen: re-add aluminium DNS [dns] - 10https://gerrit.wikimedia.org/r/173033 [16:54:51] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 15 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 11, utimed_out: False, uactive_primary_shards: 41, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 61, uinitializing_shards: 4, unumber_of_data_nodes: 3} [16:55:03] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 15 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 11, utimed_out: False, uactive_primary_shards: 41, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 61, uinitializing_shards: 4, unumber_of_data_nodes: 3} [16:55:26] (03PS1) 10Jgreen: add git-setup script for gerrit user convenience [dns] - 10https://gerrit.wikimedia.org/r/173034 [16:55:30] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 15 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 11, utimed_out: False, uactive_primary_shards: 41, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 61, uinitializing_shards: 4, unumber_of_data_nodes: 3} [16:55:35] (03CR) 10jenkins-bot: [V: 04-1] add git-setup script for gerrit user convenience [dns] - 10https://gerrit.wikimedia.org/r/173034 (owner: 10Jgreen) [16:55:37] Can some opsen ack the icinga checks on the logstash hosts please? [16:55:55] It will be a bit thrashy as the cluster rebalances [16:56:51] (03Abandoned) 10Jgreen: re-add aluminium DNS [dns] - 10https://gerrit.wikimedia.org/r/173033 (owner: 10Jgreen) [16:57:20] !log dropped replica count to 0 for logstash indices from 2014-10-30 and 2014-10-31. [16:57:22] Logged the message, Master [16:57:29] (03Abandoned) 10Jgreen: add git-setup script for gerrit user convenience [dns] - 10https://gerrit.wikimedia.org/r/173034 (owner: 10Jgreen) [16:58:34] Who is taking care of OCG software these days? cscott? [16:58:52] bd808: yup [16:58:58] what's up? [16:59:27] The elasticsearch behind logstash is quite frequently logging that records are coming with a "details.log.raw" field >32k [16:59:30] (03PS1) 10Jgreen: add gerrit setup script for convenience [dns] - 10https://gerrit.wikimedia.org/r/173035 [16:59:38] and refusing to save those records [16:59:39] (03CR) 10jenkins-bot: [V: 04-1] add gerrit setup script for convenience [dns] - 10https://gerrit.wikimedia.org/r/173035 (owner: 10Jgreen) [16:59:56] It's not a problem but I thought you might want to know that it happening [17:00:18] bd808: hm, yeah, i save the metabook.json file for debugging in various places, and for some of the wikibooks that can be quite large [17:00:33] bd808: but that explains why i often can't find that information in logstash when i'm looking for it [17:00:44] This is probably something that we can fix in the mapping for the index [17:00:52] (03CR) 10Jgreen: [C: 032 V: 032] "overriding irrelevant lint check" [dns] - 10https://gerrit.wikimedia.org/r/173035 (owner: 10Jgreen) [17:01:27] bd808: is it dropping the entire log entry in that case? it would be nice if it just dropped the 'raw' field, which is redundant anyway [17:02:00] It seems to be rejecting the whole document [17:02:07] (03PS1) 10Jgreen: re-re-re-add DNS for aluminium [dns] - 10https://gerrit.wikimedia.org/r/173036 [17:02:16] (03CR) 10jenkins-bot: [V: 04-1] re-re-re-add DNS for aluminium [dns] - 10https://gerrit.wikimedia.org/r/173036 (owner: 10Jgreen) [17:05:17] (03Abandoned) 10Jgreen: re-re-re-add DNS for aluminium [dns] - 10https://gerrit.wikimedia.org/r/173036 (owner: 10Jgreen) [17:05:18] bd808: yeah, that matches my experience futilely searching for those logs [17:05:35] (03Abandoned) 10Jgreen: add gerrit setup script for convenience [dns] - 10https://gerrit.wikimedia.org/r/173035 (owner: 10Jgreen) [17:05:50] bd808: i can probably truncate them on the sender side at, say, 24k. [17:06:07] It's apparently caused by the change I made to the schema to make raw values use the "doc_values" type [17:06:12] it would be just as frustrating if i was searching through the logs for it, but a little less mysterious. [17:06:40] Which I did for a particular field really (message.raw) but applied everywhere [17:06:56] I can fix that in the mapping template [17:07:05] * bd808 opens a bug [17:07:31] (03PS1) 10Jgreen: re^10-add aluminium to DNS [dns] - 10https://gerrit.wikimedia.org/r/173039 [17:08:20] (03CR) 10Jgreen: [C: 032 V: 031] re^10-add aluminium to DNS [dns] - 10https://gerrit.wikimedia.org/r/173039 (owner: 10Jgreen) [17:09:48] (03PS1) 10Jgreen: add git-setup gerrit config script for convenience [dns] - 10https://gerrit.wikimedia.org/r/173040 [17:09:57] (03CR) 10jenkins-bot: [V: 04-1] add git-setup gerrit config script for convenience [dns] - 10https://gerrit.wikimedia.org/r/173040 (owner: 10Jgreen) [17:10:18] (03PS2) 10Ori.livneh: Fix bug in MonitoringProtocol._getConfigStringList [debs/pybal] - 10https://gerrit.wikimedia.org/r/172949 [17:10:20] (03PS4) 10Ori.livneh: Tests for `pybal.monitor` [debs/pybal] - 10https://gerrit.wikimedia.org/r/172805 [17:11:10] (03CR) 10Ori.livneh: [C: 032] "added check that val is not empty and added test for it in followup patch" [debs/pybal] - 10https://gerrit.wikimedia.org/r/172949 (owner: 10Ori.livneh) [17:11:18] (03PS2) 10Jgreen: add git-setup gerrit config script for convenience [dns] - 10https://gerrit.wikimedia.org/r/173040 [17:11:25] (03Merged) 10jenkins-bot: Fix bug in MonitoringProtocol._getConfigStringList [debs/pybal] - 10https://gerrit.wikimedia.org/r/172949 (owner: 10Ori.livneh) [17:11:38] (03CR) 10Ori.livneh: [C: 032] Tests for `pybal.monitor` [debs/pybal] - 10https://gerrit.wikimedia.org/r/172805 (owner: 10Ori.livneh) [17:11:56] (03Merged) 10jenkins-bot: Tests for `pybal.monitor` [debs/pybal] - 10https://gerrit.wikimedia.org/r/172805 (owner: 10Ori.livneh) [17:12:14] (03CR) 10Jgreen: [C: 032 V: 031] add git-setup gerrit config script for convenience [dns] - 10https://gerrit.wikimedia.org/r/173040 (owner: 10Jgreen) [17:19:25] (03CR) 10Ori.livneh: "Why wouldn't we use Upstart? This is a trivial case, it'll be easy to port it to systemd if / when the day comes." [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/172418 (owner: 10Ori.livneh) [17:28:20] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: puppet fail [17:29:41] "if"? :) [17:29:59] paravoid: well, init.d scripts are hardly closer, no? [17:30:12] closer to what? [17:30:15] systemd [17:30:27] I don't get it? [17:30:34] re: ottomata's comment on that patch, "I think you will be hard pressed to convince ops to use upstart for this, but I could be wrong." [17:31:05] nah I don't mind it [17:31:05] i don't mind using init.d scripts instead, but it seems like a bizarre objection to me [17:31:27] that said, I think it's pretty crazy to ship a package that we made with an init script, then remove it in puppet and replace it with upstart [17:31:34] ok [17:31:37] it's our package, not a third-party one [17:31:51] so whatever you folks do, better do it in the package itself I would say [17:32:27] paravoid: is the package just for our use, or are there plans to usptream it? [17:32:52] I don't know but I doubt it's upstreamable at this point [17:32:55] it's not compatible with varnish 4 [17:33:01] oh [17:33:16] (afaik) [17:33:29] how bad would it be to just have another varnishncsa instance instead? [17:33:38] ? [17:34:05] the narrative arc here is that i'd like to provision an additional logging endpoint on bits for client-side profiling data [17:34:23] ok [17:34:24] hi! [17:34:30] * Nemo_bis waits for the plot twist [17:34:32] mark okayed the idea way back when (a year ago i think) but said we should wait for varnishkafka, which was in the works at the time [17:34:49] it rings a bell [17:34:51] if paravoid doesn't mind upstart, then go for it! [17:35:10] well, but so whatever you folks do, better do it in the package itself I would say [17:35:12] although, i'm not sure it is worth changing the .deb package for it, ti would be hard to make the .deb package support multiple instances [17:35:13] no? [17:35:23] why would it be hard? [17:35:37] what would you name the default one? [17:35:42] just varnishkafka? [17:35:44] varnishkafka-default? [17:35:45] "default"? :P [17:35:46] lester! [17:35:48] or sebastian [17:35:50] ha [17:36:04] i'm more worried about the deployment [17:36:05] dunno I don't have any strong opinions here [17:36:09] not just about the naming, in general [17:36:14] even if the package was upstart [17:36:22] it's just seems a bit crazy on our part to do something in a .deb then undo it in puppet [17:36:25] puppet would probably remove the default init file and put it its own [17:36:27] when we're the only ones using it :) [17:36:27] for multi instance [17:36:45] paravoid: tell that to the upstart / php.ini files in the hhvm package! :P [17:36:50] but I don't care all that much I should say [17:37:05] php.ini is config, so that's okay [17:37:10] hello, you have reached paravoid during his unopinionated time of day [17:37:11] :p [17:37:12] hhvm's upstart is pretty crazy on its own imho [17:37:19] but that's a different discussion altogether :) [17:37:21] ottomata: OK, so how about: do it via puppet for now, and then i'll work the change back into the deb [17:37:44] ori, if paravoid is ok with upstart in general, it is cool w me [17:37:51] i don't have a preference about the .deb using upstart over init.d [17:37:53] paravoid looooves upstart [17:38:03] even if puppet removes the init.d and then installs upstart [17:38:10] so, ok, ori, i will review that change harder then. [17:38:18] danke! [17:40:52] I think init.d scripts aren't DRY at all [17:40:59] and are always messy and usually buggy in some ways [17:41:11] yes [17:41:34] upstart is fine (although as you know ori, not a big fan of the upstart+start-stop-daemon combination) [17:41:37] and systemd is fine as well [17:41:59] systemd seems to gain traction and having some really cool features [17:42:04] i have not used systemd at all before, excited to try it one day... [17:42:19] wow, someone is saying something positive about systemd [17:42:22] I like upstart's config file better than ini files, but what can you do [17:42:38] there are things I don't like about upstart [17:42:44] can we use systemd in trusty? i saw some libs are installed, but it's not the full thing, right? [17:42:44] so, i'm excited to try an alternative [17:42:47] i know nothing more about it [17:42:47] and I find systemd's feature creep a bit worrying (PPP/PPPoE, wtf) [17:42:58] PPP/PPPoE -- wat? [17:43:00] ori: there is a PPA by the Ubuntu devs but I wouldn't do it... [17:43:01] logind, wtf [17:43:06] aye cool [17:43:14] ori: there's "networkd" nowadays, among other things [17:43:16] jgage, fyi, just in case, i'm working on the cluster upgrade to trusty [17:43:19] just did analytics1026 [17:43:22] going to do analytics1003 [17:43:24] awesome! [17:43:27] neitther of those have any prod servces on it [17:43:30] i rebooted all workers yesterday :) [17:43:32] did you get through the jvm restart [17:43:34] cool [17:43:37] any nodes you haven't done yet? [17:43:44] kafka, 1010 [17:43:49] ori: https://launchpad.net/~pitti/+archive/ubuntu/systemd [17:43:54] ok cool, i'll do those ones last anyway [17:43:56] "The trusty packages were removed, as they are way too incomplete" [17:44:07] blergh [17:44:12] oh well [17:44:15] next up: analytics1003, analytics1027, those might be all I do today [17:44:20] cool ok [17:44:21] (03PS2) 10Giuseppe Lavagetto: puppet: get rid of the nagios_group global variable [puppet] - 10https://gerrit.wikimedia.org/r/172531 [17:44:26] systemd-nspawn is pretty cool [17:44:32] btw, tracking here: [17:44:33] https://phabricator.wikimedia.org/T1200 [17:44:40] ok [17:44:44] maybe the feature creep will eventually produce upstartd [17:44:57] systemd makes me want to run screaming [17:44:59] chasemp: gave me a token! [17:45:02] i don't know what tha tmeans but thanks! [17:45:06] people are even talking about forking debian to avoid it [17:45:08] <_joe_> paravoid: systemd socket activation would help us with hhvm just fine I think, I'm experimenting on my sid with it [17:45:17] it's worth 10 million schrute bucks [17:45:22] jgage: just crazy people with no contributions whatsoever [17:45:23] chasemp never awarded me a token [17:45:39] noone is seriously saying that [17:45:42] although I wish they did [17:45:42] <_joe_> jgage: not "people", one italian dickhead [17:45:48] it'd put an end to that debate [17:45:57] <_joe_> and I just made a friend [17:46:20] i would like to be sold on systemd, but terrible scope creep and lots of new code.. i don't trust it [17:46:43] also not a lot of people know that, but half of the emails on the lists are being sent (or fueled) by a well-known troll [17:46:46] with multiple identities [17:46:55] http://geekfeminism.wikia.com/wiki/MikeeUSA [17:47:11] jgage: also linus hates systemd. his opinion counts a lot in my book. [17:47:41] yes [17:47:44] citation needed [17:48:01] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [17:48:15] who cares really [17:48:28] it's just a process that parses config files and spawns processes [17:48:31] and some other shit [17:48:36] paravoid: lmgtfy "linus systemd feud" [17:48:36] really, who cares [17:48:38] and touches network and authentication [17:48:56] I have only heard linus say that binary logs are bad, otherwise it seems "upstream" is not hostile towards systemd [17:48:57] large code base crammed down throats which is critical to linux operation.. yeah i don't trust it [17:49:06] and let's face it, binary logs, meh [17:49:21] Linus says: "I don't actually have any particularly strong opinions on systemd itself. I've had issues with some of the core developers that I think are much too cavalier about bugs and compatibility, and I think some of the design details are insane (I dislike the binary logs, for example), but those are details, not big issues." [17:49:25] http://www.zdnet.com/linus-torvalds-and-others-on-linuxs-systemd-7000033847/ [17:49:34] !log preparing for trusty upgrade of analytics1003 [17:49:37] I don't read that as "hates" [17:49:41] Logged the message, Master [17:50:00] my tinfoil hat is telling me that the nsa would love for people to adopt this large not throughly tested codebase which has its fingers in auth and every process [17:50:07] and even if he did, I don't really care about what Linus thinks [17:50:14] he doesn't use/like Debian either, for starters :) [17:50:38] what is your arguement in favor of systemd, paravoid? what advantages do you see? [17:50:49] even shuttleworth called systemd "hugely invasive and hardly justified" at one point. ;) [17:50:50] against what? [17:51:04] not against, for [17:51:06] paravoid: that's linus trying his best to be diplomatic. [17:51:14] what's the alternative? [17:51:21] what we have now [17:51:29] what we have where? [17:51:38] who's "we" [17:51:40] nevermind. [17:51:44] no, seriously [17:51:57] for pid1, Debian has SysV, Ubuntu has upstart [17:52:14] yes, so i'm asking why you support replacing those things with systemd [17:52:51] noone sane would advocate staying with SysV [17:53:01] I can go on and on about that, but that really wasn't the debate [17:53:18] everyone agreed that we need to ditch it and it was long overdue, and that's why Ubuntu ditched it anyway [17:53:23] years ago [17:53:52] (and no similar fuss was made, I should say) [17:54:04] that's an arguement against sysv, not for systemd [17:54:13] so now the debate became upstart vs. systemd, as all the other contenders didn't even come close [17:54:42] and that debate was highly political, as there are corporate interests at play here as well [17:55:00] on the technical side, I think that each had its merits [17:55:33] having one init system across all distributions seems very appealing to me [17:56:05] implementation-wise, systemd gets quite a few things better (like using cgroups to track daemons instead of ptrace tracing) [17:56:13] as well as integrating well with containers [17:56:31] its socket activation is definitely more complete [17:57:53] (e.g. upstart doesn't support SOCK_DGRAM at all) [17:58:13] I tend to agree that systemd has a lot of upsides and is a definite improvement over upstart [17:58:24] gwicke: hey, I see a Parsoid deployment window today, but it's specified in PDT. Did you mean, 14:00 PST? [17:58:36] (or IPv6 iirc) [17:58:54] but in any case, I'd be okay with upstart as well [17:59:09] as for the "auth" part (logind), I didn't see anyone caring about ConsoleKit before [17:59:23] so this, again, points out about how political this debate has become [17:59:24] it's just, there are a lot of problems in systemd's design and overall philosophy, which might have eventually been shaken out in an open process, especially one with competing implementations conforming to some new interface standards for new advanced init systems that abstract them a bit [17:59:59] instead we have this monolithic first-and-only implementation of these great ideas that isn't quite ready for wide adoption being widely adopted. [18:00:06] agreed [18:00:11] neither upstart nor systemd are very good at this community thing [18:00:20] thanks paravoid, those are the first concrete statements in favor of systemd that i have run across [18:00:35] upstart was mostly designed behind closed doors and contributions required a CLA with Canonical [18:00:50] greg-g: I'm trying to grab a deployment window from 11:00-13:00 PST, for CentralNotice things. Holler if there's a reason not to do that! [18:00:51] systemd at least has a public, active repo and takes pull reqs [18:00:55] yup [18:01:27] it's just that upstream devs are a bit... arrogant [18:01:28] I really need to gather up my remaining anti-systemd thoughts and pare them down better to the parts that are really valid [18:01:38] (and which are the root causes) [18:01:56] today's LWN article was a nice touch I think [18:02:20] "In the end, it comes down to this: it just is not that important. It is just a system initialization utility" [18:02:36] "But even if systemd turns out to be a wrong turn in the end, the current level of anxiety and heat is not called for." [18:02:37] really the only advantages i see to ditching sysv is init config files instead of init scripts, and faster boot (don't care) [18:02:50] nope, that's not just it [18:03:09] off the top of my head, though, I tend to think most of it revolves around an incomplete and poorly-specified new "API" for daemons/software vs the traditional set of POSIX interfaces, and the whole "let's rewrite everything as part of the systemd software collection" thing, which may at least be in part to other softwares' unwillingness/inability to switch to that "spec" [18:03:11] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 71, initializing_shards: 3, number_of_data_nodes: 3 [18:03:21] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 71, initializing_shards: 3, number_of_data_nodes: 3 [18:03:35] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 71, initializing_shards: 3, number_of_data_nodes: 3 [18:04:31] PROBLEM - NTP on rhenium is CRITICAL: NTP CRITICAL: No response from NTP server [18:05:39] (on that specific front of the pseudo-API stuff, I tend to think upstart is even worse, but at least allows easy fallback to SysV / plain shellscript. But I also think the sysv/shell stuff was fine at that level of the problem, even though it created problems elsewhere in the grand design of an init system) [18:06:27] but this pressure the systemd community seems to feel to rewrite lots of tools within their repo is damning, and I think directly related [18:06:55] it's like they are trying to fuel this debate too [18:06:59] it should be *easy* for the authors of traditional implementatins of those tools to patch things up to work brilliantly with or without systemd and do the things systemd needs done [18:07:16] or at least, east for the systemd community to figure that out and send them patches [18:07:21] they could just as easily call it "daemon friends of systemd" or something [18:07:40] they're not really all that tied together with pid1 [18:07:50] yeah, sure [18:07:55] but can you adopt systemd without all its friends? [18:08:16] depends on which one [18:08:19] they're not pid1, but I think part of the reason for the daemon-friends is that their traditional variants don't play well with systemd's overall design. [18:08:31] Debian didn't switch to networkd for example [18:08:35] that they don't, and that it seems easier to rewrite than send them patches, seems bad. [18:08:38] cool [18:09:02] I have to say, timedated is kinda interesting [18:09:52] oh and binary logs, that's my other separate thing to hate about systemd. fuck binary logs. [18:10:35] that's just for the journal though [18:10:42] well, sort of [18:10:44] which currently isn't even on persisent storage, that's not its point [18:11:00] and can pipe to syslog if you really want to [18:11:07] trying to figure out what problem timedated solves [18:11:21] does this matter? no processes .. swift-object-replicator on ms-be2010 , but runs on all others [18:11:26] jgage: well-defined interface to get/set the timezone, for starters [18:11:34] it is intended that the journal supplant the traditional syslog(3) interface. the fact that there are currently transition mechanisms where both journal and e.g. rsyslogd running and passing data between them seems transitional. [18:11:52] i guess dpkg-reconfigure tzdata isn't so great [18:11:54] instead of readlink("/etc/localtime") and hope for the best [18:12:09] well-defined for applications as well that is [18:12:14] systemd's design encourages software authors to stop doing things like posix daemonization and syslog(3), and start just writing foreground procs that spew to stderr and let systemd take care of the rest [18:12:41] RECOVERY - NTP on rhenium is OK: NTP OK: Offset -0.0005984306335 secs [18:12:45] my first reaction to that is ew, but maybe it's actually good if it simplifies their code [18:13:09] well it does, if what you're doing is very simple [18:13:29] if you're doing tricky things, then the new interfaces don't cut it because they're incomplete and/or poorly-thought-out. [18:13:47] and by tricky things I assume you mean graceful restarts :) [18:13:51] and then there's the whole "if FreeBSD doesn't run systemd, I can't get rid of the complex code anyways" for portable software. [18:13:59] arr yeah [18:14:14] paravoid: and non-trivial socket stuff that can't use their socket activation. [18:14:25] I'm not a big fan of socket activation in general [18:14:33] I think it's very counter-intuitive to the sysadmin for starters [18:14:59] I actually got used to writing http://cr.yp.to/daemontools.html services at $DAYJOB-1. With enough exposure you can get used to anything. [18:15:04] ithout socket activation, fully converting a peice of software to the simple model where systemd does all the hard stuff is hard to get right. [18:15:25] (because the socket isn't available immediately, and thus one needs systemd-specific notification of readiness for dependent things, etc) [18:15:42] bd808: upstart/systemd killing all those process supervisors is one of the things that make me most happy about this transition [18:16:26] * bd808 nods [18:16:29] sure [18:16:45] I totally agree. and I really hate start-stop-daemon -based initscripts too [18:16:52] those were horrible hacks and always had problems [18:16:57] yup [18:17:53] dunno, I'm hopeful that as all distros switch to systemd, all these missing features will be ironed out [18:18:19] i'd like to adopt it *after* that :) [18:18:23] at the end of the day, the reason I refrain from joining in the public ranting against systemd is that my overall view comes down to "this is cool, but if I were doing it, there's 5% of it I would like to see done better", to which the natural response is "well why didn't you write systemd then?" :p [18:18:35] so which I don't have a good answer. you have to give them props for even trying. [18:19:10] ever used "puppet kick" ? https://docs.puppetlabs.com/references/3.6.2/man/kick.html [18:19:32] is that aliased to rm? [18:19:53] :p [18:20:14] puppet kick needs puppet agent to listen on a port :P [18:20:14] I was trying systemd-nspawn on my build box (which has like 7 chroots) [18:20:17] it was pretty cool [18:20:20] machinectl and all that [18:21:02] nice [18:22:10] legoktm: still no he.wiki in https://meta.wikimedia.org/w/index.php?title=Special%3ACentralAuth&target=Tomer+T [18:22:23] matanya: yeah, they need to go to special:centralauth and merge their account [18:22:34] oh [18:22:39] but it doesn't say not attached. [18:22:45] right [18:23:11] matanya: probably https://bugzilla.wikimedia.org/show_bug.cgi?id=71773 [18:23:51] ok, legoktm anything can be done ? [18:24:03] they can go to special:centralauth and merge their account :P [18:25:01] systemd-nspawn sounds interesting, but this section of the manpage is a bummer: [18:25:05] Note that even though these security precautions are taken systemd-nspawn is not suitable for secure container setups. Many of the security features may be circumvented and are hence primarily useful to avoid accidental changes to the host system from the container. [18:28:34] ori: yt? [18:36:34] (03CR) 10Ottomata: "Ok if we are going to override the init.d script with this, we should probably do some combination of:" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/172418 (owner: 10Ori.livneh) [18:36:42] paravoid: still there? want to talk about cassandra module change? [18:36:52] are you still in unoppinionated mode? [18:37:09] how hard do you want me to change it to use a config hash rather than explict parameters? [18:38:27] awight: possibly; none of the Parsoid folks are in SF, so it doesn't matter to them ;) /cc cscott subbu [18:39:23] i don't know what we're talking about ;) [18:39:29] me neither. [18:39:58] ^ /cc gwicke awight ;) [18:40:11] cscott: subbu: no big deal--I noticed your deploy window was specified in PDT, that's all. [18:40:15] which is passe... [18:40:32] awight: sorry, next time i'll make sure it's specified in EST. [18:40:33] ;) [18:40:48] "Deployment windows are 'pinned' to the time in San Francisco and thus the UTC time will change due to the United States observance of Daylight Savings Time as appropriate." [18:40:52] so says the wikipage :) [18:41:13] awight: that calendar entry is a zombie, by the way. parsoid really only schedules monday and wednesday deploys, but a thursday window keeps getting copy-and-pasted into each new week's schedule somehow. [18:41:14] awight, ah you mean PDT vs PST? [18:41:26] well, the confusing part was only in the source: 1400PDT is actually 1300PST. [18:41:30] yep [18:41:40] cscott: oops [18:41:43] it's 4pm EST. [18:41:58] anyway, it was useful for this week, since i really did want a thursday window this week [18:42:08] hehe ok I will not linger, then [18:42:09] but i suspect that the zombie thursday will always be stuck in PDT ;) [18:42:11] deletedededed [18:42:15] oh [18:42:17] damnit [18:42:25] who's on first! [18:42:31] i dunno [18:42:42] undid [18:42:46] i dunno is on third [18:42:51] home plate! [18:43:16] greg-g: anyway, just make a mental note that parsoid hates both thursdays and daylight savings time in general ;) [18:43:17] * awight checks ages of all participants :) [18:43:37] * subbu is confused but it is midnight here .. so will hide behind that excuse [18:43:47] * subbu is in india [18:44:17] subbu: https://www.youtube.com/watch?v=kTcRRaXV-fg [18:44:20] https://www.youtube.com/watch?v=kTcRRaXV-fg [18:44:22] yep, that [18:44:29] subbu: watch and laugh :) [18:45:16] ;) [18:45:52] * subbu finds headphones and clicks [18:47:35] PROBLEM - Disk space on labsdb1006 is CRITICAL: DISK CRITICAL - free space: / 1763 MB (3% inode=97%): [18:48:40] hilarious :) [18:50:35] (03CR) 10Ottomata: [C: 04-1] "Ah, also, I think the ::monitoring class will need to be turned into a define, since it works with the specific varnishakafka.stats.json " (032 comments) [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/172418 (owner: 10Ori.livneh) [18:53:02] i should have written 'third base!' above instead of 'home plate' ;) [18:53:39] while i've got op's attention (greg-g, et al) -- i've got the private half of an ssh key for the npmtravis user i'd like to put on jenkins slaves [18:53:49] i understand that these things are put in the ops-private repo somehow? [18:54:50] hashar said he just copied the key for jenkins-mwext-sync manually to /var/lib/jenkins/.ssh/jenkins-mwext-sync_id_rsa on the jenkins-slaves for the ve sync task, because he didn't have access to ops-private [18:55:04] omg this was out of control .. [18:55:06] how would i go about doing things if I wanted to do it "properly"? [18:57:36] cscott: https://wikitech.wikimedia.org/wiki/Puppet#Private_puppet [18:58:00] basically you would need help from someone in ops to add the keys to that repo [19:00:04] awight, AndyRussG: Dear anthropoid, the time has come. Please deploy CentralNotice (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141113T1900). [19:00:21] (03PS1) 10QChris: Turn EventLogging's logrotate config into template [puppet] - 10https://gerrit.wikimedia.org/r/173065 [19:01:05] (03CR) 10jenkins-bot: [V: 04-1] Turn EventLogging's logrotate config into template [puppet] - 10https://gerrit.wikimedia.org/r/173065 (owner: 10QChris) [19:02:55] jgage: that sounds right. who in ops wants to help me? ;) [19:03:13] (03PS2) 10QChris: Turn EventLogging's logrotate config into template [puppet] - 10https://gerrit.wikimedia.org/r/173065 [19:03:21] jgage: we could maybe clean up/puppetize /var/lib/jenkins/.ssh/jenkins-mwext-sync_id_rsa for hashar at the same time [19:03:27] (03PS1) 10Andrew Bogott: Allow sshd to pull ssh keys from ldap on Trusty. [puppet] - 10https://gerrit.wikimedia.org/r/173066 [19:04:12] cscott, is there a ticket for this issue? [19:05:19] i can make one. bugzilla or RT? https://bugzilla.wikimedia.org/show_bug.cgi?id=73334 covered the first part of the process. [19:06:05] if you could make an RT ticket that would be great, thanks! [19:06:13] ok. [19:06:42] (03PS2) 10Andrew Bogott: Allow sshd to pull ssh keys from ldap on Trusty. [puppet] - 10https://gerrit.wikimedia.org/r/173066 [19:06:49] * jgage looks forward to phab and the end of "which ticket system should i use?" [19:08:06] <^d> jgage: All of them! [19:13:12] RECOVERY - Disk space on labsdb1006 is OK: DISK OK [19:15:12] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 784 [19:16:00] lutetium alert is me, silencing... [19:16:32] done generating temp table, now enabling keys [19:16:37] err wrong window [19:16:40] Then in will be "in what N projects should I use" [19:23:06] jgage: RT #8866, for your pleasure [19:28:54] (03CR) 10Dzahn: "just fixing 3 lint warnings, described inline, so that puppet-lint now runs perfectly on init.pp" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/167213 (owner: 10GWicke) [19:29:30] (03PS15) 10Dzahn: Initial RESTBase puppet module [puppet] - 10https://gerrit.wikimedia.org/r/167213 (owner: 10GWicke) [19:34:37] (03CR) 10Dzahn: [C: 031] Initial RESTBase puppet module [puppet] - 10https://gerrit.wikimedia.org/r/167213 (owner: 10GWicke) [19:36:11] Looks like beta-bits is having issues [19:36:11] http://bits.beta.wmflabs.org/en.wikipedia.beta.wmflabs.org/load.php?debug=false&lang=en&modules=startup&only=scripts&skin=vector&* [19:36:20] greg-g, chrismcmalunch ^ [19:36:32] Error: 503, Service Unavailable at Thu, 13 Nov 2014 19:35:27 GMT [19:36:59] marktraceur: https://bugzilla.wikimedia.org/show_bug.cgi?id=73377#c4 [19:37:06] yeah, that [19:37:55] Seems likely [19:38:53] i copied it from -labs [19:38:57] there was a discussion about it earlier [19:39:00] yeah [19:43:35] (03PS1) 10Awight: testwiki: Enable new banner choice method [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173076 [19:43:44] (03CR) 10jenkins-bot: [V: 04-1] testwiki: Enable new banner choice method [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173076 (owner: 10Awight) [19:43:49] blah [19:44:30] (03PS2) 10Awight: testwiki: Enable new banner choice method [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173076 [19:45:05] (03CR) 10Awight: [C: 032] testwiki: Enable new banner choice method [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173076 (owner: 10Awight) [19:52:18] !log starting upgrade to trusty on analytics1023 [19:52:20] Logged the message, Master [19:53:36] * YuviPanda waves [19:54:13] hallo! [19:55:33] (03CR) 10Dzahn: [C: 032] "previous comments from Ori and Alex said it looks good too, minor nitpicks have been addressed, and this doesn't apply the module on anyth" [puppet] - 10https://gerrit.wikimedia.org/r/167213 (owner: 10GWicke) [19:55:58] (03PS1) 10JanZerebecki: Change ru.wikinews.org to HTTPS only. [puppet] - 10https://gerrit.wikimedia.org/r/173078 [19:59:27] (03PS1) 10Yuvipanda: shinken: Setup IRC notification for shinken [puppet] - 10https://gerrit.wikimedia.org/r/173080 [20:00:23] (03Abandoned) 10John F. Lewis: add AAAA for uranium [dns] - 10https://gerrit.wikimedia.org/r/172442 (owner: 10John F. Lewis) [20:04:48] RECOVERY - Disk space on vanadium is OK: DISK OK [20:05:55] wait, did we build ircecho ourselves? [20:05:58] oh dear [20:08:03] (03CR) 10Ottomata: Move Eventlogging logs underneath /srv, which has more free space (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/172706 (owner: 10QChris) [20:08:08] (03CR) 10Jforrester: [C: 04-1] "Needs Parsoid item to be deployed first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172996 (owner: 10Jforrester) [20:11:48] (03PS3) 10QChris: Move Eventlogging logs underneath /srv, which has more free space [puppet] - 10https://gerrit.wikimedia.org/r/172706 [20:13:54] (03PS2) 10Dzahn: ssh server: make listening port configurable [puppet] - 10https://gerrit.wikimedia.org/r/172799 [20:14:06] (03PS1) 10JanZerebecki: Change ru.wikinews.org to HTTPS only. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173083 [20:14:16] !log powering down logstash1003 for a few mins to add disks [20:14:21] Logged the message, Master [20:14:57] (03CR) 10Dzahn: "Alex, the reason is to be able to have a setup where gerrit can listen on 22, for that sshd can't listen on 22, the host has 2 IPs" [puppet] - 10https://gerrit.wikimedia.org/r/172799 (owner: 10Dzahn) [20:16:12] (03PS3) 10Andrew Bogott: Allow sshd to pull ssh keys from ldap on Trusty. [puppet] - 10https://gerrit.wikimedia.org/r/173066 [20:16:21] PROBLEM - Host logstash1003 is DOWN: PING CRITICAL - Packet loss = 100% [20:16:51] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 25 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 2, uunassigned_shards: 21, utimed_out: False, uactive_primary_shards: 36, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 49, uinitializing_shards: 4, unumber_of_data_nodes: 2} [20:17:08] (03PS2) 10Dzahn: ssh server: make ListenAddress configurable [puppet] - 10https://gerrit.wikimedia.org/r/172803 (https://bugzilla.wikimedia.org/35611) [20:17:24] (03CR) 10Dzahn: ssh server: make ListenAddress configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/172803 (https://bugzilla.wikimedia.org/35611) (owner: 10Dzahn) [20:17:30] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 25 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 2, uunassigned_shards: 21, utimed_out: False, uactive_primary_shards: 36, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 49, uinitializing_shards: 4, unumber_of_data_nodes: 2} [20:18:06] (03PS3) 10Dzahn: ssh server: make ListenAddress configurable [puppet] - 10https://gerrit.wikimedia.org/r/172803 (https://bugzilla.wikimedia.org/35611) [20:18:09] (03PS4) 10QChris: Move Eventlogging logs underneath /srv, which has more free space [puppet] - 10https://gerrit.wikimedia.org/r/172706 [20:19:09] !log patched bugs 71111 and 71394 in wmf7 and wmf8 [20:19:14] Logged the message, Master [20:20:50] (03CR) 10QChris: Move Eventlogging logs underneath /srv, which has more free space (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/172706 (owner: 10QChris) [20:24:52] RECOVERY - Host logstash1003 is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [20:25:46] !log awight Synchronized php-1.25wmf8/extensions/CentralNotice: push CentralNotice updates (duration: 00m 05s) [20:25:50] Logged the message, Master [20:26:01] (03CR) 10John F. Lewis: [C: 031] "Code looks good. The idea behind it is valid from a perspective." [puppet] - 10https://gerrit.wikimedia.org/r/172799 (owner: 10Dzahn) [20:26:21] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 68, initializing_shards: 6, number_of_data_nodes: 3 [20:26:51] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 68, initializing_shards: 6, number_of_data_nodes: 3 [20:28:37] cmjohnson: If you can wait until the logstash1003 goes green before taking logstash1002 down that would be swell. [20:28:44] _joe_: lol, memcached role also is entangled with ganglia [20:28:53] yep [20:29:16] <_joe_> YuviPanda: oh man [20:29:18] YuviPanda: ganglia tentacles are everywhere [20:29:46] (03CR) 10Catrope: "Parsoid change is merged now, but will not be deployed until Monday (Nov 17)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172996 (owner: 10Jforrester) [20:30:28] _joe_: yeaaah. [20:30:29] _joe_: sigh [20:32:33] !log Dropped replica count of all logstash indices except today to 0. Should make rolling restarts faster during hardware upgrade. [20:32:35] Logged the message, Master [20:32:42] (03CR) 10Dzahn: [C: 032] Add cron job that generates flow statistics [puppet] - 10https://gerrit.wikimedia.org/r/171465 (owner: 10Milimetric) [20:33:31] !log awight Synchronized wmf-config: Enabling CentralNotice banner choice on testwiki (duration: 00m 04s) [20:33:34] Logged the message, Master [20:35:18] _joe_: doesn't seem as involved [20:35:19] greg-g: can u confirm that "sync-dir wmf-config" should have deployed globals to testwiki (among others)? [20:35:48] <_joe_> YuviPanda: tomorrow I'll merge the varnish patch if my tests with the cron script are ok [20:35:51] (03CR) 10Dzahn: "Notice: /Stage[main]/Misc::Statistics::Limn::Data::Jobs/Misc::Statistics::Limn::Data::Generate[flow]/Git::Clone[analytics/limn-flow-data]/" [puppet] - 10https://gerrit.wikimedia.org/r/171465 (owner: 10Milimetric) [20:36:23] (03CR) 10Andrew Bogott: "This works, and doesn't break anything on Precise." [puppet] - 10https://gerrit.wikimedia.org/r/173066 (owner: 10Andrew Bogott) [20:36:45] !log awight Synchronized php-1.25wmf7/extensions/CentralNotice: push CentralNotice updates (duration: 00m 05s) [20:36:50] Logged the message, Master [20:37:14] (03CR) 10Dzahn: [C: 031] "Ori, better now?" [puppet] - 10https://gerrit.wikimedia.org/r/162860 (owner: 10ArielGlenn) [20:37:45] should, I believe, Reedy confirm me ^^ [20:38:00] (03PS1) 10Ori.livneh: Move tests to pybal.test; use Twisted's test runner [debs/pybal] - 10https://gerrit.wikimedia.org/r/173086 [20:38:11] mutante: just doulbe checking, did someone ask you to merge [20:38:15] the flow cron job? [20:38:26] I'd like to install the CentralNotice "infrastructure mode" schema to testwiki... [20:39:05] cmjohnson: logstash1003 is done coping from logstash1002 so you can take 1002 down whenever you are ready. [20:39:07] (03CR) 10Ori.livneh: [C: 031] fix up ordering for salt-minion package, config, service [puppet] - 10https://gerrit.wikimedia.org/r/162860 (owner: 10ArielGlenn) [20:39:10] ottomata: gerrit did [20:39:16] bd808...cool [20:39:18] doing so now [20:39:34] !log powering down logstash1002 to add disks [20:39:37] Logged the message, Master [20:39:41] gerrit asked you to merge it? [20:39:50] i'm pretty sure it won't work without some other dependency that flow was doing [20:39:51] PROBLEM - NTP on logstash1003 is CRITICAL: NTP CRITICAL: Offset unknown [20:39:51] ottomata: yea, you said yourself it's fine too [20:39:51] i'm not relaly sure [20:39:55] dan and S were coordinating [20:40:21] i think eithe rthe git clone will fail (because there is a missing repo), or hte cron will fail (because the repo doesn't have to proper script?) not sure. [20:40:33] if it shouldn't be merged, please vote it down or add that comment [20:40:41] (03CR) 10Ori.livneh: [C: 031] Move Eventlogging logs underneath /srv, which has more free space [puppet] - 10https://gerrit.wikimedia.org/r/172706 (owner: 10QChris) [20:40:42] it was in my review queue [20:40:46] the cronjob got added fine [20:40:46] HMMM [20:40:54] will it run! that is the question [20:40:56] it won't hurt anything [20:41:01] it might just cronspam, i really dob'nt know [20:41:09] was waiting for the go ahead from dan/S on that one [20:41:14] last I heard dan told me to just wait [20:41:35] i will vote down from now on! i've never had this happen beofre :p [20:42:04] what never happened before? [20:42:21] confused [20:42:31] PROBLEM - Host logstash1002 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:44] (03PS1) 10Dzahn: Revert "Add cron job that generates flow statistics" [puppet] - 10https://gerrit.wikimedia.org/r/173089 [20:43:15] um, i guess had someone else merge a change [20:43:30] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 23 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 2, uunassigned_shards: 21, utimed_out: False, uactive_primary_shards: 31, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 2, uactive_shards: 41, uinitializing_shards: 0, unumber_of_data_nodes: 2} [20:43:40] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 23 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 2, uunassigned_shards: 21, utimed_out: False, uactive_primary_shards: 31, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 2, uactive_shards: 41, uinitializing_shards: 0, unumber_of_data_nodes: 2} [20:43:45] usually people will review, but let the folks who wrote the change do the merging [20:43:57] ottomata: it's a change by somebody who cant merge [20:44:09] mmm, originally :/ [20:44:39] (03CR) 10Dzahn: [C: 032] "per Ottomata" [puppet] - 10https://gerrit.wikimedia.org/r/173089 (owner: 10Dzahn) [20:44:41] Where is the testwiki error log? [20:44:46] (03CR) 10John F. Lewis: [C: 031] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/172803 (https://bugzilla.wikimedia.org/35611) (owner: 10Dzahn) [20:44:51] RECOVERY - NTP on logstash1003 is OK: NTP OK: Offset -0.004023194313 secs [20:44:56] mutante: its ok [20:45:02] not a big deal [20:45:05] heh. I never noticed the spurious 'u' chars from python unicode strings in the icinga alerts for elasticsearch before. [20:45:31] ha, probably we shoudl ahve just checked with S before reverting, it might have been ok. i just thought i was waiting for someone to tell me it was time to merge [20:46:40] ottomata: i can only know what it says on gerrit and that said it should now work [20:47:04] ottomata: millimetric cant merge himself.. so i dont know how it would work [20:47:12] ori: is this you? [20:47:13] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Must pass trusted_group to Class[Keyholder] on node i-0000010b.eqiad.wmflabs [20:47:20] ori: deployment-bastion [20:47:24] (03PS2) 10Ori.livneh: Allow multiple instances [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/172418 [20:47:37] YuviPanda: i'll look [20:48:06] YuviPanda: trusted_group is getting passed, not sure why it's barfing [20:48:23] aye, i guess, i didn't realize other opsies would look at it! not sure who added you, its ok, its my fault, i shoulda -1ed [20:48:40] ori: I wonder if it's rebase failure on the master [20:48:41] RECOVERY - Host logstash1002 is UP: PING OK - Packet loss = 0%, RTA = 1.58 ms [20:48:47] that's how uncommon it is people look at other people's patches :/ [20:48:53] "never happened before" [20:49:33] bd808 coming back...lmk when it's good to get the 1001 [20:49:40] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 2, active_shards: 61, initializing_shards: 1, number_of_data_nodes: 3 [20:49:50] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 2, active_shards: 61, initializing_shards: 1, number_of_data_nodes: 3 [20:49:51] (03CR) 10Ori.livneh: "@ottomata: did everything except the monitoring bit" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/172418 (owner: 10Ori.livneh) [20:50:26] mutante: btw, no pressure, but do you have a sense for how long it would take to fulfill the mwdeploy key request in rt? [20:50:27] ori: aha, can't rebase because your change conflicts with the local hacks on betalabs around the mw role [20:50:29] cmjohnson: Good to go any time for taking logstash1001 offline. [20:50:30] (i saw that you took that one) [20:50:41] okay [20:50:48] bd808: your patches are conflicting with ori's new ones :) [20:50:53] fun [20:51:02] !log powering down logstash1001 to add disks [20:51:02] YuviPanda: {{sofixit}} [20:51:05] Logged the message, Master [20:51:30] bd808: ori I'm taking a shot now [20:51:37] thanks! [20:51:49] YuviPanda: which hack patch is conflicting? [20:52:00] ok, I've no idea how any of this works... [20:52:04] or why the local patch exists... [20:52:18] bd808: Applying: [LOCAL HACK] Bug 65591: User['mwdeploy'] shell => /bin/bash [20:52:26] ah. [20:53:11] PROBLEM - Host logstash1001 is DOWN: PING CRITICAL - Packet loss = 100% [20:53:12] That keeps puppet from creating a local mwdeploy user on beta hsots [20:53:14] *hosts [20:53:18] ah, hmm [20:53:29] If Ori is changing that upstream then it should be ok to remove [20:53:41] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 27 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 2, uunassigned_shards: 25, utimed_out: False, uactive_primary_shards: 26, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 2, uactive_shards: 37, uinitializing_shards: 0, unumber_of_data_nodes: 2} [20:53:44] bd808: ori can either of you look? [20:54:00] YuviPanda: link to ori's patch? [20:54:08] I have an interview in 5 minutes [20:54:14] I'm not sure which one conflicts [20:54:17] ah [20:54:20] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 27 threshold =0.1% breach: {ustatus: ured, unumber_of_nodes: 2, uunassigned_shards: 25, utimed_out: False, uactive_primary_shards: 26, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 2, uactive_shards: 37, uinitializing_shards: 0, unumber_of_data_nodes: 2} [20:55:01] YuviPanda: https://github.com/wikimedia/operations-puppet/commit/802c7568a627cea53438425a0c450f93fd22e273 [20:55:25] YuviPanda: We need the homedir for mwdeploy changed in labs ldap to match that [20:55:31] Then drop the hack patch [20:55:38] then merge should work [20:56:14] If you don't change the ldap homedir things will get messed up [20:56:34] because puppet will create a local user [20:56:50] and that won't work with the NFS4 shares [21:00:04] gwicke, cscott, arlolra, subbu: Respected human, time to deploy Parsoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141113T2100). Please do the needful. [21:00:40] ACKNOWLEDGEMENT - puppet last run on analytics1023 is CRITICAL: CRITICAL: Puppet has 1 failures ottomata I accidentally upgraded zookeeper here when I upgraded the OS to Trusty. I want to wait a day before I continue with the ZK upgrade to make sure all is fine, and then I will fix puppet. This should be gone by tomorrow. [21:01:09] jouncebot: ok, let's break things! [21:02:31] RECOVERY - Host logstash1001 is UP: PING OK - Packet loss = 0%, RTA = 1.42 ms [21:03:31] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 2, active_shards: 62, initializing_shards: 0, number_of_data_nodes: 3 [21:04:01] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 2, active_shards: 62, initializing_shards: 0, number_of_data_nodes: 3 [21:04:06] ori: what about disabling or removing the instance that the .deb installs? [21:04:10] re. varnishkafka chang3e [21:05:48] (03CR) 10Ottomata: "What about removing the instance that the .deb installs? I.e. the init.d script? If we do that, we can just remove the default file then" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/172418 (owner: 10Ori.livneh) [21:06:34] ori: i will do it today [21:08:14] (03PS2) 10Yuvipanda: shinken: Setup IRC notification for shinken [puppet] - 10https://gerrit.wikimedia.org/r/173080 [21:12:00] is there any way to confirm I've deployed a MediaWiki configuration global correctly? [21:12:21] awight: eval.php [21:12:28] legoktm: woot! thanks [21:12:32] (03CR) 10Andrew Bogott: Get betalabs localsettings.js file from deploy repo (just like prod) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/166610 (owner: 10Subramanya Sastry) [21:13:17] (03Abandoned) 10Andrew Bogott: RT: allow login via LDAP [puppet] - 10https://gerrit.wikimedia.org/r/80577 (owner: 10Faidon Liambotis) [21:14:50] PROBLEM - Parsoid on wtp1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:15:31] PROBLEM - Parsoid on wtp1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:15:33] (03PS1) 10Awight: fix typo in global variable name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173100 [21:15:51] PROBLEM - Parsoid on wtp1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:15:57] !log updated Parsoid to version dabff010 [21:15:59] Logged the message, Master [21:16:16] cscott: can u ping me when you're finished deploying? I'm trying to sneak in one last wmf-config change. [21:16:30] PROBLEM - Parsoid on wtp1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:16:39] awight: sure. [21:16:41] PROBLEM - Parsoid on wtp1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:18:31] PROBLEM - Parsoid on wtp1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:18:34] PROBLEM - Parsoid on wtp1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:19:31] RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.008 second response time [21:20:14] PROBLEM - check_mysql on lutetium is CRITICAL: Slave IO: No Slave SQL: No Seconds Behind Master: (null) [21:20:21] (03PS1) 10Dzahn: add new mediawiki deployment public key [puppet] - 10https://gerrit.wikimedia.org/r/173103 [21:21:11] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.043 second response time [21:21:55] (03CR) 10Dzahn: [C: 032] add new mediawiki deployment public key [puppet] - 10https://gerrit.wikimedia.org/r/173103 (owner: 10Dzahn) [21:22:50] RECOVERY - Parsoid on wtp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.013 second response time [21:22:50] RECOVERY - Parsoid on wtp1017 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.062 second response time [21:22:53] (03PS5) 10Ottomata: Move Eventlogging logs underneath /srv, which has more free space [puppet] - 10https://gerrit.wikimedia.org/r/172706 (owner: 10QChris) [21:23:07] (03CR) 10Ottomata: [C: 032 V: 032] Move Eventlogging logs underneath /srv, which has more free space [puppet] - 10https://gerrit.wikimedia.org/r/172706 (owner: 10QChris) [21:23:22] (03CR) 10Dzahn: [C: 032] Provision mwdeploy's private key on tin [puppet] - 10https://gerrit.wikimedia.org/r/172919 (owner: 10Ori.livneh) [21:23:40] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [21:23:46] ottomata: merge conflict on puppetmaster, doing both,k [21:23:49] (03PS3) 10Ottomata: Retain 90 days of EventLogging logs [puppet] - 10https://gerrit.wikimedia.org/r/172707 (https://bugzilla.wikimedia.org/69029) (owner: 10QChris) [21:23:54] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.040 second response time [21:23:56] (03PS2) 10Ottomata: Link EventLogging logs into /var/log/eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/172884 (owner: 10QChris) [21:24:02] RECOVERY - Parsoid on wtp1019 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.043 second response time [21:24:09] ottomata: don't merge that one! [21:24:21] RECOVERY - Parsoid on wtp1013 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.010 second response time [21:24:25] not the link? [21:24:33] not the link. [21:24:35] ok [21:24:44] we need to cleanup the existing directory before. [21:24:57] But the logrotate thing that I screwed up yesterday would be gret. [21:25:04] s/gret/great/ [21:25:05] this one? [21:25:05] https://gerrit.wikimedia.org/r/#/c/172707/3/modules/eventlogging/files/logrotate [21:25:43] that would be ok to merge, but i was refering to [21:25:44] https://gerrit.wikimedia.org/r/#/c/173065/ [21:26:09] ok [21:26:17] (03PS3) 10Ottomata: Turn EventLogging's logrotate config into template [puppet] - 10https://gerrit.wikimedia.org/r/173065 (owner: 10QChris) [21:26:49] (03CR) 10QChris: [C: 04-1] "We need to clean up the old /var/log/eventlogging before" [puppet] - 10https://gerrit.wikimedia.org/r/172884 (owner: 10QChris) [21:28:28] (03CR) 10Ottomata: [C: 032] Turn EventLogging's logrotate config into template [puppet] - 10https://gerrit.wikimedia.org/r/173065 (owner: 10QChris) [21:31:32] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Puppet has 1 failures [21:35:20] (03Abandoned) 10Ottomata: Retain 90 days of EventLogging logs [puppet] - 10https://gerrit.wikimedia.org/r/172707 (https://bugzilla.wikimedia.org/69029) (owner: 10QChris) [21:39:24] (03Restored) 10QChris: Retain 90 days of EventLogging logs [puppet] - 10https://gerrit.wikimedia.org/r/172707 (https://bugzilla.wikimedia.org/69029) (owner: 10QChris) [21:39:38] (03PS4) 10QChris: Retain 90 days of EventLogging logs [puppet] - 10https://gerrit.wikimedia.org/r/172707 (https://bugzilla.wikimedia.org/69029) [21:40:48] awight: ok, parsoid's done [21:41:11] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [21:41:21] cscott: thanks! [21:41:32] (03CR) 10Awight: [C: 032] fix typo in global variable name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173100 (owner: 10Awight) [21:41:51] greg-g: /me is sneaking back to tin for one last config change [21:42:01] shneaky shneaky [21:42:19] (03CR) 1020after4: [C: 031] "looks good but I guess we need to wait to merge this until migration is underway?" [puppet] - 10https://gerrit.wikimedia.org/r/172471 (owner: 10Dzahn) [21:42:31] !log awight Synchronized wmf-config: Enabling CentralNotice banner choice on testwiki, take 2 (duration: 00m 06s) [21:42:33] Logged the message, Master [21:49:51] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [21:53:42] greg-g: grr, we broke mobile on mediawiki and labs with this morning's deployment, pushing a fix now... [21:54:03] so much breakage today [21:54:12] (mobile broke beta cluster today as well) [21:55:51] PROBLEM - HHVM busy threads on mw1114 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [90.0] [22:02:25] (03CR) 10Ottomata: [C: 032] Retain 90 days of EventLogging logs [puppet] - 10https://gerrit.wikimedia.org/r/172707 (https://bugzilla.wikimedia.org/69029) (owner: 10QChris) [22:03:52] !log failed over hadoop namenode to analytics1004 [22:03:54] Logged the message, Master [22:04:01] OOO [22:04:03] its happening! [22:04:03] :) [22:04:13] :D [22:04:36] On purpose for the upgrade? [22:05:16] separate from os upgrade, just upgrading jvm [22:05:52] Jvm upgrade is great too :-D [22:05:57] :) [22:06:36] still getting a lot of read requests on analytics1010 but according to the log it's handling them correctly, not sure if i should wait for those to die off before restarting the service [22:07:21] jgage: I am hitting hive a bit ... that would last several more hours. [22:07:37] but they are not important. So in case something dies, it's ok. [22:08:33] cool, ok. [22:09:26] loadavg on analytics1004 is suspiciously low although i see namenode log doing the right thing, i guess the load remaining on 1010 is because of resourcemanager which is not HA in this version of hadoop [22:10:14] PROBLEM - Hadoop NameNode Primary Is Active on analytics1010 is CRITICAL: Hadoop.NameNode.FSNamesystem.tag_HAState CRITICAL: standby [22:11:15] oops, didn't schedule downtime fast enough [22:11:51] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [22:12:24] !log restarted EventLogging jobs that write to disk, to pick up config changes [22:12:28] Logged the message, Master [22:12:35] hm, yeah, jgage, the namenode doesn't really do that much [22:12:44] unless it needs to access lots of filles or file metadata [22:12:45] in hdfs [22:13:15] back in a bit, working internmittently for a few more hours, but i'm kinda half here [22:13:16] byeee [22:13:32] ciao [22:13:38] !log awight Synchronized php-1.25wmf7/extensions/CentralNotice: push CentralNotice updates (duration: 00m 05s) [22:14:13] (03PS1) 10EBernhardson: Share parsoid cookie forwarding config for VE/Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173175 [22:14:16] !log awight Synchronized php-1.25wmf8/extensions/CentralNotice: push CentralNotice updates (duration: 00m 04s) [22:14:19] Logged the message, Master [22:15:06] (03CR) 10EBernhardson: "looking for a +1 before putting this up for SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173175 (owner: 10EBernhardson) [22:16:22] qchris: i'm going to restart resourcemanager now. not sure whether that will interrupt or just stall your hive queries. [22:16:36] Sure. Let's find out. [22:16:39] ok! [22:16:56] done [22:17:05] I did not notice anything. [22:17:20] Now I did! [22:17:27] heh [22:17:28] Everything crashed hard :-D [22:17:29] shinken-wm: hi? [22:17:32] aw :( [22:17:39] java.io.IOException: org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1409078537822_77062' doesn't exist in RM. [22:17:40] well, now we know what happens when i do that! [22:17:45] Right :-) [22:18:02] hadoop 2.5 / cdh 5.2 has HA resourcemanager [22:19:11] !log hadoop: analytics1010 is again active namenode [22:19:15] Logged the message, Master [22:19:20] RECOVERY - Hadoop NameNode Primary Is Active on analytics1010 is OK: Hadoop.NameNode.FSNamesystem.tag_HAState OKAY: active [22:25:14] (03PS1) 10Ori.livneh: keyholder: fix `keyholder` script [puppet] - 10https://gerrit.wikimedia.org/r/173177 [22:26:10] (03CR) 10Dzahn: "20after4: thanks for the review, is the script going to handle also bugs.wikimedia.org ?" [puppet] - 10https://gerrit.wikimedia.org/r/172471 (owner: 10Dzahn) [22:26:24] greg-g: ok... done unbreaking :~$ [22:26:54] (03CR) 10Ori.livneh: [C: 032] keyholder: fix `keyholder` script [puppet] - 10https://gerrit.wikimedia.org/r/173177 (owner: 10Ori.livneh) [22:27:21] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [22:29:07] (03CR) 10Ori.livneh: "Ottomata: I don't remove it, but in init.pp I ensure the service is stopped and disabled." [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/172418 (owner: 10Ori.livneh) [22:31:06] awight: for now.... [22:31:13] (03CR) 10Rush: [C: 031] "I reviewed this (did not run the tests) but from what I understand there is no harm here to overall pybal and ori's explaination makes sen" [debs/pybal] - 10https://gerrit.wikimedia.org/r/173086 (owner: 10Ori.livneh) [22:31:29] chasemp, mutante: thanks! [22:31:46] (03CR) 10Ori.livneh: [C: 032] Move tests to pybal.test; use Twisted's test runner [debs/pybal] - 10https://gerrit.wikimedia.org/r/173086 (owner: 10Ori.livneh) [22:32:02] (03Merged) 10jenkins-bot: Move tests to pybal.test; use Twisted's test runner [debs/pybal] - 10https://gerrit.wikimedia.org/r/173086 (owner: 10Ori.livneh) [22:32:51] PROBLEM - Apache HTTP on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:33:00] PROBLEM - HHVM rendering on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:33:36] hmm [22:33:44] _joe_: i have an interview in 10 mins, not sure how much i'd be able to look at that [22:33:57] <_joe_> mmmh [22:34:06] <_joe_> I'll take a look, then go to bed [22:35:08] <_joe_> very high load, it seems [22:35:46] TimStarling: any chance you can look? I have an interview in ten minutes [22:35:54] and it's very late for _joe_ [22:36:01] this is re: mw1114 alerts above [22:36:14] ok [22:36:16] not the nicest way to say "good morning", but.. :/ [22:36:17] thanks [22:37:15] (03Abandoned) 10Dzahn: nickel: remove ganglia, re-add in eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/172862 (owner: 10Dzahn) [22:38:42] <_joe_> ori: all of API is going down [22:40:35] TimStarling: just in case you didn't see that last message from _joe_ re the API cluster going down :/ [22:40:49] (03PS1) 10Dzahn: remove nickel from puppet [puppet] - 10https://gerrit.wikimedia.org/r/173186 [22:41:12] PROBLEM - HHVM busy threads on mw1114 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [90.0] [22:41:31] (03CR) 10Jforrester: "Looks fine from a VE perspective. One minor suggestion about comments." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173175 (owner: 10EBernhardson) [22:44:07] (03PS2) 10EBernhardson: Share parsoid cookie forwarding config for VE/Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173175 [22:50:01] bd808: i guess i'm going to make an ext4 raid0 mounted on /var/lib/elasticsearch to copy elastic10xx hosts, unless you feel something different would be appropriate? [22:50:11] (for logstash) [22:51:51] jgage: Sounds good to me, but we need to make sure we copy over the data that is currently in /var/lib/elasticsearch. I dropped the replica count to 0 for most indices so we can't afford to lose any data [22:52:14] yep, will do [22:52:22] \o/ [22:52:31] PROBLEM - Host analytics1003 is DOWN: PING CRITICAL - Packet loss = 100% [22:52:54] i'll make the volume, mount it on /mnt, shut down elasticsearch, copy the data, remount in the proper place, then start elastic again [22:54:38] lots of serach stuff being depooled on lvs1003 seems you guys know ? search1020.eqiad.wmnet [22:55:07] (03CR) 10Chad: Add read only configuration for ElasticSearchTTMServer (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172534 (owner: 10Nikerabbit) [22:55:27] <^d> chasemp: search1020 doesn't matter. [22:55:47] 19 and 16 too but good enough for me :) [22:56:13] <^d> search1001-1006 matter because they serve enwiki. [22:56:24] <^d> 1017, 1018 matter because they serve prefix searching. [22:56:35] <^d> Rest can flap until they die :p [22:57:25] <^d> 19, 16 and 20 are all in pool 4. [22:58:31] (03CR) 10Ottomata: "I think we should remove it, no?" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/172418 (owner: 10Ori.livneh) [23:00:05] maxsem, kaldari: Dear anthropoid, the time has come. Please deploy Mobile Web (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141113T2300). [23:01:10] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.051 second response time [23:01:12] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 68479 bytes in 0.173 second response time [23:21:50] RECOVERY - HHVM busy threads on mw1114 is OK: OK: Less than 1.00% above the threshold [60.0] [23:31:51] (03Abandoned) 10Dzahn: move Bugzilla DNS over to phab box for migration [dns] - 10https://gerrit.wikimedia.org/r/172448 (owner: 10Dzahn) [23:34:17] (03CR) 10Cmjohnson: [C: 031] remove nickel from puppet [puppet] - 10https://gerrit.wikimedia.org/r/173186 (owner: 10Dzahn) [23:34:59] (03PS1) 10Kaldari: Adding some A/B test start and end times for testing WikiGrok [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173195 [23:35:01] (03PS1) 10Spage: Enable Flow on some testwiki pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173196 [23:35:46] (03CR) 10Kaldari: [C: 032] Adding some A/B test start and end times for testing WikiGrok [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173195 (owner: 10Kaldari) [23:35:55] (03Merged) 10jenkins-bot: Adding some A/B test start and end times for testing WikiGrok [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173195 (owner: 10Kaldari) [23:36:23] (03CR) 10Cmjohnson: [C: 031] "nickel is going to spares for now. go ahead and remove public ip along with site.pp" [dns] - 10https://gerrit.wikimedia.org/r/172819 (owner: 10Dzahn) [23:36:42] (03CR) 10Dzahn: [C: 032] remove nickel from puppet [puppet] - 10https://gerrit.wikimedia.org/r/173186 (owner: 10Dzahn) [23:39:03] !log kaldari Synchronized wmf-config/mobile.php: Adding WikiGrok A/B test start and end times (duration: 00m 03s) [23:39:07] Logged the message, Master [23:39:50] (03PS1) 10Dzahn: remove nickel from ganglia server aliases [puppet] - 10https://gerrit.wikimedia.org/r/173197 [23:40:34] (03CR) 10Dzahn: [C: 032] remove nickel from ganglia server aliases [puppet] - 10https://gerrit.wikimedia.org/r/173197 (owner: 10Dzahn) [23:49:20] <^d> RoanKattouw: Were you taking swat today because you have half the patches? [23:49:42] <^d> (if not I can totally do it) [23:50:46] ^d: I was going to initially, but I think it'll be hard [23:50:53] I'm in the middle of some conversations that will take >10 mins [23:51:11] <^d> I can do it, no worries :) [23:51:20] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:51:28] <^d> bleh, stupid gitblit. [23:51:32] (03PS1) 10Ori.livneh: Correct Travis CI URL in README.md [debs/pybal] - 10https://gerrit.wikimedia.org/r/173200 [23:51:48] (03CR) 10Ori.livneh: [C: 032] "README-only" [debs/pybal] - 10https://gerrit.wikimedia.org/r/173200 (owner: 10Ori.livneh) [23:52:04] (03Merged) 10jenkins-bot: Correct Travis CI URL in README.md [debs/pybal] - 10https://gerrit.wikimedia.org/r/173200 (owner: 10Ori.livneh) [23:52:34] <^d> !log restarted gitblit on antimony [23:52:37] Logged the message, Master [23:54:40] <^d> ebernhardson: ping for swat in ~5m [23:55:04] !log nickel - remove from puppet,salt,icinga,stop services... [23:55:07] Logged the message, Master [23:55:11] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 57893 bytes in 0.497 second response time [23:55:15] ^d: kk [23:55:30] <^d> Sweet, now we just need an S and we'll be set :) [23:57:35] <^d> ebernhardson: merged your two patches, prepping submodule updates now. [23:58:53] <^d> RoanKattouw: I'm doing you last since I don't have links for yours yet. [23:59:14] Oh crap! [23:59:18] I promised I'd build those and totally forgot [23:59:20] Will build them now [23:59:22] <^d> kk