[00:00:02] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:00:04] RoanKattouw, ^d, marktraceur, MaxSem: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141113T0000). Please do the needful. [00:00:04] no, don't merge it right before you go to sleep [00:00:10] check the notifications commands for now? [00:00:14] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [00:00:17] I'm not going to sleep at least for another 30m [00:00:20] yeah, doing that now [00:00:21] let's talk more about the IRC bot before we keep reinventing it [00:00:29] that's a pattern [00:00:36] puppet's running on neon [00:00:41] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [00:00:42] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:00:43] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [00:00:50] i'll look at those ^ [00:01:08] I don't think we re-invented IRC, tbh. They're all different use cases for different tools... [00:01:14] (03PS4) 10Rush: phab email pipe cleanup and allow maint [puppet] - 10https://gerrit.wikimedia.org/r/172915 [00:01:43] icinga-wm uses ircecho, grrrit-wm uses nodejs because it needed async code, wikibugs uses python because valshallaw is better than me in writing async python code :) [00:01:54] SWAT is empty :) [00:02:09] so maybe that means we shouldn't keep writing new ones from scratch but use a stable one [00:02:19] what exactly are you proposing? [00:03:54] (03PS5) 10Rush: phab email pipe cleanup and allow maint [puppet] - 10https://gerrit.wikimedia.org/r/172915 [00:04:11] one irc bot to rule them all? [00:04:31] chasemp: I vaguely remember you talking about it :) [00:04:48] grrrit-wm will be subsumed by current wikibugs2 when we migrate to phab. [00:04:55] to rule most I think is the best I can forsee at this time [00:05:04] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:05:09] there is jouncebot and who knows what else [00:05:14] RECOVERY - check_puppetrun on thulium is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [00:05:19] I wouldn't be surprised by a dozen irc bots [00:05:29] (03PS1) 10Ori.livneh: Provision mwdeploy's private key on tin [puppet] - 10https://gerrit.wikimedia.org/r/172919 [00:05:31] well, how exactly would you integrate jouncebot into wikibugs2? [00:05:33] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [00:05:34] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [00:05:34] (03PS6) 10Rush: phab email pipe cleanup and allow maint [puppet] - 10https://gerrit.wikimedia.org/r/172915 [00:05:35] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [00:05:36] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:05:36] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [00:05:39] having one tool do 500 things isn't good either [00:06:10] maybe I don't understand it well enough, what does jouncebot do [00:06:14] that would be hard to integrate [00:06:25] reads a wikitech page, announces things [00:06:56] in my mind that's a pretty simple use case for inclusion [00:07:07] into where? [00:07:10] but the wall for me came when people declared they like the lots-of-bots [00:07:25] hmm, actually [00:07:34] logmsgbot + friends should probably be integrated together [00:07:57] (03CR) 10Rush: [C: 032] phab email pipe cleanup and allow maint [puppet] - 10https://gerrit.wikimedia.org/r/172915 (owner: 10Rush) [00:08:41] the idea that every service should have it's own irc bot I find weird [00:08:44] but that may be just me [00:09:04] it's mostly a historical artifact, and I agree that putting things into one is a good idea. [00:09:21] it's just that... I don't have the time to do it. [00:09:40] yep I think it's never a priority, which meh, is what it is [00:09:43] yeah [00:10:01] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:10:11] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [00:10:14] I think all the current bots were also written on volunteer time, and there it's always simpler to write a bot than a overarching bot framework that can take in things from everywhere. [00:10:20] YuviPanda: there's a bug somewhere for combining logmsgbot + morebots because they would get separated due to netsplits. [00:10:31] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [00:10:31] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:10:32] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [00:10:33] RECOVERY - check_puppetrun on silicon is OK: OK: Puppet is currently enabled, last run 242 seconds ago with 0 failures [00:10:39] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [00:10:41] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [00:10:53] legoktm: yup, what we should actually do is to have one 'Events pipeline' for everything in prod and plug all IRC bots from prod into that one [00:10:58] and have it present itself as one bot [00:11:01] yes [00:11:03] that also logs to SAL, etc. [00:11:05] or at least my thought as well [00:11:12] with very nice rulesets as to what goes where [00:11:15] and things just report json to it [00:11:20] and it figures out how to put them where [00:11:29] <^demon|lunch> manybubbles: I've got it. [00:12:00] YuviPanda: I think logmsgbot and morebots are the only prod IRC bots. (excluding irc.wm.o) [00:12:09] legoktm: true. [00:12:25] chasemp: mutante I think we also prefer to keep bots that don't need to be running in prod not running in prod... [00:12:26] Is morebots even in prod still? [00:12:28] so more people can fix them [00:12:41] legoktm: well, if it's noting down changes from the commandline in prod... [00:12:50] morebots is actually in labs [00:12:50] I am a logbot running on tools-exec-14. [00:12:50] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [00:12:50] To log a message, type !log . [00:12:53] :) [00:12:55] :) [00:13:00] logmsgbot is the only one in prod then [00:13:03] ah, right [00:13:07] https://wikitech.wikimedia.org/wiki/Morebots [00:13:48] Aaand logmsgbot is ircecho, which you want to use right? [00:13:59] :P [00:15:01] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:15:12] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [00:15:12] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:15:13] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [00:15:18] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [00:15:18] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [00:15:19] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [00:15:20] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [00:16:11] legoktm: :P [00:16:23] legoktm: if we build this pie-in-the-sky 'event-notification service' then we'll just send json to it [00:16:37] oh, icinga-wm is in prod. [00:18:33] legoktm: yeah, it is. shinken-wm will be in labs tho [00:20:02] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:20:15] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [00:20:33] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [00:20:34] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:20:35] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [00:20:36] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [00:20:36] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: puppet fail [00:20:37] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [00:20:38] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [00:22:25] (03PS1) 10Rush: phab remove bad save value [puppet] - 10https://gerrit.wikimedia.org/r/172921 [00:22:34] (03CR) 10jenkins-bot: [V: 04-1] phab remove bad save value [puppet] - 10https://gerrit.wikimedia.org/r/172921 (owner: 10Rush) [00:22:40] (03PS2) 10Rush: phab remove bad save value [puppet] - 10https://gerrit.wikimedia.org/r/172921 [00:23:53] (03CR) 10Rush: [C: 032] phab remove bad save value [puppet] - 10https://gerrit.wikimedia.org/r/172921 (owner: 10Rush) [00:24:59] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:25:09] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [00:25:38] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [00:25:39] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:25:40] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [00:25:41] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 290 seconds ago with 0 failures [00:25:43] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [00:25:43] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: puppet fail [00:25:44] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [00:25:45] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [00:27:06] mutante: chasemp fwiw, me and legoktm just started brainstorming writing this 'event notification pipeline' for labs :) [00:28:17] awesome :) [00:28:37] realized it's going to be super simple, and we already have salvageable parts [00:29:26] redis pubsub was my previous approach [00:29:35] yeah [00:29:40] we'll have http -> redis -> irc [00:29:47] with simple auth tokens [00:30:00] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:30:03] so services hit the http service with a simple json (from, token, channels, message) [00:30:07] then it puts them in redis [00:30:11] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [00:30:11] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [00:30:12] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [00:30:13] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:30:14] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [00:30:15] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: puppet fail [00:30:15] and then another script just puts them up on irc [00:30:16] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [00:30:16] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [00:30:22] why not just have services put them in redis directly? [00:30:24] chasemp: I usually prefer lpush + rbpop [00:30:25] why the http frontend? [00:30:27] chasemp: authentication [00:30:32] this is labs [00:30:36] oh labs [00:30:36] ah [00:31:19] chasemp: wikibugs and wikibugs2 is already organized similarly, there's redis subscriber + irc pusher, so we'll reuse it [00:31:24] chasemp: and then a simple flask app should take care of the rest. [00:31:37] chasemp: biggest problem now is getting me and legoktm to agree on a name :) [00:31:52] ircnotifier [00:32:17] if you want full pedantic compliance, yaib [00:32:30] phpwikibot2 [00:32:40] is it written in php? [00:32:42] nope [00:32:43] :D [00:32:49] troll yuvi [00:32:58] (03PS2) 10Springle: m2-master switch to dbproxy1002 [dns] - 10https://gerrit.wikimedia.org/r/172498 [00:33:04] I mean, even if the code quality ends up being bad, people can still be relieved it's not php [00:33:23] <^d> botoid? [00:33:36] we're going to call the project ircnotifier, since it's not wiki specific [00:34:12] and then we can bikeshed the bot name [00:34:14] https://github.com/yuvipanda/ircnotifier [00:34:16] legoktm: ^ [00:34:25] okay [00:34:36] (03CR) 10Springle: [C: 032] m2-master switch to dbproxy1002 [dns] - 10https://gerrit.wikimedia.org/r/172498 (owner: 10Springle) [00:34:45] legoktm: want me to steal the code from wikibugs or wikibugs2? [00:34:50] 2 [00:34:51] legoktm: wikibugs2, I think [00:34:52] yeah [00:35:00] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:35:10] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [00:35:34] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [00:35:35] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [00:35:36] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:35:37] RECOVERY - check_puppetrun on db1025 is OK: OK: Puppet is currently enabled, last run 293 seconds ago with 0 failures [00:35:37] RECOVERY - check_puppetrun on payments1001 is OK: OK: Puppet is currently enabled, last run 300 seconds ago with 0 failures [00:35:38] legoktm: python3? :) [00:35:42] RECOVERY - check_puppetrun on payments1004 is OK: OK: Puppet is currently enabled, last run 278 seconds ago with 0 failures [00:35:43] of course [00:35:43] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [00:37:08] legoktm: :D [00:37:09] (03PS2) 10Ori.livneh: Provision mwdeploy's private key on tin [puppet] - 10https://gerrit.wikimedia.org/r/172919 [00:37:48] (03CR) 10Ori.livneh: "The key first needs to be generated and added to the private repository. See RT 8857." [puppet] - 10https://gerrit.wikimedia.org/r/172919 (owner: 10Ori.livneh) [00:38:54] legoktm: can't co-routines be instance methods? [00:38:59] legoktm: why is redisrunner a function? [00:39:05] they probably can be. [00:39:11] I was trying to keep it similar to wikibugs [00:39:43] tch tch [00:40:04] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:40:21] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [00:40:21] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [00:40:22] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:40:23] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [00:40:24] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [00:45:00] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:45:12] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [00:45:26] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [00:45:27] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [00:45:27] legoktm: initial code :) https://github.com/yuvipanda/ircnotifier [00:45:27] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:45:28] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [00:45:34] legoktm: has one hugeass FIXME tho ;) [00:46:07] lol [00:46:09] !log thulium - Could not intern from pson: expected value in object at '"[PHP]\n\n; puppet:t'! [00:46:13] Logged the message, Master [00:47:19] legoktm: ;) [00:48:14] irc3 probably has a method to join a channel if you're not in it already [00:49:27] legoktm: yeah [00:49:59] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:50:09] PROBLEM - check_puppetrun on thulium is CRITICAL: CRITICAL: puppet fail [00:50:10] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [00:50:18] something is up with fundraising [00:50:23] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [00:50:24] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:50:24] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [00:50:27] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [00:50:37] ori: yeah, mutante is looking into it [00:51:28] earlier the puppet certs on fr master have been deleted accidentally [00:51:34] they have all been resigned already [00:51:43] the error on thulium i saw last is now gone [00:51:49] notice: Finished catalog run in 3.65 seconds [00:51:57] BUT.. it's suspiciously fast? [00:52:59] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 346 seconds [00:53:00] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 348 seconds [00:53:11] let's call Jeff? [00:53:17] it's not too late on the east coast [00:54:36] eek [00:54:49] has someone called already? [00:54:51] other jeff [00:54:52] I haven't [00:54:54] * jgage looks up his number [00:55:01] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [00:55:03] RECOVERY - check_puppetrun on thulium is OK: OK: Puppet is currently enabled, last run 227 seconds ago with 0 failures [00:55:14] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -0 seconds [00:55:15] RECOVERY - check_puppetrun on pay-lvs1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [00:55:16] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [00:55:16] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [00:55:17] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [00:55:24] got it, calling.. [00:55:26] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [00:55:38] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:55:57] it's 3am here, I can't be of much help [00:56:03] err: Could not retrieve catalog from remote server: Could not intern from pson: expected value in object at '"# puppet/templates/'! [00:56:09] not thinking very clearly [00:56:15] this is a bit random now [00:56:18] paravoid: i never let that stop me! [00:56:20] (j/k) [00:56:21] one run finishes, the next does that [00:56:38] ori: well you know, I do that too, but I woke up early today :) [00:57:11] left him a voicemail [01:00:01] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [01:00:11] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [01:00:12] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [01:00:13] PROBLEM - check_puppetrun on silicon is CRITICAL: CRITICAL: puppet fail [01:00:17] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [01:00:19] notice: Finished catalog run in 5.22 seconds [01:02:21] * Starting puppet master [fail] [01:02:34] * master is running [01:04:57] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [01:05:16] RECOVERY - check_puppetrun on tellurium is OK: OK: Puppet is currently enabled, last run 296 seconds ago with 0 failures [01:05:17] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [01:05:17] RECOVERY - check_puppetrun on silicon is OK: OK: Puppet is currently enabled, last run 250 seconds ago with 0 failures [01:05:20] RECOVERY - check_puppetrun on db1008 is OK: OK: Puppet is currently enabled, last run 92 seconds ago with 0 failures [01:06:07] PROBLEM - ElasticSearch health check on elastic1024 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.48 [01:08:31] hm [01:08:32] java 46746 elasticsearch 448u IPv6 68528286 0t0 TCP *:9300 (LISTEN) [01:08:35] java 46746 elasticsearch 1376u IPv6 68417825 0t0 TCP *:9200 (LISTEN) [01:09:54] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [01:10:24] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [01:10:25] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [01:10:26] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [01:10:37] it compiles catalogs for nodes. payments1001-1003 seem ok, payments1004 is not, wth [01:11:46] elasticsearch is yellow but appears to be recovering, i'll keep an eye on it [01:14:52] paravoid: yt? [01:15:03] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [01:15:15] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [01:15:18] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [01:15:19] RECOVERY - check_puppetrun on payments1004 is OK: OK: Puppet is currently enabled, last run 67 seconds ago with 0 failures [01:16:44] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: Puppet has 1 failures [01:17:01] stat1001, that's not even related.. come on [01:18:24] (03PS1) 10Springle: Revert "m2-master switch to dbproxy1002". All services switched cleanly except eventlogging, which needs firewall changes. So let them switch back and try another day. [dns] - 10https://gerrit.wikimedia.org/r/172926 [01:18:28] * YuviPanda checks on stat1001 [01:18:34] it runs fine :p [01:18:55] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [01:19:06] (03CR) 10Springle: [C: 032] Revert "m2-master switch to dbproxy1002". All services switched cleanly except eventlogging, which needs firewall changes. So let them switc [dns] - 10https://gerrit.wikimedia.org/r/172926 (owner: 10Springle) [01:19:58] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [01:20:11] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [01:20:12] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [01:21:24] hi [01:21:29] :) [01:22:04] chasemp: ^ redis relay is done. working on the http / auth part now [01:25:01] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [01:25:19] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [01:25:20] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: puppet fail [01:25:21] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [01:26:46] Jeff_Green: hey, so it's inconsistent a bit, sometimes they run and sometimes they dont [01:26:52] maybe stored configs? [01:27:12] stored configs? [01:27:21] on the puppet master [01:27:36] see failures like this odd one: [01:27:43] there's nothing fancy going on, just straight puppet lameness [01:27:57] Could not intern from pson: expected value in object at '"# puppet/templates/'! [01:28:21] or [01:28:22] ugh [01:28:25] puppet. [01:28:25] Could not intern from pson: expected value in object at '"[PHP]\n\n; puppet:t'! [01:28:39] but then, run it multiple times, and see them only sometimes [01:28:43] and the other times it finishes a run [01:28:54] you've restarted the master? [01:29:11] that too, see errors in log [01:29:30] how about we start totally fresh [01:29:51] and move aside boron:/var/lib/puppet and the same on each client [01:29:59] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [01:30:09] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [01:30:10] it was * Starting puppet master [fail] [01:30:15] but at the same time [01:30:35] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [01:30:37] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [01:30:44] the only thing I can think of is that the stored data on the master is corrupt [01:30:50] * master is running [01:31:22] that's what i meant when i said stored configs, yea [01:31:31] suggests truncation [01:31:36] so the puppetstoredconfigclean.rb thing we have in prod [01:32:01] can't just use it here though, because "Invalid db adapter " [01:32:02] i'd like to help, is there a host i can run puppet with verbose/debug on? [01:32:14] one that you guys aren't currently working on, i mean [01:32:21] ori: it's all frack hosts [01:32:37] payments1001 for example [01:32:45] mutante: i vote we just blow away /var/lib/puppet everywhere and start over [01:32:49] it's not that many hosts [01:33:08] Jeff_Green: ironically, sounds like the same thing was the root cause [01:33:27] I thought he just trashed the stored certs [01:33:36] not the entire dir [01:34:01] it was that find command [01:34:03] another possibility--purge everything *but* the stored certs [01:34:22] that, afaict find /var/lib/puppet/ssl -type f -exec rm {} \; [01:34:28] yeah [01:34:36] but yea, just ./ssl/ [01:34:39] that's a recipe for stabbing yourself in the eye [01:34:43] stopping puppetmaster [01:35:02] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [01:35:06] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [01:35:07] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [01:35:08] RECOVERY - check_puppetrun on payments1004 is OK: OK: Puppet is currently enabled, last run 233 seconds ago with 0 failures [01:35:24] Jeff_Green: are the nodes usually that fast in fr when running puppet? [01:35:33] just taking 3 or 5 seconds or something [01:35:34] yes [01:35:38] wow, ok [01:35:53] that's what happens when you rip out all the broken package version tracking [01:35:54] :-) [01:35:57] so why does it work sometimes [01:36:06] i can see the master compiling catalogs for nodes [01:36:08] because puppet is manufactured broken [01:36:18] heh [01:36:19] i've never seen this behavior before though [01:36:36] testing a theory here... one sec [01:38:00] ok here's what I just did for lutetium, let's see if it gets happier: [01:38:07] stopped puppetmaster and lutetium's puppet [01:38:32] on lutetium rm -fr /var/lib/puppet/everything_but_ssl_dir [01:38:59] on boron rm -fr /var/lib/puppet/{everything related to lutetium but the pem file} [01:39:04] start puppet and puppetmaster [01:39:18] runs clean on the first try, hopefully it will stay that way [01:39:57] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [01:40:07] doing samarium next [01:40:22] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [01:40:22] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [01:40:23] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [01:41:27] Jeff_Green: that sounds like a good plan, deleting ./yaml/node/lutetium* ./yaml/facts/lutetium etc.. yea [01:41:44] one can hope [01:41:54] is that what the prod *rb thing does? [01:42:46] that deletes from the database [01:43:00] oh, right, we have the db layer there [01:43:06] i purposely left that out here [01:43:19] it first tries to figure out from master config [01:43:28] it it uses sqlite or mysql or postgres [01:43:35] yeah [01:43:40] so when you run that in fr, you get "Invalid adapter" [01:43:43] right [01:44:00] it's even installed? [01:44:01] PROBLEM - ElasticSearch health check on elastic1025 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.49 [01:44:08] Jeff_Green: i copied it to /tmp manually [01:44:12] oic [01:44:33] i decided not to try to use all the bells and whistles for puppet in frack [01:44:34] trying to achieve kind of the same thing, make master forget all things about one node, then run it again [01:44:39] right [01:45:01] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [01:45:02] samarium done [01:45:08] backup4001 next [01:45:11] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [01:45:20] ha boron [01:45:22] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: puppet fail [01:45:23] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: puppet fail [01:45:24] RECOVERY - check_puppetrun on samarium is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [01:45:24] guess I'd better do boron [01:45:25] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [01:45:26] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [01:45:44] PROBLEM - check_puppetrun on db1008 is CRITICAL: CRITICAL: puppet fail [01:45:56] arr, yea, it came back [01:47:21]