[00:04:10] Morning guys, awjr suggested a message here - bugzilla seems to be having issues. Viewing pages is fine, cannot submit bugs at the moment, the connection keeps getting reset during submission [00:04:38] I'm trying to submit in the category Wikimedia Mobile, under devices [00:07:05] never mind, submission finally went through on the 10th attempt - may want to keep an eye on it though :) [00:11:38] PROBLEM - Lucene on searchidx1001 is CRITICAL: Connection refused [00:18:03] nagios-wm: why is that check back again ? argg [00:28:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:34:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.171 seconds [01:09:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:10:57] New review: Dzahn; "ok, going to do this (expect NRPE breakage, will stop nagios-wm temp.)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3144 [01:11:00] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3144 [01:12:01] !log stopping nagios-wm temp. while changing nrpe config (will watch it manually until it's back) [01:12:29] Logged the message, Master [01:17:16] hmm.. not consistent. when i expect it to break of course it doesnt.. or not an all hosts [01:35:36] you want more flood?! [01:37:44] heh, no;) i just want a reproducable error [01:38:16] computers and "sometimes" = sigh :p [01:50:03] New patchset: Dzahn; "add the nagios-nrpe-server init file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3233 [01:50:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3233 [01:52:44] New patchset: Dzahn; "add the nagios-nrpe-server init file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3233 [01:52:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3233 [01:54:09] mutante: maybe you want mcollective? ;) [01:54:31] (not that i know enough about it to know if it's the right choice) [01:56:15] eh for what? to restart the service on all hosts at once if it should break again? [01:56:23] to replace dsh? [01:57:48] aah, there we go with the expected breakage.. bbiaw:) restarting a bit [02:11:26] mutante: well idk enough about what dsh can do. mcollective could set services or whole boxes to be in downtime, make it's change, switch back [02:11:35] (in nagios) [02:17:59] ah, i see what you mean, yeah, we could set scheduled downtimes for all on the nagios host itself though , while dsh (and mcollective afaik) is to execute stuff on a lot of servers at a time [02:20:49] mutante: mcollective could do the nagios changes for you though [02:21:22] mutante: and it could do say up to X at a time, make sure they're working right and do some more [02:21:39] it is just an action on a single host though [02:22:00] it didn't require puppet runs to propagate? [02:22:19] not setting downtimes, no [02:22:29] oh [02:22:40] yeah, of course. i thought you meant the nrpe change [02:24:20] mcollective could make the change in batches or at least limit concurrency and have a verification step as it goes before moving on to other boxes. and it could do the nagios change itself so that the window when the downtime is in effect (and maybe suppressing alerts) is smaller [02:24:53] anyway, i really don't know enough about dsh. and i know there's others i haven't tried yet either like clusterssh [02:28:16] jeremyb: the main issue with dsh is currently that you have "group" files with the hostnames in it, but these are not yet being auto-created, so they are outdated. we would like those to be created from puppet data [02:28:48] so i run my own little loop to get hosts from nagios config .. [02:29:40] but then there are also still some hosts that need to be manually removed, like pseudo hosts that just exist for monitoring purposes, routers and sswitches..etc [02:29:58] well then what about people purposefully keeping hosts out of dsh? then they have to blacklist them in puppet conf? [02:30:49] shrug, some way those that are "real hosts" (servers) must be marked ..yea [02:36:42] RECOVERY - Disk space on mw1139 is OK: DISK OK [02:36:42] RECOVERY - Disk space on mw1074 is OK: DISK OK [02:36:42] RECOVERY - RAID on cp1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [02:36:42] RECOVERY - DPKG on cp1015 is OK: All packages OK [02:36:42] RECOVERY - DPKG on virt3 is OK: All packages OK [02:36:43] RECOVERY - RAID on cp1013 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [02:36:51] RECOVERY - Disk space on mw1127 is OK: DISK OK [02:36:51] RECOVERY - RAID on ms5 is OK: OK: Active: 50, Working: 50, Failed: 0, Spare: 0 [02:36:51] RECOVERY - DPKG on mw1131 is OK: All packages OK [02:36:51] RECOVERY - DPKG on mw1112 is OK: All packages OK [02:36:51] RECOVERY - DPKG on virt4 is OK: All packages OK [02:36:52] RECOVERY - Disk space on mw1015 is OK: DISK OK [02:36:52] RECOVERY - RAID on mw1032 is OK: OK: no RAID installed [02:36:53] RECOVERY - Disk space on mw1106 is OK: DISK OK [02:36:53] RECOVERY - Disk space on mw1112 is OK: DISK OK [02:36:54] RECOVERY - Disk space on mw1100 is OK: DISK OK [02:36:54] RECOVERY - Disk space on cp1035 is OK: DISK OK [02:36:55] RECOVERY - DPKG on mw1067 is OK: All packages OK [02:36:55] RECOVERY - RAID on mw1109 is OK: OK: no RAID installed [02:37:20] at least that tells us when the puppet run is done on spence :) [02:45:24] !log killing nrpe on several hosts where it was running as the wrong user again (somehow through the use of dsh) [02:45:28] Logged the message, Master [02:45:44] as you? ;) [02:46:22] yes, but not on all [02:46:37] nagios-wm issues? [02:46:59] Krinkle: i stopped it to prevent flooding [02:47:20] k [02:47:26] before generating said flood [02:47:27] I'd quiet it but this works too [02:48:26] Krinkle: not a real outage, just an issue everytime we change the nrpe config.. will be back soon [02:48:37] I understand [02:49:14] just saying it might be easier (dunno if it's a simple command to restart the bot), to quiet it using irc flags [02:49:54] alright, i figure it's not -v because we are not moderated [02:50:03] indeed [02:50:06] hm, yeah, easy enough "ircecho stop|start" [02:50:08] +q (or -q) [02:50:11] alright [02:51:14] do i have the flags to do that ? [02:52:16] you have to op yerself to change flags on other users i would think [02:56:03] yeah, but i am not an op [02:56:34] it's ok though, easy enough to start the bot [02:58:23] bots don't necessarily behave well after unquieting [03:03:59] should be ok again [03:04:44] jeremyb: ? [03:04:52] revenge? [03:04:53] (even though i did not get my new service i wanted :( [03:05:20] Krinkle: huh? [03:05:21] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:05:34] jeremyb: "bots don't necessarily behave well after unquieting" [03:05:39] Krinkle: so? [03:05:39] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.148 seconds [03:05:45] jeremyb: what do you mean [03:06:09] i mean that they have an error and either die or just stop functioning but don't die [03:06:19] and you have to notice and boot them [03:06:48] I've never had that problem but I guess it could go wrong if the bot doesn't support it [03:07:12] i've seen it. can't remember which bots though [03:07:13] although I don't know if irc protocol has a problem with quieted users just keep sending messages [03:07:18] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 0.020 seconds [03:07:20] afaik it's no problem, just ignored until unquieted [03:07:34] well it's not just ignored [03:07:44] it's also told it couldn't post to the channel [03:07:48] right [03:08:18] so it'll get a lot of notices from one of the services [03:08:32] I can imagine it being a problem, just never encountered it. [03:09:01] cvn ([[m:CVN]] / #cvn-bots) has a lot of channels with different bots, regularly muted and unmuted [03:09:14] !log stafford - - /var/lib/puppet/reports is getting quite large (18G), and we got the first disk space warning, do we want to keep those? [03:09:18] Logged the message, Master [03:11:51] mutante: how far back do they go? [03:13:21] at quick glance, just 3 days or so [03:14:35] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3233 [03:14:40] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3233 [03:14:44] then why would it be getting large? ;) [03:15:12] i guess if you nuke a machine and never make a new one with that name then the reports stick around [03:15:15] ? [03:15:35] because it writes one on every puppet run, and there are quite a few hosts? shrug [03:16:01] i didnt really check with find yet [03:16:05] maybe some are older [03:17:48] indeed there are older ones for some osts [03:23:30] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [03:23:54] no, theose are just directories, no file older than 3d [03:24:13] but if one .yaml file can sometimes be MBs, it just adds up [03:25:09] anyways, looks like they are rotated anyways and its just the very first warning [03:28:21] New patchset: Dzahn; "sleep 10 before cleaning up PID file, it seemed like sometimes it deletes the pid before the service has started and that might cause the failures" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3234 [03:28:30] PROBLEM - Puppet freshness on cp1022 is CRITICAL: Puppet has not run in the last 10 hours [03:28:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3234 [03:30:56] New patchset: Dzahn; "sleep 10 before cleaning up PID file, it seemed like sometimes it deletes the pid before the service has stopped and that might cause the failures" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3234 [03:31:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3234 [03:32:24] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3234 [03:32:27] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3234 [03:34:24] PROBLEM - Puppet freshness on cp1021 is CRITICAL: Puppet has not run in the last 10 hours [03:35:27] RECOVERY - RAID on aluminium is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [03:35:36] PROBLEM - Puppet freshness on cp1044 is CRITICAL: Puppet has not run in the last 10 hours [03:36:39] PROBLEM - Puppet freshness on cp1041 is CRITICAL: Puppet has not run in the last 10 hours [03:36:48] PROBLEM - swift-object-server on copper is CRITICAL: Connection refused by host [03:36:48] PROBLEM - swift-container-auditor on copper is CRITICAL: Connection refused by host [03:36:57] ah :) [03:36:57] PROBLEM - swift-container-server on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:36:57] PROBLEM - swift-container-replicator on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:36:57] PROBLEM - swift-object-auditor on magnesium is CRITICAL: Connection refused by host [03:36:57] PROBLEM - swift-account-auditor on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:36:57] PROBLEM - swift-object-updater on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:36:57] PROBLEM - swift-account-replicator on magnesium is CRITICAL: Connection refused by host [03:36:58] PROBLEM - swift-account-server on ms3 is CRITICAL: NRPE: Command check_swift_account_server not defined [03:36:58] PROBLEM - swift-object-replicator on ms3 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [03:37:15] PROBLEM - swift-account-reaper on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:37:15] PROBLEM - swift-container-updater on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:37:15] PROBLEM - swift-container-server on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:37:15] PROBLEM - swift-object-server on ms3 is CRITICAL: NRPE: Command check_swift_object_server not defined [03:37:15] PROBLEM - swift-account-auditor on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:37:15] PROBLEM - swift-container-auditor on ms3 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [03:37:15] PROBLEM - swift-container-replicator on copper is CRITICAL: Connection refused by host [03:37:16] PROBLEM - swift-object-updater on copper is CRITICAL: Connection refused by host [03:37:16] PROBLEM - swift-account-server on zinc is CRITICAL: Connection refused by host [03:38:03] this doesnt look like it i know, but its good news :) we got the swift process monitorign coming up [03:45:24] New patchset: Dzahn; "limit swift process monitoring to ms-be hosts because the testing machinges do not have nrpe installed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3235 [03:45:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3235 [03:46:38] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3235 [03:46:41] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3235 [04:19:49] !log on stafford, deleting spence's puppet report files to free some disk space (they are like the largest report files of all) [04:19:52] Logged the message, Master [04:21:07] makes sense [04:23:08] jeremyb: oh, remember these: < jeremyb> hrmmmm... i see no monitoring at all for some of the swift services < jeremyb> i.e. to make sure the container server and account server &c are actually running where they should be [04:23:37] yeah, you made an RT [04:23:44] and now you did the RT? ;) [04:23:46] http://nagios.wikimedia.org/nagios/cgi-bin/status.cgi?servicegroup=swift&style=overview [04:23:56] http://nagios.wikimedia.org/nagios/cgi-bin/status.cgi?host=ms-be1&style=detail [04:24:23] i forgot how many damned services there are [04:24:24] jeremyb: thats why i had to do the whole "nrpe" / kill bot / flooding .. everything is related :) [04:24:39] now i'm gonna eat and bbl [04:24:44] PROBLEM - SSH on sq40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:24:45] *click* [04:25:47] PROBLEM - Backend Squid HTTP on sq40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:26:05] PROBLEM - Frontend Squid HTTP on sq40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:26:22] looks like that just died, anyways, bbl :) [04:48:17] PROBLEM - Host sq40 is DOWN: PING CRITICAL - Packet loss = 100% [04:49:42] nagios-wm: that took long enough [05:01:17] PROBLEM - swift-object-server on copper is CRITICAL: Connection refused by host [05:01:17] PROBLEM - swift-container-auditor on copper is CRITICAL: Connection refused by host [05:01:26] PROBLEM - swift-account-auditor on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:01:26] PROBLEM - swift-container-server on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:01:26] PROBLEM - swift-account-replicator on magnesium is CRITICAL: Connection refused by host [05:01:26] PROBLEM - swift-object-auditor on magnesium is CRITICAL: Connection refused by host [05:01:26] PROBLEM - swift-container-replicator on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:01:26] PROBLEM - swift-object-replicator on ms3 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [05:01:27] PROBLEM - swift-object-updater on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:01:27] PROBLEM - swift-account-server on ms3 is CRITICAL: NRPE: Command check_swift_account_server not defined [05:01:32] here we go ;) [05:01:35] PROBLEM - swift-container-replicator on copper is CRITICAL: Connection refused by host [05:01:35] PROBLEM - swift-object-updater on copper is CRITICAL: Connection refused by host [05:01:44] PROBLEM - swift-container-server on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:01:44] PROBLEM - swift-account-auditor on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:01:44] PROBLEM - swift-container-updater on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:01:44] PROBLEM - swift-account-reaper on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:01:44] PROBLEM - swift-container-auditor on ms3 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [05:01:45] PROBLEM - swift-object-server on ms3 is CRITICAL: NRPE: Command check_swift_object_server not defined [05:01:53] PROBLEM - swift-account-server on zinc is CRITICAL: Connection refused by host [05:01:53] PROBLEM - swift-object-replicator on zinc is CRITICAL: Connection refused by host [05:02:02] PROBLEM - swift-account-server on magnesium is CRITICAL: Connection refused by host [05:02:02] PROBLEM - swift-account-auditor on copper is CRITICAL: Connection refused by host [05:02:02] PROBLEM - swift-container-server on copper is CRITICAL: Connection refused by host [05:02:02] PROBLEM - swift-object-server on zinc is CRITICAL: Connection refused by host [05:02:02] PROBLEM - swift-container-auditor on zinc is CRITICAL: Connection refused by host [05:02:11] PROBLEM - swift-container-replicator on ms3 is CRITICAL: NRPE: Command check_swift_container_replicator not defined [05:02:11] PROBLEM - swift-object-updater on ms3 is CRITICAL: NRPE: Command check_swift_object_updater not defined [05:02:11] PROBLEM - swift-object-server on magnesium is CRITICAL: Connection refused by host [05:02:11] PROBLEM - swift-account-reaper on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:02:11] PROBLEM - swift-account-replicator on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:02:12] PROBLEM - swift-object-replicator on magnesium is CRITICAL: Connection refused by host [05:02:20] PROBLEM - swift-container-updater on copper is CRITICAL: Connection refused by host [05:02:20] PROBLEM - swift-account-reaper on copper is CRITICAL: Connection refused by host [05:02:20] PROBLEM - swift-container-auditor on magnesium is CRITICAL: Connection refused by host [05:02:20] PROBLEM - swift-container-updater on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:02:20] PROBLEM - swift-object-auditor on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:02:29] PROBLEM - swift-object-replicator on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:02:29] PROBLEM - swift-object-updater on magnesium is CRITICAL: Connection refused by host [05:02:29] PROBLEM - swift-container-replicator on magnesium is CRITICAL: Connection refused by host [05:02:29] PROBLEM - swift-account-server on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:02:29] PROBLEM - swift-account-replicator on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:02:29] PROBLEM - swift-object-auditor on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:02:29] PROBLEM - swift-account-auditor on ms3 is CRITICAL: NRPE: Command check_swift_account_auditor not defined [05:02:30] PROBLEM - swift-container-server on ms3 is CRITICAL: NRPE: Command check_swift_container_server not defined [05:02:38] PROBLEM - swift-account-auditor on zinc is CRITICAL: Connection refused by host [05:02:38] PROBLEM - swift-container-server on zinc is CRITICAL: Connection refused by host [05:02:38] PROBLEM - swift-object-updater on zinc is CRITICAL: Connection refused by host [05:02:38] PROBLEM - swift-container-replicator on zinc is CRITICAL: Connection refused by host [05:02:47] PROBLEM - swift-container-server on magnesium is CRITICAL: Connection refused by host [05:02:47] PROBLEM - swift-account-auditor on magnesium is CRITICAL: Connection refused by host [05:02:47] PROBLEM - swift-account-replicator on copper is CRITICAL: Connection refused by host [05:02:56] PROBLEM - swift-object-auditor on copper is CRITICAL: Connection refused by host [05:02:56] PROBLEM - swift-container-auditor on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:03:05] PROBLEM - swift-account-server on copper is CRITICAL: Connection refused by host [05:03:05] PROBLEM - swift-object-replicator on copper is CRITICAL: Connection refused by host [05:03:05] PROBLEM - swift-account-reaper on zinc is CRITICAL: Connection refused by host [05:03:05] PROBLEM - swift-container-updater on ms3 is CRITICAL: NRPE: Command check_swift_container_updater not defined [05:03:05] PROBLEM - swift-container-updater on zinc is CRITICAL: Connection refused by host [05:03:05] PROBLEM - swift-account-server on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:03:06] PROBLEM - swift-account-reaper on magnesium is CRITICAL: Connection refused by host [05:03:06] PROBLEM - swift-container-updater on magnesium is CRITICAL: Connection refused by host [05:03:07] PROBLEM - swift-object-server on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:03:14] PROBLEM - swift-object-auditor on ms3 is CRITICAL: NRPE: Command check_swift_object_auditor not defined [05:03:14] PROBLEM - swift-object-replicator on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:03:14] PROBLEM - swift-object-server on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:03:23] PROBLEM - swift-object-auditor on zinc is CRITICAL: Connection refused by host [05:03:23] PROBLEM - swift-account-replicator on zinc is CRITICAL: Connection refused by host [05:03:23] PROBLEM - swift-account-reaper on ms3 is CRITICAL: NRPE: Command check_swift_account_reaper not defined [05:03:23] PROBLEM - swift-container-auditor on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:03:23] PROBLEM - swift-container-replicator on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:03:23] PROBLEM - swift-object-updater on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:03:24] PROBLEM - swift-account-replicator on ms3 is CRITICAL: NRPE: Command check_swift_account_replicator not defined [05:43:02] sigh, i had purged everything from db before... [05:44:27] was the food good at least? ;) [05:44:31] no :P [05:45:18] why not install nrpe on all hosts..hmm [05:45:37] but one by one,,why did it re-create those [06:13:37] New review: Dzahn; "looks good, i would just have called public-services-2 "208.80.153.192/26" = labs => public, but sam..." [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/3115 [06:52:48] New patchset: Dzahn; "decommission dataset1, as it's dead per RT-1345" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3236 [06:53:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3236 [06:53:19] PROBLEM - swift-container-auditor on copper is CRITICAL: Connection refused by host [06:53:19] PROBLEM - swift-object-server on copper is CRITICAL: Connection refused by host [06:53:28] PROBLEM - swift-object-auditor on magnesium is CRITICAL: Connection refused by host [06:53:28] PROBLEM - swift-account-replicator on magnesium is CRITICAL: Connection refused by host [06:53:28] PROBLEM - swift-container-server on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:53:28] PROBLEM - swift-object-updater on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:53:28] PROBLEM - swift-account-auditor on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:53:29] PROBLEM - swift-container-replicator on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:53:29] PROBLEM - swift-object-replicator on ms3 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [06:53:30] PROBLEM - swift-account-server on ms3 is CRITICAL: NRPE: Command check_swift_account_server not defined [06:53:37] PROBLEM - swift-object-updater on copper is CRITICAL: Connection refused by host [06:53:37] PROBLEM - swift-container-replicator on copper is CRITICAL: Connection refused by host [06:53:46] PROBLEM - swift-container-updater on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:53:46] PROBLEM - swift-account-reaper on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:53:46] PROBLEM - swift-container-server on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:53:46] PROBLEM - swift-account-auditor on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:53:46] PROBLEM - swift-object-server on ms3 is CRITICAL: NRPE: Command check_swift_object_server not defined [06:53:47] PROBLEM - swift-container-auditor on ms3 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [06:53:47] PROBLEM - swift-account-server on zinc is CRITICAL: Connection refused by host [06:53:55] PROBLEM - swift-object-replicator on zinc is CRITICAL: Connection refused by host [06:54:04] PROBLEM - swift-account-auditor on copper is CRITICAL: Connection refused by host [06:54:04] PROBLEM - swift-container-server on copper is CRITICAL: Connection refused by host [06:54:04] PROBLEM - swift-container-auditor on zinc is CRITICAL: Connection refused by host [06:54:04] PROBLEM - swift-object-server on zinc is CRITICAL: Connection refused by host [06:54:04] PROBLEM - swift-object-replicator on magnesium is CRITICAL: Connection refused by host [06:54:13] PROBLEM - swift-container-replicator on ms3 is CRITICAL: NRPE: Command check_swift_container_replicator not defined [06:54:13] PROBLEM - swift-object-updater on ms3 is CRITICAL: NRPE: Command check_swift_object_updater not defined [06:54:13] PROBLEM - swift-account-server on magnesium is CRITICAL: Connection refused by host [06:54:13] PROBLEM - swift-object-auditor on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:13] PROBLEM - swift-object-server on magnesium is CRITICAL: Connection refused by host [06:54:13] PROBLEM - swift-account-reaper on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:14] PROBLEM - swift-container-auditor on magnesium is CRITICAL: Connection refused by host [06:54:22] PROBLEM - swift-account-reaper on copper is CRITICAL: Connection refused by host [06:54:22] PROBLEM - swift-container-updater on copper is CRITICAL: Connection refused by host [06:54:22] PROBLEM - swift-container-replicator on zinc is CRITICAL: Connection refused by host [06:54:22] PROBLEM - swift-object-updater on zinc is CRITICAL: Connection refused by host [06:54:22] PROBLEM - swift-container-updater on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:22] PROBLEM - swift-account-replicator on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:31] PROBLEM - swift-object-updater on magnesium is CRITICAL: Connection refused by host [06:54:31] PROBLEM - swift-account-server on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:31] PROBLEM - swift-container-replicator on magnesium is CRITICAL: Connection refused by host [06:54:31] PROBLEM - swift-object-replicator on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:31] PROBLEM - swift-account-replicator on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:31] PROBLEM - swift-object-auditor on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:31] PROBLEM - swift-account-auditor on ms3 is CRITICAL: NRPE: Command check_swift_account_auditor not defined [06:54:32] PROBLEM - swift-container-server on ms3 is CRITICAL: NRPE: Command check_swift_container_server not defined [06:54:40] PROBLEM - swift-container-server on zinc is CRITICAL: Connection refused by host [06:54:40] PROBLEM - swift-account-auditor on zinc is CRITICAL: Connection refused by host [06:54:49] PROBLEM - swift-object-server on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:49] PROBLEM - swift-object-replicator on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:49] PROBLEM - swift-account-server on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:49] PROBLEM - swift-account-reaper on ms3 is CRITICAL: NRPE: Command check_swift_account_reaper not defined [06:54:49] PROBLEM - swift-account-replicator on copper is CRITICAL: Connection refused by host [06:54:50] PROBLEM - swift-container-updater on ms3 is CRITICAL: NRPE: Command check_swift_container_updater not defined [06:54:58] PROBLEM - swift-object-auditor on copper is CRITICAL: Connection refused by host [06:55:07] PROBLEM - swift-container-auditor on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:55:07] PROBLEM - swift-account-auditor on magnesium is CRITICAL: Connection refused by host [06:55:07] PROBLEM - swift-object-replicator on copper is CRITICAL: Connection refused by host [06:55:07] PROBLEM - swift-account-server on copper is CRITICAL: Connection refused by host [06:55:07] PROBLEM - swift-container-updater on zinc is CRITICAL: Connection refused by host [06:55:07] PROBLEM - swift-account-reaper on zinc is CRITICAL: Connection refused by host [06:55:11] maan [06:55:16] PROBLEM - swift-object-updater on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:55:16] PROBLEM - swift-object-server on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:55:16] PROBLEM - swift-container-auditor on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:55:16] PROBLEM - swift-container-replicator on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:55:16] PROBLEM - swift-account-replicator on ms3 is CRITICAL: NRPE: Command check_swift_account_replicator not defined [06:55:16] PROBLEM - swift-object-auditor on ms3 is CRITICAL: NRPE: Command check_swift_object_auditor not defined [06:55:16] PROBLEM - swift-account-reaper on magnesium is CRITICAL: Connection refused by host [06:55:25] PROBLEM - swift-container-server on magnesium is CRITICAL: Connection refused by host [06:55:25] PROBLEM - swift-object-auditor on zinc is CRITICAL: Connection refused by host [06:55:25] PROBLEM - swift-account-replicator on zinc is CRITICAL: Connection refused by host [06:55:25] PROBLEM - swift-container-updater on magnesium is CRITICAL: Connection refused by host [06:55:56] wth? [06:56:06] yea, indeed wth [06:56:15] purged from db9, removed from configs [06:56:19] but coming back [06:57:00] the other solution would be to just install nrpe and base on all hosts [06:57:08] but meh [06:58:56] why would be it still be coming back.. when it is NOT in the db anymore, not in the configs, and the puppet clearly says to do it only if $hostname =~ /^ms-be[19]$/ { [07:01:36] it does? [07:01:41] yes [07:01:53] thats what just happened above [07:02:19] no I mean it says do it only f the hostname matches thatpattern? [07:03:06] https://gerrit.wikimedia.org/r/#patch,sidebyside,3235,1,manifests/swift.pp [07:05:03] i checked db9 twice, and deleted the complete files for these hosts from puppet_checks.d and it happened twice already since that ..hhh [07:06:08] mutante: what do the new puppet_checks.d's say? [07:06:55] they have the services again [07:09:02] * jeremyb waits for browser [07:09:40] in db they just exist for ms-be1, they should for ms-be[1-5] ..blah [07:09:50] not even either/or there [07:38:14] New patchset: Dzahn; "decommission project2 per RT-2637" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3237 [07:38:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3237 [07:39:09] New patchset: Dzahn; "decommission project2 per RT-2637" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3237 [07:39:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3237 [07:51:16] !log replaced self-signed cert on wikitech with the star cert [07:51:20] Logged the message, Master [07:51:25] I thought [07:51:28] <--somebody could adjust the topic link now :) thx [07:51:36] we don't want the star cert there, it's an exernal site [07:51:51] arg, really? [07:51:53] yes [07:52:01] can you please undo that? sorry.... [07:52:07] but it is an ooold ticket and it was just mentioned again yesterday here [07:52:11] yeah but [07:52:11] hrmm.sure [07:52:51] there was "this will be moved to labs so we won't bother to buy a cert for it" [07:52:56] then there was some delay or other [07:53:22] now there is the discussion about a complete merge [07:54:07] but anyways that's why a cert has not yet been bought, sorry about that. someone should have written in the ticket that we don't want ths start cert out there I guess [07:55:13] i see the ticket you are referring to now [07:55:31] "pending discussion on migrating wikitech to labs wiki and such. While this is playing out, there is no point in purchasing a certificate " [07:55:37] yeah [07:55:53] but I don't know if it says explicitly that we don't want the start cert there [07:56:01] if not, it should [07:56:41] * apergos did not get anywhere looking at the nrpe configs and the puppet state and resources list on magnesium [07:56:41] "a 5 year non wildcard cert is 196 USD." [07:56:43] there's also a list thread [07:56:57] so I am giving it iup on the swift issue [07:57:04] i think gandi.net is relatively reasonable [07:57:08] cert that is [07:57:24] at least if godaddy is not considered [07:57:27] now of course they are talking about renamig it :_/ [07:57:31] no godaddy [07:57:39] i agree ;) [07:57:44] apergos: oh, i wasnt aware you were checking. thanks! yeah, i think it just needs like 3 puppet runs or so to forget about them.. hmmm [07:57:47] apergos: swift issue is nagios or what? [07:58:07] it's puppet/nagios/nrpe [07:58:09] mutante: then there's a missing dependancy [07:58:10] a nice little combo [07:58:32] apergos: yeah, just wanted to make sure it wasn't some *other* swift thing [07:58:33] jeremyb: missing dependecies usually break puppet runs [07:58:52] and tell you about them [07:59:27] not necessarily... [08:00:07] the official party line is that puppet should never need to be rerun to get a machine into the final state [08:00:25] and that if it does need multiple runs then there'd missing relationships [08:00:33] there's* [08:01:33] there must be some place it remembers the nagios_Service definitions from (at least for a little while), and that place cant be the files in /puppet_Checks.d/ and also not the mysql puppet db [08:01:52] or there is just some kind of delay, and that place is RAM [08:02:20] we will see if it happens again [08:05:47] ohhh, let me check something [08:06:19] the official party line is full of crap [08:06:20] :-P [08:06:39]