[00:04:10] Morning guys, awjr suggested a message here - bugzilla seems to be having issues. Viewing pages is fine, cannot submit bugs at the moment, the connection keeps getting reset during submission [00:04:38] I'm trying to submit in the category Wikimedia Mobile, under devices [00:07:05] never mind, submission finally went through on the 10th attempt - may want to keep an eye on it though :) [00:11:38] PROBLEM - Lucene on searchidx1001 is CRITICAL: Connection refused [00:18:03] nagios-wm: why is that check back again ? argg [00:28:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:34:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.171 seconds [01:09:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:10:57] New review: Dzahn; "ok, going to do this (expect NRPE breakage, will stop nagios-wm temp.)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3144 [01:11:00] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3144 [01:12:01] !log stopping nagios-wm temp. while changing nrpe config (will watch it manually until it's back) [01:12:29] Logged the message, Master [01:17:16] hmm.. not consistent. when i expect it to break of course it doesnt.. or not an all hosts [01:35:36] you want more flood?! [01:37:44] heh, no;) i just want a reproducable error [01:38:16] computers and "sometimes" = sigh :p [01:50:03] New patchset: Dzahn; "add the nagios-nrpe-server init file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3233 [01:50:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3233 [01:52:44] New patchset: Dzahn; "add the nagios-nrpe-server init file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3233 [01:52:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3233 [01:54:09] mutante: maybe you want mcollective? ;) [01:54:31] (not that i know enough about it to know if it's the right choice) [01:56:15] eh for what? to restart the service on all hosts at once if it should break again? [01:56:23] to replace dsh? [01:57:48] aah, there we go with the expected breakage.. bbiaw:) restarting a bit [02:11:26] mutante: well idk enough about what dsh can do. mcollective could set services or whole boxes to be in downtime, make it's change, switch back [02:11:35] (in nagios) [02:17:59] ah, i see what you mean, yeah, we could set scheduled downtimes for all on the nagios host itself though , while dsh (and mcollective afaik) is to execute stuff on a lot of servers at a time [02:20:49] mutante: mcollective could do the nagios changes for you though [02:21:22] mutante: and it could do say up to X at a time, make sure they're working right and do some more [02:21:39] it is just an action on a single host though [02:22:00] it didn't require puppet runs to propagate? [02:22:19] not setting downtimes, no [02:22:29] oh [02:22:40] yeah, of course. i thought you meant the nrpe change [02:24:20] mcollective could make the change in batches or at least limit concurrency and have a verification step as it goes before moving on to other boxes. and it could do the nagios change itself so that the window when the downtime is in effect (and maybe suppressing alerts) is smaller [02:24:53] anyway, i really don't know enough about dsh. and i know there's others i haven't tried yet either like clusterssh [02:28:16] jeremyb: the main issue with dsh is currently that you have "group" files with the hostnames in it, but these are not yet being auto-created, so they are outdated. we would like those to be created from puppet data [02:28:48] so i run my own little loop to get hosts from nagios config .. [02:29:40] but then there are also still some hosts that need to be manually removed, like pseudo hosts that just exist for monitoring purposes, routers and sswitches..etc [02:29:58] well then what about people purposefully keeping hosts out of dsh? then they have to blacklist them in puppet conf? [02:30:49] shrug, some way those that are "real hosts" (servers) must be marked ..yea [02:36:42] RECOVERY - Disk space on mw1139 is OK: DISK OK [02:36:42] RECOVERY - Disk space on mw1074 is OK: DISK OK [02:36:42] RECOVERY - RAID on cp1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [02:36:42] RECOVERY - DPKG on cp1015 is OK: All packages OK [02:36:42] RECOVERY - DPKG on virt3 is OK: All packages OK [02:36:43] RECOVERY - RAID on cp1013 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [02:36:51] RECOVERY - Disk space on mw1127 is OK: DISK OK [02:36:51] RECOVERY - RAID on ms5 is OK: OK: Active: 50, Working: 50, Failed: 0, Spare: 0 [02:36:51] RECOVERY - DPKG on mw1131 is OK: All packages OK [02:36:51] RECOVERY - DPKG on mw1112 is OK: All packages OK [02:36:51] RECOVERY - DPKG on virt4 is OK: All packages OK [02:36:52] RECOVERY - Disk space on mw1015 is OK: DISK OK [02:36:52] RECOVERY - RAID on mw1032 is OK: OK: no RAID installed [02:36:53] RECOVERY - Disk space on mw1106 is OK: DISK OK [02:36:53] RECOVERY - Disk space on mw1112 is OK: DISK OK [02:36:54] RECOVERY - Disk space on mw1100 is OK: DISK OK [02:36:54] RECOVERY - Disk space on cp1035 is OK: DISK OK [02:36:55] RECOVERY - DPKG on mw1067 is OK: All packages OK [02:36:55] RECOVERY - RAID on mw1109 is OK: OK: no RAID installed [02:37:20] at least that tells us when the puppet run is done on spence :) [02:45:24] !log killing nrpe on several hosts where it was running as the wrong user again (somehow through the use of dsh) [02:45:28] Logged the message, Master [02:45:44] as you? ;) [02:46:22] yes, but not on all [02:46:37] nagios-wm issues? [02:46:59] Krinkle: i stopped it to prevent flooding [02:47:20] k [02:47:26] before generating said flood [02:47:27] I'd quiet it but this works too [02:48:26] Krinkle: not a real outage, just an issue everytime we change the nrpe config.. will be back soon [02:48:37] I understand [02:49:14] just saying it might be easier (dunno if it's a simple command to restart the bot), to quiet it using irc flags [02:49:54] alright, i figure it's not -v because we are not moderated [02:50:03] indeed [02:50:06] hm, yeah, easy enough "ircecho stop|start" [02:50:08] +q (or -q) [02:50:11] alright [02:51:14] do i have the flags to do that ? [02:52:16] you have to op yerself to change flags on other users i would think [02:56:03] yeah, but i am not an op [02:56:34] it's ok though, easy enough to start the bot [02:58:23] bots don't necessarily behave well after unquieting [03:03:59] should be ok again [03:04:44] jeremyb: ? [03:04:52] revenge? [03:04:53] (even though i did not get my new service i wanted :( [03:05:20] Krinkle: huh? [03:05:21] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:05:34] jeremyb: "bots don't necessarily behave well after unquieting" [03:05:39] Krinkle: so? [03:05:39] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.148 seconds [03:05:45] jeremyb: what do you mean [03:06:09] i mean that they have an error and either die or just stop functioning but don't die [03:06:19] and you have to notice and boot them [03:06:48] I've never had that problem but I guess it could go wrong if the bot doesn't support it [03:07:12] i've seen it. can't remember which bots though [03:07:13] although I don't know if irc protocol has a problem with quieted users just keep sending messages [03:07:18] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 0.020 seconds [03:07:20] afaik it's no problem, just ignored until unquieted [03:07:34] well it's not just ignored [03:07:44] it's also told it couldn't post to the channel [03:07:48] right [03:08:18] so it'll get a lot of notices from one of the services [03:08:32] I can imagine it being a problem, just never encountered it. [03:09:01] cvn ([[m:CVN]] / #cvn-bots) has a lot of channels with different bots, regularly muted and unmuted [03:09:14] !log stafford - - /var/lib/puppet/reports is getting quite large (18G), and we got the first disk space warning, do we want to keep those? [03:09:18] Logged the message, Master [03:11:51] mutante: how far back do they go? [03:13:21] at quick glance, just 3 days or so [03:14:35] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3233 [03:14:40] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3233 [03:14:44] then why would it be getting large? ;) [03:15:12] i guess if you nuke a machine and never make a new one with that name then the reports stick around [03:15:15] ? [03:15:35] because it writes one on every puppet run, and there are quite a few hosts? shrug [03:16:01] i didnt really check with find yet [03:16:05] maybe some are older [03:17:48] indeed there are older ones for some osts [03:23:30] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [03:23:54] no, theose are just directories, no file older than 3d [03:24:13] but if one .yaml file can sometimes be MBs, it just adds up [03:25:09] anyways, looks like they are rotated anyways and its just the very first warning [03:28:21] New patchset: Dzahn; "sleep 10 before cleaning up PID file, it seemed like sometimes it deletes the pid before the service has started and that might cause the failures" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3234 [03:28:30] PROBLEM - Puppet freshness on cp1022 is CRITICAL: Puppet has not run in the last 10 hours [03:28:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3234 [03:30:56] New patchset: Dzahn; "sleep 10 before cleaning up PID file, it seemed like sometimes it deletes the pid before the service has stopped and that might cause the failures" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3234 [03:31:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3234 [03:32:24] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3234 [03:32:27] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3234 [03:34:24] PROBLEM - Puppet freshness on cp1021 is CRITICAL: Puppet has not run in the last 10 hours [03:35:27] RECOVERY - RAID on aluminium is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [03:35:36] PROBLEM - Puppet freshness on cp1044 is CRITICAL: Puppet has not run in the last 10 hours [03:36:39] PROBLEM - Puppet freshness on cp1041 is CRITICAL: Puppet has not run in the last 10 hours [03:36:48] PROBLEM - swift-object-server on copper is CRITICAL: Connection refused by host [03:36:48] PROBLEM - swift-container-auditor on copper is CRITICAL: Connection refused by host [03:36:57] ah :) [03:36:57] PROBLEM - swift-container-server on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:36:57] PROBLEM - swift-container-replicator on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:36:57] PROBLEM - swift-object-auditor on magnesium is CRITICAL: Connection refused by host [03:36:57] PROBLEM - swift-account-auditor on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:36:57] PROBLEM - swift-object-updater on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:36:57] PROBLEM - swift-account-replicator on magnesium is CRITICAL: Connection refused by host [03:36:58] PROBLEM - swift-account-server on ms3 is CRITICAL: NRPE: Command check_swift_account_server not defined [03:36:58] PROBLEM - swift-object-replicator on ms3 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [03:37:15] PROBLEM - swift-account-reaper on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:37:15] PROBLEM - swift-container-updater on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:37:15] PROBLEM - swift-container-server on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:37:15] PROBLEM - swift-object-server on ms3 is CRITICAL: NRPE: Command check_swift_object_server not defined [03:37:15] PROBLEM - swift-account-auditor on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [03:37:15] PROBLEM - swift-container-auditor on ms3 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [03:37:15] PROBLEM - swift-container-replicator on copper is CRITICAL: Connection refused by host [03:37:16] PROBLEM - swift-object-updater on copper is CRITICAL: Connection refused by host [03:37:16] PROBLEM - swift-account-server on zinc is CRITICAL: Connection refused by host [03:38:03] this doesnt look like it i know, but its good news :) we got the swift process monitorign coming up [03:45:24] New patchset: Dzahn; "limit swift process monitoring to ms-be hosts because the testing machinges do not have nrpe installed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3235 [03:45:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3235 [03:46:38] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3235 [03:46:41] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3235 [04:19:49] !log on stafford, deleting spence's puppet report files to free some disk space (they are like the largest report files of all) [04:19:52] Logged the message, Master [04:21:07] makes sense [04:23:08] jeremyb: oh, remember these: < jeremyb> hrmmmm... i see no monitoring at all for some of the swift services < jeremyb> i.e. to make sure the container server and account server &c are actually running where they should be [04:23:37] yeah, you made an RT [04:23:44] and now you did the RT? ;) [04:23:46] http://nagios.wikimedia.org/nagios/cgi-bin/status.cgi?servicegroup=swift&style=overview [04:23:56] http://nagios.wikimedia.org/nagios/cgi-bin/status.cgi?host=ms-be1&style=detail [04:24:23] i forgot how many damned services there are [04:24:24] jeremyb: thats why i had to do the whole "nrpe" / kill bot / flooding .. everything is related :) [04:24:39] now i'm gonna eat and bbl [04:24:44] PROBLEM - SSH on sq40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:24:45] *click* [04:25:47] PROBLEM - Backend Squid HTTP on sq40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:26:05] PROBLEM - Frontend Squid HTTP on sq40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:26:22] looks like that just died, anyways, bbl :) [04:48:17] PROBLEM - Host sq40 is DOWN: PING CRITICAL - Packet loss = 100% [04:49:42] nagios-wm: that took long enough [05:01:17] PROBLEM - swift-object-server on copper is CRITICAL: Connection refused by host [05:01:17] PROBLEM - swift-container-auditor on copper is CRITICAL: Connection refused by host [05:01:26] PROBLEM - swift-account-auditor on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:01:26] PROBLEM - swift-container-server on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:01:26] PROBLEM - swift-account-replicator on magnesium is CRITICAL: Connection refused by host [05:01:26] PROBLEM - swift-object-auditor on magnesium is CRITICAL: Connection refused by host [05:01:26] PROBLEM - swift-container-replicator on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:01:26] PROBLEM - swift-object-replicator on ms3 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [05:01:27] PROBLEM - swift-object-updater on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:01:27] PROBLEM - swift-account-server on ms3 is CRITICAL: NRPE: Command check_swift_account_server not defined [05:01:32] here we go ;) [05:01:35] PROBLEM - swift-container-replicator on copper is CRITICAL: Connection refused by host [05:01:35] PROBLEM - swift-object-updater on copper is CRITICAL: Connection refused by host [05:01:44] PROBLEM - swift-container-server on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:01:44] PROBLEM - swift-account-auditor on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:01:44] PROBLEM - swift-container-updater on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:01:44] PROBLEM - swift-account-reaper on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:01:44] PROBLEM - swift-container-auditor on ms3 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [05:01:45] PROBLEM - swift-object-server on ms3 is CRITICAL: NRPE: Command check_swift_object_server not defined [05:01:53] PROBLEM - swift-account-server on zinc is CRITICAL: Connection refused by host [05:01:53] PROBLEM - swift-object-replicator on zinc is CRITICAL: Connection refused by host [05:02:02] PROBLEM - swift-account-server on magnesium is CRITICAL: Connection refused by host [05:02:02] PROBLEM - swift-account-auditor on copper is CRITICAL: Connection refused by host [05:02:02] PROBLEM - swift-container-server on copper is CRITICAL: Connection refused by host [05:02:02] PROBLEM - swift-object-server on zinc is CRITICAL: Connection refused by host [05:02:02] PROBLEM - swift-container-auditor on zinc is CRITICAL: Connection refused by host [05:02:11] PROBLEM - swift-container-replicator on ms3 is CRITICAL: NRPE: Command check_swift_container_replicator not defined [05:02:11] PROBLEM - swift-object-updater on ms3 is CRITICAL: NRPE: Command check_swift_object_updater not defined [05:02:11] PROBLEM - swift-object-server on magnesium is CRITICAL: Connection refused by host [05:02:11] PROBLEM - swift-account-reaper on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:02:11] PROBLEM - swift-account-replicator on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:02:12] PROBLEM - swift-object-replicator on magnesium is CRITICAL: Connection refused by host [05:02:20] PROBLEM - swift-container-updater on copper is CRITICAL: Connection refused by host [05:02:20] PROBLEM - swift-account-reaper on copper is CRITICAL: Connection refused by host [05:02:20] PROBLEM - swift-container-auditor on magnesium is CRITICAL: Connection refused by host [05:02:20] PROBLEM - swift-container-updater on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:02:20] PROBLEM - swift-object-auditor on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:02:29] PROBLEM - swift-object-replicator on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:02:29] PROBLEM - swift-object-updater on magnesium is CRITICAL: Connection refused by host [05:02:29] PROBLEM - swift-container-replicator on magnesium is CRITICAL: Connection refused by host [05:02:29] PROBLEM - swift-account-server on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:02:29] PROBLEM - swift-account-replicator on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:02:29] PROBLEM - swift-object-auditor on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:02:29] PROBLEM - swift-account-auditor on ms3 is CRITICAL: NRPE: Command check_swift_account_auditor not defined [05:02:30] PROBLEM - swift-container-server on ms3 is CRITICAL: NRPE: Command check_swift_container_server not defined [05:02:38] PROBLEM - swift-account-auditor on zinc is CRITICAL: Connection refused by host [05:02:38] PROBLEM - swift-container-server on zinc is CRITICAL: Connection refused by host [05:02:38] PROBLEM - swift-object-updater on zinc is CRITICAL: Connection refused by host [05:02:38] PROBLEM - swift-container-replicator on zinc is CRITICAL: Connection refused by host [05:02:47] PROBLEM - swift-container-server on magnesium is CRITICAL: Connection refused by host [05:02:47] PROBLEM - swift-account-auditor on magnesium is CRITICAL: Connection refused by host [05:02:47] PROBLEM - swift-account-replicator on copper is CRITICAL: Connection refused by host [05:02:56] PROBLEM - swift-object-auditor on copper is CRITICAL: Connection refused by host [05:02:56] PROBLEM - swift-container-auditor on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:03:05] PROBLEM - swift-account-server on copper is CRITICAL: Connection refused by host [05:03:05] PROBLEM - swift-object-replicator on copper is CRITICAL: Connection refused by host [05:03:05] PROBLEM - swift-account-reaper on zinc is CRITICAL: Connection refused by host [05:03:05] PROBLEM - swift-container-updater on ms3 is CRITICAL: NRPE: Command check_swift_container_updater not defined [05:03:05] PROBLEM - swift-container-updater on zinc is CRITICAL: Connection refused by host [05:03:05] PROBLEM - swift-account-server on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:03:06] PROBLEM - swift-account-reaper on magnesium is CRITICAL: Connection refused by host [05:03:06] PROBLEM - swift-container-updater on magnesium is CRITICAL: Connection refused by host [05:03:07] PROBLEM - swift-object-server on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:03:14] PROBLEM - swift-object-auditor on ms3 is CRITICAL: NRPE: Command check_swift_object_auditor not defined [05:03:14] PROBLEM - swift-object-replicator on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:03:14] PROBLEM - swift-object-server on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:03:23] PROBLEM - swift-object-auditor on zinc is CRITICAL: Connection refused by host [05:03:23] PROBLEM - swift-account-replicator on zinc is CRITICAL: Connection refused by host [05:03:23] PROBLEM - swift-account-reaper on ms3 is CRITICAL: NRPE: Command check_swift_account_reaper not defined [05:03:23] PROBLEM - swift-container-auditor on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:03:23] PROBLEM - swift-container-replicator on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:03:23] PROBLEM - swift-object-updater on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [05:03:24] PROBLEM - swift-account-replicator on ms3 is CRITICAL: NRPE: Command check_swift_account_replicator not defined [05:43:02] sigh, i had purged everything from db before... [05:44:27] was the food good at least? ;) [05:44:31] no :P [05:45:18] why not install nrpe on all hosts..hmm [05:45:37] but one by one,,why did it re-create those [06:13:37] New review: Dzahn; "looks good, i would just have called public-services-2 "208.80.153.192/26" = labs => public, but sam..." [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/3115 [06:52:48] New patchset: Dzahn; "decommission dataset1, as it's dead per RT-1345" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3236 [06:53:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3236 [06:53:19] PROBLEM - swift-container-auditor on copper is CRITICAL: Connection refused by host [06:53:19] PROBLEM - swift-object-server on copper is CRITICAL: Connection refused by host [06:53:28] PROBLEM - swift-object-auditor on magnesium is CRITICAL: Connection refused by host [06:53:28] PROBLEM - swift-account-replicator on magnesium is CRITICAL: Connection refused by host [06:53:28] PROBLEM - swift-container-server on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:53:28] PROBLEM - swift-object-updater on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:53:28] PROBLEM - swift-account-auditor on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:53:29] PROBLEM - swift-container-replicator on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:53:29] PROBLEM - swift-object-replicator on ms3 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [06:53:30] PROBLEM - swift-account-server on ms3 is CRITICAL: NRPE: Command check_swift_account_server not defined [06:53:37] PROBLEM - swift-object-updater on copper is CRITICAL: Connection refused by host [06:53:37] PROBLEM - swift-container-replicator on copper is CRITICAL: Connection refused by host [06:53:46] PROBLEM - swift-container-updater on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:53:46] PROBLEM - swift-account-reaper on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:53:46] PROBLEM - swift-container-server on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:53:46] PROBLEM - swift-account-auditor on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:53:46] PROBLEM - swift-object-server on ms3 is CRITICAL: NRPE: Command check_swift_object_server not defined [06:53:47] PROBLEM - swift-container-auditor on ms3 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [06:53:47] PROBLEM - swift-account-server on zinc is CRITICAL: Connection refused by host [06:53:55] PROBLEM - swift-object-replicator on zinc is CRITICAL: Connection refused by host [06:54:04] PROBLEM - swift-account-auditor on copper is CRITICAL: Connection refused by host [06:54:04] PROBLEM - swift-container-server on copper is CRITICAL: Connection refused by host [06:54:04] PROBLEM - swift-container-auditor on zinc is CRITICAL: Connection refused by host [06:54:04] PROBLEM - swift-object-server on zinc is CRITICAL: Connection refused by host [06:54:04] PROBLEM - swift-object-replicator on magnesium is CRITICAL: Connection refused by host [06:54:13] PROBLEM - swift-container-replicator on ms3 is CRITICAL: NRPE: Command check_swift_container_replicator not defined [06:54:13] PROBLEM - swift-object-updater on ms3 is CRITICAL: NRPE: Command check_swift_object_updater not defined [06:54:13] PROBLEM - swift-account-server on magnesium is CRITICAL: Connection refused by host [06:54:13] PROBLEM - swift-object-auditor on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:13] PROBLEM - swift-object-server on magnesium is CRITICAL: Connection refused by host [06:54:13] PROBLEM - swift-account-reaper on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:14] PROBLEM - swift-container-auditor on magnesium is CRITICAL: Connection refused by host [06:54:22] PROBLEM - swift-account-reaper on copper is CRITICAL: Connection refused by host [06:54:22] PROBLEM - swift-container-updater on copper is CRITICAL: Connection refused by host [06:54:22] PROBLEM - swift-container-replicator on zinc is CRITICAL: Connection refused by host [06:54:22] PROBLEM - swift-object-updater on zinc is CRITICAL: Connection refused by host [06:54:22] PROBLEM - swift-container-updater on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:22] PROBLEM - swift-account-replicator on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:31] PROBLEM - swift-object-updater on magnesium is CRITICAL: Connection refused by host [06:54:31] PROBLEM - swift-account-server on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:31] PROBLEM - swift-container-replicator on magnesium is CRITICAL: Connection refused by host [06:54:31] PROBLEM - swift-object-replicator on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:31] PROBLEM - swift-account-replicator on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:31] PROBLEM - swift-object-auditor on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:31] PROBLEM - swift-account-auditor on ms3 is CRITICAL: NRPE: Command check_swift_account_auditor not defined [06:54:32] PROBLEM - swift-container-server on ms3 is CRITICAL: NRPE: Command check_swift_container_server not defined [06:54:40] PROBLEM - swift-container-server on zinc is CRITICAL: Connection refused by host [06:54:40] PROBLEM - swift-account-auditor on zinc is CRITICAL: Connection refused by host [06:54:49] PROBLEM - swift-object-server on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:49] PROBLEM - swift-object-replicator on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:49] PROBLEM - swift-account-server on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:54:49] PROBLEM - swift-account-reaper on ms3 is CRITICAL: NRPE: Command check_swift_account_reaper not defined [06:54:49] PROBLEM - swift-account-replicator on copper is CRITICAL: Connection refused by host [06:54:50] PROBLEM - swift-container-updater on ms3 is CRITICAL: NRPE: Command check_swift_container_updater not defined [06:54:58] PROBLEM - swift-object-auditor on copper is CRITICAL: Connection refused by host [06:55:07] PROBLEM - swift-container-auditor on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:55:07] PROBLEM - swift-account-auditor on magnesium is CRITICAL: Connection refused by host [06:55:07] PROBLEM - swift-object-replicator on copper is CRITICAL: Connection refused by host [06:55:07] PROBLEM - swift-account-server on copper is CRITICAL: Connection refused by host [06:55:07] PROBLEM - swift-container-updater on zinc is CRITICAL: Connection refused by host [06:55:07] PROBLEM - swift-account-reaper on zinc is CRITICAL: Connection refused by host [06:55:11] maan [06:55:16] PROBLEM - swift-object-updater on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:55:16] PROBLEM - swift-object-server on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:55:16] PROBLEM - swift-container-auditor on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:55:16] PROBLEM - swift-container-replicator on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:55:16] PROBLEM - swift-account-replicator on ms3 is CRITICAL: NRPE: Command check_swift_account_replicator not defined [06:55:16] PROBLEM - swift-object-auditor on ms3 is CRITICAL: NRPE: Command check_swift_object_auditor not defined [06:55:16] PROBLEM - swift-account-reaper on magnesium is CRITICAL: Connection refused by host [06:55:25] PROBLEM - swift-container-server on magnesium is CRITICAL: Connection refused by host [06:55:25] PROBLEM - swift-object-auditor on zinc is CRITICAL: Connection refused by host [06:55:25] PROBLEM - swift-account-replicator on zinc is CRITICAL: Connection refused by host [06:55:25] PROBLEM - swift-container-updater on magnesium is CRITICAL: Connection refused by host [06:55:56] wth? [06:56:06] yea, indeed wth [06:56:15] purged from db9, removed from configs [06:56:19] but coming back [06:57:00] the other solution would be to just install nrpe and base on all hosts [06:57:08] but meh [06:58:56] why would be it still be coming back.. when it is NOT in the db anymore, not in the configs, and the puppet clearly says to do it only if $hostname =~ /^ms-be[19]$/ { [07:01:36] it does? [07:01:41] yes [07:01:53] thats what just happened above [07:02:19] no I mean it says do it only f the hostname matches thatpattern? [07:03:06] https://gerrit.wikimedia.org/r/#patch,sidebyside,3235,1,manifests/swift.pp [07:05:03] i checked db9 twice, and deleted the complete files for these hosts from puppet_checks.d and it happened twice already since that ..hhh [07:06:08] mutante: what do the new puppet_checks.d's say? [07:06:55] they have the services again [07:09:02] * jeremyb waits for browser [07:09:40] in db they just exist for ms-be1, they should for ms-be[1-5] ..blah [07:09:50] not even either/or there [07:38:14] New patchset: Dzahn; "decommission project2 per RT-2637" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3237 [07:38:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3237 [07:39:09] New patchset: Dzahn; "decommission project2 per RT-2637" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3237 [07:39:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3237 [07:51:16] !log replaced self-signed cert on wikitech with the star cert [07:51:20] Logged the message, Master [07:51:25] I thought [07:51:28] <--somebody could adjust the topic link now :) thx [07:51:36] we don't want the star cert there, it's an exernal site [07:51:51] arg, really? [07:51:53] yes [07:52:01] can you please undo that? sorry.... [07:52:07] but it is an ooold ticket and it was just mentioned again yesterday here [07:52:11] yeah but [07:52:11] hrmm.sure [07:52:51] there was "this will be moved to labs so we won't bother to buy a cert for it" [07:52:56] then there was some delay or other [07:53:22] now there is the discussion about a complete merge [07:54:07] but anyways that's why a cert has not yet been bought, sorry about that. someone should have written in the ticket that we don't want ths start cert out there I guess [07:55:13] i see the ticket you are referring to now [07:55:31] "pending discussion on migrating wikitech to labs wiki and such. While this is playing out, there is no point in purchasing a certificate " [07:55:37] yeah [07:55:53] but I don't know if it says explicitly that we don't want the start cert there [07:56:01] if not, it should [07:56:41] * apergos did not get anywhere looking at the nrpe configs and the puppet state and resources list on magnesium [07:56:41] "a 5 year non wildcard cert is 196 USD." [07:56:43] there's also a list thread [07:56:57] so I am giving it iup on the swift issue [07:57:04] i think gandi.net is relatively reasonable [07:57:08] cert that is [07:57:24] at least if godaddy is not considered [07:57:27] now of course they are talking about renamig it :_/ [07:57:31] no godaddy [07:57:39] i agree ;) [07:57:44] apergos: oh, i wasnt aware you were checking. thanks! yeah, i think it just needs like 3 puppet runs or so to forget about them.. hmmm [07:57:47] apergos: swift issue is nagios or what? [07:58:07] it's puppet/nagios/nrpe [07:58:09] mutante: then there's a missing dependancy [07:58:10] a nice little combo [07:58:32] apergos: yeah, just wanted to make sure it wasn't some *other* swift thing [07:58:33] jeremyb: missing dependecies usually break puppet runs [07:58:52] and tell you about them [07:59:27] not necessarily... [08:00:07] the official party line is that puppet should never need to be rerun to get a machine into the final state [08:00:25] and that if it does need multiple runs then there'd missing relationships [08:00:33] there's* [08:01:33] there must be some place it remembers the nagios_Service definitions from (at least for a little while), and that place cant be the files in /puppet_Checks.d/ and also not the mysql puppet db [08:01:52] or there is just some kind of delay, and that place is RAM [08:02:20] we will see if it happens again [08:05:47] ohhh, let me check something [08:06:19] the official party line is full of crap [08:06:20] :-P [08:06:39] hah [08:07:25] !log i reverted that (star cert for wikitech), no worries i "shred"ded the files [08:07:28] Logged the message, Master [08:07:58] mutante: what exactly did you do to db9? [08:08:42] delete from resources where title like "magne%swift%"; etc [08:09:01] which i checked before to be all type "Nagios_service" [08:09:09] "restype" that is [08:09:24] db puppet of course [08:09:39] deleted 12 per host [08:10:02] this is what table? [08:10:14] resources [08:10:41] select * from resources where title like "%swift%"; [08:23:10] New patchset: Dzahn; "replace all occurences of "*.wikimedia.org" with "star.wikimedia.org" per RT-2512" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3238 [08:23:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3238 [08:24:33] mutante: lemme know when you have a minute to look at a dns change, if you don't mind [08:25:15] apergos: sure, on dobson i suppose? [08:25:25] sockpuppet [08:25:38] eh, yea, of course [08:25:39] /tmp/dnsdiff-atg.txt [08:25:51] ms1001 to public subnet. [08:26:01] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 316 seconds [08:26:28] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 343 seconds [08:28:07] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 27 seconds [08:30:40] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [08:31:02] apergos: looks good to me [08:33:21] ok thanks [08:34:03] mutante: puppet version? [08:35:02] jeremyb: 2.7.7 [08:43:25] RECOVERY - Auth DNS on ns2.wikimedia.org is OK: DNS OK: 0.138 seconds response time. www.wikipedia.org returns 208.80.154.225 [08:43:58] wait... how long was dns out?? [08:44:07] I restarted it thinking that it was busted from my push [08:44:07] was ns2 really broken for 16 hrs? [08:44:16] I guess someone did not check it [08:44:23] 15 20:50:13 <+nagios-wm> PROBLEM - Auth DNS on ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [08:44:28] ugh [08:44:43] it should have nagged since then though [08:44:49] ok, gotta remind folks to do that test (assuming a push broke it) [08:44:55] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 182 seconds [08:45:08] what's the nag interval? 2 hrs? [08:45:14] (that was UTC btw) [08:45:22] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 188 seconds [08:45:55] (err 12 hrs i mean) [08:46:05] it was out for [08:47:01] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 4 seconds [08:47:28] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [08:50:36] nah, that wasnt down for that long, nagios says "99.942% uptime" in the last week [08:50:42] (ns2) [08:50:55] yeah, well the log would say more [08:51:02] but my firefox is busted atm [08:55:26] hmm, actually it cant tell, undetermined, and just logged the restarts [08:57:17] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 208 seconds [08:58:56] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 271 seconds [09:00:44] New patchset: Dzahn; "do not permit root logins on bastion hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3239 [09:00:53] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3239 [09:02:36] New patchset: Dzahn; "do not permit root logins on bastion hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3239 [09:02:45] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3239 [09:05:14] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [09:05:31] New patchset: Dzahn; "do not permit root logins on bastion hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3239 [09:05:41] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [09:05:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3239 [09:07:12] mutante: see lib/puppet/face/node/clean.rb:113 [09:08:34] jeremyb: where? on any puppetmaster? dont see that path yet [09:08:45] mutante: yeah, sure [09:09:18] mutante: or just look at 2.7.7 on github [09:11:16] eh, "[ "exported=? AND host_id=?", true," ? [09:11:31] you say this is related to the nagios services? [09:11:39] or why [09:11:45] yes, nagios [09:15:47] so you say it is not exported? not sure i follow [09:16:32] well i don't *really* know how it works [09:16:46] but that function does do something besides just removing them [09:16:56] and presumably you didn't do that thing [09:18:18] changes ensure to absent afaict , well.. maybe.. but on the other hand we have been purgng from db like this all the time [09:27:35] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:30:44] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [09:31:47] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:32:50] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [09:39:44] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [09:39:44] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [09:48:26] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [09:50:59] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [09:50:59] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [09:50:59] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 157 MB (2% inode=61%): /var/lib/ureadahead/debugfs 157 MB (2% inode=61%): [09:50:59] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 176 MB (2% inode=61%): /var/lib/ureadahead/debugfs 176 MB (2% inode=61%): [09:50:59] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [09:56:55] oouuuch [09:57:08] RECOVERY - Disk space on srv221 is OK: DISK OK [10:01:20] RECOVERY - Disk space on srv222 is OK: DISK OK [10:01:20] RECOVERY - Disk space on srv219 is OK: DISK OK [10:01:20] RECOVERY - Disk space on srv223 is OK: DISK OK [10:01:20] RECOVERY - Disk space on srv220 is OK: DISK OK [10:01:20] RECOVERY - Disk space on srv224 is OK: DISK OK [10:26:33] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:30:45] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:43:52] apergos, jeremyb: see, and now after another run its OK, finally did not create those serices anymore. without further changes. (besides that the resources in db9 still dont seem to match ;) out.. laters [10:44:02] yay [10:55:44] New patchset: ArielGlenn; "vanilla stanza for ms1001 for a start" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3240 [10:55:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3240 [10:56:40] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3240 [10:56:42] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3240 [11:26:37] PROBLEM - swift-container-auditor on copper is CRITICAL: Connection refused by host [11:26:37] PROBLEM - swift-object-server on copper is CRITICAL: Connection refused by host [11:26:46] PROBLEM - swift-object-auditor on magnesium is CRITICAL: Connection refused by host [11:26:46] PROBLEM - swift-account-replicator on magnesium is CRITICAL: Connection refused by host [11:26:46] PROBLEM - swift-container-replicator on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:26:46] PROBLEM - swift-account-auditor on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:26:46] PROBLEM - swift-container-server on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:26:47] PROBLEM - swift-account-server on ms3 is CRITICAL: NRPE: Command check_swift_account_server not defined [11:26:47] PROBLEM - swift-object-updater on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:26:48] PROBLEM - swift-object-replicator on ms3 is CRITICAL: NRPE: Command check_swift_object_replicator not defined [11:27:04] PROBLEM - swift-container-replicator on copper is CRITICAL: Connection refused by host [11:27:04] PROBLEM - swift-object-updater on copper is CRITICAL: Connection refused by host [11:27:04] PROBLEM - swift-object-replicator on zinc is CRITICAL: Connection refused by host [11:27:04] PROBLEM - swift-account-server on zinc is CRITICAL: Connection refused by host [11:27:04] PROBLEM - swift-container-updater on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:27:04] PROBLEM - swift-account-reaper on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:27:05] PROBLEM - swift-account-auditor on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:27:05] PROBLEM - swift-object-server on ms3 is CRITICAL: NRPE: Command check_swift_object_server not defined [11:27:06] PROBLEM - swift-container-auditor on ms3 is CRITICAL: NRPE: Command check_swift_container_auditor not defined [11:27:06] PROBLEM - swift-container-server on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:27:22] PROBLEM - swift-account-server on magnesium is CRITICAL: Connection refused by host [11:27:22] PROBLEM - swift-container-server on copper is CRITICAL: Connection refused by host [11:27:22] PROBLEM - swift-account-auditor on copper is CRITICAL: Connection refused by host [11:27:22] PROBLEM - swift-container-auditor on zinc is CRITICAL: Connection refused by host [11:27:22] PROBLEM - swift-object-server on zinc is CRITICAL: Connection refused by host [11:27:31] PROBLEM - swift-object-server on magnesium is CRITICAL: Connection refused by host [11:27:31] PROBLEM - swift-object-auditor on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:27:31] PROBLEM - swift-account-reaper on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:27:31] PROBLEM - swift-container-auditor on magnesium is CRITICAL: Connection refused by host [11:27:31] PROBLEM - swift-container-updater on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:27:31] PROBLEM - swift-account-replicator on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:27:31] PROBLEM - swift-container-replicator on ms3 is CRITICAL: NRPE: Command check_swift_container_replicator not defined [11:27:32] PROBLEM - swift-object-updater on ms3 is CRITICAL: NRPE: Command check_swift_object_updater not defined [11:27:32] PROBLEM - swift-object-replicator on magnesium is CRITICAL: Connection refused by host [11:27:40] PROBLEM - swift-account-reaper on copper is CRITICAL: Connection refused by host [11:27:40] PROBLEM - swift-container-updater on copper is CRITICAL: Connection refused by host [11:27:49] PROBLEM - swift-object-updater on magnesium is CRITICAL: Connection refused by host [11:27:49] PROBLEM - swift-account-server on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:27:49] PROBLEM - swift-container-replicator on magnesium is CRITICAL: Connection refused by host [11:27:49] PROBLEM - swift-object-replicator on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:27:49] PROBLEM - swift-account-replicator on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:27:50] PROBLEM - swift-container-server on ms3 is CRITICAL: NRPE: Command check_swift_container_server not defined [11:27:50] PROBLEM - swift-object-auditor on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:27:51] PROBLEM - swift-account-auditor on ms3 is CRITICAL: NRPE: Command check_swift_account_auditor not defined [11:27:58] PROBLEM - swift-container-replicator on zinc is CRITICAL: Connection refused by host [11:28:07] PROBLEM - swift-object-auditor on copper is CRITICAL: Connection refused by host [11:28:07] PROBLEM - swift-account-replicator on copper is CRITICAL: Connection refused by host [11:28:07] PROBLEM - swift-container-server on zinc is CRITICAL: Connection refused by host [11:28:07] PROBLEM - swift-account-auditor on zinc is CRITICAL: Connection refused by host [11:28:16] PROBLEM - swift-account-auditor on magnesium is CRITICAL: Connection refused by host [11:28:16] PROBLEM - swift-container-server on magnesium is CRITICAL: Connection refused by host [11:28:16] PROBLEM - swift-container-auditor on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:28:16] PROBLEM - swift-object-server on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:28:16] PROBLEM - swift-object-replicator on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:28:17] PROBLEM - swift-container-updater on ms3 is CRITICAL: NRPE: Command check_swift_container_updater not defined [11:28:17] PROBLEM - swift-account-server on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:28:18] PROBLEM - swift-object-updater on zinc is CRITICAL: Connection refused by host [11:28:18] PROBLEM - swift-account-reaper on ms3 is CRITICAL: NRPE: Command check_swift_account_reaper not defined [11:28:25] PROBLEM - swift-account-server on copper is CRITICAL: Connection refused by host [11:28:25] PROBLEM - swift-object-replicator on copper is CRITICAL: Connection refused by host [11:28:25] PROBLEM - swift-account-reaper on zinc is CRITICAL: Connection refused by host [11:28:25] PROBLEM - swift-container-updater on zinc is CRITICAL: Connection refused by host [11:28:34] PROBLEM - swift-container-updater on magnesium is CRITICAL: Connection refused by host [11:28:34] PROBLEM - swift-account-reaper on magnesium is CRITICAL: Connection refused by host [11:28:34] PROBLEM - swift-object-updater on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:28:34] PROBLEM - swift-container-replicator on ms1 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:28:34] PROBLEM - swift-object-server on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:28:35] PROBLEM - swift-container-auditor on ms2 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:28:35] PROBLEM - swift-object-auditor on ms3 is CRITICAL: NRPE: Command check_swift_object_auditor not defined [11:28:36] PROBLEM - swift-account-replicator on ms3 is CRITICAL: NRPE: Command check_swift_account_replicator not defined [11:28:43] PROBLEM - swift-account-replicator on zinc is CRITICAL: Connection refused by host [11:28:43] PROBLEM - swift-object-auditor on zinc is CRITICAL: Connection refused by host [12:39:10] PROBLEM - Puppet freshness on stafford is CRITICAL: Puppet has not run in the last 10 hours [13:25:13] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [13:30:10] PROBLEM - Puppet freshness on cp1022 is CRITICAL: Puppet has not run in the last 10 hours [13:36:11] PROBLEM - Puppet freshness on cp1021 is CRITICAL: Puppet has not run in the last 10 hours [13:37:23] PROBLEM - Puppet freshness on cp1044 is CRITICAL: Puppet has not run in the last 10 hours [13:38:26] PROBLEM - Puppet freshness on cp1041 is CRITICAL: Puppet has not run in the last 10 hours [13:39:29] PROBLEM - Puppet freshness on cp1027 is CRITICAL: Puppet has not run in the last 10 hours [13:39:29] PROBLEM - Puppet freshness on cp1025 is CRITICAL: Puppet has not run in the last 10 hours [13:44:26] PROBLEM - Puppet freshness on cp1024 is CRITICAL: Puppet has not run in the last 10 hours [13:46:23] PROBLEM - Puppet freshness on cp1042 is CRITICAL: Puppet has not run in the last 10 hours [13:46:41] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:17] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:49:23] PROBLEM - Puppet freshness on cp1026 is CRITICAL: Puppet has not run in the last 10 hours [13:50:26] PROBLEM - Puppet freshness on cp1043 is CRITICAL: Puppet has not run in the last 10 hours [13:54:11] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:57:29] PROBLEM - Puppet freshness on cp1023 is CRITICAL: Puppet has not run in the last 10 hours [13:57:29] PROBLEM - Puppet freshness on cp1028 is CRITICAL: Puppet has not run in the last 10 hours [14:01:23] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.440 seconds [14:01:59] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.578 seconds [14:04:41] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:06:11] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:08:08] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:08:08] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [14:10:23] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:32] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.622 seconds [14:22:50] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:39:49] New patchset: Mark Bergsma; "Fix dependencies of varnish::logging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3242 [14:40:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3242 [14:44:38] New review: Mark Bergsma; "First of all, why are root logins being disallowed? Was there a discussion about this somewhere?" [operations/puppet] (production); V: 0 C: -2; - https://gerrit.wikimedia.org/r/3239 [14:45:30] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3242 [14:45:33] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3242 [14:47:04] RECOVERY - Puppet freshness on cp1021 is OK: puppet ran at Fri Mar 16 14:46:59 UTC 2012 [14:50:58] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Fri Mar 16 14:50:41 UTC 2012 [14:52:01] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Fri Mar 16 14:51:46 UTC 2012 [14:53:22] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [14:54:52] PROBLEM - Varnish HTTP upload-backend on cp1021 is CRITICAL: Connection refused [14:55:19] PROBLEM - DPKG on cp1021 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:57:34] RECOVERY - Puppet freshness on cp1027 is OK: puppet ran at Fri Mar 16 14:57:02 UTC 2012 [14:58:01] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Fri Mar 16 14:57:56 UTC 2012 [14:59:04] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Fri Mar 16 14:58:53 UTC 2012 [14:59:04] RECOVERY - Varnish HTTP upload-backend on cp1021 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.055 seconds [15:04:55] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 183 seconds [15:06:34] RECOVERY - Puppet freshness on cp1022 is OK: puppet ran at Fri Mar 16 15:06:14 UTC 2012 [15:06:52] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 202 seconds [15:07:17] notpeter: busy? want to work on paging problem? [15:07:28] PROBLEM - Host cp1021 is DOWN: PING CRITICAL - Packet loss = 100% [15:08:40] PROBLEM - Host cp1022 is DOWN: PING CRITICAL - Packet loss = 100% [15:09:07] RECOVERY - Host cp1021 is UP: PING OK - Packet loss = 0%, RTA = 26.75 ms [15:10:37] PROBLEM - Host cp1023 is DOWN: PING CRITICAL - Packet loss = 100% [15:10:46] RECOVERY - Host cp1022 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [15:11:22] RECOVERY - Host cp1023 is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [15:13:46] PROBLEM - MySQL Slave Delay on db1033 is CRITICAL: CRIT replication delay 182 seconds [15:14:13] PROBLEM - Varnish HTTP upload-frontend on cp1021 is CRITICAL: Connection refused [15:15:07] PROBLEM - Varnish traffic logger on cp1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [15:15:25] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 203 seconds [15:15:52] PROBLEM - Varnish traffic logger on cp1026 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [15:16:01] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [15:16:37] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [15:16:37] PROBLEM - Varnish HTTP upload-frontend on cp1028 is CRITICAL: Connection refused [15:16:55] PROBLEM - Varnish HTTP upload-frontend on cp1024 is CRITICAL: Connection refused [15:16:55] PROBLEM - Varnish HTTP upload-frontend on cp1025 is CRITICAL: Connection refused [15:16:55] PROBLEM - Varnish HTTP upload-frontend on cp1022 is CRITICAL: Connection refused [15:17:04] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [15:17:13] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [15:17:31] PROBLEM - Varnish HTTP upload-frontend on cp1026 is CRITICAL: Connection refused [15:17:40] PROBLEM - Varnish HTTP upload-frontend on cp1023 is CRITICAL: Connection refused [15:18:34] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.285 seconds [15:19:01] RECOVERY - Varnish HTTP upload-frontend on cp1022 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.053 seconds [15:19:19] RECOVERY - Varnish traffic logger on cp1022 is OK: PROCS OK: 2 processes with command name varnishncsa [15:20:31] RECOVERY - Puppet freshness on cp1024 is OK: puppet ran at Fri Mar 16 15:20:26 UTC 2012 [15:21:16] RECOVERY - Varnish HTTP upload-frontend on cp1024 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [15:21:43] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 0 seconds [15:22:10] RECOVERY - MySQL Slave Delay on db1033 is OK: OK replication delay 0 seconds [15:22:37] RECOVERY - Varnish HTTP upload-frontend on cp1021 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [15:22:55] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 2 processes with command name varnishncsa [15:23:13] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 2 processes with command name varnishncsa [15:24:34] RECOVERY - Puppet freshness on cp1023 is OK: puppet ran at Fri Mar 16 15:24:05 UTC 2012 [15:24:34] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 2 processes with command name varnishncsa [15:24:52] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:46] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 0 seconds [15:25:55] RECOVERY - Varnish HTTP upload-frontend on cp1023 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [15:26:04] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [15:28:28] RECOVERY - Puppet freshness on cp1028 is OK: puppet ran at Fri Mar 16 15:28:03 UTC 2012 [15:29:04] RECOVERY - Varnish HTTP upload-frontend on cp1028 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [15:29:40] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 2 processes with command name varnishncsa [15:33:07] RECOVERY - Puppet freshness on cp1026 is OK: puppet ran at Fri Mar 16 15:32:45 UTC 2012 [15:33:34] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.736 seconds [15:34:19] RECOVERY - Varnish HTTP upload-frontend on cp1026 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [15:34:55] RECOVERY - Varnish traffic logger on cp1026 is OK: PROCS OK: 2 processes with command name varnishncsa [15:37:55] RECOVERY - Puppet freshness on cp1025 is OK: puppet ran at Fri Mar 16 15:37:34 UTC 2012 [15:38:22] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 2 processes with command name varnishncsa [15:39:34] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.592 seconds [15:39:52] RECOVERY - Varnish HTTP upload-frontend on cp1025 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.053 seconds [15:39:52] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:45:52] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:13] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 237 MB (3% inode=61%): /var/lib/ureadahead/debugfs 237 MB (3% inode=61%): [15:48:49] !log cp1040 down for memory replacement [15:48:52] Logged the message, RobH [15:50:04] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.259 seconds [15:50:22] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.104 seconds [15:57:28] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:57:28] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:58:47] !log cp1040 repaired per rt 2611 [15:58:50] Logged the message, RobH [16:00:55] RECOVERY - Disk space on srv221 is OK: DISK OK [16:08:55] !log cp1019 coming down for memory error troubleshooting [16:08:58] Logged the message, RobH [16:09:38] !log Migrated all varnish3 packages to newer varnish packages from git [16:09:41] Logged the message, Master [16:14:59] New patchset: Mark Bergsma; "Package varnish is now installed everywhere instead of varnish3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3247 [16:15:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3247 [16:15:43] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3247 [16:15:46] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3247 [16:18:03] !log cp1019 memory error cleared after reseating, notes on rt 2651 [16:18:06] Logged the message, RobH [16:19:04] RECOVERY - Host cp1019 is UP: PING OK - Packet loss = 0%, RTA = 26.44 ms [16:20:34] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.279 seconds [16:20:34] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.272 seconds [16:20:48] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3237 [16:21:29] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3236 [16:21:32] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3237 [16:21:33] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3236 [16:22:20] !log cp1017 memory error, coming down for troubleshooting. [16:22:24] Logged the message, RobH [16:23:25] PROBLEM - Backend Squid HTTP on cp1019 is CRITICAL: Connection refused [16:24:46] PROBLEM - Frontend Squid HTTP on cp1019 is CRITICAL: Connection refused [16:26:52] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:52] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:36:37] PROBLEM - DPKG on professor is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:39:28] RECOVERY - Frontend Squid HTTP on cp1019 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.162 seconds [16:43:10] !log cp1019 back in full service [16:43:13] Logged the message, RobH [16:43:20] New patchset: Mark Bergsma; "Rename role/seach.pp to role/search.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3249 [16:43:34] New patchset: Mark Bergsma; "Retab search.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3250 [16:43:48] New patchset: Mark Bergsma; "The role classes are actually called role::lucene instead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3251 [16:44:01] New patchset: Mark Bergsma; "Add review comments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3252 [16:44:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3249 [16:44:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3250 [16:44:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3251 [16:44:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3252 [16:47:52] RECOVERY - Host cp1017 is UP: PING OK - Packet loss = 0%, RTA = 26.48 ms [16:48:07] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2682 [16:48:10] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2682 [16:49:29] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3024 [16:49:36] notpeter: the tmobile device for paging,not working. let me know when you want to work on it [16:50:48] cmjohnson1: yep. give me, like 45 minutes, if that works for you [16:50:55] sure thing [16:52:05] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3115 [16:52:08] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3115 [16:52:22] PROBLEM - Backend Squid HTTP on cp1017 is CRITICAL: Connection refused [16:53:07] !log cp1017 back in service pool [16:53:10] Logged the message, RobH [16:53:50] notpeter: i can bring down search1017 and search1018 right? (i see they have the os on them) [16:53:58] RobH: yup [16:54:04] cool [16:54:11] New patchset: Mark Bergsma; "Move labs hosts subnet to private" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3253 [16:54:13] !log search1017 and search1018 coming down for hdd swap [16:54:16] Logged the message, RobH [16:54:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3253 [16:54:42] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3253 [16:54:45] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3253 [16:55:49] PROBLEM - Host search1017 is DOWN: PING CRITICAL - Packet loss = 100% [16:56:42] notpeter: question for you [16:56:46] these servers can handle 4 disks [16:56:48] sup [16:56:51] did you want these to replace the OS disks [16:56:58] or just add to them? i assumed replace [16:56:59] yes [16:57:00] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2788 [16:57:01] replace [16:57:02] ok [16:57:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2788 [16:57:54] thanks for fixing that up [16:58:06] I read it like a dozen times but my eyes were going batty by the end [16:58:13] PROBLEM - Host search1018 is DOWN: PING CRITICAL - Packet loss = 100% [16:58:40] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3131 [16:58:43] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3131 [17:00:07] New review: Mark Bergsma; "Set -1 until git switchover" [operations/puppet] (production); V: -1 C: 0; - https://gerrit.wikimedia.org/r/2786 [17:11:13] notpeter: ok, hard disks installed and ready for the OS, all yers buddy [17:11:33] !log hdd in search1017/1018 replaced per rt 2583 [17:11:36] Logged the message, RobH [17:11:43] RECOVERY - Backend Squid HTTP on cp1019 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.160 seconds [17:12:34] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.063 seconds [17:15:14] New review: Mark Bergsma; "That hostname check is not appropriate in swift.pp! Give the class a parameter, and call it as appro..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3072 [17:17:06] RobH: woo woooo! awesome [17:17:07] thank you! [17:17:19] quite welcome [17:20:34] New review: Mark Bergsma; "Please fix the indentation in iron's node entry" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3121 [17:23:22] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.209 seconds [17:25:27] New review: Mark Bergsma; "Rob:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3122 [17:26:58] RECOVERY - Backend Squid HTTP on cp1017 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.161 seconds [17:27:26] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3249 [17:27:29] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3249 [17:28:19] RECOVERY - Host search1018 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [17:28:20] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3250 [17:28:22] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3250 [17:29:13] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3251 [17:29:16] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3251 [17:29:22] RECOVERY - Host search1017 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [17:30:44] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3252 [17:32:26] New patchset: Ryan Lane; "Fixing labs/production split in gerrit reporting" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3254 [17:32:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3254 [17:33:10] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3254 [17:33:12] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3254 [17:33:16] PROBLEM - RAID on search1018 is CRITICAL: Connection refused by host [17:33:34] PROBLEM - SSH on search1018 is CRITICAL: Connection refused [17:34:37] PROBLEM - DPKG on search1018 is CRITICAL: Connection refused by host [17:35:13] PROBLEM - DPKG on search1017 is CRITICAL: Connection refused by host [17:35:22] PROBLEM - RAID on search1017 is CRITICAL: Connection refused by host [17:35:40] PROBLEM - SSH on search1017 is CRITICAL: Connection refused [17:40:55] PROBLEM - Lucene on search1018 is CRITICAL: Connection refused [17:40:55] PROBLEM - Lucene on search1017 is CRITICAL: Connection refused [17:48:16] RECOVERY - SSH on search1017 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [17:48:34] RECOVERY - SSH on search1018 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [17:58:13] New review: Mark Bergsma; "Why is this done through a gazillion checkcommands instead of one which takes a parameter?!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3144 [18:00:12] New patchset: Mark Bergsma; "Revert "add the nagios-nrpe-server init file"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3259 [18:00:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3259 [18:00:25] PROBLEM - NTP on search1017 is CRITICAL: NTP CRITICAL: No response from NTP server [18:00:50] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3252 [18:00:53] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3252 [18:03:07] New patchset: Mark Bergsma; "Revert "limit swift process monitoring to ms-be hosts because the testing machinges do not have nrpe installed"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3260 [18:03:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3260 [18:04:46] RECOVERY - DPKG on search1017 is OK: All packages OK [18:05:04] RECOVERY - RAID on search1017 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [18:05:40] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:06:25] RECOVERY - Disk space on search1017 is OK: DISK OK [18:06:43] RECOVERY - NTP on search1017 is OK: NTP OK: Offset 0.1131283045 secs [18:07:19] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:07:46] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.750 seconds [18:07:51] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3259 [18:10:28] RECOVERY - Lucene on search1017 is OK: TCP OK - 0.027 second response time on port 8123 [18:15:07] RECOVERY - DPKG on search1018 is OK: All packages OK [18:15:07] RECOVERY - RAID on search1018 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [18:16:46] RECOVERY - Disk space on search1018 is OK: DISK OK [18:18:53] New patchset: Mark Bergsma; "Revert "add the nagios-nrpe-server init file"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3261 [18:19:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3261 [18:19:26] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3261 [18:19:28] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3261 [18:19:40] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3259 [18:19:53] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3260 [18:19:55] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.007 seconds [18:19:55] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3260 [18:20:58] RECOVERY - Lucene on search1018 is OK: TCP OK - 0.027 second response time on port 8123 [18:37:48] New patchset: Anonymous Coward; "First stab at making the openstack version configurable." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3262 [18:38:00] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3262 [18:38:23] New review: Anonymous Coward; "(Just submitting for comments, please don't merge.)" [operations/puppet] (production); V: 0 C: -2; - https://gerrit.wikimedia.org/r/3262 [18:52:25] New patchset: Andrew Bogott; "First stab at making the openstack version configurable." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3262 [18:52:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3262 [19:05:19] New patchset: Lcarr; "fixing temp file permissions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3263 [19:05:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3263 [19:05:58] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3263 [19:06:01] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3263 [19:09:22] New patchset: Andrew Bogott; "First stab at making the openstack version configurable." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3262 [19:09:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3262 [19:10:19] Ryan_Lane, Does 3262 look like it'll do what I expect it to do? (That is: purge obsolete repos and apply the desired one generally?) [19:10:49] lemme see [19:11:23] oh [19:11:26] hm [19:11:40] yeah, it should [19:11:48] I've got those 'obsolete' definitions in there, which aren't ever referred to elsewhere. [19:11:54] that's a good way of handling it [19:12:21] The next question is: How do I define the variable openstack_version? [19:12:22] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/3262 [19:12:34] can do it in the node entry [19:12:48] ma rk will likely hate that [19:13:02] Is that something I can hand-edit on a particular host? [19:13:04] we need to restructure how these manifests work, though [19:13:24] in labs, you add the variable as a project-specific variable [19:13:26] who has the magic puppet touch? "err: /Stage[main]/Misc::Mwlib::Users/User[pp]/ensure: change from absent to present failed: Could not create user pp: Execution of '/usr/sbin/useradd -d /opt/pp -G pp -s /bin/bash -r pp' returned 9: useradd: group pp exists - if you want to add this user to that group, use -g." [19:13:28] using "Manage puppet groups" [19:13:53] Jeff_Green: did you use user or systemuser? [19:14:06] user [19:14:18] is this a system account, or a user account? [19:14:34] those terms are not meaningful to me? [19:14:48] but I suppose it's a system account by some definition [19:15:01] system accounts use a different uid [19:15:20] also, I don't know how ma rk feels, but I hate shared user accounts [19:15:21] ok, so puppet declares the system range [19:15:33] sec [19:15:36] also, our puppet management of user accounts sucks [19:15:45] no, the distro declares the system range [19:15:45] it often runs into issues like this [19:15:50] system => true, [19:15:56] o.O [19:16:18] i will hereby point out that that git+puppet+labs has cost me a day so far [19:16:26] Jeff_Green: see: gerrit::account [19:16:29] in gerrit.pp [19:16:33] thx [19:16:34] systemuser { gerrit2: name => "gerrit2", home => "/var/lib/gerrit2", shell => "/bin/bash" } [19:16:45] system users should not use /home, btw :) [19:16:56] i set it to use /opt/pp [19:17:00] ah. ok [19:17:03] user { "pp": [19:17:03] name => "pp", [19:17:03] home => "/opt/pp", [19:17:03] shell => "/bin/bash", [19:17:03] ensure => "present", [19:17:03] groups => "pp", [19:17:04] allowdupe => false, [19:17:04] system => true, [19:17:05] } [19:17:05] sorry for the blat [19:17:15] no worries [19:17:19] on top of that: class misc::mwlib::users inherits misc::mwlib::groups{ [19:17:29] yeah, use gerrit::account as an example [19:17:32] I know it works properly :) [19:17:47] ok looking. thanks [19:17:51] I think it does user/group automatically [19:18:38] ok. food. [19:19:02] Ryan_Lane: Before you vanish... can I go ahead and approve my own patches when I'm still developing? [19:19:09] I don't know if I have privs to do that, actually.... [19:20:43] New review: Andrew Bogott; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3262 [19:22:09] New review: Andrew Bogott; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3262 [19:23:12] Ryan_Lane: gerrit::account doesn't appear to set a specific group [19:26:31] New review: Andrew Bogott; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3262 [19:26:34] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3262 [19:32:39] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [19:34:36] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [19:41:30] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [19:41:30] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [20:41:19] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:49:34] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:52:18] New patchset: Lcarr; "Fixing icinga init file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3265 [20:52:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3265 [20:53:17] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3265 [20:53:20] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3265 [20:53:57] andrewbogott: hey [20:54:12] i've got some apt::pparepo changes in the waiting to be merged [20:54:26] are they good to merge? [20:55:32] LeslieCarr: Fine with me -- I'm not very deep into anything so happy to resolve by hand if it conflicts. [20:56:41] LeslieCarr: Or, did you want me to look at something in particular? (Looks like 3265 is already merged.) [20:57:54] nope, just was on sockpuppet merging [20:58:01] wanted to make sure it was inteded to be merged [20:58:16] New patchset: Lcarr; "fixing up icinga-init more" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3266 [20:58:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3266 [20:58:32] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3266 [20:58:35] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3266 [21:00:00] LeslieCarr: Um... ok, hang on, I think I don't know what you're talking about :/ [21:00:46] Did I push a change to production rather than test? [21:01:30] I did. Crap. [21:01:55] LeslieCarr: change 3262 is probably harmless but I did not mean it to go into production. [21:02:43] * andrewbogott is getting all his dumb mistakes taken care of in one go. [21:05:47] oh [21:05:59] well well it will only break labs ;) [21:06:17] so do a revert change and i'll push the reverting through [21:09:26] New patchset: Andrew Bogott; "Revert "First stab at making the openstack version configurable."" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3267 [21:09:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3267 [21:10:11] New review: Andrew Bogott; "(Reverting -- I really didn't intend for this to go into production)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3267 [21:10:13] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3267 [21:10:35] LeslieCarr: OK, reverted. Sorry for the mixup. [21:20:52] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 310 seconds [21:21:19] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 337 seconds [21:21:22] New patchset: Lcarr; "more icinga tweaks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3269 [21:21:24] !log running enwiki.revision sha1 schema migrations on eqiad side [21:21:27] Logged the message, Master [21:21:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3269 [21:21:55] PROBLEM - MySQL Replication Heartbeat on db1017 is CRITICAL: CRIT replication delay 373 seconds [21:22:04] PROBLEM - MySQL Replication Heartbeat on db1043 is CRITICAL: CRIT replication delay 382 seconds [21:22:13] PROBLEM - MySQL Slave Delay on db1017 is CRITICAL: CRIT replication delay 390 seconds [21:22:31] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 408 seconds [21:23:30] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3269 [21:23:33] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3269 [21:43:45] New patchset: Lcarr; "fixing up apache config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3270 [21:43:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3270 [21:44:54] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3270 [21:44:57] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3270 [22:04:40] New patchset: Lcarr; "Fixing icinga files again" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3272 [22:04:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3272 [22:40:40] PROBLEM - Puppet freshness on stafford is CRITICAL: Puppet has not run in the last 10 hours [22:42:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:44:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.387 seconds [23:15:24] PROBLEM - Lucene on search1001 is CRITICAL: Connection refused [23:21:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:25:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.205 seconds [23:27:06] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours