[00:22:02] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.35:11000 (timeout) [00:23:23] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [00:29:05] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.15:11000 (timeout) [00:31:56] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [00:56:25] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [01:42:37] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 234 seconds [01:48:10] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 30 seconds [01:50:34] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.15:11000 (timeout) [01:53:12] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [02:24:24] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [02:25:19] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [02:27:15] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [02:29:30] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [02:57:06] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [03:15:15] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.15:11000 (timeout) [03:16:36] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [03:20:57] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.15:11000 (timeout) [03:22:27] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [04:03:21] PROBLEM - Puppet freshness on blondel is CRITICAL: Puppet has not run in the last 10 hours [04:13:24] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.15:11000 (timeout) [04:15:53] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [04:49:56] PROBLEM - Apache HTTP on srv245 is CRITICAL: Connection refused [04:53:32] PROBLEM - Host srv243 is DOWN: PING CRITICAL - Packet loss = 100% [04:55:38] RECOVERY - Host srv243 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [05:02:14] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.2.208:11000 (Connection refused) [05:04:17] PROBLEM - NTP on srv208 is CRITICAL: NTP CRITICAL: Offset unknown [05:04:53] PROBLEM - Apache HTTP on srv208 is CRITICAL: Connection refused [05:04:53] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [05:07:17] RECOVERY - NTP on srv208 is OK: NTP OK: Offset -0.0167388916 secs [05:10:53] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.15:11000 (timeout) [05:12:23] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [05:14:11] PROBLEM - Apache HTTP on srv244 is CRITICAL: Connection refused [05:14:47] RECOVERY - Apache HTTP on srv208 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [05:15:32] RECOVERY - Apache HTTP on srv245 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [05:17:47] PROBLEM - Host srv231 is DOWN: PING CRITICAL - Packet loss = 100% [05:18:59] RECOVERY - Host srv231 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [05:22:35] PROBLEM - Apache HTTP on srv231 is CRITICAL: Connection refused [05:25:17] RECOVERY - Apache HTTP on srv231 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [05:28:53] PROBLEM - Apache HTTP on srv200 is CRITICAL: Connection refused [05:29:38] RECOVERY - Apache HTTP on srv244 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [05:33:23] RECOVERY - Apache HTTP on srv200 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [05:33:50] PROBLEM - Host srv210 is DOWN: PING CRITICAL - Packet loss = 100% [05:36:05] RECOVERY - Host srv210 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [05:39:05] PROBLEM - Apache HTTP on srv210 is CRITICAL: Connection refused [05:41:11] PROBLEM - Host srv236 is DOWN: PING CRITICAL - Packet loss = 100% [05:42:14] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.2.236:11000 (Connection refused) [05:43:26] RECOVERY - Host srv236 is UP: PING OK - Packet loss = 0%, RTA = 1.57 ms [05:46:26] PROBLEM - Apache HTTP on srv236 is CRITICAL: Connection refused [05:46:26] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [05:51:05] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.23:11000 (timeout) [05:53:41] PROBLEM - Apache HTTP on srv213 is CRITICAL: Connection refused [05:53:41] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [05:58:20] PROBLEM - Host srv264 is DOWN: PING CRITICAL - Packet loss = 100% [05:59:23] RECOVERY - Apache HTTP on srv213 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [05:59:41] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.23:11000 (timeout) [06:00:08] RECOVERY - Host srv264 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [06:00:15] * jeremyb wonders if apergos is up yet? [06:00:20] nagios is pretty chatty [06:01:02] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [06:03:35] RECOVERY - Apache HTTP on srv210 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time [06:04:20] PROBLEM - Apache HTTP on srv264 is CRITICAL: Connection refused [06:06:17] PROBLEM - Host srv258 is DOWN: PING CRITICAL - Packet loss = 100% [06:08:14] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.27:11000 (timeout) 10.0.8.23:11000 (timeout) [06:08:41] RECOVERY - Host srv258 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [06:11:05] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [06:11:50] RECOVERY - Apache HTTP on srv236 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [06:12:44] PROBLEM - Apache HTTP on srv258 is CRITICAL: Connection refused [06:13:38] yes. is there an issue? [06:13:59] idk. just nagios seems to extra flappy [06:14:04] it's not flappy [06:14:30] there's a job running that is rebooting some hosts that have been up a long time, doing them one at a time with a few minutes in between [06:15:23] how does that account for hosts that have alerted several times? [06:15:53] you'll see it be unvailable, then be up but no http, then http back [06:16:30] hrmm. that is what happened for the particular host i'm looking back at now (236) [06:16:54] I did some manual reboots yesterday so I'm quite familiar with the pattern :-P [06:17:05] RECOVERY - Apache HTTP on srv264 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [06:17:06] right, i noticed you did ;) [06:17:14] PROBLEM - Apache HTTP on mw44 is CRITICAL: Connection refused [06:17:40] but there was ~25 mins between HTTP conn refused (which was after ping came back) and HTTP recovery. sounds like a long time to me [06:18:05] that's becaus of how the recovery works [06:18:07] because puppet is what starts http back up [06:18:15] if somenoe reboots manually, [06:18:22] it ensures the mediawiki codebase is up to date before it starts apache [06:18:23] they (me) will hop on the host and force a puppet run [06:18:27] so that we rsync mw [06:18:44] and then it will start apache; otherwise you are waiting for the run to happen for up to 30 mins [06:19:02] aha [06:19:11] i didn't realize puppet was involved [06:19:12] in reality, it's not much of a problem [06:19:18] we don't have hosts try to force a puppet run immediately when they come back up, [06:19:33] the reboots are spaced 8 minutes apart [06:19:36] because imagine that several of them drop off and are rebooted simultaneously... [06:19:39] so, at most 3 hosts are down at a time [06:20:00] so there's aabout 6 more that are down cause they are down (and not the script) [06:20:01] and that's assuming all puppet runs take a full 30 minutes [06:20:03] which isn't true [06:20:07] I'll have a look at those now [06:20:23] yeah, it's possible puppet fails on some [06:20:41] these were down yesterday [06:20:47] I think they died before your script got to em [06:20:59] ssh -l root srv162.mgmt [06:20:59] ssh: connect to host srv162.mgmt port 22: Connection timed out [06:21:06] well that's another one I won't look at then [06:21:16] apergos: there's your good friend srv206 ;) [06:21:33] yeah, my good friend 206 is going to stay like it is [06:21:42] I'm talking about hosts actually down, not ssh-able [06:25:10] !powercycled srv284 (unresponsive at mgmt console) [06:25:20] PROBLEM - Apache HTTP on srv247 is CRITICAL: Connection refused [06:28:20] hmm looks like it doesn't want to come up. not getting out of the bios phase [06:29:23] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.11:11000 (Connection timed out) [06:30:08] PROBLEM - Host srv261 is DOWN: PING CRITICAL - Packet loss = 100% [06:31:02] * jeremyb wonders if there's some graceful way to cycle an entire memcahced cluster [06:31:43] e.g. if you could pass the whole contents of the memcached to the new node for a given slot when it takes over (right before booting the old node for that slot) [06:32:18] hmm it wants to reinstall [06:32:37] that's pretty odd [06:32:41] RECOVERY - Host srv261 is UP: PING OK - Packet loss = 0%, RTA = 1.46 ms [06:33:44] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [06:34:20] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [06:35:26] hrmm, there's http://docs.libmemcached.org/bin/memdump.html and several other relevant google hits [06:35:50] PROBLEM - Apache HTTP on srv261 is CRITICAL: Connection refused [06:35:55] i guess with the new approach of trusting memcached less it doesn't matter so much [06:36:35] RECOVERY - Apache HTTP on srv247 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [06:36:44] RECOVERY - Apache HTTP on srv258 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [06:37:56] PROBLEM - Host srv267 is DOWN: PING CRITICAL - Packet loss = 100% [06:38:05] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.17:11000 (Connection timed out) [06:40:02] RECOVERY - Host srv267 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [06:40:56] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [06:44:32] PROBLEM - Apache HTTP on srv267 is CRITICAL: Connection refused [06:45:08] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.37:11000 (timeout) [06:48:08] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [06:48:44] RECOVERY - Apache HTTP on srv267 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [06:48:44] RECOVERY - Apache HTTP on srv261 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.022 second response time [06:54:43] PROBLEM - Apache HTTP on srv246 is CRITICAL: Connection refused [06:55:10] PROBLEM - NTP on srv246 is CRITICAL: NTP CRITICAL: Offset unknown [06:59:40] RECOVERY - NTP on srv246 is OK: NTP OK: Offset -0.003497123718 secs [07:06:07] PROBLEM - Apache HTTP on srv232 is CRITICAL: Connection refused [07:08:22] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.2.207:11000 (Connection timed out) [07:13:46] PROBLEM - Apache HTTP on srv207 is CRITICAL: Connection refused [07:16:01] RECOVERY - Apache HTTP on srv232 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [07:17:49] !log powercycled mw8 [07:17:55] Logged the message, Master [07:18:34] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [07:19:29] !log reinstalled srv284, seems to be up now [07:19:32] Logged the message, Master [07:21:43] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [07:22:10] PROBLEM - Apache HTTP on srv212 is CRITICAL: Connection refused [07:22:19] RECOVERY - Apache HTTP on srv246 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [07:22:35] !log installing upgrades on srv212 [07:22:38] Logged the message, Master [07:25:46] PROBLEM - Host mw35 is DOWN: PING CRITICAL - Packet loss = 100% [07:26:04] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.059 second response time [07:27:16] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [07:27:52] RECOVERY - Host mw35 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [07:28:54] so mutante if you care I have a summary of the boxes that are still down (not being handled by the reboot script) [07:29:13] RECOVERY - Apache HTTP on srv207 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [07:29:52] apergos: ok, sure, where is it [07:30:07] srv206 has an old rt ticket for it, puppet refuses to run on it, there's an odd error, tossin the yaml file didn't help any, see the logs. srv174 and 188 aren't reachable even mgmt port. srv281 has an open ticket for hardware... [07:31:11] ssl3004 isn't reachable even via mgmt port, it *was* until * rebooted it, so tht's worrying [07:31:37] PROBLEM - Apache HTTP on mw35 is CRITICAL: Connection refused [07:31:55] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.27:11000 (timeout) 10.0.8.13:11000 (timeout) [07:33:35] srv162 not reachable even on mgmt port. [07:34:05] there's only 266 I haven't looked at yet [07:34:25] apergos: srv162, srv174 - they are gone, decommissioned [07:34:37] this is only for hosts iin mediawiki-installation or whatever it's called [07:34:40] if they were in a list, that was outdated i guess [07:34:47] hmm might wanna remove them from that list :-D [07:35:53] i'm installing upgrades on some anyways.. seeing that samba upgrade just doesnt feel good [07:36:04] srv266 says: Severity: Non Recoverable, SEL:CPU Machine Chk: Processor sensor, transition to non-recoverable was asserted from the mgmt console [07:36:43] I'll power cycle it anyways and see what happens [07:37:04] why don't i see that one on nagios..hmm [07:37:14] PROBLEM - Memcached on srv284 is CRITICAL: Connection refused [07:37:21] !log powercycling srv266, had this message on mgmt console: Severity: Non Recoverable, SEL:CPU Machine Chk: Processor sensor, transition to non-recoverable was asserted [07:37:24] Logged the message, Master [07:37:41] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [07:39:20] PROBLEM - Apache HTTP on srv270 is CRITICAL: Connection refused [07:39:47] PROBLEM - Apache HTTP on srv265 is CRITICAL: Connection refused [07:40:23] RECOVERY - Apache HTTP on srv212 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [07:40:59] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time [07:41:44] PROBLEM - Host mw36 is DOWN: PING CRITICAL - Packet loss = 100% [07:43:50] RECOVERY - Host mw36 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [07:44:03] err: Could not run Puppet configuration client: Invalid parameter system at /var/lib/git/operations/puppet/manifests/generic-definitions.pp:50 [07:44:08] this on srv266 [07:44:11] awesome [07:44:16] apergos: srv206 - yes, weird puppet error and had hardware errors in last july :p hmm [07:44:18] same as on 206 [07:44:29] both the same puppet error? [07:44:36] yes [07:45:17] I;m saying, isn't it a bit odd that both have the same puppet error? [07:45:48] if they each have some sort of hardware error I would expect different types of failure [07:45:49] s [07:45:59] the code it refers to is in define systemuser [07:46:13] there is a user created with "system => true" [07:46:20] it does not like that parameter..uhm [07:47:08] PROBLEM - Apache HTTP on mw36 is CRITICAL: Connection refused [07:50:35] !log upgrading mw36 [07:50:39] Logged the message, Master [07:51:18] apergos: taking srv206 for reinstall [07:52:41] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [07:52:52] ok [07:53:17] RECOVERY - Apache HTTP on srv270 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.051 second response time [07:53:53] PROBLEM - Apache HTTP on mw11 is CRITICAL: Connection refused [07:57:29] PROBLEM - Host srv206 is DOWN: PING CRITICAL - Packet loss = 100% [07:58:05] PROBLEM - Host srv265 is DOWN: PING CRITICAL - Packet loss = 100% [07:59:44] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.15:11000 (Connection refused) 10.0.11.34:11000 (timeout) [07:59:45] !log reinstalling srv206 [07:59:48] Logged the message, Master [08:00:38] RECOVERY - Host srv265 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [08:02:35] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [08:03:02] RECOVERY - Host srv206 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [08:05:35] PROBLEM - Host srv274 is DOWN: PING CRITICAL - Packet loss = 100% [08:07:05] PROBLEM - Memcached on srv206 is CRITICAL: Connection refused [08:07:14] PROBLEM - SSH on srv206 is CRITICAL: Connection refused [08:07:59] RECOVERY - Host srv274 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [08:08:53] !log upgraded mw1,mw2,mw35 [08:08:55] Logged the message, Master [08:10:59] PROBLEM - Apache HTTP on mw1 is CRITICAL: Connection refused [08:12:56] RECOVERY - SSH on srv206 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [08:13:32] RECOVERY - Apache HTTP on mw11 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.318 second response time [08:13:41] PROBLEM - Apache HTTP on mw2 is CRITICAL: Connection refused [08:15:20] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.23:11000 (timeout) [08:17:53] PROBLEM - Apache HTTP on mw33 is CRITICAL: Connection refused [08:18:11] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [08:18:20] RECOVERY - Apache HTTP on srv265 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [08:19:44] apergos: anyways, the puppet problem is gone after reinstall. i can also take 266 then [08:19:50] great [08:22:14] PROBLEM - Host srv276 is DOWN: PING CRITICAL - Packet loss = 100% [08:22:46] !reinstalling srv266 [08:23:44] RECOVERY - Host srv276 is UP: PING OK - Packet loss = 0%, RTA = 2.14 ms [08:23:53] PROBLEM - NTP on srv206 is CRITICAL: NTP CRITICAL: No response from NTP server [08:24:11] RECOVERY - Apache HTTP on srv206 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.006 seconds [08:28:22] PROBLEM - Apache HTTP on srv276 is CRITICAL: Connection refused [08:29:07] RECOVERY - Apache HTTP on mw1 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.055 second response time [08:29:34] PROBLEM - SSH on srv266 is CRITICAL: Connection refused [08:29:43] PROBLEM - Apache HTTP on srv266 is CRITICAL: Connection refused [08:30:01] PROBLEM - Memcached on srv266 is CRITICAL: Connection refused [08:31:58] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.23:11000 (timeout) [08:32:46] mutante: you probably want to put !log there [08:33:00] !reinstalling :P [08:33:37] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [08:33:55] RECOVERY - Apache HTTP on srv276 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.373 second response time [08:34:04] PROBLEM - Apache HTTP on srv287 is CRITICAL: Connection refused [08:34:13] !log reinstalling srv266 [08:34:15] Logged the message, Master [08:34:15] thx petan|wk [08:34:51] btw when ur not busy poke me [08:35:07] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time [08:35:16] RECOVERY - SSH on srv266 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [08:35:16] RECOVERY - NTP on srv206 is OK: NTP OK: Offset 0.007565736771 secs [08:35:34] RECOVERY - Apache HTTP on srv287 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.021 second response time [08:38:16] PROBLEM - Host srv283 is DOWN: PING CRITICAL - Packet loss = 100% [08:39:46] RECOVERY - Host srv283 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [08:40:49] RECOVERY - Apache HTTP on mw2 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.074 second response time [08:42:37] RECOVERY - Apache HTTP on srv266 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.007 seconds [08:44:07] PROBLEM - Apache HTTP on srv283 is CRITICAL: Connection refused [08:45:37] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.2.241:11000 (Connection timed out) [08:47:07] PROBLEM - NTP on srv266 is CRITICAL: NTP CRITICAL: Offset unknown [08:48:46] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [08:49:58] PROBLEM - Apache HTTP on srv241 is CRITICAL: Connection refused [08:54:19] RECOVERY - Apache HTTP on srv241 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [08:54:28] PROBLEM - Host srv263 is DOWN: PING CRITICAL - Packet loss = 100% [08:56:34] RECOVERY - Host srv263 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [08:57:01] RECOVERY - NTP on srv266 is OK: NTP OK: Offset -0.02310967445 secs [08:59:43] PROBLEM - Apache HTTP on srv263 is CRITICAL: Connection refused [09:00:55] RECOVERY - Apache HTTP on srv283 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [09:02:16] PROBLEM - Host srv277 is DOWN: PING CRITICAL - Packet loss = 100% [09:02:34] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.27:11000 (Connection refused) [09:03:55] RECOVERY - Host srv277 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [09:04:04] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [09:06:46] RECOVERY - Apache HTTP on srv263 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [09:08:16] PROBLEM - Apache HTTP on srv277 is CRITICAL: Connection refused [09:09:28] PROBLEM - Host srv271 is DOWN: PING CRITICAL - Packet loss = 100% [09:11:43] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.21:11000 (Connection refused) [09:12:10] RECOVERY - Host srv271 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [09:13:05] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [09:19:49] PROBLEM - NTP on mw5 is CRITICAL: NTP CRITICAL: Offset unknown [09:20:07] PROBLEM - Apache HTTP on mw5 is CRITICAL: Connection refused [09:21:30] what apache? it's supposed to be nginx over there anyways [09:21:37] RECOVERY - Apache HTTP on mw5 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.523 second response time [09:21:49] thanks [09:21:55] thanks a whole lot nagios [09:24:01] RECOVERY - NTP on mw5 is OK: NTP OK: Offset 0.04755532742 secs [09:29:34] PROBLEM - Apache HTTP on mw62 is CRITICAL: Connection refused [09:32:52] RECOVERY - Apache HTTP on srv277 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.023 second response time [09:38:34] PROBLEM - Apache HTTP on mw59 is CRITICAL: Connection refused [09:42:19] PROBLEM - Host srv282 is DOWN: PING CRITICAL - Packet loss = 100% [09:42:46] RECOVERY - Apache HTTP on mw59 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.041 second response time [09:44:16] RECOVERY - Host srv282 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [09:45:01] RECOVERY - Apache HTTP on mw62 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [09:47:25] PROBLEM - Apache HTTP on srv282 is CRITICAL: Connection refused [09:49:58] PROBLEM - Host mw27 is DOWN: PING CRITICAL - Packet loss = 100% [09:50:16] RECOVERY - Apache HTTP on srv282 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [09:52:13] RECOVERY - Host mw27 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [09:55:58] PROBLEM - Apache HTTP on mw27 is CRITICAL: Connection refused [10:01:58] PROBLEM - Apache HTTP on mw7 is CRITICAL: Connection refused [10:03:19] RECOVERY - Apache HTTP on mw7 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [10:05:52] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.022 second response time [10:09:28] PROBLEM - Apache HTTP on mw61 is CRITICAL: Connection refused [10:14:09] PROBLEM - Host mw70 is DOWN: PING CRITICAL - Packet loss = 100% [10:15:21] RECOVERY - Host mw70 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [10:18:31] PROBLEM - Apache HTTP on mw70 is CRITICAL: Connection refused [10:20:00] RECOVERY - Apache HTTP on mw70 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.106 second response time [10:21:39] PROBLEM - Host mw10 is DOWN: PING CRITICAL - Packet loss = 100% [10:23:00] RECOVERY - Host mw10 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [10:25:51] RECOVERY - Apache HTTP on mw61 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [10:26:00] PROBLEM - Apache HTTP on mw10 is CRITICAL: Connection refused [10:34:51] PROBLEM - Apache HTTP on mw69 is CRITICAL: Connection refused [10:41:54] RECOVERY - Apache HTTP on mw10 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.043 second response time [10:42:48] PROBLEM - Apache HTTP on mw71 is CRITICAL: Connection refused [10:50:36] PROBLEM - Apache HTTP on mw47 is CRITICAL: Connection refused [10:57:12] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [10:58:06] PROBLEM - Apache HTTP on mw53 is CRITICAL: Connection refused [10:59:09] RECOVERY - Apache HTTP on mw69 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.118 second response time [11:01:33] PROBLEM - Host mw37 is DOWN: PING CRITICAL - Packet loss = 100% [11:03:03] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.24:11000 (timeout) 10.0.11.37:11000 (Connection refused) [11:03:57] RECOVERY - Apache HTTP on mw71 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.114 second response time [11:04:24] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [11:04:24] RECOVERY - Host mw37 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [11:07:33] PROBLEM - Apache HTTP on mw37 is CRITICAL: Connection refused [11:07:33] RECOVERY - Apache HTTP on mw47 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [11:09:57] PROBLEM - Host srv286 is DOWN: PING CRITICAL - Packet loss = 100% [11:11:09] RECOVERY - Host srv286 is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [11:11:27] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.36:11000 (Connection refused) 10.0.8.23:11000 (timeout) [11:12:57] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [11:15:21] PROBLEM - Apache HTTP on srv286 is CRITICAL: Connection refused [11:18:30] PROBLEM - Host mw55 is DOWN: PING CRITICAL - Packet loss = 100% [11:19:15] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [11:20:27] RECOVERY - Host mw55 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [11:21:48] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.043 second response time [11:30:48] PROBLEM - Apache HTTP on mw57 is CRITICAL: Connection refused [11:33:48] RECOVERY - Apache HTTP on srv286 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.022 second response time [11:42:21] PROBLEM - Host srv269 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:45] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.22:11000 (timeout) [11:44:54] RECOVERY - Host srv269 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [11:46:15] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [11:47:54] PROBLEM - Apache HTTP on srv269 is CRITICAL: Connection refused [11:54:48] PROBLEM - Apache HTTP on mw67 is CRITICAL: Connection refused [11:57:48] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [12:02:27] PROBLEM - Apache HTTP on mw19 is CRITICAL: Connection refused [12:08:50] RECOVERY - Apache HTTP on mw67 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [12:11:50] RECOVERY - Apache HTTP on srv269 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [12:12:08] PROBLEM - Apache HTTP on mw9 is CRITICAL: Connection refused [12:14:59] RECOVERY - Apache HTTP on mw9 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time [12:14:59] PROBLEM - Host srv272 is DOWN: PING CRITICAL - Packet loss = 100% [12:16:47] RECOVERY - Host srv272 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [12:20:05] PROBLEM - Apache HTTP on srv272 is CRITICAL: Connection refused [12:25:56] PROBLEM - Apache HTTP on mw41 is CRITICAL: Connection refused [12:28:56] RECOVERY - Apache HTTP on mw19 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time [12:34:56] PROBLEM - Apache HTTP on mw26 is CRITICAL: Connection refused [12:35:41] RECOVERY - Apache HTTP on srv272 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.049 second response time [12:37:38] PROBLEM - Host mw74 is DOWN: PING CRITICAL - Packet loss = 100% [12:40:11] RECOVERY - Host mw74 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [12:41:59] RECOVERY - Apache HTTP on mw26 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.047 second response time [12:44:05] PROBLEM - Apache HTTP on mw74 is CRITICAL: Connection refused [12:51:17] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [12:51:26] PROBLEM - Apache HTTP on mw73 is CRITICAL: Connection refused [12:54:17] PROBLEM - Host mw68 is DOWN: PING CRITICAL - Packet loss = 100% [12:54:35] hello [12:54:48] hi nosy [12:54:53] does anyone of you know if ms6 in haarlem is still in production? [12:55:04] hello mutante [12:55:07] do you know this? [12:55:49] it looks like it is, not decommissioned [12:56:02] mutante: thanks [12:56:13] don't see open ticket with the name either [12:56:15] k,np [12:56:23] RECOVERY - Host mw68 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [12:57:26] mutante: i just asked because oracle asked wmde to renew a support contract for it [12:57:44] if this host was no longer needed that would be good to know [12:58:18] nosy: ok, well i don't know about the contract but the host is up and running [12:59:50] PROBLEM - Apache HTTP on mw68 is CRITICAL: Connection refused [13:02:41] RECOVERY - Apache HTTP on mw74 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [13:06:49] PROBLEM - Apache HTTP on mw72 is CRITICAL: Connection refused [13:08:10] RECOVERY - Apache HTTP on mw72 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [13:14:46] PROBLEM - Apache HTTP on mw66 is CRITICAL: Connection refused [13:17:46] RECOVERY - Apache HTTP on mw73 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time [13:18:49] PROBLEM - Host mw64 is DOWN: PING CRITICAL - Packet loss = 100% [13:20:28] RECOVERY - Apache HTTP on mw66 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [13:20:46] RECOVERY - Host mw64 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [13:22:43] RECOVERY - Apache HTTP on mw68 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.093 second response time [13:24:04] PROBLEM - Apache HTTP on mw64 is CRITICAL: Connection refused [13:38:10] PROBLEM - Apache HTTP on mw22 is CRITICAL: Connection refused [13:44:19] RECOVERY - Apache HTTP on mw64 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [13:46:16] PROBLEM - Apache HTTP on mw65 is CRITICAL: Connection refused [13:52:16] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.054 second response time [13:55:16] PROBLEM - Apache HTTP on mw34 is CRITICAL: Connection refused [14:00:26] PROBLEM - Host mw45 is DOWN: PING CRITICAL - Packet loss = 100% [14:01:53] RECOVERY - Apache HTTP on mw65 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [14:02:01] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.11.45:11000 (Connection timed out) [14:02:37] PROBLEM - Apache HTTP on mw2 is CRITICAL: Connection refused [14:04:48] PROBLEM - Puppet freshness on blondel is CRITICAL: Puppet has not run in the last 10 hours [14:10:39] PROBLEM - Apache HTTP on mw52 is CRITICAL: Connection refused [14:12:42] hi ops room [14:12:49] who's around to approve my tickets? [14:13:05] Have you brought bribes? [14:13:12] hmmm [14:13:12] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [14:13:16] i have two cookies sitting next to me [14:13:20] you can come and get them? [14:13:28] they are coconut oatmeal raisin! [14:13:31] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [14:13:33] wait i am not myself [14:13:37] ottomata1? brb [14:13:58] thar we go [14:14:15] https://gerrit.wikimedia.org/r/#q,status:open+owner:ottomata,n,z [14:14:26] mark might need to check the git::clone one [14:14:26] um [14:14:38] i guess i really need this one right now [14:14:38] https://gerrit.wikimedia.org/r/#change,6038 [14:19:12] PROBLEM - Apache HTTP on mw14 is CRITICAL: Connection refused [14:22:57] RECOVERY - Apache HTTP on mw2 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.567 second response time [14:23:08] Reedy, how can I bribe you? [14:26:24] PROBLEM - Apache HTTP on mw39 is CRITICAL: Connection refused [14:35:06] PROBLEM - Apache HTTP on mw12 is CRITICAL: Connection refused [14:37:58] RECOVERY - Apache HTTP on mw12 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [14:38:33] apergos? hrrmmmm? [14:38:47] https://gerrit.wikimedia.org/r/#change,6038 [14:39:06] no bribes now [14:39:37] you didn't notice I had commited a change for that shortly after we talked about it eh? :-P [14:39:51] ack! [14:39:52] nope [14:40:03] is that change waiting to be approved too? [14:40:07] yes [14:40:09] :-D [14:40:24] I let it it for review figuring that you and/or mark might want to look at it [14:40:40] insead ou just committed a new one probably identical :-D [14:40:53] ha, yeah, wanna look at yours [14:40:59] time zone probs! [14:41:08] anyways one comment, in order for this to work the [14:41:14] (you were still in the channel :-P) [14:41:39] the files and dirs will all have to be chowned to backup user. erik as a user isn't defined over there [14:42:00] so for testing you'll want to just have a subdir that is user/group backup [14:42:11] and make sure that your test script can read/write into it ok [14:42:11] ok [14:42:16] yours does not have read only = false [14:42:19] that will be a prob, no? [14:42:34] otherwise we did pretty much the same thing [14:42:35] it doesn't have read only = true [14:42:57] right, hm, what's the default? when I've set them in the past i've had to set it to false to be able to write [14:43:10] but maybe the defaults here are set differently [14:43:23] anyways, once the script is tested then at the last minute when he is ready to switch we can do a chown on the whole dir [14:43:31] RECOVERY - Apache HTTP on mw14 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.072 second response time [14:43:39] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.041 second response time [14:43:50] anyways it is now late enoughin the day that I'm not bribeable :-P [14:44:03] rats! [14:44:06] PROBLEM - Apache HTTP on mw42 is CRITICAL: Connection refused [14:44:13] sorry ;-) [14:44:22] so'k [14:44:32] (also I was on here some of sunday late, so that means I get strict about today's hours) [14:45:18] that is good. [14:50:51] PROBLEM - Apache HTTP on mw54 is CRITICAL: Connection refused [14:58:21] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [14:59:51] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2225 [15:02:24] PROBLEM - Host mw51 is DOWN: PING CRITICAL - Packet loss = 100% [15:04:31] RECOVERY - Host mw51 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [15:04:50] PROBLEM - Apache HTTP on mw51 is CRITICAL: Connection refused [15:07:31] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [15:10:22] RECOVERY - Apache HTTP on mw54 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.023 second response time [15:14:35] sooo mayyyybe mutante [15:14:37] will review for us [15:14:52] or maybe notpeter [15:15:01] PROBLEM - Apache HTTP on mw17 is CRITICAL: Connection refused [15:15:30] there are two competing changes [15:15:35] mine and apergos' [15:15:37] either one is fine [15:15:42] whichever you think is better [15:15:44] https://gerrit.wikimedia.org/r/#change,5887 [15:15:48] https://gerrit.wikimedia.org/r/#change,6038 [15:17:52] PROBLEM - Host mw3 is DOWN: PING CRITICAL - Packet loss = 100% [15:20:43] RECOVERY - Host mw3 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [15:26:16] RECOVERY - Apache HTTP on mw51 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.041 second response time [15:27:37] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time [15:32:07] PROBLEM - Apache HTTP on mw24 is CRITICAL: Connection refused [15:33:37] is there an IRC command to play crickets? [15:34:31] PROBLEM - Host mw13 is DOWN: PING CRITICAL - Packet loss = 100% [15:36:37] RECOVERY - Host mw13 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [15:40:22] PROBLEM - Apache HTTP on mw13 is CRITICAL: Connection refused [15:44:34] RECOVERY - Apache HTTP on mw13 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.666 second response time [15:44:43] PROBLEM - Host srv266 is DOWN: PING CRITICAL - Packet loss = 100% [15:45:43] New patchset: Dzahn; "fixing link creation in HTML tables, part 1" [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/6121 [15:45:44] New patchset: Dzahn; "fix links pt.2, add lang column to mw, cut off long wiki names, fix version links, error handling, fix sorting by version,.." [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/6122 [15:45:44] New patchset: Dzahn; "fix other broken links pt.3 - use base,articlepath,server from API to build correct URLs, do not even try to guess from stats URL" [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/6123 [15:45:45] New patchset: Dzahn; "add global "license" column in stats tables" [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/6124 [15:46:13] PROBLEM - Apache HTTP on mw20 is CRITICAL: Connection refused [15:47:08] New review: Dzahn; "(no comment)" [operations/debs/wikistats] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6121 [15:47:10] Change merged: Dzahn; [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/6121 [15:48:36] New review: Dzahn; "(no comment)" [operations/debs/wikistats] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6122 [15:48:39] Change merged: Dzahn; [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/6122 [15:49:11] New review: Dzahn; "(no comment)" [operations/debs/wikistats] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6123 [15:49:13] Change merged: Dzahn; [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/6123 [15:50:43] RECOVERY - Apache HTTP on mw24 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [15:51:37] PROBLEM - Host srv273 is DOWN: PING CRITICAL - Packet loss = 100% [15:53:16] RECOVERY - Host srv273 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [15:54:01] New patchset: Dzahn; "add global "license" column in stats tables. remove whitespace" [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/6124 [15:55:18] New review: Dzahn; "(no comment)" [operations/debs/wikistats] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6124 [15:55:20] Change merged: Dzahn; [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/6124 [15:55:49] ah ha, Dzahn is committing and merging so obviously not busy at all [15:56:01] and wants to pick one of these two changes to approve [15:56:02] https://gerrit.wikimedia.org/r/#change,6038 [15:56:06] https://gerrit.wikimedia.org/r/#change,5887 [15:56:25] PROBLEM - Apache HTTP on srv273 is CRITICAL: Connection refused [15:59:45] New review: Dzahn; "duplicate of !change 6038" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/5887 [16:01:49] RECOVERY - Apache HTTP on mw20 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [16:03:10] PROBLEM - Apache HTTP on mw56 is CRITICAL: Connection refused [16:08:28] New review: Dzahn; "looks just like Ariel's change 5887 but i just see myself in a "+1 position" right now. wasnt part o..." [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/6038 [16:09:27] I don't care what gets approved, just do it and make it work [16:09:48] RECOVERY - Apache HTTP on srv273 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.048 second response time [16:10:21] ja Dzahn [16:10:22] either one is fine [16:10:27] whichever one you want [16:10:34] the only diff is mine has read-only = false [16:10:38] which may or may not be needed? [16:10:52] maybe just approve mine instead just in case it is needed [16:11:01] so we don't have to go through the approval process again if it is [16:11:09] PROBLEM - Apache HTTP on mw16 is CRITICAL: Connection refused [16:15:50] New review: Dzahn; "ok, taking this one because it has the ticket link and info." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6038 [16:15:53] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6038 [16:16:13] thank you! [16:18:30] PROBLEM - Apache HTTP on mw43 is CRITICAL: Connection refused [16:23:05] New patchset: Ottomata; "misc/statistcs.pp - changing paths to correct htpasswd protected directories" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6127 [16:23:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6127 [16:25:33] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [16:26:04] New patchset: Ottomata; "misc/statistcs.pp - changing paths to correct htpasswd protected directories" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6127 [16:26:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6127 [16:27:27] New patchset: Ottomata; "misc/statistcs.pp - changing paths to correct htpasswd protected directories" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6127 [16:27:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6127 [16:28:24] RECOVERY - Apache HTTP on mw16 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.020 second response time [16:28:33] RECOVERY - Apache HTTP on mw56 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [16:31:06] New review: Dzahn; "is there a " in front of RewriteCond on purpose or typo? see inline" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/6127 [16:31:42] PROBLEM - Host srv289 is DOWN: PING CRITICAL - Packet loss = 100% [16:33:39] RECOVERY - Host srv289 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [16:37:16] New review: Dzahn; "well you are not changing it here and the new pathes make sense." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6127 [16:37:19] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6127 [16:40:36] New review: Ottomata; "(no comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6127 [16:41:44] ottomata: ah, ok, thanks for info. ok for now? i'll be heading out soon [16:42:04] yup, looking good [16:42:05] thank you! [16:42:41] yw, ok. handing over to US timezone then:) [17:10:17] New patchset: Pyoungmeister; "adding per instance logging for udp2log" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6129 [17:10:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6129 [17:14:48] New patchset: Pyoungmeister; "adding per instance logging for udp2log" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6129 [17:15:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6129 [17:15:45] PROBLEM - jenkins_service_running on aluminium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:16:28] ottomata diederik: I'm changing some nagios stuffs for udp2log. if you see anything weird, take it with a grain of salt and let me know :) [17:16:38] ok [17:16:49] ok cool [17:17:25] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6129 [17:17:28] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6129 [17:23:41] PROBLEM - SSH on aluminium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:00] New patchset: Pyoungmeister; "forgot to add per-instance naming for nrpe checks in last commit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6132 [17:24:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6132 [17:24:41] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6132 [17:24:44] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6132 [17:25:02] RECOVERY - SSH on aluminium is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [17:39:18] PROBLEM - SSH on aluminium is CRITICAL: Server answer: [17:46:49] PROBLEM - Apache HTTP on mw15 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:49:35] drdee_: monitoring actually wroks again now :) [17:49:49] PROBLEM - udp2log processes for locke on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [17:50:34] RECOVERY - jenkins_service_running on aluminium is OK: PROCS OK: 3 processes with args jenkins [17:50:34] PROBLEM - udp2log log age for emery on emery is CRITICAL: CRITICAL: log files /var/log/squid/teahouse.log, have not been written to in 6 hours [17:51:28] RECOVERY - SSH on aluminium is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [17:56:52] RECOVERY - udp2log processes for locke on locke is OK: OK: all filters present [17:59:32] notpeter, just checking, all is well? [18:02:30] New patchset: Ryan Lane; "Only set the member attribute map before precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6136 [18:02:48] New patchset: Ryan Lane; "Change the automount timeout to 2 hours" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5927 [18:03:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6136 [18:03:05] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5927 [18:03:05] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5927 [18:03:05] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5927 [18:03:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5927 [18:03:18] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6136 [18:03:21] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6136 [18:03:45] ottomata: it looks like one of the filters on locke is crashing [18:03:55] /a/squid/urjc.awk [18:04:13] PROBLEM - udp2log processes for locke on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [18:04:23] or... is having some kinda issues [18:04:40] ottomata: the monitoring before was reading an old conf file [18:04:49] so the alerts that exist now are actually correct [18:05:05] this one? [18:05:06] # Universidad Rey Juan Carlos [18:05:06] # Contact: Antonio José Reinoso Peinado [18:05:06] # Backup contact: Jesus M. Gonzalez-Barahona [18:05:06] pipe 100 awk -f /a/squid/urjc.awk | log2udp -h wikilogs.libresoft.es -p 10514 [18:05:22] yes [18:05:25] it isn't messing with the machine or other filters though [18:05:29] the nagios check is just for pttern matching [18:05:35] how does the monitoring work? [18:05:40] is it checking that file? [18:06:00] hmm, no wait, that is sending out to udp again [18:06:08] look at puppet:files/nagios/ check_udp2log_procs [18:06:23] it's just some pattern matching against the config file [18:06:49] the /etc/udp2log/squid file? [18:07:01] ah i see [18:07:04] RECOVERY - udp2log processes for locke on locke is OK: OK: all filters present [18:07:21] yeah,conf files are in /etc/udp2log [18:07:26] and named after the instance name [18:08:27] ahhhhggg, why don't people comment code! shells scripts even [18:08:37] why do I have to decipher a regexp and gerp myself? [18:08:38] agghh [18:08:54] gerp gerp gerp [18:08:58] because we love you :-D [18:09:01] hah [18:09:16] it just looks at the name of the filter app [18:09:20] in a non-commented line [18:09:20] because it builds character? [18:09:30] and looks at the output of ps for it [18:10:27] oh to see if it is running? [18:10:38] yeah [18:15:10] PROBLEM - jenkins_service_running on aluminium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:15:19] PROBLEM - Exim SMTP on aluminium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:16:21] !log depooled & rebooting ssl1 [18:16:24] Logged the message, Master [18:16:40] RECOVERY - Exim SMTP on aluminium is OK: SMTP OK - 0.213 sec. response time [18:18:37] PROBLEM - udp2log processes for locke on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [18:20:25] has anyone looked at mw45? [18:21:20] no. is ryan's script done and it's still down? [18:21:27] yes [18:21:28] RECOVERY - udp2log processes for locke on locke is OK: OK: all filters present [18:21:34] or, his script looks done [18:21:37] PROBLEM - LVS HTTPS on wiktionary-lb.esams.wikimedia.org is CRITICAL: Connection refused [18:21:37] PROBLEM - LVS HTTPS on wikisource-lb.esams.wikimedia.org is CRITICAL: Connection refused [18:21:37] PROBLEM - LVS HTTPS on wikibooks-lb.esams.wikimedia.org is CRITICAL: Connection refused [18:21:37] PROBLEM - LVS HTTPS on wikipedia-lb.esams.wikimedia.org is CRITICAL: Connection refused [18:21:44] eehhhh [18:21:46] PROBLEM - LVS HTTPS on wikiversity-lb.esams.wikimedia.org is CRITICAL: Connection refused [18:21:46] PROBLEM - LVS HTTPS on wikiquote-lb.esams.wikimedia.org is CRITICAL: Connection refused [18:21:52] * apergos twitches [18:21:53] !log rebuilding db57 again, this time with more correct raid level! [18:21:53] mw45 is actually dead I think [18:21:55] Logged the message, notpeter [18:22:00] !log rebooting mw45 [18:22:02] Logged the message, Master [18:22:04] we cna just switch it in mc.php [18:22:13] PROBLEM - LVS HTTPS on upload.esams.wikimedia.org is CRITICAL: Connection refused [18:22:28] lets see if it comes back [18:22:31] PROBLEM - LVS HTTPS on mediawiki-lb.esams.wikimedia.org is CRITICAL: Connection refused [18:22:31] PROBLEM - LVS HTTPS on wikimedia-lb.esams.wikimedia.org is CRITICAL: Connection refused [18:22:38] argh [18:22:40] PROBLEM - LVS HTTPS on wikinews-lb.esams.wikimedia.org is CRITICAL: Connection refused [18:22:40] PROBLEM - SSH on aluminium is CRITICAL: Server answer: [18:22:49] PROBLEM - LVS HTTPS on bits.esams.wikimedia.org is CRITICAL: Connection refused [18:22:49] PROBLEM - LVS HTTPS on foundation-lb.esams.wikimedia.org is CRITICAL: Connection refused [18:22:49] PROBLEM - HTTPS on ssl3001 is CRITICAL: Connection refused [18:23:04] I didn't do anything in esams yet, but Ryan is speculating that it might be ensure => latest in puppet [18:23:43] RECOVERY - LVS HTTPS on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 638 bytes in 0.441 seconds [18:23:43] that sounds plausable [18:23:49] argh [18:23:52] RECOVERY - LVS HTTPS on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.446 second response time [18:23:52] RECOVERY - LVS HTTPS on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 80114 bytes in 0.876 seconds [18:23:55] nginx wasn't running on ssl3001 [18:24:01] RECOVERY - LVS HTTPS on wikinews-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 70920 bytes in 0.884 seconds [18:24:06] I *hate* ensure => latest [18:24:10] RECOVERY - LVS HTTPS on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3972 bytes in 0.440 seconds [18:24:10] RECOVERY - LVS HTTPS on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39069 bytes in 0.705 seconds [18:24:12] heh [18:24:14] I thought ryan got that back in shape earlier today [18:24:14] me too [18:24:17] ssl3001 I mean [18:24:17] well, feel free to change it [18:24:19] RECOVERY - HTTPS on ssl3001 is OK: OK - Certificate will expire on 07/19/2016 16:14. [18:24:23] especially with packages that restart things on install [18:24:28] RECOVERY - LVS HTTPS on wikibooks-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 43439 bytes in 0.778 seconds [18:24:28] RECOVERY - LVS HTTPS on wikisource-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 43375 bytes in 0.781 seconds [18:24:28] RECOVERY - LVS HTTPS on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 59962 bytes in 0.816 seconds [18:24:28] RECOVERY - LVS HTTPS on wikipedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 64065 bytes in 0.887 seconds [18:24:34] apergos: it's likely due to the package update [18:24:37] RECOVERY - LVS HTTPS on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 50128 bytes in 0.775 seconds [18:24:37] RECOVERY - LVS HTTPS on wikiquote-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 52107 bytes in 0.778 seconds [18:24:39] ah that would be it [18:24:41] grrrr [18:24:43] yeah, things that will auto-dos our site should probably be done away with... [18:24:43] binasher: which is in Debian policy, so most of the packages… [18:24:46] so, https wasn't likely down for everyone [18:25:10] the health checks failed because https uses the sh scheduler [18:25:13] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [18:25:18] ssl3001 being out is enough to bring the site down for some people [18:25:22] RECOVERY - Host mw45 is UP: PING OK - Packet loss = 0%, RTA = 1.34 ms [18:25:23] yes [18:25:24] it is [18:25:34] ssl3003 wasn't restarted yet [18:25:35] which is suboptimal... [18:25:44] until lvs depools and their connections to ssl3001 die [18:25:50] still has "old" nginx that runs since yesterday [18:25:51] then they'll be connected to another server [18:25:58] PROBLEM - udp2log processes for locke on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [18:26:26] mw45 came back up but doesn't have the current kernel [18:26:45] we only rebooted hosts [18:26:47] we didn't patch them [18:26:52] PROBLEM - Puppet freshness on srv206 is CRITICAL: Puppet has not run in the last 10 hours [18:26:52] PROBLEM - Host db57 is DOWN: PING CRITICAL - Packet loss = 100% [18:26:54] I didn't want to do that on a friday [18:27:50] paravoid: is it generally something you can disable with debconf? [18:27:53] !log power cycling aluminium which faceplanted [18:27:56] Logged the message, Master [18:28:40] PROBLEM - Apache HTTP on mw45 is CRITICAL: Connection refused [18:28:56] jeremyb: in Debian you can do it with having a /usr/local/bin/policy-rc.d that has #!/bin/sh\nexit 101 [18:29:05] jeremyb: that invoke-rc.d calls first to check [18:29:15] !log rebooting mw45 for kernel upgrade [18:29:17] no idea what's the interaction with upstart though [18:29:17] Logged the message, Master [18:29:25] errr... upstart [18:29:52] RECOVERY - SSH on aluminium is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [18:30:10] RECOVERY - jenkins_service_running on aluminium is OK: PROCS OK: 3 processes with args jenkins [18:31:33] so, we really have to do an upgrade/reboot cycle on all of our machines [18:34:28] yeah [18:34:37] !log pooled back ssl1; depooling ssl3 and rebooting [18:34:40] Logged the message, Master [18:35:06] !log aluminium gets kernel update, yayyyyyyy! [18:35:09] Logged the message, Master [18:35:20] hahaha [18:35:52] PROBLEM - Host ssl3 is DOWN: PING CRITICAL - Packet loss = 100% [18:38:01] RECOVERY - udp2log processes for locke on locke is OK: OK: all filters present [18:38:28] RECOVERY - Host ssl3 is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [18:38:37] PROBLEM - NTP on ssl3 is CRITICAL: NTP CRITICAL: Offset unknown [18:38:46] RECOVERY - Host db57 is UP: PING OK - Packet loss = 16%, RTA = 0.20 ms [18:40:16] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:41:55] PROBLEM - MySQL Slave Running on db57 is CRITICAL: Connection refused by host [18:41:55] PROBLEM - mysqld processes on db57 is CRITICAL: Connection refused by host [18:41:55] PROBLEM - MySQL Idle Transactions on db57 is CRITICAL: Connection refused by host [18:42:22] PROBLEM - MySQL Recent Restart on db57 is CRITICAL: Connection refused by host [18:42:22] PROBLEM - MySQL disk space on db57 is CRITICAL: Connection refused by host [18:42:22] PROBLEM - udp2log processes for locke on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [18:42:25] !log grosley gets new kernel + reboot [18:42:28] Logged the message, Master [18:42:31] PROBLEM - MySQL Slave Delay on db57 is CRITICAL: Connection refused by host [18:42:40] PROBLEM - MySQL Replication Heartbeat on db57 is CRITICAL: Connection refused by host [18:42:58] RECOVERY - NTP on ssl3 is OK: NTP OK: Offset 0.04797685146 secs [18:43:16] PROBLEM - SSH on db57 is CRITICAL: Connection refused [18:43:16] PROBLEM - Full LVS Snapshot on db57 is CRITICAL: Connection refused by host [18:46:07] PROBLEM - Host db57 is DOWN: PING CRITICAL - Packet loss = 100% [18:46:16] PROBLEM - SSH on grosley is CRITICAL: Connection refused [18:46:52] RECOVERY - Apache HTTP on mw45 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [18:47:19] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:47:28] RECOVERY - SSH on db57 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [18:47:37] RECOVERY - Host db57 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [18:48:04] PROBLEM - Exim SMTP on grosley is CRITICAL: Connection refused [18:49:07] RECOVERY - SSH on grosley is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [18:49:25] RECOVERY - Exim SMTP on grosley is OK: SMTP OK - 0.011 sec. response time [18:50:03] !log rebooting ssl1001 [18:50:06] Logged the message, Master [18:51:49] RECOVERY - Full LVS Snapshot on db57 is OK: OK no full LVM snapshot volumes [18:51:58] RECOVERY - MySQL Slave Running on db57 is OK: OK replication [18:51:58] RECOVERY - MySQL Idle Transactions on db57 is OK: OK longest blocking idle transaction sleeps for seconds [18:52:16] RECOVERY - MySQL disk space on db57 is OK: DISK OK [18:52:16] RECOVERY - MySQL Recent Restart on db57 is OK: OK seconds since restart [18:52:16] RECOVERY - udp2log processes for locke on locke is OK: OK: all filters present [18:52:34] RECOVERY - MySQL Slave Delay on db57 is OK: OK replication delay seconds [18:52:43] RECOVERY - MySQL Replication Heartbeat on db57 is OK: OK replication delay seconds [18:56:38] New patchset: Ryan Lane; "Try another way to call the function" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6140 [18:56:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6140 [18:57:01] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6140 [18:57:03] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6140 [18:59:18] !log starting innobackupex from db1034 to db57 for new s2 slave [18:59:21] Logged the message, notpeter [19:00:07] !log rebooting ssl1002 [19:00:10] Logged the message, Master [19:00:49] PROBLEM - MySQL Replication Heartbeat on db1002 is CRITICAL: CRIT replication delay 205 seconds [19:00:49] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 205 seconds [19:01:34] PROBLEM - MySQL Replication Heartbeat on db1034 is CRITICAL: CRIT replication delay 251 seconds [19:01:52] PROBLEM - MySQL Slave Delay on db1034 is CRITICAL: CRIT replication delay 269 seconds [19:02:28] PROBLEM - udp2log processes for locke on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [19:02:55] RECOVERY - MySQL Replication Heartbeat on db1034 is OK: OK replication delay 0 seconds [19:03:13] RECOVERY - MySQL Slave Delay on db1034 is OK: OK replication delay 0 seconds [19:03:40] RECOVERY - MySQL Replication Heartbeat on db1002 is OK: OK replication delay 0 seconds [19:03:40] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [19:06:36] !log rebooting ssl1003 [19:06:39] Logged the message, Master [19:08:10] RECOVERY - udp2log processes for locke on locke is OK: OK: all filters present [19:14:07] !log rebooting ssl1004 [19:14:10] Logged the message, Master [19:16:34] PROBLEM - udp2log processes for locke on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [19:17:32] New patchset: Ryan Lane; "nslcd will fail to start with improper permissions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6146 [19:17:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6146 [19:17:51] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6146 [19:17:54] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6146 [19:19:25] RECOVERY - udp2log processes for locke on locke is OK: OK: all filters present [19:21:22] PROBLEM - Router interfaces on mr1-pmtpa is CRITICAL: CRITICAL: host 10.1.2.3, interfaces up: 32, down: 1, dormant: 0, excluded: 0, unused: 0BRfe-0/0/1: down - csw5-pmtpa:8/23:BR [19:22:17] PROBLEM - BGP status on cr2-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.197, [19:22:41] paravoid: re: the db heartbeat alerts, all of the above dbs are in s2 and there were some expensive watchlist / link queries. that alert is definitely something to look into if it doesn't clear up in a few minutes. when it's all of the slaves for a shard in eqiad, it can be a sign that there's something wrong with the secondary master [19:23:18] which is db1034 in this case, the one that was lagged the most [19:24:13] PROBLEM - BGP status on cr1-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.196, [19:25:07] RECOVERY - BGP status on cr1-eqiad is OK: OK: host 208.80.154.196, sessions up: 10, down: 0, shutdown: 0 [19:25:07] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 25, down: 0, shutdown: 1 [19:25:25] PROBLEM - udp2log processes for locke on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [19:25:43] PROBLEM - Router interfaces on mr1-pmtpa is CRITICAL: CRITICAL: host 10.1.2.3, interfaces up: 32, down: 1, dormant: 0, excluded: 0, unused: 0BRfe-0/0/1: down - csw5-pmtpa:8/23:BR [19:26:46] RECOVERY - udp2log processes for locke on locke is OK: OK: all filters present [19:27:22] binasher: when you have some spare time, I'd really like to hear a few things about our db setups [19:27:30] binasher: esp. since apparently I'll be involved into db replication for labs [19:27:48] FYI, i'm trying to figure out the udp2log process alert right now [19:28:01] that filter is flapping, not really sure why, having a hard time finding output as to why [19:28:39] if a udp2log process is flapping, what is restarting it? [19:28:43] udp2log itself? [19:29:01] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:29:01] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:29:10] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:29:28] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:29:37] PROBLEM - Varnish traffic logger on cp1031 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:30:04] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:30:05] ottomata: that usually means to process is segfaulting [19:30:22] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [19:31:16] PROBLEM - udp2log processes for locke on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [19:31:19] hm, ok [19:31:30] udp2log is the parent proc? [19:31:50] or hm, epoll [19:31:52] (reading source) [19:32:29] RECOVERY - Varnish traffic logger on cp1031 is OK: PROCS OK: 3 processes with command name varnishncsa [19:32:29] PROBLEM - Host ssl3001 is DOWN: PING CRITICAL - Packet loss = 100% [19:33:12] ah no, i see it [19:33:12] paravoid: can't the labs DBs just be an extra slave off the eqiad intermediaries? [19:33:13] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [19:33:13] RECOVERY - Host ssl3001 is UP: PING OK - Packet loss = 0%, RTA = 109.32 ms [19:33:14] fork() [19:33:22] paravoid: seen noc.wm.o/dbtree/ i assume? [19:33:39] jeremyb: I have and no, the current plan is to use something like tungsten [19:33:49] so that we can filter out sensitive things [19:33:55] huh, i'll have to google [19:34:05] oh, instead of views? [19:34:06] and lower the barrier for access to volunteers [19:34:20] i wonder how that compares to trainwreck [19:34:52] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [19:38:10] PROBLEM - Host payments3 is DOWN: PING CRITICAL - Packet loss = 100% [19:39:31] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 3 processes with command name varnishncsa [19:39:37] ottomata: udp2log is the parent, it should automatically respawn filters after they crash [19:39:58] RECOVERY - udp2log processes for locke on locke is OK: OK: all filters present [19:40:52] RECOVERY - Host payments3 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [19:41:28] RECOVERY - Varnish traffic logger on cp1035 is OK: PROCS OK: 3 processes with command name varnishncsa [19:45:13] PROBLEM - udp2log processes for locke on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [19:48:47] !log upgraded & rebooted ssl3001, ssl3002, ssl3003 [19:48:49] Logged the message, Master [19:49:16] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - check plugin (check_job_queue) or PHP errors - [19:49:25] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - check plugin (check_job_queue) or PHP errors - [19:49:34] RECOVERY - udp2log processes for locke on locke is OK: OK: all filters present [19:57:15] !log payments cluster gets kernel updates and reboots [19:57:17] Logged the message, Master [19:58:07] PROBLEM - udp2log processes for locke on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [19:59:11] !log restarting nagios to get rid of some old checks [19:59:13] Logged the message, notpeter [19:59:29] RECOVERY - udp2log processes for locke on locke is OK: OK: all filters present [20:01:26] growl [20:01:34] i can't reproduce on my local [20:01:41] really not sure what is wrong with that filter [20:01:44] PROBLEM - BGP status on cr2-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.197, [20:01:50] its just an awk pipe [20:01:54] pretty simple [20:02:00] ottomata: yeah. it's werid that it would be fialing [20:02:01] gr [20:02:08] go fenari go "load average: 30.70, 14.26, 6.74" [20:02:11] PROBLEM - BGP status on cr1-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.196, [20:02:35] Jeff_Green: you can do better than 30... [20:02:38] PROBLEM - BGP status on csw2-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.244, [20:02:41] i can see udp2log contantly respawning the proc [20:02:47] notpeter: i never knew vi could party so hard [20:02:57] ottomata: is that a new filter or one that's been around a while? [20:02:58] no information as to why it is dying though [20:03:13] robla: I believe been around for a while [20:04:17] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:36] PROBLEM - BGP status on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196, [20:04:44] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:44] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:44] PROBLEM - LVS HTTP on ms-fe.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:06] robla: i think its been around a while, but notpeter just fixed the nagios monitoring [20:05:11] PROBLEM - BGP status on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197, [20:05:21] apparently nagios was reading the wrong config file [20:05:24] and wasn't checking for this proc [20:05:27] so its probably been broken for a while [20:05:30] don't know how long [20:05:47] RECOVERY - BGP status on csw2-esams is OK: OK: host 91.198.174.244, sessions up: 4, down: 0, shutdown: 0 [20:05:52] ottomata: two weeks or so [20:06:19] I fowled up monitoring slightly when I switched the the instance-based udp2log setup [20:06:26] ah, right [20:06:26] cool [20:06:41] PROBLEM - Router interfaces on mr1-pmtpa is CRITICAL: CRITICAL: host 10.1.2.3, interfaces up: 32, down: 1, dormant: 0, excluded: 0, unused: 0BRfe-0/0/1: down - csw5-pmtpa:8/23:BR [20:07:12] just got pages, what's breaking ? [20:07:35] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [20:08:20] PROBLEM - BGP status on csw1-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.247, [20:08:37] LeslieCarr: it would seem we're pushing so much data through your routers that we're breaking them? xD [20:09:19] 3rd time! [20:09:34] LeslieCarr: IIRC m ark mentioned it to you before, uplink of the rack with the NFS box is too small [20:09:44] notpeter, robla, since this filter isn't working…should I disable it? [20:09:46] yes it is [20:09:50] nfs1 has a narrow pipe, I think nfs2 has a wider pipe [20:09:53] so it saturates stuff [20:09:59] PROBLEM - Router interfaces on mr1-pmtpa is CRITICAL: CRITICAL: No response from remote host 10.1.2.3 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [20:10:01] So maybe we should switch NFS masters or whatever the right terminology is? [20:10:07] I think it's the upload for teh rack, not the box [20:10:09] speaking of that, woosters get the procurement ticket going? :) [20:10:12] it's upload for the rack [20:10:12] So that nfs-home (10.0.5.8) will point to nfs2 rather than 1 [20:10:17] PROBLEM - BGP status on csw2-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.244, [20:10:24] Reedy: nfs1 and nfs2 are in different racks AFAIK [20:10:26] PROBLEM - SSH on mw1106 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:27] PROBLEM - MySQL Idle Transactions on db31 is CRITICAL: CRIT longest blocking idle transaction sleeps for 605 seconds [20:10:27] RECOVERY - BGP status on csw1-esams is OK: OK: host 91.198.174.247, sessions up: 5, down: 0, shutdown: 0 [20:10:43] That doesn't fix the issue if the uplink is the same [20:10:44] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [20:10:53] Do we all need to smile sweetly at woosters? [20:11:06] just say pls ..;-) [20:11:14] ottomata: no idea [20:11:16] You might be interested to hear that pretty much the entire ops dept is out to lunch [20:11:22] (reedy) [20:11:22] excpt me [20:11:31] but I am really really afk [20:11:32] Except for woosters plus the remote people [20:11:36] i'm am, but only metaphorically speaking [20:11:37] ottomata: I guess so [20:11:40] unless I have to be here (11 pm etc) [20:11:47] RECOVERY - SSH on mw1106 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [20:11:47] PROBLEM - SSH on magnesium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:11:56] PROBLEM - MySQL Idle Transactions on db35 is CRITICAL: CRIT longest blocking idle transaction sleeps for 624 seconds [20:12:04] robla, notpeter, i will disable and email the contact for the filter [20:12:06] i'm here now [20:12:08] Not as if any of them can do much about it [20:12:10] apergos: Nah I think we just need LeslieCarr to charm the network into obedience [20:12:14] woosters: pleeeeeaaaasssseeee [20:12:14] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: No response from remote host 10.65.0.1 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [20:12:15] haha [20:12:18] i can't do magic [20:12:20] ottomata: sounds liek a reasonable plan [20:12:24] woosters: I'm bringing more sugar next week ;) [20:12:34] You're a network person, anything you do is magic to us :) [20:12:58] All you have to do to give LeslieCarr a heart attack is go around unplugging cables :D [20:13:05] eeep!!! [20:13:09] RECOVERY - SSH on magnesium is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [20:13:11] i have nightmares of that [20:13:42] Do you have nightmares of your peers having a row too? [20:14:19] ok, but srsly, what's killing nfs? [20:14:29] PROBLEM - Router interfaces on mr1-pmtpa is CRITICAL: CRITICAL: host 10.1.2.3, interfaces up: 32, down: 1, dormant: 0, excluded: 0, unused: 0BRfe-0/0/1: down - csw5-pmtpa:8/23:BR [20:14:29] RECOVERY - BGP status on csw2-esams is OK: OK: host 91.198.174.244, sessions up: 4, down: 0, shutdown: 0 [20:14:47] RECOVERY - HTTP on fenari is OK: HTTP OK HTTP/1.1 200 OK - 4416 bytes in 0.109 seconds [20:14:49] notpeter: Well all app servers are pulling a new MW version from it, and the network is acting up at the same time [20:14:56] RECOVERY - MySQL Idle Transactions on db35 is OK: OK longest blocking idle transaction sleeps for 0 seconds [20:14:56] RECOVERY - MySQL Idle Transactions on db31 is OK: OK longest blocking idle transaction sleeps for 0 seconds [20:14:56] RECOVERY - BGP status on cr2-pmtpa is OK: OK: host 208.80.152.197, sessions up: 7, down: 0, shutdown: 0 [20:14:56] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [20:14:58] ah [20:15:00] it's not acting up [20:15:01] Yay fenari is back [20:15:05] RECOVERY - Router interfaces on cr2-pmtpa is OK: OK: host 208.80.152.197, interfaces up: 89, down: 0, dormant: 0, excluded: 0, unused: 0 [20:15:07] it just can't physically put any more throughput [20:15:15] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 0.010 seconds [20:15:15] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 0.213 seconds [20:15:15] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 10.65.0.1, interfaces up: 32, down: 0, dormant: 0, excluded: 0, unused: 0 [20:15:15] RECOVERY - LVS HTTP on ms-fe.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 3.448 seconds [20:15:15] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 25, down: 0, shutdown: 1 [20:15:20] all the app servers are simultaneously pulling??? [20:15:20] the scap was the last straw I guess [20:15:23] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [20:15:32] RECOVERY - BGP status on cr1-eqiad is OK: OK: host 208.80.154.196, sessions up: 10, down: 0, shutdown: 0 [20:15:36] I don't think it's quite that bad [20:15:38] poor thing [20:15:59] PROBLEM - Host srv206 is DOWN: PING CRITICAL - Packet loss = 100% [20:16:16] Lol [20:16:17] New patchset: Ottomata; "Disabling Universidad Rey Juan Carlos urjc filter until further notice." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6155 [20:16:22] nfs2 has a better uplink [20:16:25] notpeter, if you approve that will disable it [20:16:31] notpeter that rack only has a 1g uplink [20:16:32] Yeah, no one has put a fan option on sync-common-file [20:16:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6155 [20:16:45] LeslieCarr: gotcha [20:16:49] ottomata: ok [20:17:06] send an email please to the enduser [20:17:13] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6155 [20:17:15] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6155 [20:17:35] enduser being engineering apergos ? [20:17:39] no [20:17:49] our contact there [20:17:58] apergos: oh crossed lines [20:18:07] New patchset: Reedy; "Make sync-common-file use -F30" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6156 [20:18:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6156 [20:18:30] hurray for that breaking a db dump I was almost done with ;? [20:18:32] ;? [20:18:37] :/ [20:18:50] do I have anything relyingon nfs? NO (at least, not /home) [20:18:54] thank god [20:19:05] Reedy: Were you running sync-common-file / sync-dir on the entire tree? Is that what took stuff down? [20:19:13] No [20:19:18] on php-1.20wmf2/cache [20:19:44] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 20.5429046923 (gt 8.0) [20:19:50] We've been saying that maybe a sync of an entirely new tree needs to be like -F1 [20:20:00] It would literally take 30x longer of course [20:20:05] Sure, but that didn't cause any problems earlier [20:20:06] But it wouldn't saturate the uplink [20:20:08] Oh [20:20:15] Is it the cache dir sync that caused the problems? [20:20:19] It was *just* pushing the l10n cache out [20:20:26] many non small files [20:20:26] right [20:20:32] torrent :-) [20:20:35] is that *just* or "just" :) [20:20:36] Ah, OK [20:20:46] So now that that has a forklimit, we should be fine in the future [20:20:51] reedy@fenari:/home/wikipedia/common/php-1.20wmf2/cache/l10n$ du --si [20:20:52] 600M . [20:21:02] 600M * numberofapaches [20:21:18] which makes lots [20:21:49] robla: [20:21:51] robla: https://gerrit.wikimedia.org/r/6156 [20:24:20] How many boxes are in the mediawiki-installation group? [20:25:01] * robla shrugs [20:25:07] 207 [20:25:21] 200 [20:25:22] wc -l is taking a lot longer on fenari right about now [20:25:25] If you discount comments [20:25:36] Lol [20:25:37] fine, wise guy :-P [20:25:43] roan, always a step ahead [20:25:43] So over 120GB of files from the NFS box [20:25:44] Either way the # of cycles you're looking at is like 7 [20:25:49] * Reedy strokes his chin [20:26:14] with the lag I have, yer lucky I even got the path to the file to take [20:26:56] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [20:28:26] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2388 [20:28:26] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.29603193798 [20:28:53] !log restarting, once again, innobackupex from db1034 to db57 for new s2 slave after fenari crash killed my screen [20:28:56] Logged the message, notpeter [20:31:16] RoanKattouw Reedy fyi, scap failed on searchidx1001, in case that affects your math [20:31:26] PROBLEM - udp2log processes for locke on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [20:31:26] (although the actual failure itself doesn't matter too much) [20:31:28] mutante: around [20:31:28] I was after rough figures [20:31:30] ? [20:31:56] hexmode: he's off for walspurgistnacht and may day [20:32:03] ah [20:32:38] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2625* [20:33:15] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [20:34:17] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2275 [20:34:35] RECOVERY - Host storage3 is UP: PING OK - Packet loss = 16%, RTA = 0.86 ms [20:35:26] backup [20:36:01] oop [20:36:06] not meant to type in irc room [20:37:08] RECOVERY - udp2log processes for locke on locke is OK: OK: all filters present [20:38:02] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: (Return code of 255 is out of bounds) [20:38:20] PROBLEM - mysqld processes on storage3 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [20:40:51] How do we make bits/varnish flush its cached for some 404'd files? [20:41:39] Reedy: purge-varnish 'partial path' [20:41:46] That'll purge /.*$1.*/ [20:41:56] tim said the documented command didn't work [20:42:03] but idk which command that was [20:42:07] It does now [20:42:09] reedy@fenari:/home/wikipedia/common$ purge-varnish 'https://bits.wikimedia.org/static-1.20wmf2' [20:42:09] root@sq67's password: [20:42:14] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:42:14] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 5s [20:42:14] You may or may not need to .... yeah [20:42:16] be root [20:42:23] Mind running it for me? [20:42:24] Let me run that [20:42:33] Ok done [20:42:34] cached stuff before symlinks were pushed [20:42:41] RECOVERY - mysqld processes on storage3 is OK: PROCS OK: 1 process with command name mysqld [20:43:03] Doesn't look to have done it [20:43:53] !log removing ssl4 from pool, stopping puppet on ssl4, adding 3rd udp2log host for testing, restarting nginx [20:43:56] Logged the message, notpeter [20:44:41] notpeter: Ryan was just telling me & CT that we ignore nginx's log output anyway [20:44:55] because of missing sequence numbers (I think) [20:45:00] hurray [20:45:10] (didn't know about that…) [20:45:24] so it's good to have the third line there for consistency, but it won't do any good re: statistics [20:45:27] well.. then... that's... frustrating [20:45:43] unless we fix the seq number problem... [20:46:30] well, then I'm not going to push this out atm [20:46:49] and.. hey look! oxygen's ready to go! surprise! [20:47:11] like, a week ago! [20:47:59] weeeeeeee [20:49:24] oh yeah? [20:49:24] yay! [20:50:21] !log db1025 and storage3 get new kernels and reboot [20:50:24] Logged the message, Master [20:50:27] I still need to figure out why the packet loss monitoring isn't working [20:50:30] but yeah [20:50:32] go to town [20:50:47] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:51:14] PROBLEM - Host db1025 is DOWN: PING CRITICAL - Packet loss = 100% [20:51:34] cool, looks like you are busy with other things [20:51:48] but i am having trouble with an rsync module we just set up on dataset2 [20:51:55] it seems to be up, but I can't write [20:52:02] got a min to check it out? [20:53:56] RECOVERY - Host db1025 is UP: PING OK - Packet loss = 0%, RTA = 26.47 ms [20:59:20] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [21:02:56] PROBLEM - Host db1008 is DOWN: PING CRITICAL - Packet loss = 100% [21:04:35] RECOVERY - Host db1008 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [21:07:02] !log db1008 gets kernel update and reboot [21:07:05] Logged the message, Master [21:10:29] ottomata: I can take a look [21:10:57] ottomata: also, I will have packetloss monitoring up and running on oxygen after I ask ben a couple of questions [21:12:43] ok cool [21:13:44] ottomata: where is this rsync code? [21:14:33] New patchset: Pyoungmeister; "adding packet loss filter to oxygen udp2log filter file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6202 [21:14:46] the module is [21:14:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6202 [21:15:04] [pagecounts-ez] in rsync.conf.downloadprimary [21:15:04] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6202 [21:15:07] and from stat1 [21:15:07] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6202 [21:15:08] i am trying [21:15:17] rsync -rvv /tmp/a/test_file dataset2::pagecounts-ez/test_file [21:15:48] and getting [21:15:48] rsync: mkstemp "/.test_file.j1FGPw" (in pagecounts-ez) failed: Permission denied (13) [21:17:11] who are you running it as? [21:19:11] i've tried as me, as root, and as backup [21:19:11] same results [21:20:08] maybe try dataset2:/data/xmldatadumps/public/other/pagecounts-ez [21:20:26] er [21:21:04] maybe make a test dir in there [21:21:04] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:21:38] Can someone run a purge command for bits varnish for me as Roan isn't now online... purge-varnish 'static-1.20wmf2' [21:22:12] well, hm it is an rsync module, which is usually accessed using :: double colon syntax, trying... [21:22:26] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:23:08] yeah, notpeter that asks for my pw [21:23:56] it is trying to go over ssh [21:23:56] no key relation. gotcha [21:25:04] ottomata: the ownership on that dir on dataset2 are wackadooo [21:25:20] hmm [21:25:25] the rsync module is set to use backup:backup [21:25:49] drwxr-xr-x 5 523 root 73 2012-04-24 20:28 . [21:25:50] who owns the dir? what are perms [21:26:02] agh [21:26:02] someone told me to use backup [21:26:02] what are other files in there? [21:26:02] ezachte and/or some number uids? [21:26:20] there is no 523 user any more [21:26:23] aye that is erik [21:26:25] hmmm [21:26:57] can we chgrp to backup [21:26:57] and make it 775? [21:27:18] or mabye [21:27:18] even better would be chgrp to wikidev [21:27:18] and change the rsync module to use that gid [21:28:08] yeah, we can do backup:wikidev [21:28:12] most of it is 523:wikidev currently [21:28:31] shall I chown -R backup:wikidev * ? [21:31:42] yeah, do it [21:32:17] +chmod g+w [21:32:17] -R [21:32:27] done [21:37:35] hmm [21:37:35] same [21:37:35] rsync: mkstemp "/.test_file.3NwtCv" (in pagecounts-ez) failed: Permission denied (13) [21:40:01] weird [21:42:30] if you change the module to uid = root and restart rsyncd [21:42:46] does it work?