[00:21:31] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5350 [00:21:34] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5350 [01:05:57] PROBLEM - Puppet freshness on cp1041 is CRITICAL: Puppet has not run in the last 10 hours [01:09:06] PROBLEM - Puppet freshness on cp1042 is CRITICAL: Puppet has not run in the last 10 hours [01:16:09] PROBLEM - Puppet freshness on cp1043 is CRITICAL: Puppet has not run in the last 10 hours [01:22:09] PROBLEM - Puppet freshness on cp1044 is CRITICAL: Puppet has not run in the last 10 hours [01:23:09] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/5595 [01:25:08] Change abandoned: MarkAHershberger; "should have used the test branch" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5595 [01:33:06] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [01:39:22] New patchset: MarkAHershberger; "lint warnings" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4734 [01:39:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4734 [01:42:42] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 206 seconds [01:46:54] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 15 seconds [01:55:09] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [02:48:57] PROBLEM - Apache HTTP on srv222 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [02:49:15] PROBLEM - Apache HTTP on srv221 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [02:53:09] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Tue Apr 24 02:52:59 UTC 2012 [02:56:09] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Tue Apr 24 02:56:03 UTC 2012 [03:00:39] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Apr 24 03:00:35 UTC 2012 [03:02:45] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Tue Apr 24 03:02:34 UTC 2012 [03:22:05] PROBLEM - Host ssl1 is DOWN: PING CRITICAL - Packet loss = 100% [03:30:27] ganglia hasn't heard from ssl1 either for nearly 15 mins [03:35:08] PROBLEM - LVS HTTP on rendering.svc.pmtpa.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [03:36:38] RECOVERY - LVS HTTP on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 59750 bytes in 0.148 seconds [03:55:54] * jeremyb wonders who wants to take a look at ssl1? i assume it's not an emergency but it does look like some people are active ;) [03:56:38] jeremyb: eh? [03:56:42] ssl1? [03:56:49] oh? [03:57:01] Ryan_Lane: nagios backlog says ping critical [03:57:07] Ryan_Lane: no ping per nagios or ganglia [03:57:11] I put money on the same thing as last time ;) [03:57:19] too bad we won't be able to tell [03:57:19] can't be [03:57:28] what's last time? [03:57:33] no, I mean the issue with apt [03:57:38] hm [03:57:39] no [03:57:42] it's not [03:57:49] at some point it just died [03:58:38] going into the console [03:58:46] oh, I was trying to [03:58:49] eh [03:58:50] err [03:58:50] heh [03:59:03] trying as in, fcking 30% packet loss [03:59:04] nothing but garbage on the console [03:59:14] box crashed [03:59:19] what kind of garbage? [03:59:25] * jeremyb wonders if paravoid has had the tty in use forever, go racadm reset initiation ;P [03:59:31] question marks in diamonds [03:59:53] that's usually a mismatching baud rate [04:00:01] !log powercycling ssl1 [04:00:04] Logged the message, Master [04:00:06] it also happens when a system crashes [04:00:19] which is amazingly unuseful [04:00:34] do we have the same baud rate for all of bios serial redirection, grub, kernel, gettys? [04:00:38] yes [04:00:45] when I reboot it, it'll work fine [04:01:01] I'm not disputing that :_-) [04:01:16] ah, you are thinking of of them isn't, and that one is spitting out crap [04:01:19] possible. [04:01:22] yes [04:01:31] I'd imagine the kernel, in this case [04:01:34] the kernel f.e. [04:01:37] right [04:01:38] since this was likely a fault [04:03:00] would have been helpful to see the fault :( [04:03:22] error: bad unit number. [04:03:23] error: unknown terminal 'serial' [04:03:28] on boot, before kernel [04:03:36] that's grub [04:03:40] well, beginning of kernel [04:03:41] I think [04:04:04] hm [04:04:05] that reminds me to check if lilo's still in debian ;P [04:04:13] I think the raid set shit itself [04:04:23] [ 26.590253] raid1: raid set md1 active with 2 out of 2 mirrors [04:04:24] [ 26.596092] md1: detected capacity change from 0 to 490106454016 [04:04:24] [ 26.603171] md1: unknown partition table [04:05:00] ** WARNING: There appears to be one or more degraded RAID device[ 98.362615] md: md0 stopped. [04:05:00] yes? [04:05:11] yay [04:05:16] yeah [04:05:19] dead raid set [04:05:41] RECOVERY - Host ssl1 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [04:05:45] well, kind of [04:05:47] system still bots [04:05:48] why would a degraded md result in a box crash? :-) [04:05:49] boots [04:06:05] that error is definitely more than just a degraded raid [04:07:27] yet both raidsets appear fine [04:08:05] no disk errors [04:08:07] * Ryan_Lane sighs [04:08:43] yeah [04:08:44] * jeremyb runs away. good night && luck! [04:08:51] definitely would have been good to see that fault [04:09:04] jeremyb: later [04:13:54] Ryan_Lane: I'm thinking of forcing md to check [04:14:04] go for it [04:17:47] md0 finished nicely, [04:23:59] PROBLEM - LVS HTTP on rendering.svc.pmtpa.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [04:24:25] nagios-wm: you're killing me smalls [04:24:43] * Ryan_Lane mumbles [04:24:47] gah [04:25:01] I'm really tired of working all fucking day [04:25:51] nothing out of the ordinary in ganglia. though it's a 500 error, which is strange [04:26:02] it must be one of the image sclaers [04:26:05] I wonder which one [04:26:41] hi ryan_lane - what's the problem with LVS? [04:26:48] no clue. just saw it [04:26:50] RECOVERY - LVS HTTP on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 59750 bytes in 0.148 seconds [04:27:38] lvs3 and lvs4 have that service [04:27:54] (walking through my thought process for paravoid's benefit) [04:28:28] looking at ipvsadm -l, I see that lvs4 is active [04:29:25] grr [04:29:37] srv222 is logging Apr 24 04:28:32 srv222 apache2[30998]: PHP Catchable fatal error: Argument 1 passed to ExtMobileFrontend::__construct() must implement interface IContextSource, none given, called in /usr/local/apache/common-local/php-1.20wmf1/extensions/MobileFrontend/MobileFrontend.php on line 158 and defined in /usr/local/apache/common-local/php-1.20wmf1/extensions/MobileFrontend/MobileFrontend.body.php on line 52 [04:29:45] there it is :) [04:29:49] it's in the redering vip, it should't be serving those [04:30:04] no? [04:30:24] that wouldn't cause this issue, though [04:30:34] no, just something wrong that i'm noticing [04:30:37] heh [04:30:40] well, two things wrong :) [04:30:40] right [04:30:43] indeed [04:30:50] it was in ./pmtpa/old/lvs3/pybal/apaches~ [04:30:54] I'm tracking which one is failing from the lvs level [04:31:17] so that's also a pybal fail [04:31:27] yes, it is [04:31:36] what's the monitor check actually do? [04:32:12] o.O [04:32:23] it just checks to see if mediawiki responds? [04:32:33] I wonder if one of them didn't get a scap properly [04:32:55] php error log isn't showing anything worthwhile? [04:33:11] I wonder if the monitoring check is somehow hitting the mobile code [04:34:24] it flapped for a minute an hour ago too [04:34:29] yeah [04:34:34] I was eating at the time [04:34:48] I came on to check about that and dealt with ssl1 instead [04:34:52] hm [04:35:12] trying to trigger it on a specific host [04:35:39] * paravoid checks what the nagios check does [04:35:57] 221 [04:36:07] srv221 just threw one on me [04:36:29] 05:48 <+nagios-wm> PROBLEM - Apache HTTP on srv222 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [04:36:32] 05:49 <+nagios-wm> PROBLEM - Apache HTTP on srv221 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [04:36:37] that's ~2h ago [04:37:10] it's throwing 7 500's a sec all on mobilefrontend [04:37:46] oh, in __construct.. it's not getting mobile requests, mw just has to load that anyways [04:38:01] ah [04:38:05] bad deploy? [04:38:08] i'm running sync-common on it [04:38:13] * Ryan_Lane nods [04:38:32] oh, I thought to say that but thought that you knew better :-) [04:38:42] what's sync-bcommon? [04:39:11] srv222 is in the same state [04:39:52] and what exactly is broken? [04:39:58] paravoid: sync-common-all, aka scap [04:40:00] fatal.log is quiet now [04:40:12] it's our deployment system [04:40:29] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.086 second response time [04:40:29] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.096 second response time [04:40:31] it does a dsh that tells the systems to pull the mediawiki code via rsync from fenari [04:41:23] so you patched it in fenari and then run scap? [04:41:24] obviously pybal isn't doing an appropriate check here [04:41:28] no [04:41:35] or was the version in 221/222 outdated? [04:41:36] some systems didn't get the scap [04:41:48] this is why I hate our deployment system [04:41:58] this should *never* happen with a proper deployment system [04:42:05] heh [04:42:07] what he said :) [04:42:10] have you seen puppi? [04:42:27] a deployment system is different from what puppet does, though ;) [04:42:45] we should say "I want to deploy this version", and the clients should get that version, no matter what [04:42:51] or they should get pulled from rotation [04:45:20] I put all this work in, and to think, wikipedia is just a cheese byproduct: http://www.smbc-comics.com/index.php?db=comics&id=2590 [04:52:41] we should monitor versions on deployed servers too [04:52:53] yeah, we should [04:52:54] but... [04:53:06] we deploy individual files [04:53:08] maybe we could serve a file with a git changeid or something and write a monitor around it [04:53:15] heh [04:53:18] sigh [04:53:33] we should always deploy with commits and never individual files [04:53:45] then we can just monitor the current commit [04:53:52] for core, and for each extension [05:38:10] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [07:31:28] New patchset: Hashar; "testswarm: log slow queries (bug 35028)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4400 [07:31:46] New patchset: Hashar; "testwarm: set innodb buffer pool size to 256M" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4395 [07:32:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4400 [07:32:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4395 [07:32:43] New review: Hashar; "Patchset 2 removes the Facebook-only MySQL configuration line." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4400 [07:33:55] New review: Hashar; "Patchset 4 is a rebase triggered by git-review :/" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4395 [07:56:38] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:05:51] !log temporarily disabled automatic zfs replication from ms7 -> ms8, cleared out space on ms8, catching up by hand [08:05:54] Logged the message, Master [08:21:53] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:56:41] !log updated blog theme per guillaume (April commits) [08:56:44] Logged the message, Master [09:24:11] New review: Dzahn; "Originally BZ 1450 but it does not block deployment of ShortURL anymore and now a new bug has been c..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/5433 [09:25:38] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [09:27:15] New review: Dzahn; "thanks Lcarr" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3513 [10:03:31] New review: Dzahn; "yes, this has been renamed to otto and stat1 did not remove the user yet. re: 5350" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5353 [10:03:34] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5353 [10:17:41] New patchset: Dzahn; "remove aotto from stat1 now that there is otto" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5707 [10:17:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5707 [10:20:35] New review: Dzahn; "otto account exists on stat1" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5707 [10:20:36] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5707 [10:28:11] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5436 [10:28:12] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5436 [10:37:09] New review: Dzahn; "Andrew, it's ok now, you have your new user and the old one is gone. It just didn't work in that ord..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5707 [10:45:12] New patchset: Dzahn; "add FIXME to mounts on stat1 and there is no group "ezachte" to own /a (and shouldn't)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5709 [10:45:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5709 [10:49:49] New patchset: Mark Bergsma; "Spaces are forbidden" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5710 [10:50:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5710 [10:50:16] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5710 [10:50:18] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5710 [11:02:13] New patchset: Dzahn; "remove ezachte as file owner, files should not be owned by real people users. he is in wikidev" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5712 [11:02:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5712 [11:09:14] New review: Dzahn; "comments on fixing the NFS issue?" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5709 [11:09:17] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5709 [11:09:25] New review: Dzahn; "write access per inline comments. give it to wikidev group" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5712 [11:09:27] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5712 [11:09:28] New review: Dzahn; "re: inline comment: done" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2393 [11:33:57] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [11:44:52] !log Sending all non-european upload traffic back to pmtpa to prepare for eqiad varnish storage rework [11:44:54] Logged the message, Master [11:54:30] !log after much cursing and kicking zfs, a manual snapshot replication is running in screen as root on ms7 to ms8, expect it to take at least a day [11:54:32] Logged the message, Master [11:55:16] hold on [11:55:25] swift originals are around the corner ;) [11:55:51] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [11:59:06] so one reason the replication takes much longer than it used to is your rsyncs, I'm pretty sure [11:59:43] New patchset: Mark Bergsma; "The Varnish persistent storage backend doesn't handle eviction when out of space" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5715 [11:59:46] anyways I hope I can get this patched up and going again reliably enough to last us til we're on swift a couple months without problems [12:00:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5715 [12:00:12] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5715 [12:00:15] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5715 [12:00:40] I can forget about it for the rest of the day in any case [12:01:16] lunch [12:02:28] there's an inotify replacement for linux [12:02:40] which would allow for a hacky replication with rsync [12:04:02] I know that inotify-ish thing has been around. I should investigate that in general [12:04:15] actually I'll add that to my todo list. [12:04:29] the new kernel support allows for a notified program to block the operation [12:04:34] so you could use that to sync the file to another host [12:04:42] I know there's some code out there already using it for something like parallel rsync [12:04:47] vaguely like [12:05:25] * apergos still pines for btrfs [12:06:51] awesome, the btrfs wiki is utterly broken. recent changes leads to a nonexistent page, the history tab for the main page shows the main page again... [12:07:13] btrfs is doing well on ms6 ;) [12:08:45] " If you've used ZFS then Btrfs feels like a clunky copy, because administering ZFS is faster and easier." [12:09:14] omg I will throw myslf off a cliff ow if that's true (well if it's true when it becomes the default os ono my distro) [12:10:18] apergos: re, broken wiki: https://btrfs.wiki.kernel.org/articles/d/o/c/Category~Documentation_c24a.html [12:10:35] yes but I don't know which are recently updated that way [12:10:43] the wiki has more or less been stagnant for awhile [12:11:15] wonders why they are doing the static .html links in mediawiki [12:12:05] probably something to do with the state oof wiki.kernel.org which was in pretty bad shape when they were trying to put everything back together [12:32:03] I wonder how well this holds up when we watch directories with a few million files in them [12:33:12] I don't see why it would make any difference [12:42:42] depends how the monitoring works [12:43:27] yes, of course [12:43:34] if they implemented it in a completely braindead way, it would matter [12:43:41] but they probably didn't [12:45:32] I guess that dnotify was implemented in a brain dead way [13:00:32] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 195 seconds [13:00:50] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 203 seconds [13:51:48] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [13:52:24] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [13:53:42] New patchset: Dzahn; "add misc::statistics::geoip to statistics role class. RT-2164" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5716 [13:53:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5716 [13:56:29] New review: Dzahn; "misc::statistics::geoip and generic::geoip should not conflict currently, but maybe they should be m..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5716 [13:56:32] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5716 [14:04:28] New review: Dzahn; "done now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5625 [14:05:59] all this irc client bouncing is making it hard to recall what channels Im supposed to idle in =P [14:06:12] ^demon|away: giving limechat a go today [14:12:11] New patchset: Ottomata; "statistics.pp - adding r-base, xorg, and wdm packages to misc::statistics::plotting class as per rt2163." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5717 [14:12:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5717 [14:14:16] New review: Ottomata; "See http://rt.wikimedia.org/Ticket/Display.html?id=2163 For more info. This builds upon change http..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/5717 [14:23:22] New review: Dzahn; "does it really need wdm and X? afaik this was just mentioned in the ticket for compiling itcan we ge..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/5717 [15:00:21] New patchset: Pyoungmeister; "adding a time period that I'd like to sleep, 2300-0700 EST (I'm an early riser)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5720 [15:00:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5720 [15:09:15] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5720 [15:09:21] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5720 [15:26:21] New patchset: Pyoungmeister; "adding in some monitoring for varnishncsa procs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5721 [15:26:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5721 [15:31:28] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5721 [15:31:31] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5721 [15:37:43] PROBLEM - Varnish HTCP daemon on cp1034 is CRITICAL: Connection refused by host [15:39:31] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [15:40:52] PROBLEM - Varnish HTCP daemon on cp1031 is CRITICAL: Connection refused by host [15:50:10] PROBLEM - Varnish HTCP daemon on cp1030 is CRITICAL: Connection refused by host [15:51:21] notpeter: ^ do you have anything to do with that? [15:51:40] PROBLEM - Varnish HTCP daemon on cp1033 is CRITICAL: Connection refused by host [15:52:19] yes [15:52:32] I will look at it [15:53:05] mark: it's probably puppet running on spence before the hosts that nrpe is to run on [15:53:15] ok [15:53:33] I just added monitoring, not a functionality change :) [15:54:04] PROBLEM - Varnish HTCP daemon on cp1036 is CRITICAL: Connection refused by host [15:54:58] RECOVERY - Varnish HTCP daemon on cp1031 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [15:56:34] exit [15:56:42] er, wrong window [15:56:55] PROBLEM - Varnish HTCP daemon on cp1032 is CRITICAL: Connection refused by host [15:57:58] weird. I added a new nrpe check and the nrpe daemon stopped, instead of restarting, but once puppet runs, it starts the daemon back up [15:59:20] RECOVERY - Varnish HTCP daemon on cp1036 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [16:02:02] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: Connection refused by host [16:03:05] PROBLEM - Varnish HTCP daemon on cp1029 is CRITICAL: Connection refused by host [16:05:47] PROBLEM - Varnish HTCP daemon on cp1035 is CRITICAL: Connection refused by host [16:06:23] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 3 processes with command name varnishncsa [16:06:32] RECOVERY - Varnish HTCP daemon on cp1034 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [16:17:38] RECOVERY - Varnish HTCP daemon on cp1030 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [16:19:17] RECOVERY - Varnish HTCP daemon on cp1033 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [16:20:18] New patchset: Pyoungmeister; "de-borking puppet on nfs1/2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5723 [16:20:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5723 [16:23:11] RECOVERY - Varnish HTCP daemon on cp1032 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [16:23:51] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5723 [16:23:53] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5723 [16:25:53] RECOVERY - Puppet freshness on nfs1 is OK: puppet ran at Tue Apr 24 16:25:47 UTC 2012 [16:26:56] RECOVERY - Puppet freshness on nfs2 is OK: puppet ran at Tue Apr 24 16:26:35 UTC 2012 [16:27:32] New patchset: Pyoungmeister; "arg. cp paste error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5724 [16:27:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5724 [16:28:06] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5724 [16:28:11] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5724 [16:31:08] RECOVERY - Varnish HTCP daemon on cp1029 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [16:35:11] RECOVERY - Varnish HTCP daemon on cp1035 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [16:35:32] hi mark, you around? I have some varnish + udp2log questions [16:35:42] yes [16:36:00] so i'm trying to set up varnish and udp2log on a local VM so I can test changes [16:36:22] (i need to do squid and nginx too, but varnish seemed to be the most completely configured by puppet, so I'm starting with that) [16:36:24] i'm pretty close [16:36:42] i've got varnish, varnishncsa, and udp2log all running [16:36:49] i *think* with proper configs for testing [16:37:00] but I can't seem to get any logs out of varnish/varnishnsca [16:37:15] varnishncsa without arguments doesn't log/print any requests? [16:37:58] ah, yes it does! [16:38:02] ok that is good [16:38:29] so just use -w : [16:38:35] and, varnishncsa has a socket open to port 8420 (udplog) [16:38:36] http://pastebin.com/m6is9q2U [16:38:37] then check with tcpdump if it's actually sending that [16:38:47] i don't see any traffic on port 8420 with tcpdump [16:38:54] # tcpdump port 8420 [16:39:18] keep in mind that varnishncsa batches up requests which fit in a 1450 byte packet [16:39:24] ohhm [16:39:25] so you may need 3-4 requests or so before it reaches that [16:40:30] hmm, don't think so, i just hit it in a big loop [16:40:33] nothing yet [16:40:45] strange [16:40:47] strace it perhaps? [16:40:53] which? [16:40:55] varnishncsa? [16:41:00] yes [16:41:55] write(5, "wmvm.localdomain 468 2012-04-24T"..., 1293) = 1293 [16:42:16] what is wmvm.localdomain? [16:42:21] my local vm name [16:42:23] is that what you specified in -w ? [16:42:29] no i have 127.0.0.1 [16:42:29] use ip addresses instead [16:42:32] ah [16:42:41] are you tcpdump'ing -i lo then? [16:42:57] oh no [16:43:06] ah! [16:43:15] i see it now :) [16:43:16] so that is good [16:43:18] good [16:43:33] needs to go to udp2log still though [16:43:52] udp2log is probably not listening on port 8420 [16:43:58] since you're already occupying that with varnishncsa [16:44:05] ahhhhh noono [16:44:06] it is owrking [16:44:19] yeah it just takes lots of requests [16:44:27] i thought iw as lookint at the output of varnishncsa [16:44:32] but tailing my udp2log file is working now too [16:44:35] cooooooooool [16:44:47] hurray! [16:44:56] awesome awesome [16:45:11] ok, now to figure out how to mess with log format :) [16:45:15] thanks mark! i think i get it now too [16:45:16] yeehaw [16:45:30] enjoy ;) [17:33:50] btw, http://www.openssl.org/news/secadv_20120419.txt [17:33:53] lcarr: ^^^ [17:34:19] ah, there we go [17:34:57] Ryan_Lane: So apparently videos.wm.o is still broken in Europe and my volunteer is getting annoyed [17:35:03] yeah [17:35:05] seeing that [17:35:08] working on it [17:35:12] OK thanks a lot [17:43:09] RobH: any reason why lists.wikimedia.org is hanging while I try to discard held messages ? [17:43:13] I've been seeing this for two days now [17:43:18] so its not going away on its own :( [17:44:21] robh probably won't look at that problem while's building out row C in the data center [17:44:36] but if you create an RT ticket perhaps someone else will look at it [17:48:28] !log deploying new squid conf to cp1001 frontend. is just a udp2log port change. [17:48:31] Logged the message, notpeter [17:51:56] mark: happy to do that [17:53:54] rt ticket filed [17:54:37] New patchset: Demon; "Rewrite gerrit hooks to all subclass HookHelper" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5727 [17:54:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5727 [17:58:17] New patchset: Demon; "Rewrite gerrit hooks to all subclass HookHelper" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5727 [17:58:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5727 [17:59:15] Change abandoned: Demon; "Went ahead and squashed this into Ic4d67fe8908db405f10ec840859a33eac5e98eff" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5629 [18:16:21] should we worry about thosee humidity alarms? [18:18:43] !log deploying to frontend [18:18:45]