[00:21:31] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5350 [00:21:34] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5350 [01:05:57] PROBLEM - Puppet freshness on cp1041 is CRITICAL: Puppet has not run in the last 10 hours [01:09:06] PROBLEM - Puppet freshness on cp1042 is CRITICAL: Puppet has not run in the last 10 hours [01:16:09] PROBLEM - Puppet freshness on cp1043 is CRITICAL: Puppet has not run in the last 10 hours [01:22:09] PROBLEM - Puppet freshness on cp1044 is CRITICAL: Puppet has not run in the last 10 hours [01:23:09] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/5595 [01:25:08] Change abandoned: MarkAHershberger; "should have used the test branch" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5595 [01:33:06] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [01:39:22] New patchset: MarkAHershberger; "lint warnings" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4734 [01:39:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4734 [01:42:42] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 206 seconds [01:46:54] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 15 seconds [01:55:09] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [02:48:57] PROBLEM - Apache HTTP on srv222 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [02:49:15] PROBLEM - Apache HTTP on srv221 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [02:53:09] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Tue Apr 24 02:52:59 UTC 2012 [02:56:09] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Tue Apr 24 02:56:03 UTC 2012 [03:00:39] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Tue Apr 24 03:00:35 UTC 2012 [03:02:45] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Tue Apr 24 03:02:34 UTC 2012 [03:22:05] PROBLEM - Host ssl1 is DOWN: PING CRITICAL - Packet loss = 100% [03:30:27] ganglia hasn't heard from ssl1 either for nearly 15 mins [03:35:08] PROBLEM - LVS HTTP on rendering.svc.pmtpa.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [03:36:38] RECOVERY - LVS HTTP on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 59750 bytes in 0.148 seconds [03:55:54] * jeremyb wonders who wants to take a look at ssl1? i assume it's not an emergency but it does look like some people are active ;) [03:56:38] jeremyb: eh? [03:56:42] ssl1? [03:56:49] oh? [03:57:01] Ryan_Lane: nagios backlog says ping critical [03:57:07] Ryan_Lane: no ping per nagios or ganglia [03:57:11] I put money on the same thing as last time ;) [03:57:19] too bad we won't be able to tell [03:57:19] can't be [03:57:28] what's last time? [03:57:33] no, I mean the issue with apt [03:57:38] hm [03:57:39] no [03:57:42] it's not [03:57:49] at some point it just died [03:58:38] going into the console [03:58:46] oh, I was trying to [03:58:49] eh [03:58:50] err [03:58:50] heh [03:59:03] trying as in, fcking 30% packet loss [03:59:04] nothing but garbage on the console [03:59:14] box crashed [03:59:19] what kind of garbage? [03:59:25] * jeremyb wonders if paravoid has had the tty in use forever, go racadm reset initiation ;P [03:59:31] question marks in diamonds [03:59:53] that's usually a mismatching baud rate [04:00:01] !log powercycling ssl1 [04:00:04] Logged the message, Master [04:00:06] it also happens when a system crashes [04:00:19] which is amazingly unuseful [04:00:34] do we have the same baud rate for all of bios serial redirection, grub, kernel, gettys? [04:00:38] yes [04:00:45] when I reboot it, it'll work fine [04:01:01] I'm not disputing that :_-) [04:01:16] ah, you are thinking of of them isn't, and that one is spitting out crap [04:01:19] possible. [04:01:22] yes [04:01:31] I'd imagine the kernel, in this case [04:01:34] the kernel f.e. [04:01:37] right [04:01:38] since this was likely a fault [04:03:00] would have been helpful to see the fault :( [04:03:22] error: bad unit number. [04:03:23] error: unknown terminal 'serial' [04:03:28] on boot, before kernel [04:03:36] that's grub [04:03:40] well, beginning of kernel [04:03:41] I think [04:04:04] hm [04:04:05] that reminds me to check if lilo's still in debian ;P [04:04:13] I think the raid set shit itself [04:04:23] [ 26.590253] raid1: raid set md1 active with 2 out of 2 mirrors [04:04:24] [ 26.596092] md1: detected capacity change from 0 to 490106454016 [04:04:24] [ 26.603171] md1: unknown partition table [04:05:00] ** WARNING: There appears to be one or more degraded RAID device[ 98.362615] md: md0 stopped. [04:05:00] yes? [04:05:11] yay [04:05:16] yeah [04:05:19] dead raid set [04:05:41] RECOVERY - Host ssl1 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [04:05:45] well, kind of [04:05:47] system still bots [04:05:48] why would a degraded md result in a box crash? :-) [04:05:49] boots [04:06:05] that error is definitely more than just a degraded raid [04:07:27] yet both raidsets appear fine [04:08:05] no disk errors [04:08:07] * Ryan_Lane sighs [04:08:43] yeah [04:08:44] * jeremyb runs away. good night && luck! [04:08:51] definitely would have been good to see that fault [04:09:04] jeremyb: later [04:13:54] Ryan_Lane: I'm thinking of forcing md to check [04:14:04] go for it [04:17:47] md0 finished nicely,  [04:23:59] PROBLEM - LVS HTTP on rendering.svc.pmtpa.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [04:24:25] nagios-wm: you're killing me smalls [04:24:43] * Ryan_Lane mumbles [04:24:47] gah [04:25:01] I'm really tired of working all fucking day [04:25:51] nothing out of the ordinary in ganglia. though it's a 500 error, which is strange [04:26:02] it must be one of the image sclaers [04:26:05] I wonder which one [04:26:41] hi ryan_lane - what's the problem with LVS? [04:26:48] no clue. just saw it [04:26:50] RECOVERY - LVS HTTP on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 59750 bytes in 0.148 seconds [04:27:38] lvs3 and lvs4 have that service [04:27:54] (walking through my thought process for paravoid's benefit) [04:28:28] looking at ipvsadm -l, I see that lvs4 is active [04:29:25] grr [04:29:37] srv222 is logging Apr 24 04:28:32 srv222 apache2[30998]: PHP Catchable fatal error: Argument 1 passed to ExtMobileFrontend::__construct() must implement interface IContextSource, none given, called in /usr/local/apache/common-local/php-1.20wmf1/extensions/MobileFrontend/MobileFrontend.php on line 158 and defined in /usr/local/apache/common-local/php-1.20wmf1/extensions/MobileFrontend/MobileFrontend.body.php on line 52 [04:29:45] there it is :) [04:29:49] it's in the redering vip, it should't be serving those [04:30:04] no? [04:30:24] that wouldn't cause this issue, though [04:30:34] no, just something wrong that i'm noticing [04:30:37] heh [04:30:40] well, two things wrong :) [04:30:40] right [04:30:43] indeed [04:30:50] it was in ./pmtpa/old/lvs3/pybal/apaches~ [04:30:54] I'm tracking which one is failing from the lvs level [04:31:17] so that's also a pybal fail [04:31:27] yes, it is [04:31:36] what's the monitor check actually do? [04:32:12] o.O [04:32:23] it just checks to see if mediawiki responds? [04:32:33] I wonder if one of them didn't get a scap properly [04:32:55] php error log isn't showing anything worthwhile? [04:33:11] I wonder if the monitoring check is somehow hitting the mobile code [04:34:24] it flapped for a minute an hour ago too [04:34:29] yeah [04:34:34] I was eating at the time [04:34:48] I came on to check about that and dealt with ssl1 instead [04:34:52] hm [04:35:12] trying to trigger it on a specific host [04:35:39] * paravoid checks what the nagios check does [04:35:57] 221 [04:36:07] srv221 just threw one on me [04:36:29] 05:48 <+nagios-wm> PROBLEM - Apache HTTP on srv222 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [04:36:32] 05:49 <+nagios-wm> PROBLEM - Apache HTTP on srv221 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [04:36:37] that's ~2h ago [04:37:10] it's throwing 7 500's a sec all on mobilefrontend [04:37:46] oh, in __construct.. it's not getting mobile requests, mw just has to load that anyways [04:38:01] ah [04:38:05] bad deploy? [04:38:08] i'm running sync-common on it [04:38:13] * Ryan_Lane nods [04:38:32] oh, I thought to say that but thought that you knew better :-) [04:38:42] what's sync-bcommon? [04:39:11] srv222 is in the same state [04:39:52] and what exactly is broken? [04:39:58] paravoid: sync-common-all, aka scap [04:40:00] fatal.log is quiet now [04:40:12] it's our deployment system [04:40:29] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.086 second response time [04:40:29] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.096 second response time [04:40:31] it does a dsh that tells the systems to pull the mediawiki code via rsync from fenari [04:41:23] so you patched it in fenari and then run scap? [04:41:24] obviously pybal isn't doing an appropriate check here [04:41:28] no [04:41:35] or was the version in 221/222 outdated? [04:41:36] some systems didn't get the scap [04:41:48] this is why I hate our deployment system [04:41:58] this should *never* happen with a proper deployment system [04:42:05] heh [04:42:07] what he said :) [04:42:10] have you seen puppi? [04:42:27] a deployment system is different from what puppet does, though ;) [04:42:45] we should say "I want to deploy this version", and the clients should get that version, no matter what [04:42:51] or they should get pulled from rotation [04:45:20] I put all this work in, and to think, wikipedia is just a cheese byproduct: http://www.smbc-comics.com/index.php?db=comics&id=2590 [04:52:41] we should monitor versions on deployed servers too [04:52:53] yeah, we should [04:52:54] but... [04:53:06] we deploy individual files [04:53:08] maybe we could serve a file with a git changeid or something and write a monitor around it [04:53:15] heh [04:53:18] sigh [04:53:33] we should always deploy with commits and never individual files [04:53:45] then we can just monitor the current commit [04:53:52] for core, and for each extension [05:38:10] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [07:31:28] New patchset: Hashar; "testswarm: log slow queries (bug 35028)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4400 [07:31:46] New patchset: Hashar; "testwarm: set innodb buffer pool size to 256M" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4395 [07:32:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4400 [07:32:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4395 [07:32:43] New review: Hashar; "Patchset 2 removes the Facebook-only MySQL configuration line." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4400 [07:33:55] New review: Hashar; "Patchset 4 is a rebase triggered by git-review :/" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4395 [07:56:38] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:05:51] !log temporarily disabled automatic zfs replication from ms7 -> ms8, cleared out space on ms8, catching up by hand [08:05:54] Logged the message, Master [08:21:53] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:56:41] !log updated blog theme per guillaume (April commits) [08:56:44] Logged the message, Master [09:24:11] New review: Dzahn; "Originally BZ 1450 but it does not block deployment of ShortURL anymore and now a new bug has been c..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/5433 [09:25:38] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [09:27:15] New review: Dzahn; "thanks Lcarr" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3513 [10:03:31] New review: Dzahn; "yes, this has been renamed to otto and stat1 did not remove the user yet. re: 5350" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5353 [10:03:34] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5353 [10:17:41] New patchset: Dzahn; "remove aotto from stat1 now that there is otto" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5707 [10:17:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5707 [10:20:35] New review: Dzahn; "otto account exists on stat1" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5707 [10:20:36] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5707 [10:28:11] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5436 [10:28:12] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5436 [10:37:09] New review: Dzahn; "Andrew, it's ok now, you have your new user and the old one is gone. It just didn't work in that ord..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5707 [10:45:12] New patchset: Dzahn; "add FIXME to mounts on stat1 and there is no group "ezachte" to own /a (and shouldn't)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5709 [10:45:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5709 [10:49:49] New patchset: Mark Bergsma; "Spaces are forbidden" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5710 [10:50:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5710 [10:50:16] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5710 [10:50:18] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5710 [11:02:13] New patchset: Dzahn; "remove ezachte as file owner, files should not be owned by real people users. he is in wikidev" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5712 [11:02:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5712 [11:09:14] New review: Dzahn; "comments on fixing the NFS issue?" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5709 [11:09:17] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5709 [11:09:25] New review: Dzahn; "write access per inline comments. give it to wikidev group" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5712 [11:09:27] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5712 [11:09:28] New review: Dzahn; "re: inline comment: done" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2393 [11:33:57] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [11:44:52] !log Sending all non-european upload traffic back to pmtpa to prepare for eqiad varnish storage rework [11:44:54] Logged the message, Master [11:54:30] !log after much cursing and kicking zfs, a manual snapshot replication is running in screen as root on ms7 to ms8, expect it to take at least a day [11:54:32] Logged the message, Master [11:55:16] hold on [11:55:25] swift originals are around the corner ;) [11:55:51] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [11:59:06] so one reason the replication takes much longer than it used to is your rsyncs, I'm pretty sure [11:59:43] New patchset: Mark Bergsma; "The Varnish persistent storage backend doesn't handle eviction when out of space" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5715 [11:59:46] anyways I hope I can get this patched up and going again reliably enough to last us til we're on swift a couple months without problems [12:00:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5715 [12:00:12] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5715 [12:00:15] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5715 [12:00:40] I can forget about it for the rest of the day in any case [12:01:16] lunch [12:02:28] there's an inotify replacement for linux [12:02:40] which would allow for a hacky replication with rsync [12:04:02] I know that inotify-ish thing has been around. I should investigate that in general [12:04:15] actually I'll add that to my todo list. [12:04:29] the new kernel support allows for a notified program to block the operation [12:04:34] so you could use that to sync the file to another host [12:04:42] I know there's some code out there already using it for something like parallel rsync [12:04:47] vaguely like [12:05:25] * apergos still pines for btrfs  [12:06:51] awesome, the btrfs wiki is utterly broken. recent changes leads to a nonexistent page, the history tab for the main page shows the main page again... [12:07:13] btrfs is doing well on ms6 ;) [12:08:45] " If you've used ZFS then Btrfs feels like a clunky copy, because administering ZFS is faster and easier." [12:09:14] omg I will throw myslf off a cliff ow if that's true (well if it's true when it becomes the default os ono my distro) [12:10:18] apergos: re, broken wiki: https://btrfs.wiki.kernel.org/articles/d/o/c/Category~Documentation_c24a.html [12:10:35] yes but I don't know which are recently updated that way [12:10:43] the wiki has more or less been stagnant for awhile [12:11:15] wonders why they are doing the static .html links in mediawiki [12:12:05] probably something to do with the state oof wiki.kernel.org which was in pretty bad shape when they were trying to put everything back together [12:32:03] I wonder how well this holds up when we watch directories with a few million files in them [12:33:12] I don't see why it would make any difference [12:42:42] depends how the monitoring works [12:43:27] yes, of course [12:43:34] if they implemented it in a completely braindead way, it would matter [12:43:41] but they probably didn't [12:45:32] I guess that dnotify was implemented in a brain dead way [13:00:32] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 195 seconds [13:00:50] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 203 seconds [13:51:48] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [13:52:24] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [13:53:42] New patchset: Dzahn; "add misc::statistics::geoip to statistics role class. RT-2164" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5716 [13:53:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5716 [13:56:29] New review: Dzahn; "misc::statistics::geoip and generic::geoip should not conflict currently, but maybe they should be m..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5716 [13:56:32] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5716 [14:04:28] New review: Dzahn; "done now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5625 [14:05:59] all this irc client bouncing is making it hard to recall what channels Im supposed to idle in =P [14:06:12] ^demon|away: giving limechat a go today [14:12:11] New patchset: Ottomata; "statistics.pp - adding r-base, xorg, and wdm packages to misc::statistics::plotting class as per rt2163." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5717 [14:12:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5717 [14:14:16] New review: Ottomata; "See http://rt.wikimedia.org/Ticket/Display.html?id=2163 For more info. This builds upon change http..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/5717 [14:23:22] New review: Dzahn; "does it really need wdm and X? afaik this was just mentioned in the ticket for compiling itcan we ge..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/5717 [15:00:21] New patchset: Pyoungmeister; "adding a time period that I'd like to sleep, 2300-0700 EST (I'm an early riser)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5720 [15:00:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5720 [15:09:15] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5720 [15:09:21] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5720 [15:26:21] New patchset: Pyoungmeister; "adding in some monitoring for varnishncsa procs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5721 [15:26:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5721 [15:31:28] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5721 [15:31:31] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5721 [15:37:43] PROBLEM - Varnish HTCP daemon on cp1034 is CRITICAL: Connection refused by host [15:39:31] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [15:40:52] PROBLEM - Varnish HTCP daemon on cp1031 is CRITICAL: Connection refused by host [15:50:10] PROBLEM - Varnish HTCP daemon on cp1030 is CRITICAL: Connection refused by host [15:51:21] notpeter: ^ do you have anything to do with that? [15:51:40] PROBLEM - Varnish HTCP daemon on cp1033 is CRITICAL: Connection refused by host [15:52:19] yes [15:52:32] I will look at it [15:53:05] mark: it's probably puppet running on spence before the hosts that nrpe is to run on [15:53:15] ok [15:53:33] I just added monitoring, not a functionality change :) [15:54:04] PROBLEM - Varnish HTCP daemon on cp1036 is CRITICAL: Connection refused by host [15:54:58] RECOVERY - Varnish HTCP daemon on cp1031 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [15:56:34] exit [15:56:42] er, wrong window [15:56:55] PROBLEM - Varnish HTCP daemon on cp1032 is CRITICAL: Connection refused by host [15:57:58] weird. I added a new nrpe check and the nrpe daemon stopped, instead of restarting, but once puppet runs, it starts the daemon back up [15:59:20] RECOVERY - Varnish HTCP daemon on cp1036 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [16:02:02] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: Connection refused by host [16:03:05] PROBLEM - Varnish HTCP daemon on cp1029 is CRITICAL: Connection refused by host [16:05:47] PROBLEM - Varnish HTCP daemon on cp1035 is CRITICAL: Connection refused by host [16:06:23] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 3 processes with command name varnishncsa [16:06:32] RECOVERY - Varnish HTCP daemon on cp1034 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [16:17:38] RECOVERY - Varnish HTCP daemon on cp1030 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [16:19:17] RECOVERY - Varnish HTCP daemon on cp1033 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [16:20:18] New patchset: Pyoungmeister; "de-borking puppet on nfs1/2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5723 [16:20:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5723 [16:23:11] RECOVERY - Varnish HTCP daemon on cp1032 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [16:23:51] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5723 [16:23:53] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5723 [16:25:53] RECOVERY - Puppet freshness on nfs1 is OK: puppet ran at Tue Apr 24 16:25:47 UTC 2012 [16:26:56] RECOVERY - Puppet freshness on nfs2 is OK: puppet ran at Tue Apr 24 16:26:35 UTC 2012 [16:27:32] New patchset: Pyoungmeister; "arg. cp paste error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5724 [16:27:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5724 [16:28:06] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5724 [16:28:11] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5724 [16:31:08] RECOVERY - Varnish HTCP daemon on cp1029 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [16:35:11] RECOVERY - Varnish HTCP daemon on cp1035 is OK: PROCS OK: 1 process with UID = 998 (varnishhtcpd), args varnishhtcpd worker [16:35:32] hi mark, you around? I have some varnish + udp2log questions [16:35:42] yes [16:36:00] so i'm trying to set up varnish and udp2log on a local VM so I can test changes [16:36:22] (i need to do squid and nginx too, but varnish seemed to be the most completely configured by puppet, so I'm starting with that) [16:36:24] i'm pretty close [16:36:42] i've got varnish, varnishncsa, and udp2log all running [16:36:49] i *think* with proper configs for testing [16:37:00] but I can't seem to get any logs out of varnish/varnishnsca [16:37:15] varnishncsa without arguments doesn't log/print any requests? [16:37:58] ah, yes it does! [16:38:02] ok that is good [16:38:29] so just use -w : [16:38:35] and, varnishncsa has a socket open to port 8420 (udplog) [16:38:36] http://pastebin.com/m6is9q2U [16:38:37] then check with tcpdump if it's actually sending that [16:38:47] i don't see any traffic on port 8420 with tcpdump [16:38:54] # tcpdump port 8420 [16:39:18] keep in mind that varnishncsa batches up requests which fit in a 1450 byte packet [16:39:24] ohhm [16:39:25] so you may need 3-4 requests or so before it reaches that [16:40:30] hmm, don't think so, i just hit it in a big loop [16:40:33] nothing yet [16:40:45] strange [16:40:47] strace it perhaps? [16:40:53] which? [16:40:55] varnishncsa? [16:41:00] yes [16:41:55] write(5, "wmvm.localdomain 468 2012-04-24T"..., 1293) = 1293 [16:42:16] what is wmvm.localdomain? [16:42:21] my local vm name [16:42:23] is that what you specified in -w ? [16:42:29] no i have 127.0.0.1 [16:42:29] use ip addresses instead [16:42:32] ah [16:42:41] are you tcpdump'ing -i lo then? [16:42:57] oh no [16:43:06] ah! [16:43:15] i see it now :) [16:43:16] so that is good [16:43:18] good [16:43:33] needs to go to udp2log still though [16:43:52] udp2log is probably not listening on port 8420 [16:43:58] since you're already occupying that with varnishncsa [16:44:05] ahhhhh noono [16:44:06] it is owrking [16:44:19] yeah it just takes lots of requests [16:44:27] i thought iw as lookint at the output of varnishncsa [16:44:32] but tailing my udp2log file is working now too [16:44:35] cooooooooool [16:44:47] hurray! [16:44:56] awesome awesome [16:45:11] ok, now to figure out how to mess with log format :) [16:45:15] thanks mark! i think i get it now too [16:45:16] yeehaw [16:45:30] enjoy ;) [17:33:50] btw, http://www.openssl.org/news/secadv_20120419.txt [17:33:53] lcarr: ^^^ [17:34:19] ah, there we go [17:34:57] Ryan_Lane: So apparently videos.wm.o is still broken in Europe and my volunteer is getting annoyed [17:35:03] yeah [17:35:05] seeing that [17:35:08] working on it [17:35:12] OK thanks a lot [17:43:09] RobH: any reason why lists.wikimedia.org is hanging while I try to discard held messages ? [17:43:13] I've been seeing this for two days now [17:43:18] so its not going away on its own :( [17:44:21] robh probably won't look at that problem while's building out row C in the data center [17:44:36] but if you create an RT ticket perhaps someone else will look at it [17:48:28] !log deploying new squid conf to cp1001 frontend. is just a udp2log port change. [17:48:31] Logged the message, notpeter [17:51:56] mark: happy to do that [17:53:54] rt ticket filed [17:54:37] New patchset: Demon; "Rewrite gerrit hooks to all subclass HookHelper" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5727 [17:54:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5727 [17:58:17] New patchset: Demon; "Rewrite gerrit hooks to all subclass HookHelper" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5727 [17:58:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5727 [17:59:15] Change abandoned: Demon; "Went ahead and squashed this into Ic4d67fe8908db405f10ec840859a33eac5e98eff" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5629 [18:16:21] should we worry about thosee humidity alarms? [18:18:43] !log deploying to frontend [18:18:45] Logged the message, Mistress of the network gear. [18:18:51] paravoid: ask cmjohnson1 they are often false [18:26:59] tfinc [18:27:03] chrome is the issue [18:27:11] woosters: let me try in FF [18:27:12] does not play well with mailman [18:28:05] paravoid: you guys dont need to worry about them [18:28:10] i do, chris does, mark does [18:28:24] i have them as a followup to handle with chris next week, as he is not on site this week [18:28:44] you can also check out observium and see if the single sensors are high/low and how the rest look [18:29:11] okay [18:29:21] bleh, my observium pass is fubar [18:29:29] LeslieCarr: you have observium access right? [18:29:40] can you reset my password to the mgmt pass for me pls? [18:31:09] RobH: ok [18:32:51] time to go to lunch, back shortly [18:48:33] New patchset: Pyoungmeister; "giving otto root on locke, emery, oxygen, and stat1 per rt 2856. ct gave approval." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5731 [18:48:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5731 [18:50:34] New patchset: Lcarr; "allowing halfak access to emery, RT2707" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5732 [18:50:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5732 [18:51:14] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5732 [18:51:17] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5732 [18:52:35] New patchset: Ottomata; "admins.pp - removing aotto user. This has been replaced by 'otto' and is no longer needed." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5733 [18:52:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5733 [19:16:22] PROBLEM - Host db58 is DOWN: PING CRITICAL - Packet loss = 100% [19:16:31] New review: Pyoungmeister; "approved by ct" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5731 [19:16:34] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5731 [19:16:57] New review: Pyoungmeister; "rt ticket mentioned by roan has all signoffs" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5215 [19:17:00] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5215 [19:17:41] ottomata: if you want a review, you need to use the gerrit request review feature [19:17:51] someone told me you just asked for a review via the gerrit emails [19:17:55] I filter those and don't read themn [19:18:10] RECOVERY - Host db58 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [19:18:17] whats the key combo to get into a perc raid controller via the ipmi console at boot? [19:18:55] ottomata: so, now it doesn't show up in my gerrit queue, and it'll look like I'm just ignoring you [19:19:44] wait... [19:19:55] hahaha [19:19:57] you did [19:20:00] but someone else merged it [19:20:15] set me on /ignore [19:20:29] RobH: whats the key combo to get into a perc raid controller via the ipmi console at boot? [19:21:07] usually ctrl-c [19:21:14] or maybe ctrl-r [19:21:33] ctril-rage [19:26:43] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [19:28:30] !log depooling ssl3001 [19:28:33] Logged the message, Master [19:29:19] binasher, Ryan_Lane: there's http://hwraid.le-vert.net/wiki/DebianPackages [19:29:32] and there's also http://www.inquisitor.ru/doc/einarc/ but I haven't tried it [19:31:12] New patchset: Ryan Lane; "Add support for sending to an https backend by default" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5736 [19:31:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5736 [19:31:55] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5736 [19:31:58] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5736 [19:32:14] (ma rk probably won't like that change :D ) [19:33:35] New patchset: Pyoungmeister; "adding db57 to s2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5738 [19:33:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5738 [19:34:27] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5738 [19:34:30] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5738 [19:36:37] hey mark, got another question about varnish logs if you are still around [19:36:49] pretty sure he isn't [19:36:56] hmmmmmk [19:37:10] preilly wrote the logging support, though [19:37:17] so, maybe you can ask him [19:37:27] you around preilly? [19:39:01] ottomata: I am around but in the middle of something [19:39:06] ottomata: can I get back to you? [19:39:44] yup, i will PM my question to you and you can answer it at your leisure :) [19:39:58] Apr 24 19:36:02 brewster dhcpd: DHCPDISCOVER from 78:2b:cb:66:aa:3e via 10.0.0.202: network 10.0/16: no free leases [19:42:21] thanks preilly for the help [19:42:36] according to preilly, the default varnish log format is modified in our custom varnish package [19:42:40] Q for ops [19:42:48] would it be better to modify the log format in the package [19:42:57] or to change the init scripts in puppet to pass a -F option to change the format? [19:43:09] also, where can I find the source code for our varnish package? [19:43:13] do the latter [19:43:19] k, I agree [19:43:25] i need to find the source to see what the current format is [19:49:53] found it: http://apt.wikimedia.org/wikimedia/pool/main/v/varnish/ [19:50:10] !log repooling ssl3001 [19:50:14] Logged the message, Master [19:51:18] !log starting innobackupex from db1034 to db57 for new s2 slave [19:51:21] Logged the message, notpeter [19:52:49] ok, something isn't quite right though [19:53:02] I've got an example of a log line from one of the logging servers [19:53:16] but it does not match the format that varnish is set to use [19:53:20] so [19:53:26] that either means that I am looking in the wrong place [19:53:29] for the format [19:53:40] or, the formats are different from different sources [19:53:51] and the log line that I have is not from varnish (which it isn't) [19:53:59] welllll, i tmight be [19:54:26] if I the request I am looking at is a GET for upload.wikimedia.org [19:54:31] that's probably varnish, right? [19:54:44] logged from sq83.wikimedia.org [19:54:46] ? [19:56:55] ottomata: looking at the same package version # in both places? [19:57:14] ottomata: that's from squid [19:57:55] yes, but i think i am onto something [19:58:02] i think maybe udp2log messes with log lines [19:59:37] https://gerrit.wikimedia.org/r/#q,project:operations/debs/varnish,n,z [19:59:47] ottomata [20:01:19] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.39:11000 (timeout) [20:02:40] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [20:03:34] PROBLEM - Host db59 is DOWN: PING CRITICAL - Packet loss = 100% [20:06:53] Ryan_Lane was saying earlier that you have to explicitly request review if you want it. can you request from the wind? or you have to choose an arbitrary person? [20:08:32] or i guess s/arbitrary //. but if you can't think of someone that would be relevant then it's arbitrary [20:09:54] domas: are you around? [20:10:02] a bit [20:10:04] boarding soon [20:10:19] whatsup [20:10:23] !log reimaged db58 with fixed raid setup, imaging db59 [20:10:26] Logged the message, Master [20:10:41] okay, very quick: about that cache setting in webstatscollector, are we actually using that code? [20:10:47] yes [20:10:56] it reduced cpu usage quite a bit [20:11:08] because it seems that the binary of collector is older than tthe source code version [20:11:12] hm [20:11:13] odd [20:11:25] sorry, boarding call [20:11:31] :D [20:11:35] good flight! [20:11:43] * jeremyb hands domas some chewing gum [20:11:47] oh, too late [20:13:46] !log re-enabled replication via cron on ms7, it should catch up within an hour or so [20:13:48] Logged the message, Master [20:14:49] RECOVERY - Host db59 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [20:18:16] PROBLEM - MySQL disk space on db59 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:18:16] PROBLEM - SSH on db59 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:19:46] RECOVERY - SSH on db59 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [20:20:38] New patchset: Ottomata; "site.pp - adding otto user on bayes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5768 [20:20:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5768 [20:27:08] New patchset: Pyoungmeister; "adding db6[0-9] to site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5769 [20:27:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5769 [20:31:42] RobH: you around? [20:32:01] what's the magic to get mac address from drac? [20:34:52] PROBLEM - mysqld processes on db57 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [20:40:46] New patchset: Pyoungmeister; "adding some stuffs for db59,60" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5770 [20:41:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5770 [20:42:41] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5769 [20:42:43] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5769 [20:43:29] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5770 [20:43:31] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5770 [20:44:18] New patchset: Ottomata; "Amending. Don't need xorg or wdm. Do need libcairo (Diederik checked this on bayes)." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5717 [20:44:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5717 [21:03:04] apergos: what was the reason to use bzip2 for the xml full dumps? just better compression vs gzip? [21:11:40] notpeter: racadm getsysinfo [21:12:01] cool, thanks! [21:19:43] PROBLEM - NTP on db58 is CRITICAL: NTP CRITICAL: No response from NTP server [21:20:01] PROBLEM - MySQL disk space on db58 is CRITICAL: Connection refused by host [21:29:37] Does anyone know at what URL length Squid will return ERR_TOO_BIG? [21:29:49] Someone in -dev triggered it with a long API request and he's wondering what the limit is [21:32:28] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 185 seconds [21:32:37] PROBLEM - MySQL Replication Heartbeat on db16 is CRITICAL: CRIT replication delay 187 seconds [21:45:50] RoanKattouw: I think it's 8k but might only be 4k [22:14:56] RECOVERY - MySQL Replication Heartbeat on db16 is OK: OK replication delay seconds [22:15:02] aude: I'm kind of floored by the fact my wikimedia labs talk was rejected for wikimania [22:15:14] RECOVERY - MySQL Slave Delay on db16 is OK: OK replication delay seconds [22:16:23] Ryan_Lane: really? [22:16:28] ues [22:16:29] *yes [22:17:02] did you get an email or does the submission page say so? [22:17:06] email [22:17:52] huh, any change on the submission page? [22:17:58] cuz im wondering about the op panel [22:18:26] I think the labs presentation is more important than reviewing our infrastructure for another year. [22:18:42] since you can touch on infrastructure and who to talk to about it during the labs presentation. [22:18:49] meh. fuck it [22:18:50] !log rebooting db16 with updated kernel. it's probably still hopeless (dimm errors) [22:18:52] Logged the message, Master [22:18:53] I just won't go to wikimania [22:19:19] arent you already approved regardless of presentation? [22:19:37] I really don't care to go now, honestly [22:19:45] I'm not entirely sure how that works, IIRC that was somewhat vague [22:20:01] I'm probably on a panek [22:20:02] *panel [22:20:07] (i.e. whether an approved person whose talk didn't get accepted still gets to go on WMF's dime) [22:20:31] we are both on the ask the operators panel [22:20:38] PROBLEM - Host db16 is DOWN: PING CRITICAL - Packet loss = 100% [22:20:41] which is a panel that is co-opting my normal presentation [22:20:50] but if the community doesn't care about a project that is meant to be run by the community, I don't care to go to the community conference [22:20:51] which im fine with since im going either way ;] [22:21:09] I think the community cares, its just the organizers are mistaken. [22:21:13] * Ryan_Lane shrugs [22:21:26] I would email and question the decision, cuz I think it is a bad one. [22:21:29] Maybe they figure Berlin is already focusing on it and they want to focus on other stuff [22:21:48] there are tons of folks who cannot attend both [22:21:57] and would attend wikimania over a smaller hackathon. [22:22:08] I know, I'm not saying it's necessarily valid [22:22:18] But one could see how they might think that way [22:22:23] Labs is changing the entire process on how volunteers can influence the entire infrastructure of the projects [22:22:26] RECOVERY - Host db16 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [22:22:33] I think its more important than the ask the ops panel. [22:22:40] of course, they may have denied the ops panel too [22:22:44] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [22:22:48] but only emailed leslie who originally created the page. [22:22:52] well, I just emailed saying I should likely be removed from the list. [22:23:01] New review: Aaron Schulz; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/4652 [22:23:21] we have other people on the waiting list who haven't been to wikimania [22:23:29] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 1.68 ms [22:23:46] if I don't have a talk, it's more fair for them to go [22:24:11] (I put my name on the panel because I was also going to be there, I don't think I'm really needed on it, we have other ops there) [22:26:11] PROBLEM - MySQL Replication Heartbeat on db16 is CRITICAL: CRIT replication delay 1323 seconds [22:26:14] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/4652 [22:26:29] PROBLEM - MySQL Slave Delay on db16 is CRITICAL: CRIT replication delay 1333 seconds [22:26:47] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [22:26:48] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/5680 [22:27:03] I guess I'll see if my talk will make it off the wait list before I say I'm not going [22:27:23] Did you already e-mail them saying never mind? [22:27:32] I should do that for mine, because I'm now giving a longer version of it in Berlin [22:29:50] no [22:29:57] good [22:30:03] I just did for mine [22:31:13] Ryan_Lane: if you dont come, no founding farmers [22:31:21] that is the only nice thing i have to say about DC [22:31:36] oh and killer eithiopean food [22:31:59] but SF overall has better food choices, so meh. [22:32:18] * RobH wants a pork belly sandwich  [22:32:31] that sounds so good right now. [22:32:52] heh [22:35:41] RobH: Enjoy the Ethiopian food while you can, Annie tells me it's not as good anywhere else [22:37:03] its the only good thing about DC [22:37:12] there are about 100 awesome ethiopian places [22:46:18] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [22:51:28] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4652 [22:51:32] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4652 [22:51:44] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5680 [22:51:47] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5680 [23:07:30] robla, Ryan_Lane RoanKattouw Leslie's "Ask the Operators" submission has been accepted [23:07:53] oops that was mean for RobH [23:07:57] :) [23:08:11] and according to the organisers, there's already an entire session on "operations" so the labs one could go in there Ryan_Lane [23:11:32] Ryan_Lane: would you be willing to do a workshop on labs during the wikimania hackathon? [23:11:47] * aude can't speak for sumanah but think we want and need that [23:13:27] New patchset: Asher; "adding db58 to s7, pulling db16" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5774 [23:13:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5774 [23:14:04] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5774 [23:14:06] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5774 [23:23:12] RECOVERY - MySQL disk space on db58 is OK: DISK OK [23:29:39] PROBLEM - carbon-cache.py on professor is CRITICAL: PROCS CRITICAL: 0 processes with command name carbon-cache.py [23:29:39] PROBLEM - profiler-to-carbon on professor is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/udpprofile/sbin/profiler-to-carbon [23:31:09] RECOVERY - profiler-to-carbon on professor is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/udpprofile/sbin/profiler-to-carbon [23:32:03] RECOVERY - NTP on db58 is OK: NTP OK: Offset 0.05184221268 secs [23:44:08] binasher: generic::apt::pin-package [23:52:01] aude: the point of labs is that it's meant to be run by the community. part of that is making the community aware it exists [23:52:20] Ryan_Lane: ok [23:52:25] aude: the people going to the hacking sessions very likely already know it exists, and many already have accounts [23:52:52] Ryan_Lane: there will be new people i'm sure who are lost [23:52:58] at the hackathon [23:53:12] I agree [23:53:24] but there's other people at the foundation waiting on a list to be able to come [23:53:26] we'll do the best to fit labs in the main program. [23:53:27] and I'm taking up a slot [23:53:41] the idea was that we would only come if we were speaking [23:53:48] so, it's likely best that I give up my slot [23:53:53] i agree it's important and sad that some of the other reviewers (less technical) didn't rate yours as highly [23:54:00] Ryan_Lane: disagree! [23:54:21] and your a key part of the "operators" session, since labs is key here [23:55:38] !log streaming hot backup of db1041 to db58 (building a new s7 slave) [23:55:40] Logged the message, Master [23:58:11] I only happened to add myself to the ops panel at the last minute [23:58:28] I don't think it's a strong enough reason for me to go, over someone in tech that hasn't been to a wikimania [23:58:49] Well were you riding on the wikimania budget, or the ops budget? [23:58:59] because ops has had to use its budget in the past. [23:59:07] Ryan_Lane: i'm seeing what we can do [23:59:10] (its a woosters question) [23:59:22] !log powering off db16 [23:59:24] Logged the message, Master [23:59:52] what's the question?