[00:06:11] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:07:21] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [00:07:51] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 23337 MB (2% inode=99%): [00:08:01] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 00:07:59 UTC 2013 [00:08:11] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:09:21] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 00:09:11 UTC 2013 [00:10:11] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:10:21] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 00:10:16 UTC 2013 [00:11:11] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:11:21] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 00:11:13 UTC 2013 [00:12:11] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:12:11] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 00:12:04 UTC 2013 [00:13:11] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:13:31] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 00:13:24 UTC 2013 [00:14:11] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:14:31] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 00:14:29 UTC 2013 [00:15:11] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:26:11] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [00:49:37] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:02:37] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [01:03:31] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [01:05:11] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 22757 MB (2% inode=99%): [01:05:41] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [01:21:01] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:47:16] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:53:06] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [02:03:40] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [02:05:50] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [02:06:20] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 23162 MB (2% inode=99%): [02:06:31] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:08:50] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [02:12:25] !log LocalisationUpdate completed (1.21wmf12) at Mon Apr 1 02:12:24 UTC 2013 [02:12:32] Logged the message, Master [02:14:50] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:29:00] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:32:31] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [02:38:40] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [02:44:40] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:53:00] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [03:04:30] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [03:06:40] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [03:06:40] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [03:07:10] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 22656 MB (2% inode=99%): [03:07:40] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [03:22:54] !log reinstated max_user_connections = 80 for wikiadmin on s1 [03:31:40] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [03:44:43] !log asher synchronized wmf-config/db-eqiad.php 'pulling db1051' [03:49:30] PROBLEM - mysqld processes on db1051 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [03:50:31] RECOVERY - mysqld processes on db1051 is OK: PROCS OK: 1 process with command name mysqld [03:50:59] morebots is missing. [03:51:00] Sigh. [03:51:11] Lost in the netsplit. [03:58:05] !log asher synchronized wmf-config/db-eqiad.php 'returning db1051' [04:04:04] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:06:14] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [04:06:44] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 22196 MB (2% inode=99%): [04:07:54] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 04:07:49 UTC 2013 [04:08:04] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:09:04] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 04:08:54 UTC 2013 [04:09:04] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:09:54] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 04:09:53 UTC 2013 [04:10:04] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:10:54] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 04:10:46 UTC 2013 [04:11:04] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:11:34] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 04:11:32 UTC 2013 [04:12:04] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:12:15] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 04:12:13 UTC 2013 [04:13:04] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:14:54] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 04:14:47 UTC 2013 [04:15:04] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [05:04:16] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [05:14:36] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 05:14:31 UTC 2013 [05:15:06] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [06:05:07] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [06:07:17] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [06:07:47] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 21899 MB (2% inode=99%): [06:21:17] PROBLEM - Host mw27 is DOWN: PING CRITICAL - Packet loss = 100% [06:22:17] RECOVERY - Host mw27 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [06:24:27] PROBLEM - Apache HTTP on mw27 is CRITICAL: Connection refused [06:29:57] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 06:29:55 UTC 2013 [06:30:07] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [06:30:30] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 06:30:24 UTC 2013 [06:31:07] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [06:32:27] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.152 second response time [07:04:21] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [07:06:01] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 21473 MB (2% inode=99%): [07:06:31] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [07:14:51] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 07:14:42 UTC 2013 [07:15:21] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [07:45:24] !log depooling ms-fe1 for staging [07:50:27] New review: Faidon; "I think it's okay, but as a general comment, I'd rather change the logic of this altogether. I think..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55302 [08:04:06] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [08:04:29] paravoid: want to boot morebots? :) [08:04:38] * jeremyb_ will re!log everything if it's soon [08:06:16] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [08:06:46] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 20996 MB (2% inode=99%): [08:07:46] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 08:07:42 UTC 2013 [08:08:06] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [08:08:25] New review: Yurik; "Faidon, I spoke with dfoy regarding this, and according to him, the reason for this logic is that is..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55302 [08:12:35] New review: Faidon; "a) this isn't just about zero, zero is just one part of this (and traffic-wise, probably small part)..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55302 [08:12:53] New review: MaxSem; "Still, can redirection be done in PHP instead of Varnish?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55302 [08:13:32] yurik_, where are you and why you're not asleep?:) [08:14:02] i forget what borough yurik_ lives in [08:14:05] * jeremyb_ is in kings [08:14:13] i'm in a hotel in SF [08:14:19] flying back tomorrow night [08:14:36] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 08:14:28 UTC 2013 [08:14:36] and it takes forever for the laundry to dry :( [08:14:55] where is "back"? [08:15:06] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [08:16:49] paravoid: my city! [08:16:55] i assume [08:19:05] New patchset: Yurik; "Unified default lang redirect from m. & zero." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55302 [08:23:20] New review: Yurik; "You are talking about a double-redirect - not very good for a mobile, often no-pictures device that ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55302 [08:23:42] New review: Yurik; "patch 10 is a rebase" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55302 [08:24:07] paravoid, NYC [08:24:20] jeremyb_, yep :) [08:24:24] :) [08:27:51] speaking of varnish... [08:28:07] paravoid, can you review https://gerrit.wikimedia.org/r/#/c/56502/ please? [08:29:07] paravoid, maxsem, rebased and answered. Maybe one day we will have a better solution for a global Zero home page. Or start using browser's lang settings. Maybe. [08:30:35] I don't understand what double redirect you're talking about [08:31:06] basically, go to zero.wp.o -> get redirected by php instead of varnish [08:31:33] slightly more load on backend apaches, but helluva more flexible [08:32:37] if you properly do caching, it shouldn't be much load [08:32:49] anyway, wait for mark's review too [08:34:31] another plus is that you will need no ops review then:) [08:38:41] MaxSem, no ops review is a big plus, totally agree there :) [08:39:39] doing it in PHP is, of course, an option, but that would (if agreed) be another patch [08:40:04] here it was a simple consolidation of logic so that new telcos can be launched quickly [08:41:11] MaxSem, re-redirects - if i understand correctly, we don't have a server at zero.wikipedia.org - all requests were redirected to en.zero.wikipedia.org -- hence the double redirect zero->en.zero->xx.zero [08:41:56] should be fixable [08:42:00] i could be mistaken re nonexistance of zero. server of course - but at least that's the logic we had in varnish before [08:43:19] MaxSem, are you suggesting we have a server just to do zero redirects? Or to point zero.wikipedia.org to en.zero? [08:44:14] every apache has docroots for every WMF wiki [08:44:23] I'm not sure what you're implying about ops review [08:44:44] it was a compliment to ops :) [08:45:07] general infrastructure still needs ops review, yes [08:45:32] but just like with MediaWiki, not every php revision will ned to be dragged to you for review [08:45:50] btw, it isn't nice that zero.wikipedia.org exposes XVO headers [08:46:14] that's all of .m. I think [08:46:26] not just XVO, all of our internal X- headers [08:46:29] the reviews are very thorough, hard to sneak stuff past you :) [08:46:52] but to get back to the subject [08:46:58] oooooph [08:47:16] I don't understand why when I type "m.wikipedia.org" on my phone I get redirected to english wikipedia [08:47:35] why did you guess english and not greek? [08:47:45] why are you even in the business of guessing? [08:47:58] we don't do that for www. on any of our projects [08:48:02] I don't see why we do it for mobile. [08:48:29] also note that there's not an m.$project.org for the other projects, just wikipedia [08:48:42] which is very inconsistent I think [08:48:55] I think all these need to be dealt with and I think varnish (and ops) is the wrong place to deal with them [08:50:38] huh. there's an en.m.wiktionary.org but not an m.wiktionary.org [08:50:56] paravoid, i agree that we should reexamine it, but in the case of ZERO (note, this is not m. vs zero., but rather - zero cost arranged with carriers), we have somewhat different requirements - we don't guess the language, we show the one that we aggreed upon when signing legal paperwork [08:51:21] I don't think you understand [08:51:27] damn lawyers [08:51:31] I don't want to be involved in all that :) [08:51:41] heh, neither do i to be honest [08:51:47] but it's your job :) [08:52:03] nah, my job is to get you to approve my changes ;) [08:52:44] NOT APPROVED:P [08:52:59] * yurik_ throws snowball at MaxSem [08:53:06] yeah, I think you're telling me that the only way to move this forward is to just -2 all of your changes :) [08:53:19] lol [08:53:32] :-) [08:53:34] yurik_, is there snow in SF? [08:53:40] or even NYC? [08:53:49] NYC had some last monday [08:54:08] and i'm sure i can scrape enough in the freezer for one snoball [08:54:55] no, we had very light rain off and on today [08:54:58] no snow [08:55:49] paravoid, so what about https://gerrit.wikimedia.org/r/#/c/56502/ ?:) [08:56:52] aanyway, this patch was not meant to change our philosophical approach to zero & m domains, only to consolidate current logic without substantial changes. If you have strong feelings about it, lets add it to the http://www.mediawiki.org/wiki/Requests_for_comment/Zero_Architecture#Varnish [08:58:07] we all try to "make the world a better place" (tm), lets try to do it constructivelly.. [08:58:13] yurik_, I don't think anyone was blocking your changes to make you change the architecture [08:58:31] notice how my comment didn't come with a -1/-2. [08:58:48] but still you've been explained that there's a better way [08:58:49] paravoid, i did appreciate that ;) [09:00:33] MaxSem, i would love to reexamine this, assuming we can demonstrate a php-based solution which is just as fast for the clients and satisfy "business requirements". (I don't want to write two varnish plugin functions when i can do just 1, or better yet - zero!) [09:01:50] it seems the only true need ATM is to determine X-CS (carrier) id at the varnish level [09:02:14] otherwise we don't have any data for analytics [09:02:44] or a way to customize banners [09:03:41] i got to catch some sleep, MaxSem & paravoid ,would love to pick your brains tomorrow re optimal way to redirect [09:04:39] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [09:04:44] X-CS has nothing to do with this [09:06:19] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 20494 MB (2% inode=99%): [09:06:49] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [09:07:04] paravoid, X-CS is the carrier ID, and that determines which redirect target to use [09:07:27] you think that after all these reviews I don't know what X-CS is for? :) [09:07:42] hehe [09:07:46] I never said that we shouldn't do X-CS at the Varnish layer [09:07:57] I only talked about the m.wikipedia.org & zero.wikipedia.org redirects [09:07:58] if we will say that carrier no longer determines which default language to show, than yes [09:08:13] carrier might just as well determines that [09:08:23] X-CS gets passed to mediawiki and mediawiki can Vary [09:09:01] if php returns 302 redirect, will it be cached? [09:09:12] (per X-CS header) [09:09:45] it can be cached, yes. [09:10:19] depends on MW headers [09:10:26] I don't think we override them for 3xx [09:11:05] i would love to change zero config in that case [09:11:50] sigh [09:11:54] it's not just about zero [09:11:56] could you help with configuring varnish to cache redirects from apache, and to direct all m.* and zero.* to en [09:12:03] I'm not sure how many times I have to say the same thing [09:12:16] sorry, must have missed something [09:12:37] this is also about "m.wikipedia.org" [09:12:59] anyway, let's wait for mark or maybe asher too to comment on that topic [09:13:07] m.wiki if coming from the Zero carrier gets the same rules IIRC [09:13:20] if they agree too that this is a good way forward, I think we should coordinate with the core mobile team about making that happen [09:13:36] and then you can discuss internally between core mobile/partners about specific zero needs [09:13:39] how's that for a plan? [09:13:42] MaxSem: ^ [09:13:52] sounds good to me - i will make changes to that RFC to specify redirect ideas [09:14:25] trust me, i would much rather do the bouncing in Zero extension than varnish [09:14:45] feel free to change that page btw [09:14:58] varnish is flexible enough to put as little or as much logic as you want to it [09:15:10] there are people doing memcached or redis queries from varnish and whatnot [09:15:23] in our case I don't think we should complicate things too much [09:15:45] totally agree [09:17:19] off to bed, thanks for great suggestions and getting me acclimatized with zero stuff & mobile varnish stuff in general [09:17:30] :) [09:17:35] no worries :) [09:17:49] api was easier ;) [09:19:02] New review: Faidon; "If you say so :)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/56502 [09:42:00] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [09:42:00] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [09:42:00] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [10:04:41] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [10:06:21] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 19969 MB (2% inode=99%): [10:06:51] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [10:26:51] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [11:03:53] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [11:06:03] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [11:06:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:06:33] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 21345 MB (2% inode=99%): [11:07:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [11:18:42] New review: Nemo bis; "WONTFIX, please abandon." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/56101 [11:37:19] New patchset: Faidon; "Swift: add a document root container" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56911 [11:38:56] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56911 [11:54:15] !log repooling ms-fe1; staggered depool/restart/pool of ms-fe2-4 [12:04:10] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:06:20] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [12:06:50] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 20650 MB (2% inode=99%): [12:08:01] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 12:07:58 UTC 2013 [12:08:10] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:09:11] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 12:09:05 UTC 2013 [12:10:10] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:10:11] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 12:10:06 UTC 2013 [12:11:10] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:11:13] Change abandoned: Odder; "Abandoning per Bugzilla bug." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56101 [12:11:50] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 12:11:46 UTC 2013 [12:12:10] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:12:32] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 12:12:27 UTC 2013 [12:13:10] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:14:30] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 12:14:27 UTC 2013 [12:15:10] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:19:20] New patchset: Faidon; "Swift: add rewrites for legacy math URLs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56912 [13:02:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:03:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [13:05:28] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [13:07:08] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 20050 MB (2% inode=99%): [13:07:38] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [13:55:28] !log added webstatscollector-0.1-2 to apt repo [14:06:12] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [14:08:22] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [14:08:52] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 20526 MB (2% inode=99%): [14:25:08] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56912 [14:39:19] !log restarted glusterfs-server on labstore1 and labstore2, because of read errors in public/datasets/public in bots project [14:39:49] morebots died [14:42:27] dang [15:05:18] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [15:05:58] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [15:07:34] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [15:07:58] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 19900 MB (2% inode=99%): [16:02:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:03:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [16:05:22] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:05:46] New patchset: Ottomata; "Adding define ganglia::view for abstracting ganglia web views." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56921 [16:06:21] morning! Ryan_Lane would love a review of that whenever you get a sec [16:06:37] sure [16:07:02] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 19289 MB (2% inode=99%): [16:07:10] (ergh I see some trailing whitespace, gonna fix) [16:07:32] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [16:08:02] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 16:07:58 UTC 2013 [16:08:22] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:09:12] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 16:09:05 UTC 2013 [16:09:15] New patchset: Ottomata; "Adding define ganglia::view for abstracting ganglia web views." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56921 [16:09:22] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:10:12] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 16:10:05 UTC 2013 [16:10:22] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:11:02] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 16:10:58 UTC 2013 [16:11:22] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:11:52] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 16:11:46 UTC 2013 [16:12:01] looking [16:12:22] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:12:32] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 16:12:27 UTC 2013 [16:13:22] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:14:32] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Mon Apr 1 16:14:28 UTC 2013 [16:15:22] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:18:31] ottomata: why does labs and production have different configs? [16:18:56] there's not a really great reason for that [16:19:22] that's the way they are? it took me a long time to figure out where prod ganglia's conf dir was [16:19:23] we'll eventually want that manifest to be in a module [16:19:34] yeah totally [16:19:44] i figured we'd change that when the time came [16:19:45] and putting realm specific things there will make it hardr [16:19:47] but here is a better q [16:19:47] *hardr [16:19:51] damn it [16:20:14] ganglia.wikimedia.org apache vhost has [16:20:14] Alias /latest /srv/org/wikimedia/ganglia-web-3.5.4+security [16:20:31] and [16:20:51] /srv/org/wikimedia/ganglia-web-3.5.4+security/conf_default.php has: [16:20:51] $conf['ganglia_web_root']="/srv/org/ganglia_storage/3.5.1"; [16:26:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [16:36:49] New patchset: Ottomata; "Adding define ganglia::view for abstracting ganglia web views." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56921 [16:41:36] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56921 [16:42:04] ottomata: that's a nice class [16:42:22] er, resource [16:42:23] yeah, i was wondering how those few views that are in ganglia got there [16:42:27] define! [16:42:27] hehe [16:42:39] i wanted a udp2log (and other) ones [16:42:53] for analytics nodes I had once made a stupid php file taht just gathered the graphs from ganglia manually and put them on a page [16:42:55] this is much better [16:43:27] I shoudl convert the two existing views (swift frontend and backend) to use this in puppet too [16:44:00] btw about the udp graphs.. i think there's something crappy like if you change the metric type you have to kill & restart the ganglia aggregator for it to matter, otherwise the previous value sticks [16:44:04] something like that [16:46:19] hmm, ay, will try that [16:46:33] i still see the counter graph for the ones we changed [16:46:58] no, i mean: kill the receiver end :-/ [16:47:09] or change the metric names slightly so that they go into a new metric [16:48:22] New patchset: Ottomata; "Ensuring ruby-json is installed on puppetmasters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56927 [16:49:30] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56927 [16:54:48] COOool, thanks ori-l + Ryan_Lane!: [16:54:49] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&vn=udp2log [16:55:05] http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=locke%7Cemery%7Coxygen%7Cgadolinium&mreg[]=pkts_in&z=large>ype=stack&title=pkts_in&aggregate=1&r=hour [16:56:00] sweet [16:56:17] i just restarted ganglia-monitor on the two aggregators for misc eqiad [16:56:30] hopefully that will fix the few other udp stats I want to add [17:05:01] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [17:06:11] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [17:06:41] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 18741 MB (2% inode=99%): [17:13:55] New patchset: Cmjohnson; "Adding rdb1/2 dhcpd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56931 [17:14:31] !log moved aside williams:/opt/otrs-home/.spamassassin/auto-whitelist and restarted spamd to test whether the bloated AWL is resulting in poor filtering [17:16:30] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56931 [17:23:30] New patchset: Cmjohnson; "Setting absent to a public id for chris johnson" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56936 [17:24:57] robh: can you merge that ^ [17:27:12] !log reedy synchronized wmf-config/ExtensionMessages-1.22wmf1.php [17:29:53] 1.22wmf1! [17:30:11] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56710 [17:34:34] cmjohnson1: i take it you no longer use that key then yes [17:34:34] ? [17:34:55] correct [17:35:22] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56936 [17:35:25] cmjohnson1: ok, i leave sockpuppet merge to you [17:35:42] cool thx [17:44:56] cmjohnson1: are you at DC physically ? [17:45:43] not at the moment...i have someone coming to the house today...will be in later though. whats up? [17:46:03] okay [17:46:08] need to test out the sfp-t's [17:46:22] oh..okay..where do you want them? [17:46:58] i am thinking a5 and moving a normal machine to the 4500 [17:47:04] lemme see if anything is available [17:47:08] kk [17:47:33] notpeter: are any of the search machines in eqiad (search1001-1024 at least) ok to disconnect for a minute and then reconnect ? [17:47:34] mutante-away, are you still away? ;) [17:48:14] jgonera: still looking for a verification scheme? [17:48:23] jgonera: did you update gerrit/wikitech? [17:48:28] I did [17:48:30] I mean [17:48:33] gerrit yes [17:48:36] what's the other one? [17:48:42] wikitech.wikimedia.org [17:48:46] same creds as gerrit [17:49:03] LeslieCarr: nope, they are serving traffic [17:49:16] we can switch traffic to pmtpa, though [17:49:23] although I might take down the site in doing so [17:50:04] ^ hrmmm let's think about that for a minute [17:50:20] notpeter: speaking of search... seen 4844? [17:50:25] jeremyb_, ok, wikitech updated too [17:50:31] cmjohnson1: sorry, i was joking [17:50:34] I can definitely do that [17:50:34] hehehe [17:50:58] ok, if we can do that (or just take down one), whichever is easier [17:51:07] yeah...i know...sarcasm is never clear on irc [17:51:08] or just take the site down ;) nobody really uses it [17:51:31] jeremyb_, mutante-away said he needs to do a hangout or a phone call with me to verify that I'm myself, so I'm still waiting [17:51:48] jgonera: right, i remember [17:52:09] (doesn't have to be him... but he seems active...) [17:52:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:52:51] who else can do that? he replied to my e-mail 10 minutes ago so I guess he should be around soon [17:53:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [17:55:04] !log reedy synchronized php-1.22wmf1 'Initial sync of php-1.22wmf1' [17:55:45] !log reedy synchronized docroot [17:56:29] !log reedy synchronized w [17:56:30] jgonera: well i guess it should ideally be someone who knows you already. idk who that would be though. and probably easiest if it's a root [17:57:37] New review: Nikerabbit; "Maxing is short for making sure?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56493 [17:59:10] so, uh, https://wikitech.wikimedia.org/wiki/Server_admin_log doesn't seem to be updating.... [17:59:26]