[00:00:32] New patchset: Ryan Lane; "Switch from upstart to init for glusterd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52556 [00:01:52] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52556 [00:05:15] New patchset: Ryan Lane; "Fix reference to gluster's upstart file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52557 [00:06:05] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52557 [00:08:00] RECOVERY - Puppet freshness on spence is OK: puppet ran at Thu Mar 7 00:07:53 UTC 2013 [00:08:40] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [00:08:40] RECOVERY - Puppet freshness on constable is OK: puppet ran at Thu Mar 7 00:08:37 UTC 2013 [00:09:11] RECOVERY - Puppet freshness on spence is OK: puppet ran at Thu Mar 7 00:09:02 UTC 2013 [00:09:36] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/52555 [00:09:40] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [00:09:40] PROBLEM - Puppet freshness on constable is CRITICAL: Puppet has not run in the last 10 hours [00:10:00] RECOVERY - Puppet freshness on constable is OK: puppet ran at Thu Mar 7 00:09:52 UTC 2013 [00:10:10] RECOVERY - Puppet freshness on es3 is OK: puppet ran at Thu Mar 7 00:10:02 UTC 2013 [00:10:20] RECOVERY - Puppet freshness on spence is OK: puppet ran at Thu Mar 7 00:10:14 UTC 2013 [00:10:41] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [00:10:41] PROBLEM - Puppet freshness on constable is CRITICAL: Puppet has not run in the last 10 hours [00:11:01] RECOVERY - Puppet freshness on constable is OK: puppet ran at Thu Mar 7 00:10:53 UTC 2013 [00:11:21] RECOVERY - Puppet freshness on spence is OK: puppet ran at Thu Mar 7 00:11:12 UTC 2013 [00:11:40] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [00:11:41] PROBLEM - Puppet freshness on constable is CRITICAL: Puppet has not run in the last 10 hours [00:11:50] RECOVERY - Puppet freshness on constable is OK: puppet ran at Thu Mar 7 00:11:47 UTC 2013 [00:12:11] RECOVERY - Puppet freshness on spence is OK: puppet ran at Thu Mar 7 00:12:02 UTC 2013 [00:12:20] RECOVERY - Puppet freshness on amssq38 is OK: puppet ran at Thu Mar 7 00:12:18 UTC 2013 [00:12:30] RECOVERY - Puppet freshness on knsq24 is OK: puppet ran at Thu Mar 7 00:12:19 UTC 2013 [00:12:31] RECOVERY - Puppet freshness on ms6 is OK: puppet ran at Thu Mar 7 00:12:20 UTC 2013 [00:12:31] RECOVERY - Puppet freshness on hooft is OK: puppet ran at Thu Mar 7 00:12:20 UTC 2013 [00:12:31] RECOVERY - Puppet freshness on amssq46 is OK: puppet ran at Thu Mar 7 00:12:21 UTC 2013 [00:12:31] RECOVERY - Puppet freshness on amslvs2 is OK: puppet ran at Thu Mar 7 00:12:21 UTC 2013 [00:12:31] RECOVERY - Puppet freshness on amssq34 is OK: puppet ran at Thu Mar 7 00:12:22 UTC 2013 [00:12:31] RECOVERY - Puppet freshness on ssl3001 is OK: puppet ran at Thu Mar 7 00:12:24 UTC 2013 [00:12:31] RECOVERY - Puppet freshness on amssq43 is OK: puppet ran at Thu Mar 7 00:12:24 UTC 2013 [00:12:31] RECOVERY - Puppet freshness on knsq19 is OK: puppet ran at Thu Mar 7 00:12:25 UTC 2013 [00:12:32] RECOVERY - Puppet freshness on amssq32 is OK: puppet ran at Thu Mar 7 00:12:26 UTC 2013 [00:12:32] RECOVERY - Puppet freshness on amssq52 is OK: puppet ran at Thu Mar 7 00:12:26 UTC 2013 [00:12:33] RECOVERY - Puppet freshness on knsq22 is OK: puppet ran at Thu Mar 7 00:12:26 UTC 2013 [00:12:33] RECOVERY - Puppet freshness on amssq54 is OK: puppet ran at Thu Mar 7 00:12:27 UTC 2013 [00:12:34] RECOVERY - Puppet freshness on amssq44 is OK: puppet ran at Thu Mar 7 00:12:27 UTC 2013 [00:12:34] RECOVERY - Puppet freshness on knsq23 is OK: puppet ran at Thu Mar 7 00:12:27 UTC 2013 [00:12:35] RECOVERY - Puppet freshness on amssq48 is OK: puppet ran at Thu Mar 7 00:12:27 UTC 2013 [00:12:35] RECOVERY - Puppet freshness on amssq49 is OK: puppet ran at Thu Mar 7 00:12:27 UTC 2013 [00:12:36] RECOVERY - Puppet freshness on amslvs3 is OK: puppet ran at Thu Mar 7 00:12:28 UTC 2013 [00:12:37] RECOVERY - Puppet freshness on amssq60 is OK: puppet ran at Thu Mar 7 00:12:28 UTC 2013 [00:12:37] RECOVERY - Puppet freshness on knsq27 is OK: puppet ran at Thu Mar 7 00:12:29 UTC 2013 [00:12:37] RECOVERY - Puppet freshness on knsq16 is OK: puppet ran at Thu Mar 7 00:12:29 UTC 2013 [00:12:38] RECOVERY - Puppet freshness on knsq17 is OK: puppet ran at Thu Mar 7 00:12:29 UTC 2013 [00:12:38] RECOVERY - Puppet freshness on knsq28 is OK: puppet ran at Thu Mar 7 00:12:29 UTC 2013 [00:12:39] RECOVERY - Puppet freshness on knsq26 is OK: puppet ran at Thu Mar 7 00:12:29 UTC 2013 [00:12:40] RECOVERY - Puppet freshness on knsq18 is OK: puppet ran at Thu Mar 7 00:12:30 UTC 2013 [00:12:41] RECOVERY - Puppet freshness on amssq40 is OK: puppet ran at Thu Mar 7 00:12:30 UTC 2013 [00:12:41] RECOVERY - Puppet freshness on amssq45 is OK: puppet ran at Thu Mar 7 00:12:30 UTC 2013 [00:12:58] weeee [00:13:17] Will icinga-wm autorejoin? [00:13:22] or do we have to kick it on neon? [00:13:52] nm. [00:14:06] New patchset: Pyoungmeister; "setting 1 node per shard to innodb_file_per_table for conversion" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52559 [00:15:46] !log reedy synchronized wmf-config/InitialiseSettings.php [00:15:51] Logged the message, Master [00:16:42] Reedy: mind if I sync CommonSettings.php? pushing out the addition of a config var that is currently inert (not checked by anything); trying to make deployment simpler. [00:16:53] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52559 [00:17:26] notpeter: "innodb_file_per_table.. This variable is available as of MySQL 4.1.1" - welcome to the future! [00:17:38] ori-l: if fenari wasn't hanging [00:17:38] hahahaha [00:18:10] the dump + load will probably reclaim some disk space too. maybe not a ton but probably at least a few gigs per shard [00:18:56] well, as space/time tradeoffs go, I dunno if it's woth it. [00:19:03] I mean total wall clock time [00:19:06] not execution time [00:19:23] Reedy: oh. heh. [00:19:39] notpeter: that's not why you're doing it though [00:20:11] * Reedy stabs nfs1 [00:21:22] notpeter: the space reclamation is just like earning points on a credit card that screws you with fees [00:22:40] bahahaha [00:22:48] New patchset: Pyoungmeister; "derp. need to pass down" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52563 [00:23:43] New patchset: Reedy; "Allow wikimedia blog on foundationwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/52564 [00:23:48] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52563 [00:24:02] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/52564 [00:25:59] * RobH deletes more stuff off wikitech while apergos isnt looking [00:29:38] yay, fresh puppet [00:30:24] servermon is far happier now than a fwe hours ago. [00:32:04] Change merged: Tim Starling; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/52543 [00:33:49] !log reedy synchronized php-1.21wmf11/extensions/RSS [00:33:54] Logged the message, Master [00:38:00] New review: Tim Starling; "Maybe you could also change search0x to search1000x in manifests/role/lucene.pp to prevent the same ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52547 [00:39:33] !log puppet halted run on cp1003, left in locked state, killed all puppet instances and refired, cleared. [00:39:34] Logged the message, RobH [00:40:40] New patchset: Lcarr; "fixed typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52574 [00:40:47] !log puppet locked on ms1004, killall puppet, refired [00:40:52] Logged the message, RobH [00:41:53] RobH: I don't bother to even log when i do that it happens so much [00:42:24] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52574 [00:42:26] meh, ms1004 is full [00:42:30] so that wont fix it. [00:42:38] i fear deleting shit on thumb server =P [00:45:02] ahh old log files that are already compressed, why are you still there... [00:45:27] LeslieCarr: this is what i been doing all day, figured i should log one or two to show im alive ;] [00:45:33] hehe [00:49:15] !log restarted pybal on lvs3 [00:49:21] Logged the message, Mistress of the network gear. [00:52:01] LeslieCarr: the celsus icinga alert is false positive from IP change right? [00:52:07] cuz im gonna ack it then. [00:52:29] yeah [00:53:33] huh, it died [00:53:36] icinga, restarting. [00:54:02] !log kicked icinga [00:54:07] Logged the message, RobH [01:02:56] New patchset: Ram; "Bug: 45795 Add explicit property identifying null host." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52547 [01:05:56] RoanKattouw_away: pmtpa parsoidcache is all ready [01:26:59] LeslieCarr: so how long are we keeping the tampa DC? Is there any vague scheduling on that? [01:41:41] Aaron|laptop it all depends on sue/the board and the budget meeting [01:46:38] * Aaron|laptop hands Ryan_Lane a box of Gluster [01:47:00] AaronSchulz: want to manage storage? :) [01:47:03] it's fun [01:47:05] a lot of fun [01:47:35] so would any block storage system work? [01:48:09] as long as it can be shared between clients [01:48:12] would be interesting to have a requirements page [01:48:13] and it isn't a SPOF [01:48:27] though there is little total data right? [01:48:37] right now 10TB [01:48:43] New patchset: Dzahn; "puppetize haproxy (for brewster), the very basics (RT-4660)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52578 [01:48:43] oh [01:48:45] also, it needs to restrict access [01:48:51] and it needs to have read-only support [01:49:01] "restrict access"? [01:49:15] export a can only be accessed by client x,y,z [01:49:38] export b can be accessed by anything, but is read only [01:50:36] have you talked to any gluster dev about the problems labs has? [01:50:38] yes [01:50:46] there are no magic nobs I take it? :) [01:50:49] "gluster 3.4 will likely work better for you" [01:50:56] gluster has almost no knobs [01:51:01] better as in "actually work consistently"? [01:51:05] glusterd is single threaded [01:51:21] and every single change you make to any volume requires all bricks to talk to each other [01:51:29] the more volumes you have, the worse things become [01:51:44] gluster 3.4 will have a multi-threaded glusterd service [01:52:11] so in reality it probably will help [01:53:09] there's other things that I very much dislike, though [01:53:11] performance at least [01:53:22] !log installing package upgrades on sodium [01:53:22] that'll only help performance of glusterd [01:53:28] Logged the message, Master [01:53:37] glusterfsd service is what runs for the gluster filesystem itself [01:53:53] and the glusterfs service runs for nfs [01:54:14] * Aaron|laptop lols at the thumbnails errors on http://gluster.org/community/documentation/index.php/GlusterFS_Concepts [01:54:15] the gluster service on the client eats up tons of cpu [01:54:20] heh [01:54:48] the gluster service on the client used to eat up a ton of memory due to memory leaks too, but thankfully those were fixed in the last point release [01:55:18] any major issue with gluster results in an outage, though [01:55:25] and outages result in split-brained files [01:55:37] we're using replica=2, so we can't use the quorum features [01:56:07] you can't relax consistency either? [01:56:21] meaning? [01:56:37] oh. I see what you mean [01:56:43] well, that would result in corruption [01:57:38] though realistically the end-result is the same, I guess [01:58:03] depends on the nature of the "issue" [01:58:09] because it's nearly impossible to fix split-brain issues now, so we just have a bunch of files with input/output errors [01:58:26] and whether blocks can be known to be broken (via some hash or something) or not [01:58:38] * Aaron|laptop has to read up on the design of gluster [01:58:41] well, it's not really a block based filesystem [01:58:48] it works at the file level [01:59:02] so blocks map to local files or something? [01:59:12] files map to files :) [01:59:21] and a client directly writes to two spots [01:59:30] if one is down, then when it comes back up, it's updated from the other [01:59:57] if one goes down, the client switches to the second [02:00:08] if that then goes down, and the other one comes back up [02:00:15] then the client continues to write.... [02:01:34] if using replica=3 and a quorum it'll block writing unless there's a quorum [02:01:38] telling the client its read-only [02:02:44] we could probably switch to replica=3 [02:02:54] but that's a massive waste of space for normal block data [02:03:27] to bad it's not one of those systems that uses solomon-reed codes to avoid 3 copies [02:03:30] *too [02:03:31] and realistically with replica=3 we're approaching netapp prices for hardware [02:03:38] Thehelpfulone: i got the requested list.. RT-4656, but having trouble attaching a 15kb file.. sigh [02:04:50] we're really just using a DFS to avoid a SPOF [02:04:52] Ryan_Lane: so what's the problem if just one spot goes down and then up again? [02:05:05] Aaron|laptop: none [02:05:13] theoretically [02:05:25] so why is stuff going down so much? [02:05:28] though in practice I've seen some issues with availability [02:05:50] gluster doesn't have a concept of read-only [02:06:01] so, we're using gluster's nfs support for this [02:06:15] glusterfsd services are decoupled from the glusterd service [02:06:19] Aaron|laptop: because gluster's actual implementation is best described by the same acronym as "point of sales" [02:06:24] it seems glusterfs services are not [02:06:32] glusterfs services share nfs [02:07:11] (we need read-only for the ssh keys and for the xml dumps) [02:07:30] so, when the glusterd service died, it would restsrt [02:07:32] *restart [02:08:02] if you were lucky and tried to login when that happened, especially on lucid, it would hang, or deny [02:08:09] lucid would never come back [02:08:16] which is why I just rebuilt bastion1 [02:09:12] so, we haven't had a full gluster outage in a few weeks [02:09:24] we changed how we were interacting with glusterd [02:09:38] previously it was swap-deathing itself [02:09:55] now we're just having the nfs issue [02:12:48] so nfs problems are the reason you want to switch? [02:13:19] or at least you sounded like you wanted to switch to something else this mourning :) [02:14:04] Ryan_Lane: It looks like something is up with the search engine on wikitech (labsconsole). Any pages that were imported don't show up (e.g. looking for "moreb" doesn't suggest "morebots") [02:14:08] known issue? [02:14:15] not known issue [02:14:21] known issue now :) [02:14:27] is there a maintenance script to update the search index? [02:14:48] updateSearchIndex.php [02:15:04] *** Couldn't write to the searchUpdate.labswiki.pos! [02:15:06] o.O [02:15:17] that looks wrong [02:15:34] is labs using the normal search hooks or some OAI thing? [02:15:41] normal mediawiki search [02:16:26] it needs to write to a file? [02:17:01] in the maintenance/ dir by default [02:17:38] well, I tried changing that [02:17:40] *** Couldn't write to the /tmp/searchUpdate.labswiki.pos! [02:17:41] heh [02:18:13] New review: MZMcBride; "Reverted by ." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21322 [02:18:56] you just want to update from some time range probable (-s and -e) [02:19:04] it doesn't really need to write for that case I guess, heh [02:19:58] *probably [02:20:04] ah [02:20:05] yep [02:20:10] I just set the timestamp to 2001 [02:20:35] actually, 2000 :) [02:20:36] well 200101010000 or whatever [02:20:39] Krinkle: should work now [02:21:19] * Aaron|laptop missed a 00 [02:21:19] ... [02:21:22] or not [02:21:49] any other maintenance scripts I should run? [02:21:52] nope [02:22:08] I get a few on Special:Search , but that may be because I edited the pages [02:22:13] I don't see them in prefix search [02:22:14] yeah [02:22:15] indeed [02:22:24] lemme isolate the api request for easier testing [02:22:29] thanks [02:22:36] https://wikitech.wikimedia.org/w/api.php?format=json&action=opensearch&search=moreb&namespace=0&suggest= [02:22:38] I'm sure there's some specific script for this [02:22:48] https://wikitech.wikimedia.org/w/api.php?format=json&action=opensearch&search=User:Mo&namespace=0&suggest= [02:22:56] the latter has been edited and the former has not [02:22:58] neither shows up yet [02:23:48] Ryan_Lane: did the script do anything? [02:23:52] yeah [02:23:59] it said it indexed all the pages [02:24:18] * Aaron|laptop likes how there is search update code in the base Maintenance class... [02:29:09] yeah that script is broken if -p is given too [02:29:17] :D [02:29:19] realpath() returns false [02:29:24] *isn't [02:29:25] it's ok, only third parties use that [02:29:37] ;) [02:29:38] !log LocalisationUpdate completed (1.21wmf11) at Thu Mar 7 02:29:38 UTC 2013 [02:29:44] Logged the message, Master [02:32:13] bleh [02:32:15] I'm going to run maintenance/rebuildall.php [02:32:50] I'm actually not sure why realpath() gives false, even if I use __DIR__ [02:33:11] the dir perms look fine (missing x bit can cause realpath to fail) [02:33:36] doesn't work as root either, wtf [02:33:43] :D [02:37:20] hm. can't seem to make search work properly [02:38:44] this realpath thing must be a php bug, ugh [02:38:50] it's known to have bugs [02:39:41] :D [02:40:03] I guess I'll need to go through the code to see what the search is doing and why it isn't working [02:41:19] Ryan_Lane: does output the list of titles? [02:41:45] yep [02:43:33] though not all of them [02:44:01] but, some that aren't showing up in the search were in the list [02:44:12] I was just about to ask about that [02:44:50] for some reason I remember seeing this in the past [02:47:13] "It will not update the search index for the pages that do not appear in http://www.mediawiki.org/wiki/Special:RecentChanges." [02:47:14] seriously? [02:47:23] no fucking wonder [02:47:49] you can only update the search index for pages in recent changes [02:48:13] Ryan_Lane: did you run rebuildRecentChanges? [02:48:22] yes.... [02:48:26] but think about that ;) [02:53:50] !log LocalisationUpdate completed (1.21wmf10) at Thu Mar 7 02:53:50 UTC 2013 [02:53:56] Logged the message, Master [06:01:52] New review: Nemo bis; "@Reedy: who are you replying to? I meant the password protection without expiry (fixed in the meanwh..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37441 [07:58:19] Nikerabbit, now you're there? [08:39:10] New review: Mattflaschen; "Let's use a separate slow-parse file for private wikis:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49678 [08:40:40] New review: MaxSem; "Just filtering blacklisted wiki names is dangerous, as people will forget to update such blacklist w..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49678 [08:43:02] MaxSem: yes [08:43:35] Nikerabbit, https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=vanadium&service=Solr [08:46:00] MaxSem: and what I'm supposed to get out of it? I only see CRITICAL [08:46:08] ...and its cause [08:46:28] MaxSem: average request time for what? [08:46:46] for searches. since it's your installation, let's decide what to do about it [08:47:02] MaxSem: what exactly is profiling my searches? [08:47:19] ? [08:47:40] Icinga;) [08:47:57] MaxSem: and how does it do that? [08:48:18] gets the stats from solr [08:48:26] oki [08:48:44] so it's a reflection of your real preformance (or a lack of it):) [08:49:08] so... we could raise the threshold for this particular service [08:49:23] but it's friggin slow already [08:49:59] okay so [08:50:23] MaxSem: let's rise the threshold a bit, I will fix bug http://bugzilla.wikimedia.org/43778 and see how it acts after that [08:50:46] MaxSem: also, better profiling than just "average" would be nice [08:51:02] whether there is only few hugely slow or whether it is across the board [08:51:45] do you know a better metric in admin/stats.jsp or whatever? [08:52:32] MaxSem: haven't studied those [08:52:33] if you just want to know what's going on, use standard MediaWiki profiling with graphite [08:52:45] MaxSem: are these milliseconds: https://noc.wikimedia.org/cgi-bin/report.py?db=all&sort=real&limit=5000&prefix=Solr ? [08:52:50] icinga is just for cases where shit hit fan [08:53:18] not sure [08:54:31] MaxSem: how do I find that in graphite? [08:54:36] Nikerabbit, https://graphite.wikimedia.org/dashboard/temporary-23 [08:55:26] ugh, it looks scary [08:55:34] yeah, even with this few dataplots [08:55:59] or is that just dust in my screen [08:56:25] hmm updates timing out... that is bad [08:56:46] I wonder why is that [08:57:58] MaxSem: I will also be setting up Solr for translatewiki.net production to see how it compares [08:58:45] meanwhile, I'll tweak monitoring [09:32:08] New patchset: awjrichards; "Update X-CS handling to new k/v pair spec" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52606 [10:04:07] New review: Platonides; "The filter can read the blacklist directly from private.dblist" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49678 [11:13:49] New review: Aklapper; "Hmm, now that you say I think I remember something. Let's abandon this change, if I find out how to ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52393 [11:14:12] Change abandoned: Aklapper; "Adds more confusion than before, see comment by Nemo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52393 [11:15:38] New review: Aklapper; "34 is the width of a column - needed to extend it so the string isn't cropped" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52400 [12:18:39] New review: Peachey88; "Wouldn't it be easier to do it on wgDebugLogFile compared to wg DebugLogGroups, So all the private l..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/52608 [12:46:29] New patchset: Aklapper; "[bug 45770] Update used parameters in statesToRun and resolutionsToRun arrays" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52616 [12:47:23] New review: Aklapper; "Superseded by https://gerrit.wikimedia.org/r/#/c/52616/ - marking as abandoned." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52396 [12:47:43] Change abandoned: Aklapper; "Superseded by https://gerrit.wikimedia.org/r/#/c/52616/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52396 [14:53:49] New review: Platonides; "$wgDebugLogGroups has priority over $wgDebugLogFile. It may make sense to remove most of them for pr..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/52608 [15:00:45] !log Restarted gmetad on nickel [15:00:51] Logged the message, Master [15:28:14] !log Reinstalled cp3003 [15:28:20] Logged the message, Master [15:30:08] does udp2log need a config change when adding a new type ? [15:30:50] what do you mean by a type? [15:33:07] see https://gerrit.wikimedia.org/r/52608 [15:33:32] that new privatewiki-slow-parse log [15:34:30] hmm, don't know [15:48:32] paravoid: are you around? [15:48:37] yes [15:49:07] i want to reseat disk4 on ms-be1004. is this ok? [15:49:31] yes [15:50:04] k..cool also found a few failed disk in 1011/1009/1006 [15:50:29] oh really osdmap e170240: 144 osds: 138 up, 138 in [15:51:30] odd [15:52:10] 1002, 1004, 1006, 1008, 1009, 1011 [15:52:14] let me have a look [15:59:20] 1002 sdl, 1004 sde, 1006 sdf, 1008 sdk, 1009 sdh, 1011 sdi [15:59:27] cmjohnson1: these seem to be all broken [15:59:44] I/O errors [16:00:05] note that sda is the VD with the SSDs, so it's off-by-one to get the bay nr. [16:00:30] next hardware issue [16:00:33] bad drives? ;-) [16:01:04] that's quite a lot suddenly [16:01:24] statistically though it might make sense [16:01:36] although I wonder why we haven't had the same ratio in pmtpa/swift [16:02:05] we are not 100% r720s in tampa [16:02:22] this looks like disk-related [16:02:34] but the r720xd replacement includes disks, right? [16:02:58] yes...they all have the same 3TB disks [16:03:11] w/the exception of the flex bays [16:04:13] New patchset: Mark Bergsma; "Revert "make check_http (80) and check_tcp (8080) on install hosts a critical (paging) service"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52623 [16:04:48] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52623 [16:05:03] mark: I disagreed with that on RT too [16:05:08] i saw [16:06:51] and even if that were a critical service to labs, that needs to become a properly designed HA setup before it can become a critical service [16:07:07] !log reseating disk4 on ms-be1004 [16:07:12] Logged the message, Master [16:21:28] New review: MaxSem; "Thanks!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36222 [16:26:34] cmjohnson1: it still snowing there? [16:26:46] robh: no it stopped last night [16:26:51] it is all melting today [16:27:02] good =] [16:27:09] so the cameras were still in recieving [16:27:31] well, they better not charge us for delivery on upcoming bill [16:27:35] i'll have to check, thx for info [16:28:06] yep...I need some help w/the ticket for equinix...but will ping you about it shortly in chat session w/dell [16:28:13] \o/ [16:28:24] cool, im working from home today, so will be online and available from now onward [16:28:34] no commute \o/ [16:29:20] gotta love that! [16:29:35] New patchset: MaxSem; "WIP: OSM module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36222 [16:29:44] robh: the hpm servers have both connections so all disk drives will work [16:31:59] awesome! [16:32:12] so we'll be taking a hpm ssd model and putting it in frack, ill drop ticket in a few minutes [16:34:29] ok...jeff_green can you put a ticket in the eqiad queue for the frack db's you need moved [16:34:53] cmjohnson1: sure. but we can't start that until April [16:35:16] that's okay..i will stall it but don't want to forget ;-] [16:35:20] k [16:36:55] cmjohnson1: https://rt.wikimedia.org/Ticket/Display.html?id=4665 [16:37:15] woot [16:37:31] That is for samarium, jeff's new public log host [16:37:53] Jeff_Green: I put in that ticket to plug it into whatever firewall has the least number of connections [16:38:08] Is that ok, or did it need dual connections? [16:38:17] single is ok, and that's probably fine [16:38:23] cool [16:38:31] although can you let me know which it ends up in? [16:38:42] I asked in ticket for chris to drop a new network ticket for it [16:38:43] with all the info [16:38:47] k [16:38:59] (i'd link them all together with refers too personally, i like link history of tickets) [16:39:22] cool [16:55:00] robh: need you help w/equinix ticket for cameras [16:55:16] oh the request for mounts? [16:55:45] yes...what type of request would this be? [16:55:53] its a smart hands [16:56:04] request a metal junction box + conduit for each camera [16:56:11]