[02:35:47] New patchset: Tim Starling; "1/100 sampling for banner impressions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6779 [02:36:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6779 [02:47:45] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6779 [02:47:48] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6779 [03:28:25] New patchset: Asher; "don't try to cache large media objects in the frontend instance set stream buffer 10M in frontend enable streaming from the backend for objects > 64M" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6780 [03:28:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6780 [03:29:27] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6780 [03:29:30] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6780 [03:34:21] New patchset: Asher; "beresp.stream_pass_bufsize isn't actually in varnish 3.0.2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6781 [03:34:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6781 [03:34:43] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6781 [03:34:45] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6781 [03:37:07] New patchset: Asher; "dash vs. underscore. underscore wins." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6782 [03:37:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6782 [03:37:28] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6782 [03:37:30] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6782 [07:20:02] !log power cycled db45 (crashed dewiki slave) [07:20:06] Logged the message, Master [08:00:59] !log upgrading/rebooting the last couple sq* servers [08:01:02] Logged the message, Master [08:39:14] New patchset: Dzahn; "minor fixes to language and license columns" [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/6787 [08:39:15] New patchset: Dzahn; "enhance siteinfo() fetching - debug error codes" [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/6788 [08:39:15] New patchset: Dzahn; "needed to handle wikis with API siteinfo but not API stats, fix sorting by http" [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/6789 [08:40:13] New review: Dzahn; "(no comment)" [operations/debs/wikistats] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6787 [08:40:15] Change merged: Dzahn; [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/6787 [08:41:35] New review: Dzahn; "(no comment)" [operations/debs/wikistats] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6788 [08:41:37] Change merged: Dzahn; [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/6788 [08:42:36] New patchset: Dzahn; "needed to handle wikis with API siteinfo but not API stats, fix sorting by http, fix red gerrit marks" [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/6789 [08:43:17] New review: Dzahn; "(no comment)" [operations/debs/wikistats] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6789 [08:43:19] Change merged: Dzahn; [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/6789 [09:27:02] !log rebooting bits varnish sq68-70 one by one.. [09:27:05] Logged the message, Master [10:42:00] New patchset: Dzahn; "adding interactive server upgrade-helper script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6791 [10:42:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6791 [10:43:26] New patchset: Dzahn; "adding interactive server upgrade-helper script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6791 [10:43:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6791 [10:44:51] New review: Dzahn; "just putting a helper script in misc/scripts/" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6791 [10:44:53] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6791 [11:09:50] !log squids - sq* done. all latest kernel and 0 pending upgrades. [11:09:53] Logged the message, Master [11:10:42] * ^demon hands mutante a cookie :) [11:11:21] ty, demon;) [11:11:42] <^demon> yw. I mean, who says no to a cookie :) [11:12:31] just my browser, sometimes:) [11:13:49] the helper script gives you a kitten if it detects you are all done. make it more fun :p [11:18:13] !log continuing with upgrades/reboots in amssq* on the side during the day [11:18:16] Logged the message, Master [11:20:16] <^demon> mutante: `ack --thppt` [11:21:52] oh? hashar wants to make sure it is removed in favor of grep-ack? [11:24:49] <^demon> ack-grep, app::ack, betterthangrep, it's all the same thing :) [11:24:58] <^demon> A much much faster version designed for searching source code. [11:28:36] yea, adding classes for them is fine, it is the discussion what should be in base and what should not though that always has different opinions (and the stuff like ack vs. grep-ack). just like global vim config and stuff.. [11:29:45] <^demon> *nod* [11:30:23] <^demon> Useful tools are useful, but cluttering the base install with a bunch of tools -> $maintainability-- [11:30:43] and editors, re:Joe [11:31:05] <^demon> Joe was only because brion liked it, iirc ;-) [11:31:23] better joe than mc ;) [11:31:38] just saying i can imagine the next editor request [11:32:39] <^demon> "provide me with a google docs bridge so I can edit site configuration from my browser" [11:32:42] <^demon> ;-) [11:32:51] i dont hate joe, i think it's ok to offer labs users one vim alternative..but maybe not 10 [11:33:29] hehe @ google docs config ,,yea [11:34:22] thats why i'd like to see private etherpads , hah [11:41:49] New patchset: Dzahn; "minor fix and tabbing in upgrade-helper script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6792 [11:42:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6792 [11:42:31] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6792 [11:42:33] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6792 [12:43:18] New review: Demon; "What was the reason for this again?" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/4083 [12:49:25] !log pushing out virtual host for wikimania2013 wiki. sync / apache-graceful/all [12:49:29] Logged the message, Master [12:51:35] Reedy: Thehelpfulone ^^ [12:52:09] ooh a ping [12:52:10] :D [12:52:32] Thehelpfulone: it's on the Apaches now [12:52:50] ok [12:53:17] you can go ahead with wiki install i guess, now brings you to that start page [12:53:27] instead of wikimediafoundation.org [12:53:45] the wiki's been installed partly already I think mutante [12:53:54] there was an email to that newprojects mailing list [12:54:07] http://wikimania2013.wikimedia.org/wiki/Main_Page it just needs all the relevant settings etc copied [12:54:12] ok, shift+reload and see the difference [12:54:16] ok [12:56:04] do you know which ones those are https://bugzilla.wikimedia.org/show_bug.cgi?id=36477 is the bug - see my first comment. [12:57:35] Thehelpfulone: updated BZ, but no i don't know which exact settings you need, Reedy will for sure [12:58:28] yep sure no problem, casey's comments were "This wiki should have the same settings as the other Wikimania wikis for the logo, project name, extra namespaces, ForceUIMsgasContentMessage, anon editing restrictions, etc. (previous bugs: bug 18740, bug 13547)" [13:00:59] that can wait for Reedy though who should be in the office in a few hours [13:01:16] mutante, is the initial crat made through the database? [13:01:47] <^demon> addwiki.php doesn't make an initial crat, no. [13:02:23] Thehelpfulone: ^demon knows better than i do about anything after DNS and Apache config [13:02:28] heh [13:02:47] <^demon> DNS? I know zilch. [13:02:57] ^demon: yeah I meant on previous requests a user was made a crat to be able to assign rights etc [13:02:59] usually that is like the point of giving it from ops to devs/devops [13:03:13] <^demon> Thehelpfulone: Perhaps someone edited the database and made them a crat? [13:03:20] yeah that's what I said :P [13:03:45] so who do I need to ask to be made the initial crat, I'm taking that role of handing out rights etc [13:04:08] <^demon> Any +shell user can, but since Reedy made the wiki I'd ask him. [13:04:55] ok [13:16:07] hey apergos, are you around? [13:19:21] yes [13:19:43] sorry, I was afk redesigning some code that I had written in a completely braindead way [13:19:49] np [13:19:54] (let me grab ottomata) [13:19:54] what's up? [13:20:22] i'm here [13:20:43] ottoman and i are curious to hear what happened with the fund-raising filter yesterday [13:20:52] and what actions we can take to make our lives easier in the future [13:21:05] and is storage3 an additional filter box? [13:21:41] I'm absolutely the wrong prson to ask. I can tell you what I saw and what I did but the prson who knows about this is (I think) jefff and he's on ... ny? sf? time [13:21:54] so the locke logs are copied to storage3 periodically [13:22:18] the partition where they were to be copied was not accessible, spewing a bunch of raid related messages in the log [13:22:21] so copies failed [13:22:37] apergos, how are they copied? [13:22:42] cron on storage3? [13:22:50] so locke started getting full-ish on that one partition with the logs and someone using that data saw that it wasn't available [13:22:55] uh huh a cron job on storage3 [13:22:58] k [13:23:11] and they are deleted on locke by the same cron? [13:23:24] yes [13:23:35] from /a/squid/archive? [13:23:35] of course we don't delete if the copy doesn't happen [13:23:55] you will hate me but I already don't remember the directory [13:23:59] it's in the log maybe [13:24:04] (sysadmin log) [13:24:09] yeah, i think that is it [13:24:20] it looks full and there is a logrotate file putting stuff there [13:24:24] ottomata: i think you need to apply for a storage3 account :) [13:24:24] so then I said I would try to reboot storage and see if we got the partitions back, [13:24:37] since it was already broken as things were [13:24:49] and on reboot those partitions didn't come up [13:25:02] so I skipped the moount for the two raid partitions [13:25:18] and that's where storage3 is, out of action as far as log storage. [13:25:31] at that point I manually copied the logs from locke to hume, [13:25:37] gzipped them there, [13:25:49] what's hume? [13:25:56] (ah, after hupping udp2log) [13:26:04] and then removed the copies from locke. [13:26:15] did you copy from the live log files then? [13:26:20] and not the archived ones? [13:26:27] (if you needed to hup) [13:26:30] hume is a host we use for .. well it has had copies of logs in the past but typically it is for more computationally intensive tasks by developers, long running scripts and such [13:26:38] ok [13:27:05] no. I moved the logs to a different directory, hupped the upd2log, thn copied those logs which were not longer being updated to hume, then gzipped them, then removed them from locke [13:27:32] this is ordinarily what the cron script does [13:27:52] I simply did it manually, with the exception that instead of copying to storage3 I put them in a directory on hume (see the sysadminlog) [13:28:12] all clear? [13:28:16] really? [13:28:23] what? [13:28:26] it doesn't copy the already zipped logs from /a/squid/archive? [13:28:34] there is a logrotate script that is moving files there [13:28:35] sorry [13:28:39] I read the cron job and did what it does [13:28:41] / var/log/archive [13:28:42] hm [13:29:04] this is for bannerimpressions and one other log which (again) I already forget what it is [13:29:19] I copied a total of 4 files, two of each type. [13:29:21] ah no, /a/squid/archive [13:29:21] is that [13:29:25] hm [13:29:38] ah they have their own directory [13:29:55] /a/squid/fundraising [13:30:02] quick irrelevant Q [13:30:03] obviously the ones now on hume should *not* be removed [13:30:09] why are all these logs in a directory called 'squid' [13:30:13] the logs are not all from squid [13:30:14] we need to keep them until jeff can get things sorted [13:30:27] do not look at me, thanks :-P [13:31:11] haha [13:31:28] i have such a strong urge to organize this whole system [13:31:34] but i'd probably break lots of things doing it [13:31:34] oh sorry it runs on locke [13:31:37] not on storage3 [13:31:40] the cron? [13:31:43] as who? [13:31:46] rotate_logs_and_copy_to_storage3.pl here's the sciptname [13:32:00] file_mover I guess [13:32:23] the only way for me to be sure is go look everything up again [13:32:59] yeah found it [13:33:01] file_mover [13:33:30] found this [13:33:30] http://wikitech.wikimedia.org/view/Fundraising_Analytics/Impression_Stats [13:33:36] great [13:34:18] I didn't disable the cron job, it runs and fails right now [13:34:32] a bit spammy but presumably it will get sorted out later today [13:35:30] so now you know as much as (or more than) me [13:36:10] ok cool [13:36:17] thanks apergos [13:36:24] yw [13:36:26] yeah thanks! [13:36:28] any idea where this whole /a thing came from? [13:36:36] i had thought it was erik for 'analytics' [13:36:38] but i guess not [13:36:39] all that stuff well predates me [13:36:46] years and years old. [13:37:11] could be /a for apache. or /a because we chose a letter at random for all I know [13:50:16] so, apergos, who are you waiting on to fix hume/storage3 stuff? [13:50:28] I hope jeff will look at it [13:50:32] iirc it's his "baby" [13:50:37] what's his sn? [13:50:47] nick? [13:51:02] he's not on right now, wrong timezone [13:51:42] aye, just curious what it is [13:51:47] he has had issues with storage3 in the past so I bet he has some fallback plans up his sleeve [13:52:37] aye ok [13:52:50] including adding monitoring for disk space there :p ? [13:53:23] Jeff_Green is the nick [13:53:36] it's not a matter of disk space [13:53:44] it is afaict a hardware issue [13:56:47] oh hm, ok [13:57:28] you did look at the sysadmin log right? :-P [14:01:35] how do I do that? [14:01:44] the nagios notices? [14:03:38] RT-2907: rsync: change_dir "/archive/udplogs" failed: No such file or directory (2) [14:03:41] there he is:) [14:04:05] nice timing [14:04:43] ugh, storage3? wtf [14:05:00] there were backup failures [14:05:09] rsync -ar --delete /archive/udplogs/ file_mover@hume.wikimedia.org: [14:05:13] /archive/udplogs/ [14:05:19] storage3 is fubar [14:05:22] and rsync -ar /archive/jenkins_builds/ logmover@storage3.pmtpa.wmnet: [14:05:33] ottomata just asked about it before you joined [14:05:41] put it in an RT for now [14:05:48] root@storage3:~# mount /a [14:05:48] mount: special device UUID=cc471f5e-062d-4d73-83a2-8946667a13e5 does not exist [14:05:59] huh [14:06:07] yeah, [14:06:12] see the sys admin log [14:06:28] looking [14:06:35] where is this sysadmin log? [14:06:44] http://wikitech.wikimedia.org/view/Server_admin_log [14:07:02] i am crying that this was not a pageable event [14:07:05] ah and those are teh !log messages people type here in here? [14:07:10] yes they are [14:07:23] well I don't think I broke anything [14:07:36] nothing filled up before I got called in [14:08:11] i don't think the issue is full disk [14:08:14] no [14:08:19] i think it's, as usual, RAID [14:08:21] yes [14:08:37] megacli like I say was just returning 0 and no contents [14:08:56] I didn't trry to dick around in any radi bioses or whatever [14:09:00] has rob h been alerted? [14:09:12] I doubt it [14:09:25] afaik dicking around with RAID bioses remotely is problematic anyway [14:09:34] at least on these machines [14:09:35] well I could have looked at it [14:10:22] but seemed like I wouldn't have been able to d much beyond look [14:10:58] I'm going to delete this ticket daniel posted and start with the RAID failure [14:11:07] ok. I didn't see the ticket [14:11:22] he entered one for the backups failures, but that's just a symptom [14:11:27] you have the useful info at this point [14:11:34] eh yeah, i just made that earlier when i read email [14:11:48] before i saw you talking about further issues [14:11:51] yeah [14:11:58] new one going in now [14:12:01] so right now storage3 is up without /a /archive mounted cause not detected as ready [14:12:30] yeah [14:12:35] all kinds of hell breaks loose in this case [14:12:38] possibly :-P [14:12:50] i'm not sure I accounted for this in the backup scripts [14:12:52] well no rsyncs go of course [14:13:34] as long as the ones that "rsync -var --delete /blah offhost:/blah" fail rather than purging everything on offhost [14:13:36] all the rsyncs are into lower level dirs I guess [14:14:04] yeah [14:14:07] that's good! [14:14:14] rob was on a little while ago, but dunno if on-site (or chris) [14:14:18] /home/file_mover/scripts/rotate_logs_and_copy_to_storage3.pl died: rsync: change_dir "/archive/incoming_udplogs" failed: No such file or directory (2) [14:14:26] this seemed to be the desired result [14:14:28] yep--that's good [14:15:45] do you guys know if nagios monitors disk space on all paritions by default? [14:15:54] on any puppetized machine? [14:16:02] the earlier failures are interesting (in an academic sort of way, not that interesting for fixing the real life issue) [14:16:28] [ 5.341656] megasas: Waiting for FW to come to ready state [14:16:28] [ 5.381673] megasas: FW in FAULT state!! [14:16:36] ain't it grand [14:16:39] apergos: which? [14:16:49] the earlier cron failures I mean, before my reboot [14:16:57] oh. looking [14:17:27] '/archive/udplogs' and '/archive/old_udplogs/' are identical (not copied) at /usr/local/bin/impression_log_rotator line 42 [14:17:27] impression_log_rotator died: Input/output error [14:17:38] yeah [14:17:51] whatever, we don't care right now but it is curious [14:17:53] it looks like all the RAID fell completely off line before you rebooted [14:17:59] it seemed to be off [14:18:10] that's why I rebooted, nothing was going to happen leaving it there [14:18:17] I couldn't do aany sort of diagnostics [14:18:31] identical may have been both dirs having no content [14:18:35] and a reboot had a possibility of clearing up some issue, or leaving things no worse off... [14:18:44] probably better, rlly [14:19:08] both empty, that's sure possible :-D [14:19:15] the 'identical' errors suggest the kernel wasn't in a healthy state about that partition [14:19:19] eh no [14:19:20] :-D [14:19:31] whereas after the reboot it's got the story straight :-( [14:19:42] yeah. no happy ending [14:20:26] !log stopped cron jobs on storage3 because of RAID failure [14:20:28] Logged the message, Master [14:20:42] I will take this opportunity to rant about RAID just once. I hate RAID. [14:20:59] really? why, what kind of RAID was it? [14:24:03] megacli, so some kind of LSI hardware RAID [14:24:17] why: because it's a false sense of security [14:24:39] RAID controllers are garbage, in my experience they have like a 20% failure rate [14:25:05] http://adminzen.org/backup/ :p [14:26:11] for storage though, do you really need a dedicated controller? [14:26:20] if you are just raiding for redundancy and not performance [14:26:37] md raid would be fine? [14:27:46] it'd be better imo [14:27:51] but this is also for mysql [14:28:51] ha, on a machine called storage3? [14:28:58] is it actually serving queries? [14:29:42] not really [14:29:53] then meh, md is fine too [14:30:09] the FR mysql architecure is db1008-->db1025/storage3 [14:30:33] it's an offsite slave, does dumps and other non-mysql backups [14:30:45] yeah md would probably be fine [14:36:00] yeah mysql was not running on the box when I got on it [14:36:10] yawp [14:36:23] (nor did I start it. uh uh.) [14:36:35] the long term plan is to carve out some space on the netapp [14:36:44] hmmmm [14:36:51] I guess people do that [14:37:54] it'd be better than relying on storage3 at least [14:38:20] and i don't want to double-book db1025 which is the only other option atm [14:40:12] Change abandoned: Hashar; "I am not sure why that was needed. Probably to manually setup branch bases documentation. That sur..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4083 [14:40:40] agree [14:43:43] Jeff_Green: got me a question about something and your fingerprints are on it: [14:43:47] what is hume? [14:43:53] hahaha [14:44:12] notpeter: not sure! lemme check my notes [14:44:16] heh [14:44:19] "misc cmputing jobs that need a bit more cpu or memory, and we don't want them to run on fenari" [14:44:31] ah [14:44:35] mostly we don't want stuff like that on fenari, so that's where we put them [14:44:47] I got asked this very thing about an hour ago [14:45:12] notpeter: i think there was preexisting fundraising stuff on hume which I retooled a bit [14:45:31] gotcha gotcha [14:46:16] see, once you touch something it's yours forever, until the next person touches it [14:46:17] notpeter: which fingerprints where you referring to? [14:46:23] make sure to dump your notes on http://wikitech.wikimedia.org/view/Hume :-D [14:46:36] you had comments in site.pp [14:46:41] anyway, cool! [14:47:11] i'm drawing a complete blank, looking! [14:47:58] Jeff_Green: I now know enough to know that I don't care [14:48:00] no worries! [14:48:17] yeah but now I care [14:48:20] :-P [14:48:44] oh interestin [14:48:45] g [14:49:10] wow, somewhere in the recesses of my brain there's history on this [14:49:26] I fired up a mysql instance for some kind of testing [14:51:35] huh! [14:58:31] hashar: you've got a message in #wikimedia-labs ;) [15:00:38] Thehelpfulone: aka here :-D [15:00:51] I am not part of the ops team so I can't +2 operation/puppet changes [15:01:10] ok [15:01:10] you will have to find a op to take care of your changes :-] [15:01:13] but notpeter is? ;) [15:01:57] notpeter: ah, hume was used for testing mysql schema mods leading up to the civicrm upgrade, I believe I can tear down that stuff now that they're done [15:07:43] Thehelpfulone: yeah. someone gave this money access to the +2 button [15:08:08] they must have been under the influence when they gave it to you ;) [15:08:09] [15:42:22] https://gerrit.wikimedia.org/r/#/c/6727 and https://gerrit.wikimedia.org/r/#/c/6584/ if you can :) [15:10:28] if you could oblige notpeter ^ [15:14:59] Thehelpfulone: so, iwill definitely merge the second one, but I actually think that the first one could be done better. can you set it up so to just notify => Service["lighttpd"] instead ? [15:15:12] without the exec definition, I mean [15:15:17] !log updating firmware on storgae3 [15:15:20] Logged the message, RobH [15:15:40] (or if that's not reasonable, let me know why) [15:15:58] Jeff_Green: so i am going to update the drac firmware, then bios, then the raid controller [15:16:04] and see if i cannot clear the error [15:16:35] ok [15:17:01] notpeter: sure I'll have a go, I'm new to all this git gerrit stuff so just replace notify => Exec["service-lighttpd-reload"]; with notify => Service["lighttpd"] ? [15:18:39] yeah, although I guess that will do a restart instead of a reload [15:18:45] lemme look at the puppet docs a little [15:19:22] notify might not do a restart [15:19:29] i think it depends on how the service is set up [15:19:42] i think if it is hasreload => true [15:19:45] or someting like that [15:19:47] it would reload [15:19:50] that's just a guess though [15:22:56] ottomata: yeah, I'm trying to figure that out [15:22:59] oh puppet docs [15:23:05] you're so.... mildly ok [15:24:10] ottomata: ah, sadly, no.looks like there's a ticket to create such functionality in puppet [15:24:33] Thehelpfulone: okie dokie. I can merge that, as it doesn't seem like there's much of a better way. *sigh* [15:24:52] ok [15:25:50] aye rats [15:25:58] ok [15:26:01] there is another way though [15:26:04] if you really want to restart [15:26:08] instead of notifying the service [15:26:12] you can set up an exec that reloads [15:26:15] that gets notified [15:26:23] ottomata: yeah, that's what Thehelpfulone did [15:26:29] and it will totally work [15:26:36] but I thought there was a prettier way [15:26:37] I didn't do that, credits to jeremyb :) [15:26:42] ahh, ok [15:26:42] ah, ok [15:26:42] cool [15:27:06] actually, maybe better than notifying [15:27:09] is to subscribe the exec [15:27:29] well, hmm, that would only be better if you are only using this exec for one file [15:27:34] so you can do [15:27:43] subscribe => File[/whatever] [15:27:45] on the exec [15:27:51] with refreshonly => true [15:27:57] which will only run the exec if the file changes [15:28:12] OR, the conversly the file can notify the exec to reload, like i guess you are doing now [15:28:34] ja [15:28:38] should I re-run puppet now notpeter? [15:28:50] do either of you guys know about how to set up new nagios monitoring and/or triggers? [15:29:07] i'm not messing with it right now, but I might soon and I want to know how it is done [15:29:10] properly [15:29:43] Thehelpfulone: wiat, what is this for? [15:29:51] ottomata: yes, i can tell you about setting up monitoring [15:29:57] mailman on wmf labs [15:30:37] notpeter: around? [15:30:39] then why is it in the prod branch? [15:30:42] hexmode: yes [15:30:51] hexmode: is this about the ticket you reponend? [15:30:51] :) [15:30:57] yes [15:31:02] how likely is it? [15:31:50] that will probably be done as part of upgrade to precise, which should be on the sooner side. ma_rk already has precise installing in the cluster, unless it's really pressing to get it done [15:32:04] notpeter: because we're using the production config? I don't know the exact reasoning but I know that's what we've done so far :P [15:32:09] err: Could not apply complete catalog: Found 1 dependency cycle: [15:32:10] (Exec[service-lighttpd-reload] => Class[Webserver::Static] => Lighttpd_config[10-cgi] => File[/etc/lighttpd/conf-enabled/10-cgi.conf] => Exec[service-lighttpd-reload]) [15:32:11] Try the '--graph' option and opening the resulting '.dot' file in OmniGraffle or GraphViz [15:32:14] that's the error message I got [15:34:15] Thehelpfulone, can I look? [15:34:18] is this change in gerrit? [15:34:23] yes [15:34:32] linky? [15:34:33] https://gerrit.wikimedia.org/r/#/c/6727 and https://gerrit.wikimedia.org/r/#/c/6584/ [15:34:35] were the changes [15:34:52] Just curious, what's the difference between the squids and the memcaches? [15:35:11] squid = text, memcached = db stuff? [15:38:53] Thehelpfulone [15:39:06] i think it is because you are defining the exec in the webserver::static [15:39:24] and when you do [15:39:24] lighttpd_config { "10-cgi": require => Class["webserver::static"] } [15:39:35] it requires that class happen first [15:39:41] hmm [15:39:44] thinking through it [15:39:57] i would do it differently, but i'm thinking through why puppet is annoyed by this [15:40:25] * Thehelpfulone pokes jeremyb ^ with the above [15:40:28] hmm, [15:40:35] I didn't write that, just poking someone to merge it :) [15:40:38] well, while i think about it, here's how i'd do it [15:40:40] ah ok [15:40:54] should I add my comments as a review? [15:41:00] hexmode: does that sound reasonable? [15:41:15] oh, it has already been merged [15:41:18] notpeter: catching up ... 1s [15:41:30] welp, i'd actually define the exec in the lighttpd_config define [15:41:35] and name it something different for each file [15:41:38] um [15:42:00] kk [15:42:13] notpeter: ok, yeah, I'll see what mar k's idea of precise is [15:42:34] example [15:42:42] here's a virtual host define i wrote for apache once [15:42:42] https://gist.github.com/2628514 [15:42:50] hexmode: it's pretty high prio for a number of reasons, so I'm pretty sure it'll happen quickly [15:43:20] notpeter: :) excellent! it'll probably help with some svg bugs, too :) [15:44:45] ottomata: if you make a new commit to overwrite the existing one that would be good [15:45:44] ok… this is in test branch [15:45:45] aye ok [15:46:02] ottomata: I'm going to revert old one. thanks! [15:46:44] um, naw leave it, [15:46:53] i'll just commit [15:46:54] and change [15:46:56] what's there now [15:47:12] too late [15:47:14] sorry [15:47:20] too hasty :P [15:47:27] hehe, k [15:47:36] also... again... why was that in the prod branch and not the test branch if it was for labs? [15:47:36] quick q [15:47:40] this is what confuses me. [15:47:49] what's the deal with this install thing? [15:47:58] do you guys have to do 2 commits to get a file in place? [15:48:02] define lighttpd_config($install="false") { [15:48:15] ottomata: I'm assuming that that's a legacy thing [15:48:23] well, install is false by default [15:48:29] and if install is false, it only creates a symlink [15:48:37] and doesn't actually put the lighttpd conf file in place [15:48:51] yeah, I don't get it [15:49:06] needs more grand unified class for lighttpd [15:49:18] not app in generic, config in webserver.pp [15:49:48] man, i know this is probably the typical experience of a new guy coming into an old org [15:49:54] but AGHHH so many things need more organizing [15:49:58] yep! [15:50:12] i'd love to take a month and reorganize so many of these puppet things [15:50:14] I can only do so many... and udp2log took up a lot of my revamp energy [15:50:19] yeah [15:50:26] i really want to work on these webserver once [15:50:28] ones* [15:50:33] I say go for it! [15:50:35] :) [15:50:36] i was using them on a labs instance and almost gave up [15:50:53] ha, i'll ask diederik if I can, we're really just waiting for hardware right now before we can do more important stuff [15:51:24] you want to work with me on tuning the monitoring for udp2log stuff? [15:51:36] let's talk about it [15:51:40] yeah [15:51:49] (sorry for slow response, drdee_...) [15:52:03] how often do you get a false positive? [15:52:06] i think monitoring disk space could be useful but defer to your opinion [15:52:20] sure. that's easy to set up and always good to know [15:52:30] when udp2log fails, does that mean that all filters will fail? [15:52:41] (I don't know why that's not one of our default checks, tbh) [15:52:46] or does each filter have its own udp2log instance? [15:53:13] most all run in the same instance [15:53:20] (with the exception of the aft thing) [15:53:40] so maybe it's not a false positive then [15:54:15] well, I was looking at the logs that get spit out, and they have data from the time/date that you sent me that alert from [15:54:21] so, something strange is going on [15:55:39] drdee_ tell us exactly which ones you think are false positives [15:55:40] and we can check [15:55:50] but i don't htink i have seen any that are really false positives [15:55:57] most of the notices we've seen are from filters not running [15:55:59] which does happen [15:56:07] but usually they are only down for a few seconds [15:56:08] i think [15:56:12] because udp2log starts them back up [15:57:05] yeah but we should wait longer then with sending warnings [15:57:21] the problem richt now is that we are getting flooded with warnings [15:57:30] and that means everybody is ignoring them [15:57:37] drdee_: it currently waits 3 minutes (or retries 3 time, really) [15:57:43] so we should increase the signal / to noise ratio [15:57:46] yeah [15:58:11] !log Going to power cycling storage3 several times to troubleshoot hardware issue [15:58:14] Logged the message, Master [15:58:30] well, or [15:58:36] we should figure out why the procs are flapping [15:58:41] cause that is not cool [15:58:43] if they are [15:59:15] notpeter, when you reverted that commit [15:59:18] did you merge that into test repo? [15:59:22] i pull and still see the same [16:01:32] notpeter: so how about waiting 5 or 10 minutes? [16:04:29] ottomata: was never in the test repo... [16:04:49] drdee_: sure, I can turn down the sensitivity [16:05:28] ottomata: everything's all merged up and up to date [16:05:32] lunch. bbiab [16:09:03] notpeter [16:09:04] this commit [16:09:06] says test branch [16:09:07] https://gerrit.wikimedia.org/r/#/c/6727/ [16:13:24] jeff_green: r u about? [16:13:39] i am [16:13:59] storage3 is raid controller card...i updated ticket #2909 [16:14:14] card is dead? [16:14:19] yes [16:14:45] have you tried reseating the card? [16:15:16] no..let me do that now. ping you in a few [16:15:22] ok thanks [16:16:08] New patchset: Ottomata; "Reloading lighttpd whenever a lighttpd_config file changes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6798 [16:16:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6798 [16:16:39] !log shutting down storage3 to reseat RAID card [16:16:42] Logged the message, Master [16:19:17] notpeter, Thehelpfulone, check out that change [16:19:19] see if it makes sense [16:20:29] notpeter: would you have a sec to look at this: http://rt.wikimedia.org/Ticket/Display.html?id=2888 [16:28:05] cmjohnson1: if we can track down a replacement controller, it sounds as though the perc stores config data on the disks so maybe there's hope of recovering the RAID (?) [16:30:17] jeff_green...ok...let's see if this reseat will work...booting up now [16:30:21] ok [16:39:16] jeff_green: just sent you console pic to your email [16:39:39] Reedy: when you see this, please can you finish the setup of wikimania2013 wiki? [16:39:50] hope you had a safe flight ;) [16:41:06] cmjohnson1: you're going to make me cry? [16:42:04] As per the bug, someone needs to tell me what's wanted on it [16:42:54] cmjohnson1: could you also also reseat the SAS cables and drives? [16:43:32] Reedy: I thought I did? "There's also some [16:43:33] basic setup that needs to take place as per wikimania2012 wiki's bug request, [16:43:33] https://bugzilla.wikimedia.org/show_bug.cgi?id=28520 - See Casey's comments. [16:43:33] Please also install the translate extension." [16:44:10] that was my first comment :) [16:48:24] jeff_green" everything reseated [16:48:53] * pgehres crosses fingers [16:50:04] you must have broke something cmjohnson1, both robla and LeslieCarr just came on at the same time :P [16:50:16] hah! [16:50:24] Thehelpfulone: he can't break storage3 any more i don't think [16:51:09] notpeter: i don't follow. why do you think it was on the prod branch? [16:51:14] haha [16:52:02] notpeter: (6727/6796) [16:52:05] cmjohnson1: if it comes up the same, I think a call in to Dell would be wise [16:52:17] oh, he's lunching [16:52:25] we are out of contract with them for storage3 [16:52:28] * apergos peeks in [16:52:33] there's archival data on there we can't replace, because we've had nowhere else to store it [16:52:40] then pay them cash for support :-) [16:52:41] the card? ouch [16:53:33] I don't know enough about these controllers to know how to finesse the situation where the controller loses config, without losing the data [16:53:39] but if it's possible Dell should [16:54:06] wow, this is SAS? didn't know people still used that. (i thought mostly SSD if you need so fast) [16:54:31] we need to keep track of what the disk config was [16:54:41] Jeff_Green: maybe pull all the drives and do some test runs with fresh drives? [16:54:46] we've done it with other arrays, I *expect* it's doable with dell [16:55:56] jeremyb: why wouldn't you use SAS? [16:56:41] apart from the protocol, SAS typically has better failure rates and performance over SATA [16:56:42] Jeff_Green: i just assumed people used SSD instead now. maybe i'm wrong. haven't bought a new system in a while [16:56:57] SSD is still spendy and small unfortunately [16:57:21] * pgehres imagines a multi-TB array of SSDs :-) [16:57:23] I wait patiently for the day anything with cables and power cords is landfilled [16:58:12] hahaha [16:58:30] Jeff_Green: be green! recycle them! [17:00:12] pgehres: only if they promise never to make anything with a platter or a power cable out of the scrap [17:01:10] you don't want one of those HDD clocks? [17:03:04] i used a HDD platter/bearing once to make a spinning guitar speaker [17:04:04] that's an acceptable reuse I suppose [17:05:59] notpeter: ohhhh, maybe you thought it was prod because Thehelpfulone said so? that's surely not actually true. (i'm very nearly certain) [17:07:33] notpeter: oh, no you said prod branch before he did. so i'm again confused [17:07:38] yeah I don't know :P [17:07:46] is it in the production branch or the test branch? [17:13:01] i committed my change to prod (if you are still talking about the same thing) [17:13:22] Reedy: did you need any other configuration settings before you can fix it? [17:13:33] I just need time [17:13:57] Also, rather than saying "see other bug" it's more useful to just copy paste it [17:15:15] Jeff_Green: could be a spinning bicycle decoration [17:15:32] Thehelpfulone: it's clearly in test not prod [17:16:08] jeremyb: i like bikes! [17:16:19] Reedy: ok sorry :) [17:29:17] notpeter: back? [17:31:51] drdee_: hey, yeah [17:32:18] would you have 3 seconds to look at rt 2888? [17:32:25] yeah [17:32:28] I can d othat now [17:32:30] if anyone's got some uploadwizard knowledge, a code review of https://gerrit.wikimedia.org/r/#/c/6722/ (non urgent) would be appreciated [17:32:43] notpeter: sweet [17:33:00] <^demon|away> Thehelpfulone: If reviews are non-urgent, you should use the "add a reviewer" box. [17:33:42] and who do I chose ^demon? [17:34:14] <^demon> Well, anyone who usually works on that code is a good bet. [17:36:28] is there a history option on gerrit? [17:36:42] <^demon> History of...? [17:36:57] so I can see who usually works on that code - commits [17:37:03] <^demon> That's all in gitweb. [17:37:06] Look at the git log? [17:37:17] Gitweb is rubbish :P [17:37:24] <^demon> For example: https://gerrit.wikimedia.org/gitweb/mediawiki/extensions/UploadWizard.git [17:37:25] gitweb? [17:37:31] ah [17:37:38] <^demon> Damianz is rubbish. [17:41:01] Damianz is an ass. [17:41:03] Slight difference. [17:41:10] back in a bit [17:41:20] <^demon> jeremyb: No, don't leave me [17:41:40] drdee_: do you have an ssh key for ayush? he currently doens't have shell on any of our boxes as far as I can tell [17:41:45] ^demon: i must! [17:42:03] notpeter: mmmmm, let me ask [17:42:15] <^demon> jeremyb: Well ok. Hurry back though before I get lonely :) [17:42:39] ^demon: ok, i'll leave you with a parting question then [17:43:17] ^demon: bugzilla (at least upstream and probably the WMF instance too) allows requesting review from the wind (from no one in particular). can't do that with gerrit? ;-( [17:43:41] <^demon> Nope. [17:43:43] ^demon: do you think it would be worth having a dummy user that you could request review from which means "up for grabs, anyone can review"? [17:43:53] <^demon> There is a feature request for "default reviewers" for certain types of changes. [17:43:57] <^demon> Which would be super cool :) [17:44:02] oooh [17:44:09] but i think not exactly the same question [17:44:27] maybe some of the defaults could be dummies ;P [17:44:31] <^demon> You can query for "everything that hasn't been reviewed" too. [17:44:39] yeah... [17:44:53] i'm thinking more on the side of the person waiting for review [17:44:57] drdee_: just get it attached to the ticket, plx [17:45:14] notpeter: okay, i'll do that when i receive it [17:45:51] <^demon> jeremyb: True...for the person who doesn't know who to ask it's not an easy question. Probably something we could come up with a list of by fetching permissions for all projects. [17:45:58] <^demon> Maybe some kind of "suggest a reviewer" tool. [17:46:04] * ^demon is just throwing ideas out there [17:46:46] <^demon> I'm already wanting to build a tool that gets you info about all the projects and such, no reason not to put reviewer info in there too. [17:46:54] if you don't want to bother anyone in particular but you just want to mention that you don't want to wait a {week,month,year} either then there's no place to do that other than IRC/email. i think. (and sometimes that IRC/email comes after you've already waited a while) [17:46:59] i like suggest a reviewer [17:47:28] i think i found some project info lacking when i tried getting it out of a git clone [17:47:52] <^demon> I also kind of like the idea of a weekly nag to the list. Something like "Nobody's looked at these 8 changesets in the past ~month, somebody should respond this week" [17:47:53] e.g. refs/notes/review or refs/meta/config (refs are off the top of my head, may be wrong) [17:48:14] <^demon> refs/notes/review isn't very useful outside of gerrit. [17:48:18] sure. or just get metrics working again [17:48:23] <^demon> refs/meta/config is actually kind of useful but not to humans. [17:48:28] hah [17:48:55] <^demon> Did you see my e-mail to the list about `gerrit query`? [17:49:03] <^demon> That's kind of cool too for data nerds ;-) [17:49:31] i didn't (unless it was a while ago). i think robla mailed about it though? [17:49:43] <^demon> I e-mailed wikitech-l about it this morning. [17:49:54] anyway, the problem with that is that it can't (i don't think) be used by people that don't have accounts [17:50:08] oh, haven't even started catching up on mail [17:50:24] and now i'm off [18:13:30] binasher: hey, what do you think about setting up a poison pill in mysql repl so that the TS replication stops when a master switch happens but WMF slaves are unaffected? I was thinking have an empty DB on the WMF side that doesn't exist in the TS replicas and whenever time comes to do a master switch then after it's read only for a bit and you're all set to switch, send an update or delete or something to that DB that doesn't exist in TS. [18:14:38] the end was: prod does have the DB so query works, TS doesn't and query fails [18:14:53] discussed briefly with nosy and she likes it but has to discuss with the rest of the admins [18:14:58] jeremyb: do you work with the ts admins? if they want that, maybe [18:15:10] i don't but see what i just wrote ;) [18:15:47] note they were last night replicating from a box that hadn't been a master for 2+ months [18:16:20] also i think that fixes the issue of accidentally sending them a big alter they're not ready for [18:16:21] that's fine [18:16:38] actually, i'd rather them not replicate from the actual masters [18:17:00] but from the second level masters that eqiad / analytic slaves replicate from [18:17:11] well sure, but then you have to !log the master switches for intermediates too [18:17:25] and the poison pill still works then [18:18:43] (maybe they are logged anyway. just saying now someone really needs to see them ;) [18:35:45] jeff_green: spoke with dell they are not charging for the help....they agree it's most likely card but running a report for them [18:36:04] also said you should be able to add a new card and import the foreign cfg [18:36:04] cmjohnson1: ah, fantastic! [18:36:19] very cool [18:38:24] υαυ [18:38:26] er [18:38:27] yay :-D [18:38:41] cmjohnson1: hey, want to break the site ^H^H^H do the uplink again today? ;) [18:40:16] * jeremyb introduces LeslieCarr to ^W ;) [18:41:05] hehe [18:41:41] lesliecarr: that sounds like fun! sure [18:50:27] notpeter: i attached ssh key to rt ticket 2888 [18:52:22] sweet! thank you [18:53:27] paravoid: hover over s2 on http://noc.wikimedia.org/dbtree/ [19:14:53] drdee [19:15:01] bayes will be eol soon [19:15:16] so u sure u want him to be on it? [19:18:29] o.0 [19:36:09] woosters: stat1 is cool as wel [19:36:18] yay [19:40:50] will wikistats take better machine then? :) [20:01:10] New patchset: Pyoungmeister; "readding blondel to mysql.pp and site.pp to start as db9 slave" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6811 [20:01:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6811 [20:03:15] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6811 [20:03:17] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6811 [20:11:26] !log rebooting db1019 [20:11:29] Logged the message, Master [20:15:18] !log rebooting db45 [20:15:21] Logged the message, Master [20:20:14] I can ignore the blondel page right? [20:20:20] yes you can apergos [20:20:25] sweet [20:20:30] !log restarted irc bot [20:20:33] Logged the message, Mistress of the network gear. [20:20:34] PROBLEM - MySQL Replication Heartbeat on db1019 is CRITICAL: CRIT replication delay 298 seconds [20:21:28] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 279 seconds [20:21:49] LeslieCarr: that's ambiguous... (gerrit-wm, etc.) [20:21:56] oh yes [20:22:02] !log (above) restarted nagios-wm on spence [20:22:05] Logged the message, Mistress of the network gear. [20:22:05] thanks jeremyb [20:22:21] sure ;) [20:24:46] RECOVERY - MySQL Replication Heartbeat on db1019 is OK: OK replication delay 0 seconds [20:25:04] RECOVERY - MySQL Slave Delay on db1019 is OK: OK replication delay 0 seconds [20:25:04] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [20:25:05] notpeter puppetized the exim fwiw [20:25:40] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [20:29:14] I plead the fifth [20:29:28] I did some, and then ma_rk did a lot [20:29:34] so I only kinda know that code at this point [20:31:17] oh, the fifth is an option? [20:31:42] always [20:33:55] PROBLEM - SSH on storage3 is CRITICAL: Connection refused [20:37:47] !log attempting a live online schema change for zuwikitionary.recentchanges on the prod master [20:37:50] Logged the message, Master [20:39:55] PROBLEM - MySQL Slave Running on db1019 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Table _recentchanges_new already exists on query. Default d [20:40:31] PROBLEM - MySQL Replication Heartbeat on db1019 is CRITICAL: CRIT replication delay 211 seconds [20:40:49] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 231 seconds [20:43:12] * AaronSchulz wonders what binasher is up to [20:50:30] RobH: are you in DC today ? [20:50:35] Jeff_Green: I am still waiting to hear back from Dell Support. Assuming it is the card, here is an option...srv217 has been down and out for awhile and not even powered on [20:50:43] cmjohnson1: are you in DC today as wel ? [20:50:54] LeslieCarr: nope, planned to be once my networking stuff gets in [20:51:07] like later today or later in the week ? [20:51:11] it has the same card, I want to check witih Ma_rk before removing though...but if he okays than we could use that card to swap and get storage3 up [20:51:19] cmjohnson1: i'm fine with that if you've got a procedure that doesn't wipe the stuff on storage3 [20:51:21] yes Lesliecarr I am [20:51:25] LeslieCarr: not later today, but can be tomorrow, if we need to do something [20:51:30] oh i guess it's already 5pm there, probably not going in tonight ;) [20:51:31] whatcha need me to do? [20:51:42] RobH: could you go in tomorrow for the FPC5 reseat and possible replacement ? [20:51:50] I am going to go w/ Dell's suggestion and swap the card boot into raid bios and import the foreign cfg [20:51:56] cool [20:52:10] LeslieCarr: sure can, so will do that with you my early afternoon? [20:52:16] sounds good [20:52:17] i'd be inclined to wipe any config on the card before you pull it from the spare box [20:52:36] probably a good idea [20:52:51] cool--thanks for dealing with this [20:53:00] yep...np [20:53:37] lesliecarr: whats up? [20:54:08] cmjohnson1: retry the uplink [20:54:13] how long will you be there until ? [20:54:19] till 6 [20:54:31] oh so probably should get this right now :) [20:55:30] yep..ez to forget about us east coasters! [20:56:01] more like easy to get caught up in stuff and then it's way too late out there [20:56:08] let's just fix this whole "sun" thing [20:56:50] cool...i think i need more cable though ;-] [21:00:51] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [21:01:44] hehehe [21:02:12] ok, so we were going to plug this in to 0/1/45 on the asw-a4 side [21:02:53] and 2.2 on the csw1 side, right ? [21:06:41] yep..same as last week [21:06:52] 45 or 46? [21:06:59] last week it was 46 on a4 [21:07:15] ph 46 :) [21:07:16] thanks [21:07:38] okay....r u ready? [21:08:35] lesliecarr: done [21:08:45] almost ready [21:08:46] :) [21:09:24] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2600* [21:09:31] so if all of a sudden your internet dies [21:09:34] please unplug it right away :) [21:09:47] okay [21:13:18] PROBLEM - Backend Squid HTTP on amssq49 is CRITICAL: Connection refused [21:13:27] PROBLEM - Frontend Squid HTTP on amssq49 is CRITICAL: Connection refused [21:16:15] lol [21:18:02] lesliecarr: still here... [21:18:14] no loops [21:18:19] ok can you unplug the cable real quick [21:18:25] turns out foundries suck ;) [21:18:47] not actually looped, just that i need to pull it out of the trunk and then put it back in [21:19:14] o.0 [21:20:14] ok, cmjohnson1 can you hook a4 back up again ? [21:20:37] done [21:20:58] hrm [21:21:01] wait [21:21:06] let me do that again [21:21:26] ok [21:22:19] not showing link :( [21:24:51] RECOVERY - Frontend Squid HTTP on amssq49 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 0.219 seconds [21:24:51] RECOVERY - Backend Squid HTTP on amssq49 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 0.218 seconds [21:25:09] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2300 [21:29:43] !log shutting down storage3 for troubleshooting [21:29:46] Logged the message, Master [21:32:57] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [21:37:59] cmjohnson1: i'm not seeing link [21:38:04] (doh forgot to hit send) [21:40:18] PROBLEM - Backend Squid HTTP on amssq50 is CRITICAL: Connection refused [21:40:27] PROBLEM - Frontend Squid HTTP on amssq50 is CRITICAL: Connection refused [21:41:12] lesliecarr: working? [21:41:18] not seeing link [21:42:33] want me to take out and reinsert [21:42:56] cmjohnson1: I'm heading out, if anything comes up and you need me call my cell [21:43:22] okay, thx [21:43:34] yes please [21:44:55] !log moved default resolution for upload from eqiad to pmtpa [21:44:58] Logged the message, Master [21:48:16] cmjohnson1: still not seeing link up :( [21:49:11] notpeter: has ayush access to stat1? [21:50:57] RECOVERY - mysqld processes on db45 is OK: PROCS OK: 1 process with command name mysqld [21:51:15] PROBLEM - Host knsq23 is DOWN: PING CRITICAL - Packet loss = 100% [21:51:55] lesliecarr: just reinserted...got hung up with storage3 [21:52:05] check 4 link [21:52:21] RECOVERY - Host knsq23 is UP: PING OK - Packet loss = 0%, RTA = 109.21 ms [21:52:40] ok [21:52:57] PROBLEM - NTP on knsq23 is CRITICAL: NTP CRITICAL: Offset unknown [21:53:01] still nothing [21:53:33] PROBLEM - MySQL Replication Heartbeat on db45 is CRITICAL: CRIT replication delay 135503 seconds [21:54:00] PROBLEM - MySQL Slave Delay on db45 is CRITICAL: CRIT replication delay 135073 seconds [21:55:21] lesliecarr: try it now [21:55:25] bet you get traffic [21:55:37] or at least a link [21:55:41] i see it up on the asw-a4 side [21:55:45] not up on the csw1 side [21:57:20] ok....i plugged that in the wrong spot...should be up [21:57:50] oh interesting, so on csw1 it's on 1/26 and not in 2/2 [21:58:15] at least that's the link that just came up [21:58:46] geetting some pages [21:58:52] searches [21:59:03] wiktionary-lb-esams [21:59:12] upload-esams [21:59:12] yep, why you leave nagios ? [21:59:18] !log was still upgrading/rebooting amssq* and knsq* hosts on the side (slow,b/c upload squids). expect temp. nagios squid reports tomorrow as well. out for now. [21:59:22] Logged the message, Master [22:00:21] on vibrate now [22:00:44] cmjohnson1: hey, so is the thing plugged into 1/26 on csw1 asw-a4 ? [22:01:20] checking [22:01:55] cmjohnson1: that port came up when you said you have it plugged in , so there might be some mislabeled patch panels [22:02:20] cmjohnson1: db1019 is down (doesn't respond to ping, though occasionally accepts tcp connections to 3306!) and db1019.mgmt doesn't respond to a ping either. can you check it out? [22:02:42] test.wikipedia.org appears to be having a problem - looks like srv193 has rather high load at the moment [22:03:01] fenari nfs seems broken [22:03:04] ^ [22:03:05] Ops! [22:03:10] wtf is wrong with fenari and nfs-hoem [22:03:15] nfs prob would kill test.wiki too [22:03:21] ohho [22:03:31] We were in the middle of a scap and started seeing connection timeouts to 10.0.5.8 (nfs-home) [22:03:34] and even 'no route to host' [22:03:34] lol [22:03:38] Wasn't me this time! ;) [22:03:38] <^demon> Quick everyone log onto fenari ;-) [22:03:46] Then fenari shell just started being unresponsive [22:03:48] open moar shells till you get on [22:03:53] RoanKattouw: you are scapping during the mobile deployment window? [22:04:06] we have untested changes up on fenari... [22:04:14] Oh, we're deploying pagetriage [22:04:19] ! [22:04:20] Deployment window scheduling fail I guess? [22:04:23] If so that would be our fault [22:04:34] we have a weekly window on mondays from 3-4pm PST [22:04:43] why is the site dying? [22:04:58] Aaah, we only had until 3 [22:04:59] ssh: connect to host nfs-home port 22: Connection timed out [22:05:00] nfs [22:05:00] Right [22:05:06] lesliecarr: it is mrjp a3 port2/ which goes to line card 1/25 [22:05:10] oh ok [22:05:11] We technically scapped before 3pm I guess [22:05:13] cool [22:05:19] so i was the one who didn't get where it was :) [22:05:23] i'm disabling nagios notifications [22:05:43] hehehe RoanKattouw sorry, i also didnt see you ont eh calendar because i was looking at week of april 30 [22:05:48] So 1) your untested changes shouldn't have been there yet when we started scapping and 2) it's not like it's actually pushing them out anyway cause the network is broken [22:06:03] awjr: We have the room for 2-4 so I thought we had the window for 2-4 too, but apparently we didn't. Sorry about that [22:06:14] binasher db1019 is in eqiad..i can't do anything from here (pmtpa) [22:06:21] Anyway now we're both stuck cause nothing is working [22:06:24] derp [22:06:30] binasher: +1 [22:06:39] RoanKattouw: no worries, we both failed. i wouldn't have started early if i realized you guys were still in your window, i read the calendar wrong :( plus yeah, world…exploding. [22:06:52] what's exploding? [22:06:54] can we just work on fixing it instead of discussing who failed? :-) [22:07:12] Who failed is more interesting :D [22:07:17] paravoid: Our failures are unrelated to the problem :) [22:07:18] what can we (ops) do? [22:07:28] nfs1 and nfs2 both don't respond to a ping [22:07:30] Make it so that 'ssh fenari' actually does thinsg? [22:07:36] cmjohnson1: can you check those? nfs1 first [22:07:44] well, it's not clear that any changes actually got pushed. i'm not sure what to do now, though, since fenari's broken. [22:08:23] i had shells open there, but they froze when i tried to run a thing. [22:08:26] binasher: looking at it [22:08:34] oh [22:08:37] fenari is back [22:08:56] and i can get to the apaches again [22:09:01] the theory is that the dual deployment created an overload and brought nfs{1,2} to its knees [22:09:15] that's a reasonable theory [22:09:23] raindrift1: Did you change i18n in your update at all [22:09:24] resulting in fenari being unreachable [22:09:35] well, im not sure that makes sense, it's not like we ran scap at the same time [22:09:39] RoanKattouw: i believe so, yes. [22:09:39] If not you can sync-dir instead of scap, and the mobile folks will be unaffected [22:09:42] 193 is still dead [22:09:44] we just happened to have changes up on fenari [22:10:19] maybe we don't have i18n updates, actually. lemme check. [22:10:24] binasher: both nfs's respond to ping locally [22:10:54] no, no i18n changes for us this time around. [22:10:58] they do from bast1001 as well [22:11:04] srv193 load average: 27.45, 34.28, 21.62 [22:11:27] fenari right after it started responding again: 37.34, 21.90, 9.43 [22:11:34] fenari now: 3.18, 13.29, 8.03 [22:11:51] On that note. Will someone review/merge and push https://gerrit.wikimedia.org/r/#/c/6156/ [22:12:47] re-enabling nagios notifications [22:13:11] awjr: OK we're sync-dir'ing just our extension dir now [22:13:33] RoanKattouw: ok cool, i'll just wait to do anything else until you guys are done - just lmk [22:13:44] sorry for the midair collision [22:13:48] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [22:13:48] No worries [22:13:57] PROBLEM - Frontend Squid HTTP on amssq51 is CRITICAL: Connection refused [22:14:17] awjr: You have changes in InitialiseSettings.php , we need to switch one var there [22:14:28] RoanKattouw: that should be a safe change [22:14:31] OK [22:14:48] So we can make our change and then sync out both InitialiseSettings changes (leaving CommonSettings alone)? [22:15:00] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [22:15:14] awjr: ? [22:15:21] RoanKattouw yeah - and if you need CommonSettings, we can just back it up and restore it [22:15:26] OK, thanks [22:15:29] No we just need IS, not CS [22:15:42] really ? [22:15:47] wait sorry [22:15:49] ignore the really [22:16:44] If someone reviews and merges 6156, sync'ing dirs/files won't do so many simultaneous requests [22:17:25] nfs2 has been up 326 days on 2.6.32-32-server [22:17:39] nfs1 was reporting network fails from nfs2 [22:18:25] New review: Reedy; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/6156 [22:18:46] !log rebooting nfs2 to new kernel [22:18:49] Logged the message, Master [22:19:19] RoanKattouw: are you doing a deploy right now? [22:19:32] paravoid: Done now, but awjr is also deploying things [22:19:39] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:19:41] and i can ping db1019.mgmt from pmtpa again, and db1019 appears fine [22:20:15] PROBLEM - Apache HTTP on mw45 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [22:20:15] i think there was definitely a network gremlin [22:20:32] once i finish testing these changes, i am going to need to run scap - will that be a problem? [22:20:42] PROBLEM - Apache HTTP on srv197 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [22:20:42] PROBLEM - Apache HTTP on srv206 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error [22:21:05] May 7 22:20:54 10.0.2.197 apache2[6500]: PHP Warning: require_once(/usr/local/apache/common-local/php-1.20wmf2/extensions/PageTriage/PageTriage.php) [function.require-once]: failed to open stream: Permission denied in /usr/local/apache/common-local/wmf-config/CommonSettings.php on line 2488 [22:21:15] Permission denied?!? [22:21:25] Looking [22:22:22] reedy@fenari:~$ ls -al /usr/local/apache/common-local/php-1.20wmf2/extensions/PageTriage/PageTriage.php [22:22:22] -rw-r--r-- 1 mwdeploy mwdeploy 11600 2012-05-03 22:20 /usr/local/apache/common-local/php-1.20wmf2/extensions/PageTriage/PageTriage.php [22:22:56] binasher: drbd does not seem to be back up on nfs2 [22:24:27] RECOVERY - Apache HTTP on mw45 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.043 second response time [22:24:54] RECOVERY - Apache HTTP on srv197 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [22:24:54] RECOVERY - Apache HTTP on srv206 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [22:26:51] RECOVERY - Frontend Squid HTTP on amssq51 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 0.218 seconds [22:26:51] RECOVERY - Backend Squid HTTP on amssq51 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 0.219 seconds [22:28:46] http://www.drbd.org/users-guide/s-resolve-split-brain.html [22:28:57] RECOVERY - Backend Squid HTTP on amssq50 is OK: HTTP OK HTTP/1.0 200 OK - 636 bytes in 0.327 seconds [22:29:20] paravoid: i'm attempting to fix [22:29:33] RECOVERY - Frontend Squid HTTP on amssq50 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 0.220 seconds [22:31:02] root@nfs1:~# drbdadm primary nfshome [22:31:02] 1: State change failed: (-2) Refusing to be Primary without at least one UpToDate disk [22:35:16] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6156 [22:35:19] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6156 [22:36:36] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.196 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [22:37:48] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [22:40:07] !log deleting 14k tmp files from spence's /home/nagios [22:40:10] Logged the message, Master [22:42:02] paravoid: i'm not sure how to get nfs1 to be the primary again [22:42:49] the primary option to drbdadm only works before use "Promote the resource´s device into primary role. You need to do this before any access to the device, such as creating or mounting a file system." [22:46:33] paravoid: i was trying to follow http://www.drbd.org/users-guide/s-resolve-split-brain.html [22:46:46] though it needs some adjustment - it seems to be for a newer version of drbd [22:48:35] !log running an osc against plwiktionary.recentchanges on master [22:48:38] Logged the message, Master [22:49:17] !log upgrading glusterfs on labstore1-4 [22:49:20] Logged the message, Master [22:51:05] * Damianz runs screaming [22:52:57] RECOVERY - MySQL Slave Running on db1019 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [22:54:20] paravoid: last admin log about drbd was 19:20 mark: Migrated DRBD sync between nfs1 and nfs2 from protocol C (sync) to A (async) [22:54:25] Damianz: that's not when to go screaming [22:54:26] from 12/2011 [22:54:29] Damianz: this is when to go screaming [22:54:37] !log upgrading glusterfs on virt1-5 [22:54:40] Logged the message, Master [22:54:54] PROBLEM - MySQL Slave Delay on db1019 is CRITICAL: CRIT replication delay 7285 seconds [22:55:31] Ryan_Lane: Pre-emptive screaming for when you trigger a self heal [22:55:45] oh. I'm not doing that [22:56:04] Yay [22:56:05] binasher: okay, should we wait until he wakes up and ask him? [22:56:17] i think so [22:56:23] okay [22:56:25] fair enough [22:56:42] DatabaseBase::makeList: empty input [22:56:42] Backtrace: [22:56:50] on enwiki trying to block an IP [22:56:59] there's a long page, if anybody wants to see a pastebin [22:57:00] Reedy, hashar: I think everything's in order for now, wanna continue our meeting? [22:57:02] !log restarting glusterd processes on virt1-5 [22:57:05] Logged the message, Master [22:57:11] er, s/continue/start/ even :) [22:57:33] http://pastebin.com/5ZByA9J4 [22:58:15] On what wiki? [22:58:17] Doing what? [22:58:21] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 6938 seconds [22:58:32] Joan: 05/07/12 [18:56:55] on enwiki trying to block an IP [22:58:55] I don't think you want this channel. [23:01:22] paravoid: yup [23:02:52] binasher, around? [23:03:11] hey [23:03:25] heya .. do you have a sec to recap the state on the varnish large file streaming issue? [23:03:31] we switched back to pmtpa and did some further config tweaks? [23:04:23] further config tweaks didn't help, upload is currently being served from pmtpa and esams on squid [23:05:33] I see, back to squid for now. that definitely helped; I can play all the files reliably now. [23:05:59] i've reached out to martin gydeland re: the commercially funded streaming work they're doing [23:06:13] I'll close the bug as at least the end user impact is resolved for now [23:06:52] cool, thanks. ct mentioned a varnish focused company in norway that could help with custom dev as well [23:08:11] it's the same company [23:08:14] so do you want that MediaWiki patch I suggested done urgently? [23:08:19] ah, ok [23:08:21] Eloquence: it's actually Varnish Inc. [23:08:53] Varnish Software AS .. i don't think the do the Inc thing over there :) [23:09:07] well, yes :) [23:09:11] lol [23:09:12] TimStarling: it would be great if you could still do it, but isn't urgent now [23:09:21] the point is, it's the people who wrote it and maintain it [23:10:02] have we asked wikia how they do it (if they do acept large video files) ? [23:10:06] they use varnish i believe [23:10:08] they say the brightcove cdn is using (and funded) the streaming patches for wide scale video streaming [23:10:17] LeslieCarr: they don't run their own varnish [23:10:21] oh ok [23:10:23] +clear [23:11:16] AaronSchulz: are you on the ops mailing list? [23:11:18] apparently a large cable tv provider is also using it for VOD streaming and is able to saturate a 10G link per server via video streaming [23:11:24] yes [23:11:54] so you know the patch I'm talking about? using a different upload hostname for large files? [23:12:07] yeah [23:12:10] I didn't know there was actually a patch [23:12:12] * AaronSchulz checks [23:12:15] there's no patch [23:12:46] I mean a patch that we would like to exist at some point in the future [23:12:49] that's why I can't find it ;) [23:13:03] TimStarling: or at least a different hostname for video [23:13:09] it's the things that don't exist that are always the hardest to find... [23:13:27] AaronSchulz: is that something you'd be interested in doing? [23:14:11] binasher: ideally it would be good if we could still serve both kinds of content from both hosts, so that we can shift the load around but maintain backwards compatibility [23:14:42] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [23:14:43] fastly's varnish build which wikia uses has a bunch of changes that they are keeping closed source / commercial.. too bad varnish isn't gpl licensed [23:14:51] RECOVERY - MySQL Replication Heartbeat on db1019 is OK: OK replication delay 0 seconds [23:15:00] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 781 seconds [23:15:19] TimStarling: I guess [23:15:30] if varnish isn't good enough for the use [23:15:45] RECOVERY - MySQL Slave Delay on db1019 is OK: OK replication delay 0 seconds [23:16:02] there's no isVideo() method in the interface, but there is getSize(), so it would be a bit easier to split by size [23:16:15] makes sense [23:16:18] TimStarling: if we can get at least the tiny frontend varnish instances working well with video streams, we could also have them route requests to dedicated backends for files with specific extensions [23:16:21] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [23:17:15] we could have a list of MIME types or extensions that are sent to some different backend, if that is more useful [23:17:24] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 1 seconds [23:18:05] but video files can be smaller than still images so I'm not sure how much sense it would make [23:18:26] * AaronSchulz would tend to stick to size [23:18:31] agree [23:26:06] RECOVERY - MySQL Slave Delay on db45 is OK: OK replication delay 0 seconds [23:27:09] RECOVERY - MySQL Replication Heartbeat on db45 is OK: OK replication delay 0 seconds [23:49:23] TimStarling: are you talking about the uploadwizard by any chance? [23:49:38] no [23:49:58] heh ok [23:59:24] PROBLEM - MySQL Slave Delay on db1033 is CRITICAL: CRIT replication delay 185 seconds