[00:02:53] New patchset: Cmjohnson; " Changing mac address for ms-be7 to reflect new server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33673 [00:05:35] lesliecarr: can you +2 my change please [00:07:00] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33673 [00:08:14] cmjohnson1: yes [00:08:15] oh asher got it [00:08:39] cool..lesliecarr: i will have one more fix for you in abit [00:10:30] New patchset: Cmjohnson; "Adding wtp1 to netboot config to use lvm.cfg partman recipe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33675 [00:11:01] lesliecarr ^ [00:12:30] binasher: do you know ldap well ? [00:12:51] LeslieCarr: not really :( is it openldap or opendj? [00:12:54] LeslieCarr: please don't swear! [00:13:02] oh opendj [00:16:23] binasher: could you +2 that change for me plz [00:16:47] binasher: did robh tty about the sandisk for labsdb1001 and 1002 in eqiad? [00:20:15] cmjohnson1: nope [00:20:40] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33675 [00:21:01] !log tstarling synchronized php-1.21wmf4/extensions/UniversalLanguageSelector [00:21:07] Logged the message, Master [00:21:24] ok..he only has the intel 710's there...i have enough for the 2 db's there. Do you want me to send him the sand disk (binasher) [00:21:32] sandisk [00:21:33] I know LDAP well [00:21:37] not sand disk [00:21:38] I know very little of OpenDJ [00:22:16] cmjohnson1: do you have enough even after labsdb3? [00:22:39] yes...and enough for a couple of spare for both sites [00:26:50] New patchset: Tim Starling; "Fix small.dblist nonexistent DB names" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33680 [00:27:18] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33680 [00:27:27] New patchset: Tim Starling; "Disable ULS toolbar for anons" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33534 [00:28:07] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33534 [00:28:15] New patchset: Reedy; "Expose s4-s7 dblists and small.dblist" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33681 [00:28:44] RECOVERY - Host wtp1 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [00:28:47] !log tstarling synchronized small.dblist [00:28:53] Logged the message, Master [00:28:57] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33681 [00:29:08] !log tstarling synchronized wmf-config/InitialiseSettings.php [00:29:14] Logged the message, Master [00:30:44] !log tstarling synchronized wmf-config/InitialiseSettings.php [00:30:51] Logged the message, Master [00:34:08] New patchset: Tim Starling; "Set $wgULSEnableAnon=false again" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33682 [00:34:36] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33682 [00:35:04] !log tstarling synchronized wmf-config/CommonSettings.php [00:35:11] Logged the message, Master [00:35:22] !log tstarling synchronized wmf-config/InitialiseSettings.php [00:35:29] Logged the message, Master [00:36:32] PROBLEM - Host wtp1 is DOWN: PING CRITICAL - Packet loss = 100% [00:41:11] PROBLEM - Squid on brewster is CRITICAL: Connection refused [00:45:02] New patchset: Cmjohnson; " Removing wtp1 from the lvm.cfg partman recipe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33686 [00:45:52] binashser: please +2 this change...it didn't work..just going to do a manual partition..thx [00:47:19] New patchset: Reedy; "Kill off old unused dblists" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33687 [00:49:02] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33687 [00:52:08] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33686 [00:52:41] thanks paravoid [01:02:05] RECOVERY - Host wtp1 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [01:05:47] TimStarling: Any idea about "Could not open input file: MWScript.php" in the log/norotate/updateSpecialPages.log cronlogs? [01:06:23] Running the same script on hume as apache works fine [01:07:25] AaronSchulz: Function: FlaggedRevsStats::getEditReviewTimes Error: 2013 Lost connection to MySQL server during query (10.0.6.21) [01:08:09] I guess that at the top of the file has something to do with it... [01:08:15] The file seems ot just start at enwikiquote [01:08:27] strange [01:08:49] so it was run from some other server? [01:09:04] that would explain it, if it was run from somewhere without NFS mounted [01:09:24] The cron entries aren't handled by puppet [01:09:33] It looks like it was just put there manually on hume [01:09:53] PROBLEM - Host wtp1 is DOWN: PING CRITICAL - Packet loss = 100% [01:10:20] !log aaron synchronized php-1.21wmf4/thumb.php 'debug logging' [01:10:29] Though, how could it write to a log file n NFS if NFS isn't mounted? [01:10:40] the plot thickens [01:11:54] wtf [01:12:31] * AaronSchulz looks up $_REQUEST again [01:12:37] It also seemingly "just broke" [01:12:51] Logged the message, Master [01:13:17] * AaronSchulz likes how $params in thumb.php has cookie info and stuff [01:15:44] RECOVERY - Host wtp1 is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [01:17:55] !log aaron synchronized php-1.21wmf4/thumb.php [01:18:03] Logged the message, Master [01:32:04] !log aaron synchronized php-1.21wmf4/thumb.php [01:32:11] Logged the message, Master [01:32:34] binasher: have you ever had [01:32:40] An error has been detected while trying to use the specified Ubuntu │ [01:32:40] │ archive mirror. [01:33:13] cmjohnson1: hmm, it might mean that something is down on brewster [01:33:42] yeah, squid was down on it.. i just started it back up [01:33:47] i wonder what happened to it [01:33:49] !log aaron synchronized php-1.21wmf4/thumb.php [01:33:56] Logged the message, Master [01:33:58] ah, it died again [01:34:04] that's odd [01:34:07] oh, / is full [01:34:47] again? [01:34:53] !log aaron synchronized php-1.21wmf4/thumb.php 'more debugging' [01:34:59] Logged the message, Master [01:35:14] PROBLEM - Host wtp1 is DOWN: PING CRITICAL - Packet loss = 100% [01:36:39] can we migrate brewster to one of the new misc servers we have in Tampa (r320's) [01:36:53] RECOVERY - Squid on brewster is OK: TCP OK - 0.003 second response time on port 8080 [01:37:28] cmjohnson1: that's probably a good idea, especially if we can put extra disk in one [01:37:29] sh [01:37:35] should be ok for now [01:38:19] great thx [01:39:36] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 240 seconds [01:40:11] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 302 seconds [01:42:54] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [01:46:47] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [01:46:57] RECOVERY - Host wtp1 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [01:49:37] !log aaron synchronized php-1.21wmf4/thumb.php 'remove hacks' [01:49:44] Logged the message, Master [01:49:54] New patchset: Reedy; "Add s4-s7 to comment of output cluster lists" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33692 [01:50:05] New patchset: Asher; "prevent build-new from blowing away all incremental status files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33693 [01:50:28] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33693 [01:50:33] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33692 [01:58:20] RECOVERY - SSH on wtp1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [01:59:59] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 245 seconds [02:01:12] binasher: just read the last line of your email :) [02:02:02] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 367 seconds [02:05:36] TimStarling: odd, if I request http://commons.wikimedia.org/w/thumb.php?f=Aqueduc_Luynes.jpg&w=400 enough times it fails saying no width is set [02:06:14] and the exception entry just has "/w/thumb.php?f=Aqueduc_Luynes.jpg"...it's like the &w=400 just randomly disappears sometimes [02:07:12] it doesn't correlate with particular scalars [02:21:41] RECOVERY - NTP on wtp1 is OK: NTP OK: Offset -0.02122247219 secs [02:25:23] binasher: around? [02:25:57] pgehres: slightly, what's up? [02:26:25] our pmtpa fundraising slave has some unusual CPU activity the past hour [02:26:39] was wondering if you had a sec to poke db78 and see if I should page Jeff [02:27:27] pgehres: i don't have access to the fr dbs [02:27:41] even as a root? :-( [02:27:53] i'm not an fr root [02:27:57] i have no fr access [02:28:26] no ssh and don't know the mysql pw's [02:28:30] ah, i didn't know Jeff had removed wmf roots from them [02:28:52] !log LocalisationUpdate completed (1.21wmf4) at Fri Nov 16 02:28:52 UTC 2012 [02:28:56] i can't say i mind ;) [02:28:59] Logged the message, Master [02:29:03] i can imagine [02:29:15] dyk if anyone other than Jeff has access? [02:29:30] pgehres: paravoid has ssh. idk about mysql passwd [02:29:38] but he's surely asleep ;) [02:29:52] i'd settle for a top right now to figure out what its doing [02:30:10] it's not too late for Jeff, so I will poke him [02:30:16] thanks [02:30:32] yeah, it's fine in Jeff_Green's TZ ;) (that's my TZ too!) [02:30:52] indeed, but he left like 3H ago to deal with offspring [02:31:26] PROBLEM - Puppet freshness on mw60 is CRITICAL: Puppet has not run in the last 10 hours [02:31:31] right [02:32:09] whut [02:32:29] PROBLEM - Puppet freshness on mw61 is CRITICAL: Puppet has not run in the last 10 hours [02:33:06] Jeff_Green: run top on db78 please ;) [02:33:16] pgehres [02:33:17] it's the nightly mysqldump run [02:33:35] that's an impressively long dump [02:34:05] you should have seen it at the lunch buffet earlier [02:34:13] that's what you get for storing raw web data in a db :-P [02:34:27] * pgehres glares to the left [02:35:14] * binasher is doubly grateful for not having fr db access [02:35:17] sorry for the false alarm then, i am edgy today [02:35:18] 348G faulkner [02:35:23] 113G pgehres [02:35:29] it's a race! [02:36:01] funny how a well designed table can cause that number to be so much lower [02:36:11] * pgehres can't wait until analytics takes over this crap [02:36:19] ha yes [02:36:46] Jeff_Green: i would be happy to archive my raw table to disk and truncate the table [02:37:08] !log Forcing puppet run on wtp1 to reinstall Parsoid [02:37:16] Logged the message, Mr. Obvious [02:37:16] go puppet go! [02:37:35] pgehres: i'm only concerned in terms of the the living db footprint, which is rather large [02:37:43] indeed [02:37:53] dumped the faulknerdb is only 14G [02:38:04] gzip for the win [02:38:07] yup [02:38:14] you could do myisam with table compression [02:38:16] i suppose [02:38:25] but eh. this is the least of our concerns really [02:38:28] he tried myisam last year [02:38:37] * binasher is never never touching that db [02:38:49] binasher: don't be a snob :-P [02:39:07] Jeff_Green: do we have a public list somewhere of who has what access to fr? [02:39:10] that's like saying binasher: stop being binasher [02:39:12] * pgehres marks binasher on the list of the first to go [02:39:34] myisam with table compression is actually pretty good for this purpose, write once (i.e. daily), compress, concatenate, laugh if the table crashes. [02:39:47] jeremyb: nope [02:39:55] top secret! [02:39:56] Jeff_Green: you mean page compression, right? [02:40:04] Jeff_Green: srsly? [02:40:05] no i mean table compression [02:40:16] i wonder what that is [02:40:24] entire table, compressed [02:40:36] it's an actually-useful myisam feature [02:40:52] uhuh [02:40:54] ;P [02:41:29] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [02:41:29] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [02:41:53] srsly. i used it in CL's webstats. ran good. scaled to 100x what I ever imagined it would do [02:43:02] jeremyb: re. the list, it's not that it's top secret. but is there such a list anywhere for any system around here? [02:43:33] Jeff_Green: admins.pp [02:43:56] Jeff_Green: ewwww, CL is myisam??? [02:44:05] no, not CL [02:44:40] Jeff_Green: do we dump off of db1013? [02:44:45] Jeff_Green: (also, of course site.pp iirc) [02:44:47] or just 1025 and 78? [02:44:52] CL's web stats uses myisam for a specific type of table [02:45:33] jeremyb: admins.pp has a block for fundraising [02:45:38] ah [02:45:51] that will tell you who has shell access to fundraising machines which aren't in the new frack payments cluster [02:46:09] Jeff_Green: and if they *are* in frack? [02:46:21] there's a separate puppet instance which is actually top secret. [02:46:41] * jeremyb didn't realize there were any fr machines outside of frack [02:46:54] heh, frack is new and shiny [02:46:59] Jeff_Green: and so e.g. faidon has access to all of the above? [02:47:01] and only in eqiad [02:47:06] jeremyb: yes [02:47:20] ok [02:48:07] right now, it's essentially me, faidon, leslie, and mark as the ops role [02:48:14] RECOVERY - Puppet freshness on mw60 is OK: puppet ran at Fri Nov 16 02:47:51 UTC 2012 [02:48:32] (err sorry for binging everyone's IRC handles there) [02:48:37] !log LocalisationUpdate completed (1.21wmf3) at Fri Nov 16 02:48:37 UTC 2012 [02:48:44] Logged the message, Master [02:48:48] oh and daniel [02:48:59] ok, so leslie and mar k and muntant [02:49:20] err, s/n// [02:52:46] RECOVERY - Puppet freshness on mw61 is OK: puppet ran at Fri Nov 16 02:52:28 UTC 2012 [03:31:34] gerrit down for anyone else? [03:32:31] which port? [03:32:44] 443 [03:33:00] K4-71_3 commited a change but we can't review it [03:33:09] i suppose there is the api [03:33:12] I was about to say: "Me too". [03:33:16] apache's hanging [03:33:18] but that sounds like a lot of work [03:33:42] i guess you're saying ssh is fine then? [03:33:57] anyway, yes something needs fixing [03:34:14] why isn't nagios complaining? [03:42:11] Gerrit WFM [03:48:05] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [03:58:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [04:12:20] Change abandoned: Tim Starling; "Superseded" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33318 [04:21:32] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [04:54:23] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [05:03:23] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [05:10:26] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [05:27:58] TimStarling: http://meta.wikimedia.org/?diff=4568837&oldid=4566973&rcid=3697528 [05:28:08] I'm pretty sure that that's nonsense [05:28:52] <^demon> "server hard drives that are destroyed due to angry politicians" [05:28:56] <^demon> What was your first clue? [05:28:57] ^ [05:29:01] (exactly) [05:29:25] the fact that he/she calls our infrastructure fragile [05:29:44] close duplicate [05:31:09] it's not exactly the first time somebody has said wikipedia should be moved to some distributed system like freenet or bittorrent or whatever [05:31:27] <^demon> Also, "A simple example is political pages on wikipedia which are often manipulated by competing parties." [05:31:37] <^demon> I'm not entirely sure what that has to do with infrastructure. [05:32:27] <^demon> TimStarling: Indeed. Moving us on to a P2P network and/or backing revisions in Git seem to be a perenial proposal. [05:34:10] <^demon> Even more fun, combine both proposals into one: http://lists.wikimedia.org/pipermail/wikitech-l/2008-December/040516.html [05:36:29] I think I have found an earlier one... [05:37:02] http://article.gmane.org/gmane.science.linguistics.wikipedia.misc/833 [05:37:05] perennial * [05:37:34] "Nowadays distributed software solutions are the height of fashion. Why not devise a distributed Wikipedia ? Programmers ?" [05:37:46] from August 2001 [05:38:16] > Wikipedia is a great idea combined with a new, revolutionary software [05:38:19] and it has a lot of brilliant committed authors. Her growth is [05:38:22] explosive. But there are also weaknesses (Wikinesses ?) brought into [05:38:25] light be some of us. [05:39:51] * Jasper_Deng_busy considers whether to reply to what he just linked [05:40:10] do I really need to spend my time telling how wrong that person is on a technical level? [05:40:34] I thought it was robo-spam. [05:40:37] you can probably find an FAQ entry somewhere [05:45:52] <^demon> Brooke: I didn't have a squiggly underline, so I didn't bother seeing if it was incorrect ;-) [05:48:33] March 2002: http://article.gmane.org/gmane.science.linguistics.wikipedia.misc/2091 [06:13:43] lol: http://article.gmane.org/gmane.science.linguistics.wikipedia.misc/18418 [06:14:15] there you go Jasper_Deng_busy, I guess I answered this myself back in 2004 [06:15:01] TimStarling: I would also think that the OP (in what I linked) is wrong b/c we have 3 datacenters, across 2 countries [06:16:28] no replication of images though [06:17:08] and our plans are to set up image replication to another datacentre in the US, not to another country [06:17:20] so even that plan wouldn't make our paranoid friend happy [06:17:49] the "lack of innovation" charge is also quite spurious [06:18:32] well, it is innovation narrowly defined [06:19:03] he's saying that there is a lack of innovation, where innovation is defined as distributed computing [06:20:38] TimStarling: on this note, doesn't Google manage to replicate between such large numbers of datacenters? [06:21:46] probably [06:22:16] they probably use a secret mechanism, but is cost the only barrier to the WMF doing so? [06:22:24] (although I personally see no need for it) [06:22:44] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:22:44] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [06:22:44] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [06:23:22] so we spent about a year implementing Swift support, because (among other things) it had a replication feature [06:23:42] images are now stored into swift (as well as NFS) [06:24:23] but when someone actually tried to turn on replication of images to eqiad, he discovered that the feature was actually half-written and completely inappropriate for our needs [06:24:29] it was about 200 lines of code [06:25:13] but, luckily, the world moves on, and it turns out that ceph, previously a research project, has had a lot of funding in the last few years and is now looking like a serious alternative to swift [06:25:38] it has a swift-compatible API (the few accidental incompatibilities have been fixed in the client by Aaron already) [06:25:53] and it appears to have a working replication feature [06:26:24] so maybe some time in the next year or so, we will have image replication [06:26:27] But nothing as crazy as described by that person, right? [06:28:32] well, that person is paranoid and exaggerates the risks [06:29:08] if some politician orders all WMF hard drives destroyed, then there is a copy in the Internet Archive of most stuff [06:29:21] IA is distributed and has content outside the US [06:30:07] maybe the politician will also order missile strikes on foreign IA installations in order to destroy Wikimedia more completely [06:30:29] by then, everyone will probably download our DB [06:30:51] right, well there are plenty of copies of the article text around, it would be hard to destroy all of those [06:31:15] less copies of the images, for those you more or less rely on the initial contributors keeping copies of their own work [06:31:28] then maybe they can be contacted to reconstruct the project as it once was [06:31:42] also, with missile strikes on amsterdam etc., the user database would be lost [06:31:54] so you would have to reconstruct that somehow [06:32:38] probably the result could be described as a "broken shell" as the commenter puts it [06:32:58] but the chance of it actually happening seems fairly low to me [06:33:05] and of no immediate concern [06:33:07] hrmmmmm, can you name your source please? [06:33:09] > According to a credible source, Dr. Stallman is our love slave. [06:37:01] yeah, well he was very positive about the whole idea of wikipedia [06:38:45] this person is also concerned about things like DNS poisoning and IP blocking, apparently [06:39:13] the WMF should fund the freedombox obviously [06:39:14] which of course is reasonably common [06:40:38] I just wonder whether he really knows what he's talking about [06:40:47] (the person whose comment I linked to) [06:41:00] no, not really [06:41:17] take china for example [06:41:32] they block us routinely, with various means [06:41:43] they control access to certain pages and search terms [06:42:05] but they don't really care that much about controlling circumvention techniques [06:42:30] as far as they are concerned, as long as 99% of the population is kept in the dark, they've won the battle [06:42:41] talk about the 1% [06:43:02] they have realistic expectations [06:43:14] they don't expect to be able to control the thoughts of every would-be activist [06:43:44] they expect to be able to shut down the group that the activist forms, if and when it becomes a threat to the party [06:44:15] gtg [06:47:45] I'm still impressed that we manage to make do with so little [06:48:05] hit rate! [06:48:22] and domas and co [07:40:43] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [10:16:47] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:00] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 1.69 ms [10:22:29] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [10:30:19] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.058 second response time [12:07:47] New patchset: ArielGlenn; "ms-be7 reconfigured with ssds sdm and sdn; try fixing up sdmn/sdn3 also" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33711 [12:32:07] New patchset: Nemo bis; "(bug 15434) Periodical run of currently disabled special pages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33713 [12:35:52] It would look like bits is periodically not serving css/js to a large number of people [12:35:59] Not sure if it's got anything to do with the 12.04 upgrades [12:37:21] !log Running sync common on srv252, lots of Failed opening required fatals... [12:37:28] Logged the message, Master [12:40:00] New patchset: Nemo bis; "(bug 15434) Periodical run of currently disabled special pages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33713 [12:41:44] Reedy: there is a longbackread in this channel about that [12:41:49] lol [12:42:04] drwxr-xr-x 17 mwdeploy mwdeploy 4096 Oct 9 23:29 php-1.20wmf12 [12:42:04] drwxr-xr-x 17 mwdeploy mwdeploy 4096 Oct 18 19:43 php-1.21wmf1 [12:42:04] drwx------ 17 mwdeploy mwdeploy 4096 Nov 2 18:38 php-1.21wmf2 [12:42:04] drwx------ 17 mwdeploy mwdeploy 4096 Nov 12 15:26 php-1.21wmf3 [12:42:05] drwx------ 17 mwdeploy mwdeploy 4096 Nov 16 02:13 php-1.21wmf4 [12:42:16] oh seriously, I thought the perms stuff got fixed [12:42:17] Has anyone tried to fix the directory permissions? [12:42:24] ah [12:42:24] they said that the configs weren't being read because of that [12:42:29] yeah [12:42:32] nothing is [12:42:36] (I was not here at the time) [12:42:38] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [12:42:38] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [12:42:41] er what hosts? [12:42:42] drwx------ 5 mwdeploy mwdeploy 4096 Nov 16 00:34 wmf-config [12:42:53] Just srv252 that I can see in the logs at the moment [12:43:10] only srv252 in the last 1000 log lines [12:44:52] dirs done [12:45:47] PROBLEM - check_gcsip on payments2 is CRITICAL: Connection timed out [12:45:47] PROBLEM - check_gcsip on payments1 is CRITICAL: Connection timed out [12:45:47] PROBLEM - check_gcsip on payments4 is CRITICAL: Connection timed out [12:45:47] PROBLEM - check_gcsip on payments3 is CRITICAL: Connection timed out [12:46:10] lemme know if something like that is still showing up [12:46:23] PROBLEM - check_gcsip on payments1002 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [12:46:23] PROBLEM - check_gcsip on payments1001 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [12:46:23] PROBLEM - check_gcsip on payments1003 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [12:46:23] PROBLEM - check_gcsip on payments1004 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [12:49:06] apergos: Can you do wmf-config on that host too please? [12:49:54] should have caught that [12:50:13] did tests too for good measure though I think we don't need that [12:50:25] heh [12:50:34] don't see anything else right off [12:50:38] i'll give it a couple of minutes and see if the errors have quietened down [12:50:42] thanks [12:50:53] PROBLEM - check_gcsip on payments1 is CRITICAL: Connection timed out [12:50:53] PROBLEM - check_gcsip on payments2 is CRITICAL: Connection timed out [12:50:53] PROBLEM - check_gcsip on payments4 is CRITICAL: Connection timed out [12:50:53] PROBLEM - check_gcsip on payments3 is CRITICAL: Connection timed out [12:51:03] RECOVERY - Apache HTTP on srv252 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.059 second response time [12:51:09] heh lookie there [12:53:54] Change abandoned: ArielGlenn; "will bork the existing partitions this way" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33711 [12:55:24] RECOVERY - check_gcsip on payments1001 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.489 second response time [12:55:24] RECOVERY - check_gcsip on payments4 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.700 second response time [12:55:24] RECOVERY - check_gcsip on payments1003 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.739 second response time [12:55:24] RECOVERY - check_gcsip on payments1004 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.488 second response time [12:55:24] RECOVERY - check_gcsip on payments2 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.706 second response time [12:55:24] RECOVERY - check_gcsip on payments1002 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.508 second response time [12:55:24] RECOVERY - check_gcsip on payments3 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.684 second response time [12:55:25] RECOVERY - check_gcsip on payments1 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.693 second response time [13:01:46] New patchset: ArielGlenn; "ms-be7 added as 720xd with ssds as sdm/n" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33715 [13:04:09] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33715 [13:05:20] Yup, that's clearing up [13:05:32] just not enough errors going on that it disappears from the last 1k linrs ;) [13:07:14] ok [13:07:16] yay [13:28:09] apergos: Good Morn'n [13:28:23] hello [13:28:38] i see you have ms-be7 going...did you change the boot disk in bios? [13:29:03] I did but it's not pxe booting, just hanging [13:29:09] I just went back into the bios to check but [13:29:17] it's changed [13:29:25] i.e. nic first [13:29:32] so why it hangs, I have no idea [13:30:48] that is odd [13:30:59] okay for me to look at it [13:31:01] ? [13:31:17] well I'm just now trying a pxe boot again [13:31:26] so as soon as that fails again, sure [13:33:53] was doublechecking the mac but it looks right [13:34:43] wow well [13:34:54] Nov 16 13:34:11 brewster dhcpd: DHCPDISCOVER from 90:b1:1c:18:bc:65 via 10.0.0.202: network 10.0/16: no free leases [13:35:12] oh wow [13:35:27] NIC.Integrated.1-1-1 Ethernet = 90:B1:1C:18:BC:65 [13:35:32] heh [13:35:46] wanna fix that in puppet and push it around? (or I can, don't care) [13:36:04] the first one in the *list* is 67, but nic 1 is 65 :-) [13:36:34]