[00:02:53] New patchset: Cmjohnson; " Changing mac address for ms-be7 to reflect new server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33673 [00:05:35] lesliecarr: can you +2 my change please [00:07:00] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33673 [00:08:14] cmjohnson1: yes [00:08:15] oh asher got it [00:08:39] cool..lesliecarr: i will have one more fix for you in abit [00:10:30] New patchset: Cmjohnson; "Adding wtp1 to netboot config to use lvm.cfg partman recipe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33675 [00:11:01] lesliecarr ^ [00:12:30] binasher: do you know ldap well ? [00:12:51] LeslieCarr: not really :( is it openldap or opendj? [00:12:54] LeslieCarr: please don't swear! [00:13:02] oh opendj [00:16:23] binasher: could you +2 that change for me plz [00:16:47] binasher: did robh tty about the sandisk for labsdb1001 and 1002 in eqiad? [00:20:15] cmjohnson1: nope [00:20:40] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33675 [00:21:01] !log tstarling synchronized php-1.21wmf4/extensions/UniversalLanguageSelector [00:21:07] Logged the message, Master [00:21:24] ok..he only has the intel 710's there...i have enough for the 2 db's there. Do you want me to send him the sand disk (binasher) [00:21:32] sandisk [00:21:33] I know LDAP well [00:21:37] not sand disk [00:21:38] I know very little of OpenDJ [00:22:16] cmjohnson1: do you have enough even after labsdb3? [00:22:39] yes...and enough for a couple of spare for both sites [00:26:50] New patchset: Tim Starling; "Fix small.dblist nonexistent DB names" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33680 [00:27:18] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33680 [00:27:27] New patchset: Tim Starling; "Disable ULS toolbar for anons" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33534 [00:28:07] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33534 [00:28:15] New patchset: Reedy; "Expose s4-s7 dblists and small.dblist" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33681 [00:28:44] RECOVERY - Host wtp1 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [00:28:47] !log tstarling synchronized small.dblist [00:28:53] Logged the message, Master [00:28:57] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33681 [00:29:08] !log tstarling synchronized wmf-config/InitialiseSettings.php [00:29:14] Logged the message, Master [00:30:44] !log tstarling synchronized wmf-config/InitialiseSettings.php [00:30:51] Logged the message, Master [00:34:08] New patchset: Tim Starling; "Set $wgULSEnableAnon=false again" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33682 [00:34:36] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33682 [00:35:04] !log tstarling synchronized wmf-config/CommonSettings.php [00:35:11] Logged the message, Master [00:35:22] !log tstarling synchronized wmf-config/InitialiseSettings.php [00:35:29] Logged the message, Master [00:36:32] PROBLEM - Host wtp1 is DOWN: PING CRITICAL - Packet loss = 100% [00:41:11] PROBLEM - Squid on brewster is CRITICAL: Connection refused [00:45:02] New patchset: Cmjohnson; " Removing wtp1 from the lvm.cfg partman recipe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33686 [00:45:52] binashser: please +2 this change...it didn't work..just going to do a manual partition..thx [00:47:19] New patchset: Reedy; "Kill off old unused dblists" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33687 [00:49:02] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33687 [00:52:08] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33686 [00:52:41] thanks paravoid [01:02:05] RECOVERY - Host wtp1 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [01:05:47] TimStarling: Any idea about "Could not open input file: MWScript.php" in the log/norotate/updateSpecialPages.log cronlogs? [01:06:23] Running the same script on hume as apache works fine [01:07:25] AaronSchulz: Function: FlaggedRevsStats::getEditReviewTimes Error: 2013 Lost connection to MySQL server during query (10.0.6.21) [01:08:09] I guess that at the top of the file has something to do with it... [01:08:15] The file seems ot just start at enwikiquote [01:08:27] strange [01:08:49] so it was run from some other server? [01:09:04] that would explain it, if it was run from somewhere without NFS mounted [01:09:24] The cron entries aren't handled by puppet [01:09:33] It looks like it was just put there manually on hume [01:09:53] PROBLEM - Host wtp1 is DOWN: PING CRITICAL - Packet loss = 100% [01:10:20] !log aaron synchronized php-1.21wmf4/thumb.php 'debug logging' [01:10:29] Though, how could it write to a log file n NFS if NFS isn't mounted? [01:10:40] the plot thickens [01:11:54] wtf [01:12:31] * AaronSchulz looks up $_REQUEST again [01:12:37] It also seemingly "just broke" [01:12:51] Logged the message, Master [01:13:17] * AaronSchulz likes how $params in thumb.php has cookie info and stuff [01:15:44] RECOVERY - Host wtp1 is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [01:17:55] !log aaron synchronized php-1.21wmf4/thumb.php [01:18:03] Logged the message, Master [01:32:04] !log aaron synchronized php-1.21wmf4/thumb.php [01:32:11] Logged the message, Master [01:32:34] binasher: have you ever had [01:32:40] An error has been detected while trying to use the specified Ubuntu │ [01:32:40] │ archive mirror. [01:33:13] cmjohnson1: hmm, it might mean that something is down on brewster [01:33:42] yeah, squid was down on it.. i just started it back up [01:33:47] i wonder what happened to it [01:33:49] !log aaron synchronized php-1.21wmf4/thumb.php [01:33:56] Logged the message, Master [01:33:58] ah, it died again [01:34:04] that's odd [01:34:07] oh, / is full [01:34:47] again? [01:34:53] !log aaron synchronized php-1.21wmf4/thumb.php 'more debugging' [01:34:59] Logged the message, Master [01:35:14] PROBLEM - Host wtp1 is DOWN: PING CRITICAL - Packet loss = 100% [01:36:39] can we migrate brewster to one of the new misc servers we have in Tampa (r320's) [01:36:53] RECOVERY - Squid on brewster is OK: TCP OK - 0.003 second response time on port 8080 [01:37:28] cmjohnson1: that's probably a good idea, especially if we can put extra disk in one [01:37:29] sh [01:37:35] should be ok for now [01:38:19] great thx [01:39:36] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 240 seconds [01:40:11] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 302 seconds [01:42:54] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [01:46:47] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [01:46:57] RECOVERY - Host wtp1 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [01:49:37] !log aaron synchronized php-1.21wmf4/thumb.php 'remove hacks' [01:49:44] Logged the message, Master [01:49:54] New patchset: Reedy; "Add s4-s7 to comment of output cluster lists" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33692 [01:50:05] New patchset: Asher; "prevent build-new from blowing away all incremental status files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33693 [01:50:28] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33693 [01:50:33] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33692 [01:58:20] RECOVERY - SSH on wtp1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [01:59:59] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 245 seconds [02:01:12] binasher: just read the last line of your email :) [02:02:02] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 367 seconds [02:05:36] TimStarling: odd, if I request http://commons.wikimedia.org/w/thumb.php?f=Aqueduc_Luynes.jpg&w=400 enough times it fails saying no width is set [02:06:14] and the exception entry just has "/w/thumb.php?f=Aqueduc_Luynes.jpg"...it's like the &w=400 just randomly disappears sometimes [02:07:12] it doesn't correlate with particular scalars [02:21:41] RECOVERY - NTP on wtp1 is OK: NTP OK: Offset -0.02122247219 secs [02:25:23] binasher: around? [02:25:57] pgehres: slightly, what's up? [02:26:25] our pmtpa fundraising slave has some unusual CPU activity the past hour [02:26:39] was wondering if you had a sec to poke db78 and see if I should page Jeff [02:27:27] pgehres: i don't have access to the fr dbs [02:27:41] even as a root? :-( [02:27:53] i'm not an fr root [02:27:57] i have no fr access [02:28:26] no ssh and don't know the mysql pw's [02:28:30] ah, i didn't know Jeff had removed wmf roots from them [02:28:52] !log LocalisationUpdate completed (1.21wmf4) at Fri Nov 16 02:28:52 UTC 2012 [02:28:56] i can't say i mind ;) [02:28:59] Logged the message, Master [02:29:03] i can imagine [02:29:15] dyk if anyone other than Jeff has access? [02:29:30] pgehres: paravoid has ssh. idk about mysql passwd [02:29:38] but he's surely asleep ;) [02:29:52] i'd settle for a top right now to figure out what its doing [02:30:10] it's not too late for Jeff, so I will poke him [02:30:16] thanks [02:30:32] yeah, it's fine in Jeff_Green's TZ ;) (that's my TZ too!) [02:30:52] indeed, but he left like 3H ago to deal with offspring [02:31:26] PROBLEM - Puppet freshness on mw60 is CRITICAL: Puppet has not run in the last 10 hours [02:31:31] right [02:32:09] whut [02:32:29] PROBLEM - Puppet freshness on mw61 is CRITICAL: Puppet has not run in the last 10 hours [02:33:06] Jeff_Green: run top on db78 please ;) [02:33:16] pgehres [02:33:17] it's the nightly mysqldump run [02:33:35] that's an impressively long dump [02:34:05] you should have seen it at the lunch buffet earlier [02:34:13] that's what you get for storing raw web data in a db :-P [02:34:27] * pgehres glares to the left [02:35:14] * binasher is doubly grateful for not having fr db access  [02:35:17] sorry for the false alarm then, i am edgy today [02:35:18] 348G faulkner [02:35:23] 113G pgehres [02:35:29] it's a race! [02:36:01] funny how a well designed table can cause that number to be so much lower [02:36:11] * pgehres can't wait until analytics takes over this crap [02:36:19] ha yes [02:36:46] Jeff_Green: i would be happy to archive my raw table to disk and truncate the table [02:37:08] !log Forcing puppet run on wtp1 to reinstall Parsoid [02:37:16] Logged the message, Mr. Obvious [02:37:16] go puppet go! [02:37:35] pgehres: i'm only concerned in terms of the the living db footprint, which is rather large [02:37:43] indeed [02:37:53] dumped the faulknerdb is only 14G [02:38:04] gzip for the win [02:38:07] yup [02:38:14] you could do myisam with table compression [02:38:16] i suppose [02:38:25] but eh. this is the least of our concerns really [02:38:28] he tried myisam last year [02:38:37] * binasher is never never touching that db [02:38:49] binasher: don't be a snob :-P [02:39:07] Jeff_Green: do we have a public list somewhere of who has what access to fr? [02:39:10] that's like saying binasher: stop being binasher [02:39:12] * pgehres marks binasher on the list of the first to go [02:39:34] myisam with table compression is actually pretty good for this purpose, write once (i.e. daily), compress, concatenate, laugh if the table crashes. [02:39:47] jeremyb: nope [02:39:55] top secret! [02:39:56] Jeff_Green: you mean page compression, right? [02:40:04] Jeff_Green: srsly? [02:40:05] no i mean table compression [02:40:16] i wonder what that is [02:40:24] entire table, compressed [02:40:36] it's an actually-useful myisam feature [02:40:52] uhuh [02:40:54] ;P [02:41:29] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [02:41:29] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [02:41:53] srsly. i used it in CL's webstats. ran good. scaled to 100x what I ever imagined it would do [02:43:02] jeremyb: re. the list, it's not that it's top secret. but is there such a list anywhere for any system around here? [02:43:33] Jeff_Green: admins.pp [02:43:56] Jeff_Green: ewwww, CL is myisam??? [02:44:05] no, not CL [02:44:40] Jeff_Green: do we dump off of db1013? [02:44:45] Jeff_Green: (also, of course site.pp iirc) [02:44:47] or just 1025 and 78? [02:44:52] CL's web stats uses myisam for a specific type of table [02:45:33] jeremyb: admins.pp has a block for fundraising [02:45:38] ah [02:45:51] that will tell you who has shell access to fundraising machines which aren't in the new frack payments cluster [02:46:09] Jeff_Green: and if they *are* in frack? [02:46:21] there's a separate puppet instance which is actually top secret. [02:46:41] * jeremyb didn't realize there were any fr machines outside of frack [02:46:54] heh, frack is new and shiny [02:46:59] Jeff_Green: and so e.g. faidon has access to all of the above? [02:47:01] and only in eqiad [02:47:06] jeremyb: yes [02:47:20] ok [02:48:07] right now, it's essentially me, faidon, leslie, and mark as the ops role [02:48:14] RECOVERY - Puppet freshness on mw60 is OK: puppet ran at Fri Nov 16 02:47:51 UTC 2012 [02:48:32] (err sorry for binging everyone's IRC handles there) [02:48:37] !log LocalisationUpdate completed (1.21wmf3) at Fri Nov 16 02:48:37 UTC 2012 [02:48:44] Logged the message, Master [02:48:48] oh and daniel [02:48:59] ok, so leslie and mar k and muntant [02:49:20] err, s/n// [02:52:46] RECOVERY - Puppet freshness on mw61 is OK: puppet ran at Fri Nov 16 02:52:28 UTC 2012 [03:31:34] gerrit down for anyone else? [03:32:31] which port? [03:32:44] 443 [03:33:00] K4-71_3 commited a change but we can't review it [03:33:09] i suppose there is the api [03:33:12] I was about to say: "Me too". [03:33:16] apache's hanging [03:33:18] but that sounds like a lot of work [03:33:42] i guess you're saying ssh is fine then? [03:33:57] anyway, yes something needs fixing [03:34:14] why isn't nagios complaining? [03:42:11] Gerrit WFM [03:48:05] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [03:58:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [04:12:20] Change abandoned: Tim Starling; "Superseded" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33318 [04:21:32] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [04:54:23] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [05:03:23] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [05:10:26] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [05:27:58] TimStarling: http://meta.wikimedia.org/?diff=4568837&oldid=4566973&rcid=3697528 [05:28:08] I'm pretty sure that that's nonsense [05:28:52] <^demon> "server hard drives that are destroyed due to angry politicians" [05:28:56] <^demon> What was your first clue? [05:28:57] ^ [05:29:01] (exactly) [05:29:25] the fact that he/she calls our infrastructure fragile [05:29:44] close duplicate [05:31:09] it's not exactly the first time somebody has said wikipedia should be moved to some distributed system like freenet or bittorrent or whatever [05:31:27] <^demon> Also, "A simple example is political pages on wikipedia which are often manipulated by competing parties." [05:31:37] <^demon> I'm not entirely sure what that has to do with infrastructure. [05:32:27] <^demon> TimStarling: Indeed. Moving us on to a P2P network and/or backing revisions in Git seem to be a perenial proposal. [05:34:10] <^demon> Even more fun, combine both proposals into one: http://lists.wikimedia.org/pipermail/wikitech-l/2008-December/040516.html [05:36:29] I think I have found an earlier one... [05:37:02] http://article.gmane.org/gmane.science.linguistics.wikipedia.misc/833 [05:37:05] perennial * [05:37:34] "Nowadays distributed software solutions are the height of fashion. Why not devise a distributed Wikipedia ? Programmers ?" [05:37:46] from August 2001 [05:38:16] > Wikipedia is a great idea combined with a new, revolutionary software [05:38:19] and it has a lot of brilliant committed authors. Her growth is [05:38:22] explosive. But there are also weaknesses (Wikinesses ?) brought into [05:38:25] light be some of us. [05:39:51] * Jasper_Deng_busy considers whether to reply to what he just linked [05:40:10] do I really need to spend my time telling how wrong that person is on a technical level? [05:40:34] I thought it was robo-spam. [05:40:37] you can probably find an FAQ entry somewhere [05:45:52] <^demon> Brooke: I didn't have a squiggly underline, so I didn't bother seeing if it was incorrect ;-) [05:48:33] March 2002: http://article.gmane.org/gmane.science.linguistics.wikipedia.misc/2091 [06:13:43] lol: http://article.gmane.org/gmane.science.linguistics.wikipedia.misc/18418 [06:14:15] there you go Jasper_Deng_busy, I guess I answered this myself back in 2004 [06:15:01] TimStarling: I would also think that the OP (in what I linked) is wrong b/c we have 3 datacenters, across 2 countries [06:16:28] no replication of images though [06:17:08] and our plans are to set up image replication to another datacentre in the US, not to another country [06:17:20] so even that plan wouldn't make our paranoid friend happy [06:17:49] the "lack of innovation" charge is also quite spurious [06:18:32] well, it is innovation narrowly defined [06:19:03] he's saying that there is a lack of innovation, where innovation is defined as distributed computing [06:20:38] TimStarling: on this note, doesn't Google manage to replicate between such large numbers of datacenters? [06:21:46] probably [06:22:16] they probably use a secret mechanism, but is cost the only barrier to the WMF doing so? [06:22:24] (although I personally see no need for it) [06:22:44] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:22:44] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [06:22:44] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [06:23:22] so we spent about a year implementing Swift support, because (among other things) it had a replication feature [06:23:42] images are now stored into swift (as well as NFS) [06:24:23] but when someone actually tried to turn on replication of images to eqiad, he discovered that the feature was actually half-written and completely inappropriate for our needs [06:24:29] it was about 200 lines of code [06:25:13] but, luckily, the world moves on, and it turns out that ceph, previously a research project, has had a lot of funding in the last few years and is now looking like a serious alternative to swift [06:25:38] it has a swift-compatible API (the few accidental incompatibilities have been fixed in the client by Aaron already) [06:25:53] and it appears to have a working replication feature [06:26:24] so maybe some time in the next year or so, we will have image replication [06:26:27] But nothing as crazy as described by that person, right? [06:28:32] well, that person is paranoid and exaggerates the risks [06:29:08] if some politician orders all WMF hard drives destroyed, then there is a copy in the Internet Archive of most stuff [06:29:21] IA is distributed and has content outside the US [06:30:07] maybe the politician will also order missile strikes on foreign IA installations in order to destroy Wikimedia more completely [06:30:29] by then, everyone will probably download our DB [06:30:51] right, well there are plenty of copies of the article text around, it would be hard to destroy all of those [06:31:15] less copies of the images, for those you more or less rely on the initial contributors keeping copies of their own work [06:31:28] then maybe they can be contacted to reconstruct the project as it once was [06:31:42] also, with missile strikes on amsterdam etc., the user database would be lost [06:31:54] so you would have to reconstruct that somehow [06:32:38] probably the result could be described as a "broken shell" as the commenter puts it [06:32:58] but the chance of it actually happening seems fairly low to me [06:33:05] and of no immediate concern [06:33:07] hrmmmmm, can you name your source please? [06:33:09] > According to a credible source, Dr. Stallman is our love slave. [06:37:01] yeah, well he was very positive about the whole idea of wikipedia [06:38:45] this person is also concerned about things like DNS poisoning and IP blocking, apparently [06:39:13] the WMF should fund the freedombox obviously [06:39:14] which of course is reasonably common [06:40:38] I just wonder whether he really knows what he's talking about [06:40:47] (the person whose comment I linked to) [06:41:00] no, not really [06:41:17] take china for example [06:41:32] they block us routinely, with various means [06:41:43] they control access to certain pages and search terms [06:42:05] but they don't really care that much about controlling circumvention techniques [06:42:30] as far as they are concerned, as long as 99% of the population is kept in the dark, they've won the battle [06:42:41] talk about the 1% [06:43:02] they have realistic expectations [06:43:14] they don't expect to be able to control the thoughts of every would-be activist [06:43:44] they expect to be able to shut down the group that the activist forms, if and when it becomes a threat to the party [06:44:15] gtg [06:47:45] I'm still impressed that we manage to make do with so little [06:48:05] hit rate! [06:48:22] and domas and co [07:40:43] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [10:16:47] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:00] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 1.69 ms [10:22:29] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [10:30:19] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.058 second response time [12:07:47] New patchset: ArielGlenn; "ms-be7 reconfigured with ssds sdm and sdn; try fixing up sdmn/sdn3 also" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33711 [12:32:07] New patchset: Nemo bis; "(bug 15434) Periodical run of currently disabled special pages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33713 [12:35:52] It would look like bits is periodically not serving css/js to a large number of people [12:35:59] Not sure if it's got anything to do with the 12.04 upgrades [12:37:21] !log Running sync common on srv252, lots of Failed opening required fatals... [12:37:28] Logged the message, Master [12:40:00] New patchset: Nemo bis; "(bug 15434) Periodical run of currently disabled special pages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33713 [12:41:44] Reedy: there is a longbackread in this channel about that [12:41:49] lol [12:42:04] drwxr-xr-x 17 mwdeploy mwdeploy 4096 Oct 9 23:29 php-1.20wmf12 [12:42:04] drwxr-xr-x 17 mwdeploy mwdeploy 4096 Oct 18 19:43 php-1.21wmf1 [12:42:04] drwx------ 17 mwdeploy mwdeploy 4096 Nov 2 18:38 php-1.21wmf2 [12:42:04] drwx------ 17 mwdeploy mwdeploy 4096 Nov 12 15:26 php-1.21wmf3 [12:42:05] drwx------ 17 mwdeploy mwdeploy 4096 Nov 16 02:13 php-1.21wmf4 [12:42:16] oh seriously, I thought the perms stuff got fixed [12:42:17] Has anyone tried to fix the directory permissions? [12:42:24] ah [12:42:24] they said that the configs weren't being read because of that [12:42:29] yeah [12:42:32] nothing is [12:42:36] (I was not here at the time) [12:42:38] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [12:42:38] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [12:42:41] er what hosts? [12:42:42] drwx------ 5 mwdeploy mwdeploy 4096 Nov 16 00:34 wmf-config [12:42:53] Just srv252 that I can see in the logs at the moment [12:43:10] only srv252 in the last 1000 log lines [12:44:52] dirs done [12:45:47] PROBLEM - check_gcsip on payments2 is CRITICAL: Connection timed out [12:45:47] PROBLEM - check_gcsip on payments1 is CRITICAL: Connection timed out [12:45:47] PROBLEM - check_gcsip on payments4 is CRITICAL: Connection timed out [12:45:47] PROBLEM - check_gcsip on payments3 is CRITICAL: Connection timed out [12:46:10] lemme know if something like that is still showing up [12:46:23] PROBLEM - check_gcsip on payments1002 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [12:46:23] PROBLEM - check_gcsip on payments1001 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [12:46:23] PROBLEM - check_gcsip on payments1003 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [12:46:23] PROBLEM - check_gcsip on payments1004 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [12:49:06] apergos: Can you do wmf-config on that host too please? [12:49:54] should have caught that [12:50:13] did tests too for good measure though I think we don't need that [12:50:25] heh [12:50:34] don't see anything else right off [12:50:38] i'll give it a couple of minutes and see if the errors have quietened down [12:50:42] thanks [12:50:53] PROBLEM - check_gcsip on payments1 is CRITICAL: Connection timed out [12:50:53] PROBLEM - check_gcsip on payments2 is CRITICAL: Connection timed out [12:50:53] PROBLEM - check_gcsip on payments4 is CRITICAL: Connection timed out [12:50:53] PROBLEM - check_gcsip on payments3 is CRITICAL: Connection timed out [12:51:03] RECOVERY - Apache HTTP on srv252 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.059 second response time [12:51:09] heh lookie there [12:53:54] Change abandoned: ArielGlenn; "will bork the existing partitions this way" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33711 [12:55:24] RECOVERY - check_gcsip on payments1001 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.489 second response time [12:55:24] RECOVERY - check_gcsip on payments4 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.700 second response time [12:55:24] RECOVERY - check_gcsip on payments1003 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.739 second response time [12:55:24] RECOVERY - check_gcsip on payments1004 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.488 second response time [12:55:24] RECOVERY - check_gcsip on payments2 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.706 second response time [12:55:24] RECOVERY - check_gcsip on payments1002 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.508 second response time [12:55:24] RECOVERY - check_gcsip on payments3 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.684 second response time [12:55:25] RECOVERY - check_gcsip on payments1 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.693 second response time [13:01:46] New patchset: ArielGlenn; "ms-be7 added as 720xd with ssds as sdm/n" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33715 [13:04:09] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33715 [13:05:20] Yup, that's clearing up [13:05:32] just not enough errors going on that it disappears from the last 1k linrs ;) [13:07:14] ok [13:07:16] yay [13:28:09] apergos: Good Morn'n [13:28:23] hello [13:28:38] i see you have ms-be7 going...did you change the boot disk in bios? [13:29:03] I did but it's not pxe booting, just hanging [13:29:09] I just went back into the bios to check but [13:29:17] it's changed [13:29:25] i.e. nic first [13:29:32] so why it hangs, I have no idea [13:30:48] that is odd [13:30:59] okay for me to look at it [13:31:01] ? [13:31:17] well I'm just now trying a pxe boot again [13:31:26] so as soon as that fails again, sure [13:33:53] was doublechecking the mac but it looks right [13:34:43] wow well [13:34:54] Nov 16 13:34:11 brewster dhcpd: DHCPDISCOVER from 90:b1:1c:18:bc:65 via 10.0.0.202: network 10.0/16: no free leases [13:35:12] oh wow [13:35:27] NIC.Integrated.1-1-1 Ethernet = 90:B1:1C:18:BC:65 [13:35:32] heh [13:35:46] wanna fix that in puppet and push it around? (or I can, don't care) [13:36:04] the first one in the *list* is 67, but nic 1 is 65 :-) [13:36:34] don't care ...go ahead and get it done [13:36:39] sure thing [13:37:04] i would not have caught that the first time either ...who puts the 3rd nic first? [13:37:15] hahaha dunno [13:37:29] will have to check that on the next 10 [13:38:10] New patchset: ArielGlenn; "fix mac for ms-be7" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33718 [13:39:07] yeah look for the one that says 1-1 [13:39:21] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33718 [13:39:52] so these aren't ready for first puppet run yet, I'm going to pxe and then I have to get a filesystem tweak working on there [13:40:23] if I can get it working for ms-be7 the rest can do just install outa the box and "just work" (well as soon as they get moved to the right stanzas in netboog.cfg and site.pp) [13:41:40] okay [13:41:55] will you be able to tweak ms-be6...or did you do that already? [13:42:46] it's already done by hand [13:42:54] but we don't want to do that for them all [13:45:20] better [13:46:50] are those other 10 here yet? [13:46:52] er, there [13:47:04] no, they are not here or there? [13:47:09] boo [13:47:15] next wekk I guess? [13:47:20] *week [13:47:34] maybe next week but American Thanksgiving will disrupt taht [13:47:48] oh yeah [13:48:24] well maybe if they arrive early in the week a few can get racked [13:48:27] I will be checking in with my Dell contact to see if he has an update later today...i will send you an email unless i see you active on irc [13:48:37] ok [13:48:44] if I can them Mon or Tues...they will be up before the holiday. [13:48:47] (I will do thebackead here either way) [13:48:54] I am leaving for eqiad next weekend [13:49:02] cool [13:49:11] wait, permanent leaving? [13:49:18] yep [13:49:22] wow congrats [13:49:25] got a place picked out? [13:49:47] i do, went up a few weeks ago for a couple of days [13:50:21] sweeeet [13:50:57] RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [13:51:01] wow [13:51:05] uh oh [13:51:11] I didn't clean up certs or anything [13:51:15] where are you gonna live? [13:51:39] in Ashburn...a community called Brambleton...about 5 miles from DC [13:51:51] pics? [13:52:10] none that I took [13:53:07] http://www.brambleton.com/ [13:53:27] arrgg [13:53:54] apergos: if you don't clean certs it is a pita [13:54:03] Unable to install GRUB in /dev/sda [13:54:14] PROBLEM - swift-container-auditor on ms-be7 is CRITICAL: Connection refused by host [13:54:14] PROBLEM - swift-object-auditor on ms-be7 is CRITICAL: Connection refused by host [13:54:14] PROBLEM - swift-account-auditor on ms-be7 is CRITICAL: Connection refused by host [13:54:23] it's not supposed to go in /sda [13:54:31] it is supposed to go in /dev/sdm [13:54:41] PROBLEM - swift-account-reaper on ms-be7 is CRITICAL: Connection refused by host [13:54:59] PROBLEM - swift-container-replicator on ms-be7 is CRITICAL: Connection refused by host [13:54:59] PROBLEM - swift-object-replicator on ms-be7 is CRITICAL: Connection refused by host [13:55:08] PROBLEM - swift-account-replicator on ms-be7 is CRITICAL: Connection refused by host [13:55:08] PROBLEM - swift-object-server on ms-be7 is CRITICAL: Connection refused by host [13:55:08] PROBLEM - swift-container-server on ms-be7 is CRITICAL: Connection refused by host [13:55:34] yeah I know [13:55:35] PROBLEM - swift-object-updater on ms-be7 is CRITICAL: Connection refused by host [13:55:35] PROBLEM - swift-container-updater on ms-be7 is CRITICAL: Connection refused by host [13:55:35] PROBLEM - SSH on ms-be7 is CRITICAL: Connection refused [13:55:44] PROBLEM - swift-account-server on ms-be7 is CRITICAL: Connection refused by host [13:56:04] apergos: in bios or raid bios you need to select the disk that will be the boot disk...not bios boot order [13:56:15] grrr [13:56:21] ok, guess I'd better do that then [13:56:34] you need to select disk 12 the first ssd [13:56:50] right [13:58:12] so mark i will never be more than 15 minutes away...no big commute time for me [13:58:13] now I can go clean up the cert :-P [13:58:23] cool [14:00:46] in raid bios you said? [14:01:15] i believe so [14:06:09] here we go again [14:08:46] it will work [14:08:47] PROBLEM - Host ms-be7 is DOWN: PING CRITICAL - Packet loss = 100% [14:10:19] fyi: just received this " The remaining 10 shipped today. " [14:10:25] yay! [14:10:31] and that means they will arrive on...? [14:12:10] didn't give me a tracking number yet...just requested it...i will let you know but normally 1-2 days max [14:13:11] so maybe even monday! [14:13:40] yep..maybe! that would be great [14:13:47] yeah :-) [14:14:29] RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [14:15:04] c'mon grub install correctly [14:15:27] that's what I'm sayin [14:15:33] nope [14:15:43] * apergos goes back to the raid bios again [14:17:56] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [14:21:16] it's set correctly afaict [14:22:17] PROBLEM - Host ms-be7 is DOWN: PING CRITICAL - Packet loss = 100% [14:22:53] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [14:27:59] RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [14:28:57] New review: Tarheel95; "What's the status on this?" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/31302 [14:42:36] New patchset: Cmjohnson; "Adding db42 to decommission.pp list to remove from Nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33720 [14:43:50] New patchset: Matthias Mullie; "Redis-setup is unavailable for wmflabs, use memcached" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33721 [14:51:32] PROBLEM - NTP on ms-be7 is CRITICAL: NTP CRITICAL: No response from NTP server [14:51:47] it's still trying to put grub on /dev/sda no matter what I do [14:53:45] hrm...the recipe was right...it worked on be6 [14:54:04] apergos: is it failing install? [14:54:12] yes, it fails [14:54:47] I'm looking at the log now hoping to find something useful [14:55:08] ok form me to poke at it for a few? [14:55:19] lemme finish looking then I'll give it to you [14:55:28] k [14:55:53] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [15:04:53] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [15:06:14] cmjohnson1: getting off, you'll want to powercycle the box [15:06:24] okay [15:06:43] can you check this for me when you get a chance https://gerrit.wikimedia.org/r/33720 [15:06:45] thanks [15:06:55] common.cfg [15:07:01] common.cfg:d-i grub-installer/bootdev string /dev/sda [15:11:12] PROBLEM - Host ms-be7 is DOWN: PING CRITICAL - Packet loss = 100% [15:11:56] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [15:16:04] !log pooling ms-fe1 back with the modified rewrite.py [15:16:12] Logged the message, Master [15:19:35] on ms-be6 grub is installed on /dev/sdm and /dev/sdn as it should be [15:21:59] RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [15:24:22] apergos: so it's definitely not h/w or server related. it appears to be using the wrong netboot.cfg [15:25:41] there's only one of those [15:26:13] the partman recipe [15:26:37] it partitions them fine, I see the partitions [15:26:45] md0, using sdm/n as it should [15:29:22] if you go to a shell from the installer [15:29:32] is there a way to tell which config it is using for partitioning? [15:29:57] you can cat /proc/mdstat and see that md0 is using the right things [15:30:00] uhhh [15:30:07] I don't know [15:33:22] !log reedy synchronized php-1.21wmf4/extensions/EducationProgram/ [15:33:28] Logged the message, Master [15:34:19] !log reedy synchronized php-1.21wmf3/extensions/EducationProgram/ [15:34:25] Logged the message, Master [15:35:31] PROBLEM - Host ms-be7 is DOWN: PING CRITICAL - Packet loss = 100% [15:43:55] RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [16:23:58] RECOVERY - Puppet freshness on ms-fe1 is OK: puppet ran at Fri Nov 16 16:23:28 UTC 2012 [16:24:07] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [16:24:07] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [16:24:07] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [16:41:28] New review: Cmcmahon; "I'd like to have this in place as soon as possible, if it looks right" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/33721 [16:48:32] mw52 is making noise now [16:48:55] or was, 6 minutes ago [16:49:00] !log Running sync-common on mw52 [16:49:06] Logged the message, Master [16:55:40] PROBLEM - check_minfraud_primary on payments1 is CRITICAL: CRITICAL - Cannot make SSL connection [17:00:10] RECOVERY - check_minfraud_primary on payments1 is OK: HTTP OK: HTTP/1.1 302 Found - 120 bytes in 0.223 second response time [17:00:37] New patchset: Hashar; "Redis-setup is unavailable for wmflabs, use memcached" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33721 [17:02:17] New review: Cmcmahon; "Sounds reasonable to me" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/33721 [17:03:27] apergos: a whole new error http://p.defau.lt/?K1ot_9vgcsB80WaK4qso2g [17:03:42] oh joy [17:04:13] is this with the same disks we had? [17:04:20] yes [17:07:02] New patchset: Hashar; "Redis-setup is unavailable for wmflabs, use memcached" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33721 [17:09:30] New review: Hashar; "Patchset 3:" [operations/mediawiki-config] (master); V: 0 C: 1; - https://gerrit.wikimedia.org/r/33721 [17:13:04] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33721 [17:15:33] !log hashar synchronized wmf-config/CommonSettings.php 'Redis-setup is unavailable for wmflabs, use memcached {{gerrit|33721}}' [17:15:40] Logged the message, Master [17:15:52] New patchset: Reedy; "Update to master" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33735 [17:16:18] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33735 [17:28:49] New patchset: Hashar; "beta: disable OnlineStatusBar" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33738 [17:29:19] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33738 [17:40:58] !log depooling ms-fe1 [17:41:06] Logged the message, Master [17:42:19] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [17:50:08] PROBLEM - Host ms-be7 is DOWN: PING CRITICAL - Packet loss = 100% [18:03:13] RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [18:06:23] notpeter: i have a question...busy? [18:17:36] PROBLEM - Host ms-be7 is DOWN: PING CRITICAL - Packet loss = 100% [18:35:29] hi Andrew, ping me when you are back. thks [18:41:58] cmjohnson1: sup [18:45:43] nopeter: could you look at the partman recipe ms-be-with-ssd-last.cfg....for some reason the grub installer keeps wanting to put the OS on /dev/sda and and not /sdm where it should go [18:45:54] *when you get a chance [18:46:28] sure! [18:47:40] gimme like 30 minutes and I'll take a look [18:48:27] sure..ok [18:48:28] thx [18:48:48] RECOVERY - mysqld processes on es2 is OK: PROCS OK: 1 process with command name mysqld [18:56:09] !log temp stopping puppet on brewster [18:56:15] Logged the message, notpeter [19:17:29] New patchset: Pyoungmeister; "repooling es2" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33745 [19:18:29] can some look at this..did this mornign https://gerrit.wikimedia.org/r/33720 [19:22:54] cmjohnson1: for some reason I thought we weren't decomming db42 [19:23:01] but you probably know better than I do [19:23:19] we are not decoming it completely...just want nagios to stop reporting it...it will need to be repurposed [19:23:27] cmjohnson1: also, where is this partman conf that you want me to look at [19:23:54] ms-be-with-ssd-last.cfg. [19:23:55] we could just ack it in nagios? [19:24:11] notpeter: okay...didn't know to do that [19:24:24] cmjohnson1: on what boxdo you have nagios credentials? [19:24:41] don't know [19:24:43] ah, there's the file, sorry [19:29:05] cmjohnson1: I think you can add [19:29:06] d-i grub-installer/only_debian boolean true [19:29:13] to the partman recipe [19:29:42] or, if that doesn't work [19:29:47] looks like this will: [19:29:48] d-i grub-installer/only_debian boolean false [19:29:54] d-i grub-installer/with_other_os boolean false [19:30:32] d-i grub-installer/bootdev string (sdm1) [19:30:45] (form https://help.ubuntu.com/12.04/installation-guide/example-preseed.txt ) [19:32:31] d-i grub-installer/only_debian boolean true [19:32:31] d-i grub-installer/with_other_os boolean true [19:32:31] are in common.cfg now [19:35:35] what's the problem? [19:35:49] ms-be6 was installed fine with that conf, wasn't it? [19:37:34] apergos: ^^^ [19:37:42] yes [19:37:47] I wish I knew what the problem is [19:37:51] but it won't install [19:37:58] what do you mean it won't install? [19:38:11] specifically, grub tries to install on /dev/sda and fails [19:38:35] apergos: we should replicate the same sequence as ms-be6 and see if it works [19:38:35] logs? [19:39:30] Nov 16 14:51:50 main-menu[541]: (process:18004): Error: /dev/sda: unrecognised disk label [19:39:38] Nov 16 14:51:50 main-menu[541]: (process:18004): grub-probe: error: unknown filesystem. [19:39:49] and then it fails out [19:41:13] ~/wikimedia/puppet/files/autoinstall$ grep sda * [19:41:13] common.cfg:d-i grub-installer/bootdev string /dev/sda [19:41:40] yes, I know [19:42:05] but as you point out [19:42:08] it worked on ms-be6 [19:42:15] what is broken? [19:42:20] for ms-be7 [19:42:20] it worked on ms-be6 because you first partitioned it on sda/b [19:42:38] so you had a partition table on sda [19:43:42] and yet there is no grub bootloader installed on /dev/sda on ms-be6 [19:44:14] I checked for that [19:46:50] puppet calls out to parted and recreates MBR though, isn't it? [19:47:18] I don't know if it writes over the mbr [19:47:19] yes it does, and it even formats it into a GPT label [19:47:23] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33745 [19:48:20] !log py synchronized wmf-config/db.php 'es2 going live' [19:48:27] Logged the message, Master [19:48:32] did you try changing grub-installer/bootdev? [19:48:54] no, not yet [19:49:15] ok [19:49:20] what made the grub loader be installed on sdm and sdn o ms-be6? [19:49:21] if that fails too, ping me and I'll have a look [19:49:37] it seems to me we just want that to happen without the "extra" sda [19:54:36] AaronSchulz: any objection to me deleting all metrics from graphite that haven't been updated in the last 14 days? [19:54:48] nope [20:02:56] !log depooling srv214-srv218 from api pool for upgrade to precise [20:03:03] Logged the message, notpeter [20:03:29] !log depooling srv231-srv247 from apaches pool for upgrade to precise [20:03:36] Logged the message, notpeter [20:06:13] binasher: did you get a change to look at job_attempts? [20:06:46] AaronSchulz: ba, no.. i'll do that shortly [20:06:55] its already been merged, right? [20:07:07] yeah, it should have just been +2ed ;) [20:08:35] Change abandoned: Cmjohnson; "easier way to do this" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33720 [20:10:53] ok, just deleted over 12k metrics from from graphite [20:11:10] now let me look at that schema change [20:13:44] New patchset: Cmjohnson; "Fixing partman recipe for msbe cfg" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33749 [20:13:58] apergos: review plz ^ [20:14:45] looks ok to me, no idea where it should actually go or what happens if we give directives multiple times though [20:15:01] (I've been asking google but not finding anything so far) [20:15:13] PROBLEM - Host srv216 is DOWN: PING CRITICAL - Packet loss = 100% [20:15:14] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33749 [20:15:59] notpeter: I saw at somepoint you disabled puppet on brewster [20:16:04] we need to run it [20:16:09] ok, yeah [20:16:10] go for it [20:16:16] sorry I didn't re-enable [20:16:32] running manually, that's fine [20:18:01] mc1016 mac addr change [20:18:03] notpeter: ^^ [20:18:19] yeah [20:18:23] that was for some testing [20:18:52] paravoid: https://gerrit.wikimedia.org/r/#/c/33751/ :) [20:19:42] PROBLEM - Apache HTTP on srv218 is CRITICAL: Connection refused [20:19:51] PROBLEM - Apache HTTP on srv215 is CRITICAL: Connection refused [20:19:51] PROBLEM - Memcached on srv215 is CRITICAL: Connection refused [20:19:51] PROBLEM - Memcached on srv218 is CRITICAL: Connection refused [20:20:00] PROBLEM - SSH on srv218 is CRITICAL: Connection refused [20:20:09] PROBLEM - SSH on srv215 is CRITICAL: Connection refused [20:20:09] PROBLEM - Memcached on srv214 is CRITICAL: Connection refused [20:20:36] PROBLEM - SSH on srv214 is CRITICAL: Connection refused [20:20:36] PROBLEM - Apache HTTP on srv214 is CRITICAL: Connection refused [20:20:54] RECOVERY - Host srv216 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [20:21:46] notpeter: which es server were you working on yesterday? [20:21:48] PROBLEM - Host srv238 is DOWN: PING CRITICAL - Packet loss = 100% [20:21:48] PROBLEM - Host srv237 is DOWN: PING CRITICAL - Packet loss = 100% [20:21:48] PROBLEM - Host srv236 is DOWN: PING CRITICAL - Packet loss = 100% [20:21:48] PROBLEM - Host srv233 is DOWN: PING CRITICAL - Packet loss = 100% [20:21:48] PROBLEM - Host srv231 is DOWN: PING CRITICAL - Packet loss = 100% [20:21:49] PROBLEM - Host srv235 is DOWN: PING CRITICAL - Packet loss = 100% [20:21:49] PROBLEM - Host srv232 is DOWN: PING CRITICAL - Packet loss = 100% [20:21:50] PROBLEM - Host srv239 is DOWN: PING CRITICAL - Packet loss = 100% [20:22:22] binasher: es2 [20:22:29] is it back in? [20:22:31] yes [20:22:42] I looked at it, tested it on one apache [20:22:45] looked good [20:23:24] cool, go ahead with es1, but this time copy from a host in eqiad that isn't getting traffic [20:23:32] AaronSchulz: yay! [20:24:04] okay, away for a few hours [20:24:09] binasher: ok [20:24:11] paravoid: tailing the swift log to debug 42047 showed to much annoying stuff :) [20:24:19] binasher: any reason in particular? [20:24:22] just curious [20:24:33] there were a spike in db query failures to es3 throughout the day yesterday.. the 7200 rpm sata disks in the es1 cluster suck for this [20:24:46] ah, ok [20:24:48] PROBLEM - Memcached on srv216 is CRITICAL: Connection refused [20:24:53] i was pretty tempted to kill the copy yesterday, probably should have [20:24:57] RECOVERY - SSH on srv218 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:24:58] gah [20:24:59] ok [20:25:01] sorry :/ [20:25:06] RECOVERY - SSH on srv215 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:25:33] RECOVERY - SSH on srv214 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:25:42] PROBLEM - Apache HTTP on srv216 is CRITICAL: Connection refused [20:26:01] binasher: did I tell you that I adore https://gdash.wikimedia.org/dashboards/jobq/ [20:26:09] RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [20:27:30] RECOVERY - Host srv238 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [20:27:30] RECOVERY - Host srv236 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [20:27:30] RECOVERY - Host srv237 is UP: PING OK - Packet loss = 0%, RTA = 2.23 ms [20:27:30] RECOVERY - Host srv233 is UP: PING OK - Packet loss = 0%, RTA = 2.30 ms [20:27:30] RECOVERY - Host srv231 is UP: PING OK - Packet loss = 0%, RTA = 3.03 ms [20:27:31] RECOVERY - Host srv232 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [20:27:31] RECOVERY - Host srv239 is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [20:27:32] RECOVERY - Host srv235 is UP: PING OK - Packet loss = 0%, RTA = 1.55 ms [20:27:57] binasher: that weird thumbnail bug I was talking about last night is https://bugzilla.wikimedia.org/show_bug.cgi?id=42047 [20:28:15] PROBLEM - SSH on srv234 is CRITICAL: Connection refused [20:28:15] PROBLEM - Memcached on srv234 is CRITICAL: Connection refused [20:28:24] PROBLEM - Apache HTTP on srv234 is CRITICAL: Connection refused [20:29:16] apergos: nope...failed still trying for /sda [20:29:45] ok so either it's a problem with multiple directives, ec etc [20:29:50] or its some other random thing [20:29:52] or it's [20:30:01] https://bugs.launchpad.net/ubuntu/+source/debian-installer/+bug/1012629 that bug [20:30:08] could be the bug [20:30:11] I guess that about covers all the possibilities [20:30:17] I take it it tries and fails? [20:30:34] same old error? [20:30:42] PROBLEM - Apache HTTP on srv231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:30:42] PROBLEM - Apache HTTP on srv236 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:30:42] PROBLEM - Apache HTTP on srv233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:30:42] PROBLEM - Apache HTTP on srv237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:30:43] PROBLEM - Apache HTTP on srv239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:30:47] i don't know I watched the grub installer and didn't look like it tried anything but /sda [20:30:55] !log purging zhwiki refreshLinks2 jobs [20:31:00] PROBLEM - Memcached on srv236 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:31:02] Logged the message, Master [20:31:03] ok. but I mean it complained with the same error message? [20:31:07] binasher: that's a good way to start :) [20:31:09] yes [20:31:12] ok [20:31:15] AaronSchulz: exactly! [20:31:27] PROBLEM - Memcached on srv239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:31:36] PROBLEM - Memcached on srv235 is CRITICAL: Connection refused [20:31:36] PROBLEM - Memcached on srv237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:31:36] PROBLEM - Memcached on srv238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:31:45] RECOVERY - SSH on srv234 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:32:12] PROBLEM - Apache HTTP on srv232 is CRITICAL: Connection refused [20:32:12] PROBLEM - Apache HTTP on srv235 is CRITICAL: Connection refused [20:32:12] PROBLEM - Apache HTTP on srv238 is CRITICAL: Connection refused [20:32:57] PROBLEM - Memcached on srv233 is CRITICAL: Connection refused [20:32:57] PROBLEM - Memcached on srv231 is CRITICAL: Connection refused [20:33:06] PROBLEM - Memcached on srv232 is CRITICAL: Connection refused [20:35:30] RECOVERY - Apache HTTP on srv214 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.003 seconds [20:36:33] PROBLEM - Host srv242 is DOWN: PING CRITICAL - Packet loss = 100% [20:37:40] hrm, plwiktionary has 238k jobs in its queue [20:39:12] grrr not again [20:39:21] New patchset: MaxSem; "Kill mobileRedirect.php, not used since forever" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33786 [20:39:42] PROBLEM - NTP on srv215 is CRITICAL: NTP CRITICAL: No response from NTP server [20:40:00] PROBLEM - SSH on srv241 is CRITICAL: Connection refused [20:40:09] PROBLEM - SSH on srv243 is CRITICAL: Connection refused [20:40:09] PROBLEM - SSH on srv244 is CRITICAL: Connection refused [20:40:09] PROBLEM - SSH on srv245 is CRITICAL: Connection refused [20:40:18] PROBLEM - SSH on srv247 is CRITICAL: Connection refused [20:40:18] PROBLEM - SSH on srv246 is CRITICAL: Connection refused [20:40:18] PROBLEM - Apache HTTP on srv241 is CRITICAL: Connection refused [20:40:18] PROBLEM - Apache HTTP on srv246 is CRITICAL: Connection refused [20:40:18] PROBLEM - Apache HTTP on srv243 is CRITICAL: Connection refused [20:40:19] PROBLEM - Apache HTTP on srv245 is CRITICAL: Connection refused [20:40:19] PROBLEM - Apache HTTP on srv247 is CRITICAL: Connection refused [20:40:20] PROBLEM - Apache HTTP on srv244 is CRITICAL: Connection refused [20:40:45] PROBLEM - Memcached on srv243 is CRITICAL: Connection refused [20:40:45] PROBLEM - Memcached on srv244 is CRITICAL: Connection refused [20:40:45] PROBLEM - Memcached on srv245 is CRITICAL: Connection refused [20:40:48] !log running patch-job_attempts.sql migration on all wikis [20:40:54] PROBLEM - Memcached on srv246 is CRITICAL: Connection refused [20:40:57] Logged the message, Master [20:41:04] PROBLEM - Memcached on srv247 is CRITICAL: Connection refused [20:41:04] PROBLEM - Memcached on srv241 is CRITICAL: Connection refused [20:41:12] PROBLEM - Memcached on srv240 is CRITICAL: Connection refused [20:41:29] New patchset: Pyoungmeister; "setting srv214-218 and srv231-247 to use appalicationserver role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33803 [20:41:30] PROBLEM - SSH on srv240 is CRITICAL: Connection refused [20:41:30] PROBLEM - NTP on srv218 is CRITICAL: NTP CRITICAL: No response from NTP server [20:41:57] PROBLEM - Apache HTTP on srv240 is CRITICAL: Connection refused [20:41:57] PROBLEM - Host srv214 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:15] RECOVERY - Host srv242 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [20:42:57] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33803 [20:44:18] AaronSchulz: done [20:44:48] PROBLEM - NTP on srv216 is CRITICAL: NTP CRITICAL: No response from NTP server [20:45:51] PROBLEM - Memcached on srv242 is CRITICAL: Connection refused [20:46:27] RECOVERY - SSH on srv240 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:46:27] RECOVERY - SSH on srv241 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:46:45] RECOVERY - SSH on srv243 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:46:45] RECOVERY - SSH on srv245 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:46:45] RECOVERY - SSH on srv244 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:46:45] RECOVERY - SSH on srv246 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:46:54] PROBLEM - Apache HTTP on srv242 is CRITICAL: Connection refused [20:47:03] RECOVERY - SSH on srv247 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:47:39] RECOVERY - Host srv214 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [20:47:57] PROBLEM - NTP on srv234 is CRITICAL: NTP CRITICAL: No response from NTP server [20:50:03] RECOVERY - Apache HTTP on srv236 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [20:50:03] RECOVERY - Apache HTTP on srv240 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [20:51:15] PROBLEM - NTP on srv239 is CRITICAL: NTP CRITICAL: No response from NTP server [20:51:15] PROBLEM - NTP on srv233 is CRITICAL: NTP CRITICAL: No response from NTP server [20:51:24] PROBLEM - SSH on srv214 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:51:24] PROBLEM - NTP on srv232 is CRITICAL: NTP CRITICAL: No response from NTP server [20:51:42] RECOVERY - Apache HTTP on srv215 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.015 seconds [20:51:42] RECOVERY - Apache HTTP on srv244 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.003 seconds [20:52:09] PROBLEM - Apache HTTP on srv214 is CRITICAL: Connection refused [20:52:54] RECOVERY - SSH on srv214 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:52:54] PROBLEM - NTP on srv237 is CRITICAL: NTP CRITICAL: No response from NTP server [20:52:58] yes, nagios-wm, everything's critial. I get it [20:53:03] PROBLEM - NTP on srv236 is CRITICAL: NTP CRITICAL: Offset unknown [20:53:03] PROBLEM - NTP on srv231 is CRITICAL: NTP CRITICAL: Offset unknown [20:53:03] PROBLEM - NTP on srv235 is CRITICAL: NTP CRITICAL: No response from NTP server [20:53:03] PROBLEM - NTP on srv238 is CRITICAL: NTP CRITICAL: No response from NTP server [20:53:12] RECOVERY - Apache HTTP on srv231 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.008 seconds [20:55:18] RECOVERY - Apache HTTP on srv214 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.002 seconds [20:58:27] RECOVERY - Apache HTTP on srv241 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.002 seconds [20:58:27] RECOVERY - Apache HTTP on srv237 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.009 seconds [20:59:39] PROBLEM - NTP on srv245 is CRITICAL: NTP CRITICAL: Offset unknown [20:59:39] PROBLEM - NTP on srv241 is CRITICAL: NTP CRITICAL: Offset unknown [20:59:57] RECOVERY - Apache HTTP on srv245 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [21:00:33] RECOVERY - Apache HTTP on srv216 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [21:00:34] PROBLEM - NTP on srv243 is CRITICAL: NTP CRITICAL: No response from NTP server [21:01:09] PROBLEM - NTP on srv246 is CRITICAL: NTP CRITICAL: No response from NTP server [21:01:09] PROBLEM - NTP on srv247 is CRITICAL: NTP CRITICAL: No response from NTP server [21:01:18] RECOVERY - NTP on srv231 is OK: NTP OK: Offset 0.07543969154 secs [21:01:18] PROBLEM - NTP on srv240 is CRITICAL: NTP CRITICAL: Offset unknown [21:03:06] PROBLEM - Host ms-be7 is DOWN: PING CRITICAL - Packet loss = 100% [21:04:36] RECOVERY - Apache HTTP on srv232 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.009 seconds [21:06:15] RECOVERY - Apache HTTP on srv238 is OK: HTTP OK HTTP/1.1 200 OK - 456 bytes in 0.003 seconds [21:06:15] PROBLEM - NTP on srv242 is CRITICAL: NTP CRITICAL: No response from NTP server [21:06:38] New patchset: Pyoungmeister; "pulling es1 for converstion to innodb" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33811 [21:06:51] RECOVERY - Apache HTTP on srv242 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [21:07:45] RECOVERY - NTP on srv245 is OK: NTP OK: Offset -0.09133017063 secs [21:08:03] RECOVERY - Apache HTTP on srv246 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.004 seconds [21:08:48] RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [21:08:56] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33811 [21:09:41] !log py synchronized wmf-config/db.php 'pulling es1' [21:09:48] Logged the message, Master [21:10:09] RECOVERY - Apache HTTP on srv218 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.006 seconds [21:11:12] PROBLEM - NTP on srv214 is CRITICAL: NTP CRITICAL: Offset unknown [21:13:00] RECOVERY - Apache HTTP on srv233 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.003 seconds [21:14:03] RECOVERY - NTP on srv236 is OK: NTP OK: Offset -0.03819358349 secs [21:14:31] RECOVERY - NTP on srv240 is OK: NTP OK: Offset -0.04238283634 secs [21:14:31] RECOVERY - Apache HTTP on srv239 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.007 seconds [21:14:39] RECOVERY - NTP on srv242 is OK: NTP OK: Offset -0.01423549652 secs [21:15:42] RECOVERY - NTP on srv215 is OK: NTP OK: Offset -0.04974234104 secs [21:16:00] RECOVERY - Apache HTTP on srv247 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.006 seconds [21:16:18] RECOVERY - Apache HTTP on srv243 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.011 seconds [21:18:18] New patchset: Hashar; "(bug 41183) move beta logs out of /home" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33813 [21:18:56] LeslieCarr: stooooop lunching :-D [21:19:18] RECOVERY - NTP on srv214 is OK: NTP OK: Offset -0.03077101707 secs [21:20:57] RECOVERY - Apache HTTP on srv234 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.008 seconds [21:22:00] RECOVERY - NTP on srv237 is OK: NTP OK: Offset -0.03861808777 secs [21:23:12] RECOVERY - NTP on srv241 is OK: NTP OK: Offset -0.03002393246 secs [21:23:57] RECOVERY - NTP on srv239 is OK: NTP OK: Offset -0.0932238102 secs [21:24:06] RECOVERY - NTP on srv216 is OK: NTP OK: Offset -0.04292368889 secs [21:27:24] RECOVERY - MySQL Slave Delay on es1001 is OK: OK replication delay NULL seconds [21:27:33] RECOVERY - NTP on srv232 is OK: NTP OK: Offset -0.03715598583 secs [21:28:09] RECOVERY - Apache HTTP on srv235 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.013 seconds [21:29:03] RECOVERY - NTP on srv234 is OK: NTP OK: Offset -0.07721054554 secs [21:29:57] PROBLEM - mysqld processes on es1 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:30:33] RECOVERY - NTP on srv238 is OK: NTP OK: Offset -0.03103542328 secs [21:30:53] !log starting innobackupex from es1004 to es1 [21:31:00] Logged the message, notpeter [21:32:21] RECOVERY - NTP on srv246 is OK: NTP OK: Offset -0.02613461018 secs [21:33:51] RECOVERY - NTP on srv218 is OK: NTP OK: Offset -0.04030764103 secs [21:35:21] RECOVERY - NTP on srv233 is OK: NTP OK: Offset -0.03011929989 secs [21:39:51] RECOVERY - NTP on srv243 is OK: NTP OK: Offset -0.03958690166 secs [21:40:45] RECOVERY - NTP on srv247 is OK: NTP OK: Offset -0.03112268448 secs [21:51:50] RECOVERY - NTP on srv235 is OK: NTP OK: Offset -0.03366339207 secs [21:57:30] hashar: what's up [21:57:45] other than delicious mexican food [21:57:50] which is down [21:57:52] in my belly [21:58:13] \O/ [21:58:29] so I got a tiny change for your rt duty :-] https://gerrit.wikimedia.org/r/#/c/33813/ [21:58:37] which is for the 'beta' project on labs. [21:59:00] we write udp2log files under /home which is a shared NFS instance of 18GB shared among ALL instances [21:59:13] whenever beta start logging a ton of stuff, that block any projects :-] [21:59:32] Mmm mexican [21:59:39] so that patch is all about writing udp2log logs to /data/project which is per project :-] [22:00:02] hashar: You should totally ship that to logstash and work with the A guy on making it awesome :D [22:00:03] * hashar gets a glass of wine [22:00:13] Damianz: logstash ? [22:00:19] * Damianz gets a bottle of wine to go with his chicken dinner [22:00:27] http://logstash.net/ < [22:00:27] ohhh [22:00:32] yet another logging tool [22:00:34] so whmmhm [22:00:35] We have an instance in labs with instances syslogs going to it [22:00:41] \O/ [22:01:16] I think someone is working on getting rid of udp2log [22:03:36] I am afraid LeslieCarr has been scared by my puppet change is now running around in the office [22:03:51] oh, i wasn't looking at irc without the pinging [22:03:52] (screaming) *** noooo another hasharr's change nooo *** [22:04:07] argh [22:04:12] LeslieCarr: i keep forgetting to ping people sorry :( [22:04:13] hashar: so it's not referenced anywhere [22:04:31] hop cause that is on labs. We apply the class by using OpenStackManager on labsconsole.wmflabs.og [22:04:36] ok cool [22:04:43] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33813 [22:04:43] so that should be safe as far as production is concerned [22:04:45] \O/ [22:04:46] done [22:04:49] I luuuve role classes [22:04:50] hashar: re: udp2log. yes, it may get replaced, but it's going to be a few months before we can stop worrying about it (and may end up keeping it in the pipeline) [22:04:54] so easy to get them merged in [22:05:09] Damianz: what robla said ^^^^^ [22:05:28] robla: ahh good to know :-] I guess it is not that much high priority since udp2log has been a success over the last few years [22:05:37] (plus we know the developer who is an awesome guy) [22:05:43] LeslieCarr's talk of delicious mexican food makes me wonder if there's a delicious mexican option that he wasn't aware of. [22:05:59] s/he/I/ [22:06:00] robla: tropisueno [22:06:36] ah, that one. I've never done that for lunch [22:07:02] robla: about the report for our CI sprint I am afraid we are a bit late. Been to busy but we scheduled some time with zelko to write the report down on monday [22:14:08] LeslieCarr: worked like a charm. Thanks a ton !!! [22:17:42] !log catrope synchronized php-1.21wmf3/resources/jquery/jquery.tablesorter.js 'Deploying 2405956a8d6f305f6d3f9b0a1bda4deb7e917ec8' [22:17:50] Logged the message, Master [22:18:00] !log repooling srv214-srv218 [22:18:06] Logged the message, notpeter [22:18:12] notpeter: FYI: [22:18:14] srv234: rsync: change_dir#3 "/apache/common-local/php-1.21wmf3/resources" failed: No such file or directory (2) [22:18:16] srv234: rsync error: errors selecting input/output files, dirs (code 3) at main.c(643) [Receiver=3.0.9 [22:18:17] ditto srv235 [22:18:19] yeah [22:18:27] they're getting their first sync right now [22:18:38] but thanks for the heads up! [22:18:43] I appreciate more eyes [22:18:46] Alright [22:18:51] !log catrope synchronized php-1.21wmf4/resources/jquery/jquery.tablesorter.js 'Deploying 9cb43905cd7189aa9a42f942f24a7aea50b4f0e9' [22:18:53] I just stumbled upon it because I was deploying a bug fix [22:18:57] Logged the message, Master [22:19:13] when upgrading several hundred servers, the posibility of me missing sometihng is non-trivial [22:19:43] Yeah [22:20:24] Although you've upgraded more servers and in larger batches than anyone ever did around here, and thanks to better automation it has never led to out-of-sync servers AFAIK [22:20:25] New patchset: Reedy; "Disable validation statistics maintenance report update" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33824 [22:20:44] We used to have that problem from time to time before we automated this properly [22:20:44] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33824 [22:21:48] !log reedy synchronized wmf-config/InitialiseSettings.php [22:21:55] Logged the message, Master [22:25:43] RoanKattouw: yeah, it's hard to automate fully. I'll be more able to do that with newer servers [22:26:08] I think that I have pretty much cobbled together a fully automated, extremely abusive shellscript for this [22:26:10] New patchset: Reedy; "Revert "Disable validation statistics maintenance report update"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33826 [22:26:21] Well for a while we've had the thing where puppet makes sure that a server is in sync before it starts Apache [22:26:33] That works /really/ well to eliminate most occurrences of out-of-sync servers [22:26:36] yeah [22:27:01] Before we had that, Rob would catch heat every time he revived a bunch of dead Apaches [22:27:23] Because of course if you're doing a batch of >10 of them, one of them is bound to get screwed up in some way [22:27:32] yeah [22:29:10] And no one had time to revive Apaches in short order, so the dead ones would pile up and get revived in a batch every now and then. Which meant that the impact was larger because they were more out of date. The really fun one was the first revival round after the Monobook->Vector switch, when some revived boxes served the wrong skin using a different MW release [22:29:32] ugh [22:29:36] Yeah [22:29:42] Our users actually noticed that time :) [22:29:49] Other times the bugs would be more subtle [22:31:15] yeah, I've caused some of those :) [22:31:50] which has lead me to get this to a semi-automated point that involves a lot of overkill (multiple syncs... cuz why not!) [22:32:16] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33826 [22:32:21] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30062 [22:32:42] notpeter: I am Roan Kattouw and I support your paranoia [22:32:43] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/33475 [22:32:59] heheheheh [22:33:08] Yes notpeter, the wikipedians really ARE after you [22:33:19] firing off shell scripts is easy :) [22:33:23] oh, I know.... [22:33:30] they're after my data [22:37:11] * hashar waves, have a nice weekend everyone [22:40:17] PROBLEM - Host ms-be7 is DOWN: PING CRITICAL - Packet loss = 100% [22:40:56] hashar: have a wonderful weekend! [22:41:12] notpeter: will surely have one :) [22:43:53] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [22:43:53] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [22:44:58] !log repooling srv231-srv247 [22:45:04] Logged the message, notpeter [22:46:08] RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.94 ms [22:56:18] Ryan_Lane: I don't know what's up re HTTPS for logged-in users. csteipp: do you? [22:56:24] (Re wikitext-l post of 5 mins ago) [22:56:27] no? [22:56:51] we needed a mechanism to redirect people back to https when they hit http links [22:56:55] Ah, yeah.. I'll answer on list [22:56:57] I haven't been involved in recent HTTPS efforts [22:56:59] great :) [22:57:02] Yeah the whole insecure cookie thing [22:57:11] RoanKattouw: me either, since I've been waiting on mediawiki changes :) [22:57:13] I mean I know what technology is required, I just don't know what its current status is [22:57:40] And I would be more involved, except, you know, VE release and school and all that [22:58:33] !log maxsem synchronized php-1.21wmf4/extensions/TimedMediaHandler 'https://gerrit.wikimedia.org/r/#/c/33819/' [22:58:41] Logged the message, Master [22:59:16] RoanKattouw: you don't sound too busy [22:59:25] ;) [22:59:45] * Damianz thinks Roan is under 30 being worked on items and can take more [23:00:00] !log maxsem synchronized php-1.21wmf3/extensions/TimedMediaHandler 'https://gerrit.wikimedia.org/r/#/c/33819/' [23:00:08] Logged the message, Master [23:12:59] PROBLEM - Host ms-be7 is DOWN: PING CRITICAL - Packet loss = 100% [23:18:41] RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [23:37:40] New patchset: Reedy; "Disable updating of ValidationStatistics in SpecialPageCacheUpdates" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33845 [23:44:26] New patchset: Reedy; "Disable updating of ValidationStatistics in SpecialPageCacheUpdates" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33845 [23:45:51] PROBLEM - Host ms-be7 is DOWN: PING CRITICAL - Packet loss = 100% [23:51:42] RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms