[00:03:42] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24511 [00:03:45] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [00:07:51] New patchset: Andrew Bogott; "Add a hard-coded region to the automatic instance status." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24513 [00:08:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24513 [00:09:50] New patchset: Faidon; "swift: passthrough HTTP redirects from thumbhandler" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24514 [00:10:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:10:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24514 [00:10:51] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24513 [00:16:01] !log configured pt-heartbeat on es2 and es3 shards [00:16:11] Logged the message, Master [00:18:45] RECOVERY - mysqld processes on es6 is OK: PROCS OK: 1 process with command name mysqld [00:27:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [00:34:41] New review: Asher; "This would currently clash with the memlock ulimit." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/23872 [01:00:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:14:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [01:24:17] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [01:42:17] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 278 seconds [01:44:05] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 299 seconds [01:45:26] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 6 seconds [01:45:35] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 0 seconds [01:45:53] PROBLEM - Host carbon is DOWN: PING CRITICAL - Packet loss = 100% [01:47:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:59:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.876 seconds [02:08:08] New review: Krinkle; "There is still issues to be resolved. See bug report." [operations/mediawiki-config] (master); V: -1 C: -2; - https://gerrit.wikimedia.org/r/21322 [02:32:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:40:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.023 seconds [02:46:20] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [02:46:20] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [03:34:14] Change abandoned: Jgreen; "sigh. thanks gerrit." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11979 [03:38:30] New patchset: Jgreen; "redoing pgehres->deploy commit per RT #3143" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24522 [03:39:28] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24522 [03:40:16] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [04:35:54] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [04:56:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:01:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [05:28:17] PROBLEM - Lucene on search1015 is CRITICAL: Connection timed out [05:35:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:39:14] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Puppet has not run in the last 10 hours [05:39:14] PROBLEM - Puppet freshness on analytics1005 is CRITICAL: Puppet has not run in the last 10 hours [05:40:17] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.027 second response time on port 8123 [05:44:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.935 seconds [06:07:08] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [06:08:29] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [06:12:14] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [06:13:44] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [06:19:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:32:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.072 seconds [06:32:56] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:34:08] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [06:39:17] Change abandoned: Hashar; "We need to make Jenkins available in Precise." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24427 [07:06:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:17:26] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [07:17:26] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [07:17:26] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [07:17:26] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [07:17:26] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [07:17:26] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [07:18:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.108 seconds [07:53:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:05:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.974 seconds [08:14:22] New patchset: Hashar; "(bug 40419) extension assets not available on beta" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24525 [08:39:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:54:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.060 seconds [09:09:42] New review: Dereckson; "@Krinkle These lines were to ask the semi protection. The last bug requesting that were a 2006 one. ..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/23059 [09:14:22] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [09:27:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:37:19] PROBLEM - Puppet freshness on es1005 is CRITICAL: Puppet has not run in the last 10 hours [09:38:22] PROBLEM - Puppet freshness on es1006 is CRITICAL: Puppet has not run in the last 10 hours [09:38:22] PROBLEM - Puppet freshness on es1008 is CRITICAL: Puppet has not run in the last 10 hours [09:38:22] PROBLEM - Puppet freshness on es1007 is CRITICAL: Puppet has not run in the last 10 hours [09:38:22] PROBLEM - Puppet freshness on es1009 is CRITICAL: Puppet has not run in the last 10 hours [09:38:22] PROBLEM - Puppet freshness on es1010 is CRITICAL: Puppet has not run in the last 10 hours [09:41:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.036 seconds [10:04:19] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [10:12:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:27:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.032 seconds [10:59:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:13:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.220 seconds [11:25:49] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [11:46:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:01:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.025 seconds [12:35:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:47:11] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [12:47:11] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [12:49:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.022 seconds [12:56:07] mark: re. ipv6 and geoip, does the client always access geoip.wikimedia directly? or is it ever used in internal subrequests? [12:56:57] what is an internal subrequest? [12:57:12] i.e. from client-->proxy-->webserver-->geoip [12:57:19] no [12:57:25] geoip only looks at the connecting ip [12:57:37] so you can't pass it a parameter with an ip, like a webserver script would have to do [12:57:37] iirc it looks at a header it gets from the proxy [12:57:44] that is true [12:57:54] but I don't see how it makes a difference? [12:57:56] so it *could* be used in theory however we want [12:58:21] i hesitated to draw any conclusions about this whole issue because I didn't fully understand how its used here [12:58:32] i really don't know how the fundraising team is using it [12:58:39] so that's why I said it should be figured out first [12:58:51] but when used in any sane way, it *should* not have a problem ;) [12:59:11] sure. also I don't think zack's point was that it necessarily *is* broken. [12:59:28] zack just doesn't understand it [12:59:33] exactly [12:59:51] anyway, geoiplookup has an ipv4 address only, so however way it is contacted, that first client ip is always an ipv4 address [12:59:56] anyway, if it were an internal request it could matter b/c we wouldn't have the followup ipv4 request to work with [12:59:59] a quick google didn't show an easy explanation of how ipv6 versus ipv4 works [13:00:23] we'd have only the header from the proxy with the ipv6 IP [13:00:29] oh wait [13:00:35] isn't this because the bits.wikimedia.org/geoiplookup thing [13:00:47] bits. DOES have an ipv6 ip [13:00:48] it might be. I really don't know [13:00:53] yes, that's got to be it [13:01:11] we really need to break down each type of request that's used to understand this stuff [13:01:52] so now it uses the bits.wikimedia.org domain, and that IS contacted over ipv6 [13:02:37] looking for a banner . . . [13:02:48] nothing up atm afaik [13:03:21] hrm, you need a "jimmy-light" (like a batman light) [13:03:28] ha [13:03:59] turn on the light, the banner comes on, getting donations and scaring small children :) [13:04:18] maybe we're being adblocked or something [13:08:13] sent a followup email [13:21:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [13:41:11] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [13:48:50] PROBLEM - Host analytics1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:51:48] New patchset: Cmjohnson; "adding Howie Fung authorization to access stat1 per rt 3577" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24542 [13:52:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24542 [13:54:29] notpeter: can you review https://gerrit.wikimedia.org/r/24542 [14:03:05] heyaaa, mutante, are you around? can you help me with a couple more cisco things? [14:03:24] i've got all but 2 reinstalled, not sure why these two are giving me a hard time [14:03:45] at 7 am in SF, unlikely [14:03:51] ah he's there, ok [14:04:57] yeah..too early for those guys...check back in 3 hours [14:09:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:21:22] if anyone wanna update some packages in our apt repo, I got a request filled with https://rt.wikimedia.org/Ticket/Display.html?id=3579 [14:21:38] which are bug 40414 & bug 40426 [14:21:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.847 seconds [14:26:02] RECOVERY - Puppet freshness on analytics1005 is OK: puppet ran at Fri Sep 21 14:25:51 UTC 2012 [14:37:08] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [14:44:03] Any ops available to review this: https://gerrit.wikimedia.org/r/#/c/24428/ [14:44:22] thanks in advance :) [14:45:41] RECOVERY - NTP on analytics1005 is OK: NTP OK: Offset -0.03102123737 secs [14:46:03] <^demon> Hydriz: Looks ok. Why would ops review that? [14:46:07] <^demon> That's an extension change :) [14:46:08] er that doesn't seem like an ops thing [14:46:40] not sure, but ops seems to be the one with the powers to merge [14:46:55] <^demon> Insofar as they can merge anything. [14:47:00] and it was deployed on Wikimedia (though removed) [14:47:18] <^demon> Anyone in https://gerrit.wikimedia.org/r/#/admin/groups/53,members can review it. [14:47:29] <^demon> And as with all extensions, anyone in the 'mediawiki' group can review it [14:47:57] oh, I see [14:48:11] so I probably might just need to sit around and wait [14:49:29] !log labs: cleaned out a 4GB file out of labs-nfs1:/export/home/deployment-prep [14:49:40] Logged the message, Master [14:55:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:57:36] hashar: shouldn't that be in the labsconsole SAL? [14:58:23] oh maybe it should [14:58:28] but the logging bot is dead there [15:02:19] well then fix the bot ;) [15:11:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [15:13:48] lesliecarr: can you merge this plz...https://gerrit.wikimedia.org/r/24542 [15:24:27] * jeremyb suggests cmjohnson1 leave some whitespace before his URLs ;-) [15:25:15] jeremyb: okay...i hate whitespace [15:25:34] either i have it or I don't....ugh..:-P [15:25:37] is kinda important! [15:25:50] > Iceweasel doesn't know how to open this address, because the protocol (plz...https) isn't associated with any program. [15:26:47] also, the machine's name is Macintosh? seems a little too generic. ;-) [15:43:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:55:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.373 seconds [16:00:01] New patchset: Demon; "Refactor gerrit2 account stuff so we can reuse it on other hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24557 [16:01:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24557 [16:01:39] New review: Demon; "This is a work-in-progress. Need to test some things on labs first. Also want to check with producti..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/24557 [16:18:46] New review: Demon; "Manganese said uid 999, but I'm assuming that was random and we never picked an unused uid." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/24557 [16:30:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:36:43] New patchset: Helder.wiki; "(bug 22911) Configure Extension:SubpageSortkey for enwikibooks and ptwikibooks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24561 [16:37:59] New review: Helder.wiki; "Please check if this is the way to add new configuration to this file (e.g. I wasn't sure about usin..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/24561 [16:43:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.940 seconds [17:05:09] New review: RobH; "Not sure what Chris meant when he mentioned whitespace issue before the URL. I put a few suggestion..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/24542 [17:16:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:58] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [17:17:58] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [17:17:58] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [17:17:58] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [17:17:58] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [17:17:58] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [17:32:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.033 seconds [17:48:22] hiyyyya mutante, you around? [18:02:54] New review: Dzahn; "please do not use UID 607, it is already in use and we recently had to fix duplicate UIDs. The next ..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/24542 [18:04:09] ah ha! mutante is around, I know it! [18:04:20] i'm running to a cafe, be back in a bit, [18:04:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:16:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.723 seconds [18:29:41] anyone available to help me w/git? [18:31:07] getting some weird stuff....if i blow out my repo i get this [18:31:11] chrisjohnson:puppet chrisj$ git reset --hard origin [18:31:11] HEAD is now at 89e4fcf redoing pgehres->deploy commit per RT #3143 [18:31:26] ^ not sure what this is about [18:33:50] that sounds normal, right? [18:33:57] it is just telling you what commit it reset to [18:34:11] i am not pgehres [18:34:49] right, but his is probably the most recent commit [18:35:03] cmjohnson1: that is just the commit msg, it was done by Jeff Green [18:35:06] even still ...when i fetch my change to fix....make changes ammend and then after review i get rebase stuff [18:35:32] ok...i am confused on why i can't make changes [18:43:02] I assume you have a change in gerrit that you want to amend, that's how you got stuck where you are? [18:43:30] yes [18:43:38] ok [18:43:42] i have a change I am trying to amend [18:44:31] i fetched the changed fixed, ammended and git review...than i get rebase b.s. [18:50:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:54:31] New patchset: Cmjohnson; "adding Howie Fung authorization to access stat1 per rt 3577" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24542 [18:54:57] apergos: thx that worked [18:55:14] oh yay [18:55:24] hey maybe the lab instructions could be updated [18:55:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24542 [18:55:52] i think the difference was -R flag [18:56:06] not in the labs instruction [18:56:32] huh [18:56:59] The -R is important here. It tells git-review to not rebase your change against master, which clutters diffs between patch set 1 and 2. [18:59:55] ls [19:04:21] mutante: fixed the uid...can you check and merge [19:04:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.763 seconds [19:04:45] New patchset: Faidon; "swift: propagate thumb.php's 404 error message" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24576 [19:05:12] ok lol "Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request" [19:05:41] New patchset: Faidon; "swift: passthrough HTTP redirects from thumbhandler" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24514 [19:06:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24576 [19:06:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24514 [19:13:24] New patchset: Asher; "fix eqiad es regexes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24577 [19:14:23] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24577 [19:15:10] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [19:17:34] RECOVERY - Puppet freshness on es1006 is OK: puppet ran at Fri Sep 21 19:17:24 UTC 2012 [19:17:35] RECOVERY - Puppet freshness on es1005 is OK: puppet ran at Fri Sep 21 19:17:26 UTC 2012 [19:18:01] RECOVERY - Puppet freshness on es1009 is OK: puppet ran at Fri Sep 21 19:17:36 UTC 2012 [19:22:06] binasher: hahahaha [19:22:39] binasher: AaronSchulz: i was going to spin up a couple of apaches in eqiad so that you can test stuff against them. would that be helpful? how does 5 sound? [19:24:11] notpeter: having a few up would be good to start working on / testing deployment stuff [19:24:21] kk [19:24:50] 2 or 3 would be ok right now [19:25:22] ok, I'll spin up 1017-1019 [19:27:09] binasher: I made some comments on patch #7 on https://gerrit.wikimedia.org/r/#/c/17512/ [19:27:28] RECOVERY - Puppet freshness on es1007 is OK: puppet ran at Fri Sep 21 19:27:03 UTC 2012 [19:28:42] AaronSchulz: is that fa_sha1 index actually going to be used? [19:29:00] by the api yes, for finding duplicates [19:29:11] it won't be behind $wgMiserMode anymore [19:37:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:31] RECOVERY - Puppet freshness on es1008 is OK: puppet ran at Fri Sep 21 19:37:01 UTC 2012 [19:39:10] PROBLEM - Puppet freshness on es1010 is CRITICAL: Puppet has not run in the last 10 hours [19:39:54] AaronSchulz: i think just an index on fa_sha1(8) would be enough [19:41:00] did you run a query? [19:42:53] it isn't the same and has a lot of duplicaiton, but i was looking at fj_path_sha1 on enwiki.. i should try on commons [19:42:54] hmm, that around ~10 hex chars, seems to be pretty unique for git :) [19:43:29] binasher: that column is supposed to be dropped :) [19:43:59] binasher: I wouldn't be surprised if 8 was fine...how many rows due you want it to scan per result-row? [19:44:20] e.g, x:1 [19:45:01] apergos: still lots of "Couldn't resolve host 'ms-fe.pmtpa.wmnet': Failed to obtain valid HTTP response", did you look at that some more or were you stumped? [19:50:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.597 seconds [19:51:01] RECOVERY - Puppet freshness on es1010 is OK: puppet ran at Fri Sep 21 19:51:00 UTC 2012 [19:58:51] ottomata1: re [20:01:41] !change 24542 | cmjohnson1 [20:01:41] cmjohnson1: https://gerrit.wikimedia.org/r/#q,24542,n,z [20:01:53] approved, but has path conflict [20:02:52] hewo [20:02:55] ack, 1! [20:03:36] ottomata: so, whats up with the ana servers [20:03:49] heyaaa [20:03:56] ja so, 1002 and 1007 are not being nice [20:04:16] 1002 is hanging just like 1010 was yesterday [20:04:22] even if I try to boot it pxe [20:04:35] ok, i will take a look [20:04:37] 1007 just doesn't reboot, at least, i can't see any change in console when I powercycle [20:04:55] so if they are just reinstalled and boot you are happy, right [20:05:07] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [20:05:08] yup [20:05:21] ok, on it [20:05:28] ottomata: were they already imaged? [20:05:34] or is this the 1st time [20:05:54] this is the 2nd time they've been installed, but this time with precise [20:06:13] i successfully reinstalled 7 others [20:06:18] these two are just being annoying [20:10:50] New patchset: Demon; "Raise gerrit's apache timeout to 10 minutes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24583 [20:11:52] New patchset: Hashar; "jenkins job builder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24584 [20:12:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24583 [20:12:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24584 [20:13:02] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24583 [20:15:47] !log analytics1007 seems dead. powercycle but never see any console output on mgmt at all [20:15:58] Logged the message, Master [20:16:39] ottomata: 1007 is either dead or the mgmt interface / console redirection settings have changed [20:17:36] !log apt-get upgrade on manganese, restart apache for adjusted 5->10m gerrit timeout [20:17:46] Logged the message, Master [20:19:39] Change abandoned: Cmjohnson; "too many errors..doing it again" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24542 [20:22:19] New patchset: Hashar; "jenkins job builder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24584 [20:23:15] Jeff_Green: mutante do you have any documentation about puppet modules ? [20:23:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24584 [20:23:31] I am considering migrating the contint classes to a module ;-D [20:23:41] oh--I haven't worked with modules yet [20:24:04] hashar: http://docs.puppetlabs.com/puppet/2.7/reference/modules_fundamentals.html [20:24:12] New review: Demon; "We went ahead and deployed this so Jeff could test our theory about his timeouts. Can't see it causi..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24583 [20:24:36] hashar: i read that..but actually i havent written one yet [20:24:59] New patchset: Hashar; "jenkins job builder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24584 [20:25:07] :-] [20:25:13] guess I will have to RTFM hehe [20:25:29] hashar: look at the ones peter wrote [20:25:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:25:41] ohh [20:25:56] there is a "modules" directory in the operations puppet repo now [20:26:02] so I will be able to ask n0tpeter about it since our timezones overlap nicely [20:26:02] New patchset: Andrew Bogott; "Update instance status in response to compute.instance.metadatachange" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24586 [20:26:15] I need the 5 minute 'why bother' talk on modules [20:26:21] it has applicationserver, ntp, salt... [20:26:34] every time I've looked at it, it seems like a ton of overhead for a gain I haven't yet managed to identify [20:27:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24584 [20:27:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24586 [20:27:36] Jeff_Green, i think the gain is isolation and, um modularization [20:27:42] stuff in manifest has dependencies all over the place [20:27:54] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24586 [20:28:01] if you write you rmodule nice, you should be able to use it anywhere (even in another system) and it would work [20:28:12] i guess, but can't you write a manifest the same way? [20:28:19] so ideally, a module wouldn't have anything wmf specific in it [20:28:33] yeah, but it is harder to extract as a 'module' [20:28:39] since your files and templates are all next to all the others [20:28:39] http://forge.puppetlabs.com/ [20:28:46] because the files/templates are interleaved on the filesystem? [20:28:50] this kinda keeps the files and templates and manifests all together [20:28:52] hashar: you can find tons of modules on "puppet forge" [20:29:02] <^demon> !log restarted gerrit like 10m ago to deploy I03de7c74 (apache timeout increase) [20:29:12] Logged the message, Master [20:29:13] ottomata: i see. that makes sense. [20:30:15] ottomata: i have the same issues you have with these 2 boxes :/ [20:30:46] 1002, one boot option was disabled, i enabled it again, rebooted again.. it still hangs etc. [20:30:47] mutante: yeah will end up reading them :-] [20:30:48] New patchset: Hashar; "jenkins job builder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24584 [20:30:59] mutante: I was kind of expecting some magic command like: puppet module create template [20:31:05] mutante: will look at it next week [20:31:24] hashar: http://forge.puppetlabs.com/modules?q=jenkins&commit=Go :) [20:31:39] well [20:31:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24584 [20:31:49] hashar: nothing with puppet is ever magical [20:33:40] mutante, aye [20:33:41] hm [20:33:57] do we need a data center doctor to take a look? [20:34:14] ottomata: send the doctor to 1007. yes. 1002 give me another try [20:34:36] but 1007 ,just dont have any output ..so cant do much [20:34:56] New patchset: Asher; "moving ES writes from es1 (cluster23) to es2 (cluster24) and es3 (cluster25)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24587 [20:35:09] k [20:35:11] New patchset: Hashar; "jenkins job builder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24584 [20:35:24] RobH, you around? [20:36:05] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/24587 [20:36:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24584 [20:36:12] ottomata: sup? [20:36:45] New patchset: Hashar; "jenkins job builder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24584 [20:37:19] mutante and I are having trouble with a couple of the analytics ciscos [20:37:32] RobH: analytics1007. we cant get any console output via mgmt. powercycle, but we never see output.. [20:37:35] one in particular just doesn't respond to powercycle or antyhing really [20:37:42] as opposed to other hosts in that range [20:37:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24584 [20:37:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.869 seconds [20:37:57] hrmm [20:38:13] well, it shoudl work just like all the others, you guys confirmed the bios settings are the same as the rest? [20:38:31] dont even see it booting before the OS [20:38:33] the ciscos, unlike the dell, have pretty much nothing i can see more than you can remotely [20:38:36] no Bios screen [20:38:48] unless we can check them via mgmt console commands [20:38:48] but i can of course check for actual hardware failure that may prevent boot entirely [20:39:01] im going to be in the datacenter on monday [20:39:12] if you wanna drop a ticket, i can pull and console it, confirm operation, etc [20:39:15] we will just create a ticket then [20:39:21] so the mgmt works, but cannot post? [20:39:21] thanks [20:39:43] yeah, we can log into mgmt [20:39:52] sounds right, cuz i tested mgmt on all hosts [20:39:55] yes, can connect to mgmt, sending power cycle command, then connecting to host.. from there no output at all [20:39:56] but i didnt post each [20:40:03] since the ciscos take like 5 minutes to post =P [20:40:11] yeaa, even more :) [20:40:20] its cause those analytics people have THAT much RAM :) [20:40:36] well, we heard they like ram so we put more ram in the ram. [20:40:37] and this must have worked at some point, because they have been installed before [20:40:51] yes, i actually installed that one :p [20:40:53] interesting, note that in ticket pls =] [20:42:14] PROBLEM - MySQL Slave Delay on es2 is CRITICAL: CRIT replication delay 249 seconds [20:42:14] PROBLEM - MySQL Slave Delay on es1001 is CRITICAL: CRIT replication delay 249 seconds [20:42:32] PROBLEM - MySQL Slave Delay on es1004 is CRITICAL: CRIT replication delay 266 seconds [20:42:59] PROBLEM - MySQL Slave Delay on es4 is CRITICAL: CRIT replication delay 293 seconds [20:43:26] PROBLEM - MySQL Slave Delay on es1002 is CRITICAL: CRIT replication delay 319 seconds [20:44:55] ottomata: RobH : RT-3582 ..done [20:45:15] cool, will poke it on monday [20:45:28] ottomata: if you want to add something to it.. go ahead [20:45:31] thanks Rob [20:45:41] PROBLEM - Apache HTTP on mw1018 is CRITICAL: Connection refused [20:50:48] looks good mutante, thanks [20:51:03] New patchset: Hashar; "jenkins job builder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24584 [20:51:37] bah we need a pip class [20:51:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24584 [20:53:42] New patchset: Hashar; "jenkins job builder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24584 [20:54:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24584 [20:54:41] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.617 second response time [20:54:53] boum, just discovered we don't want to use pip :/ [20:54:57] but need debian packages! [20:55:19] !log removing srv190 from apaches pool to use to test build as an imagescaler [20:55:29] Logged the message, notpeter [20:58:26] New patchset: Hashar; "jenkins: OpenStack jenkins-job-builder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24620 [20:59:22] Change abandoned: Hashar; "Replaced by https://gerrit.wikimedia.org/r/24620" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24584 [20:59:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24620 [21:02:28] New patchset: Pyoungmeister; "setting srv190 as in imagescaler to test precise build" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24621 [21:03:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24621 [21:03:49] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24621 [21:06:14] PROBLEM - Host srv190 is DOWN: PING CRITICAL - Packet loss = 100% [21:06:32] RECOVERY - Host srv190 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [21:07:50] New patchset: Pyoungmeister; "re-adding srv190 mac to dhcpd.conf, as it's not decom and I want to abuse it for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24622 [21:08:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24622 [21:09:08] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24622 [21:10:08] PROBLEM - Apache HTTP on srv190 is CRITICAL: Connection refused [21:11:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:14:22] New patchset: Andrew Bogott; "If I'm using an arbitrary event, might as well give it an accurate name." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24623 [21:15:16] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24623 [21:26:11] PROBLEM - Host srv190 is DOWN: PING CRITICAL - Packet loss = 100% [21:26:29] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [21:27:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.057 seconds [21:30:59] RECOVERY - Host srv190 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [21:32:59] New patchset: Pyoungmeister; "fix weird cp/paste error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24625 [21:33:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24625 [21:34:01] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24625 [21:38:17] binasher: so did fa_sha1(8) look reasonable after inspection? [21:39:59] PROBLEM - SSH on srv190 is CRITICAL: Connection refused [21:40:19] AaronSchulz: very much so, if the deprecated commonswiki.filejournal.fj_path_sha1 column is anything to go by [21:40:29] AaronSchulz: binasher: mw1017-1019 are yours to play with [21:40:33] they need db access.... [21:40:40] also, they're not in mediawiki-installations [21:40:51] so they might need to be sync'd by hand from time to time [21:41:24] binasher: did you comment on that patch? [21:41:37] AaronSchulz: of ~8.5 mil rows, 6095231 had a sha1. of those, 2322823 were unique [21:42:22] New patchset: Andrew Bogott; "Arbitrary respacing change" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24626 [21:43:14] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24626 [21:43:28] from 3k different sample queries, using 4 characters would match an avg of 40.127 uniques per query. first 6 = 1.03 per, and first 8 was all unique [21:43:42] sweet [21:43:48] AaronSchulz: this has me wanting to do the same test against revision.rev_sha1 [21:43:53] we should probably start changing some other indexes [21:43:54] which i've started, but will take a lot longer [21:44:08] then i'm going to write up some guidelines from the results [21:44:25] enwiki rev_sha1 that is [21:44:28] New patchset: Cmjohnson; "adding Howie Fung to site.pp and admins.pp for stat1 access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24627 [21:44:29] RECOVERY - SSH on srv190 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:45:21] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/24627 [21:48:17] mutante: https://gerrit.wikimedia.org/r/24627 [21:54:43] cmjohnson1: Syntax error at '.'; expected '}' at ./manifests/site.pp:2363 [21:54:58] cmjohnson1: the actual cause can be seen at the very end of all that spammy output [21:55:36] missing a , after dandreescu [21:56:16] k..thx [21:56:57] and after hfung it's a . [21:58:26] PROBLEM - NTP on srv190 is CRITICAL: NTP CRITICAL: No response from NTP server [21:59:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:00:34] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3291 [22:01:32] merging a change related to Nagios NRPE.. if Nagios starts complaining a lot after next run it will be that and need to restart the service.. watching it though [22:12:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.788 seconds [22:41:04] New review: Hashar; "good job :-) see inline comments. One change should be split in a different commit." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/23059 [22:45:31] Change abandoned: Cmjohnson; "littered w/ syntax errors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24627 [22:46:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:48:32] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [22:48:32] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [22:59:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.892 seconds [23:25:12] binasher: [23:25:12] CREATE TABLE object ( ROWID INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT, created_at TEXT, [23:25:44] created_at TEXT [23:29:06] using sqlite there is such a fail that i don't think that even adds to it.. the python code should just float() those text columns [23:29:32] (ok it adds to it a bit) [23:32:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:32:29] it's worse than that [23:41:30] paravoid: all of the ms-be systems mount their drives with the nobarrier option, including the container drives [23:42:04] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [23:42:41] I think Ben (or ma rk?) followed the exact recommendations of the swift people [23:42:47] disabling barrier increases performance a bunch but is *only* ever recommended on systems with a bbu raid controller, or another type of storage that offers crash safe consistency.. otherwise you want it, because: "With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued. A powerfail "only" loses data in the cache but no essential ordering is violated, and corruption wi [23:42:48] not occur." [23:43:05] (from xfs.org) [23:44:32] barrier support was enabled by default from 2.6.17 onwards, disabling it on single sata drives will result in the sort of data corruption xfs was known for back in the day [23:45:05] swift does recommend it, but they also write swift. [23:45:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.339 seconds [23:46:47] yeah, the 0 sized files [23:47:03] I unfortunately remember [23:47:25] maybe the swift team thinks that doesn't matter since you'll have multiple copies of everything [23:49:24] watch a colo lose power and all container files with the most recent updates are lost at the same time [23:58:30] binasher: I noticed that http://ceph.com/docs/master/radosgw/swift/ has been filled in more [23:59:35] New patchset: Pyoungmeister; "reducing the number of threads on precise jobrunners fro m12 to 5" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24645