[05:24:31] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Last successful Puppet run was Mon 23 Dec 2013 05:21:10 PM UTC [06:11:55] (03PS1) 10Dzahn: install graphviz on bugzilla role [operations/puppet] - 10https://gerrit.wikimedia.org/r/103525 [06:13:01] (03CR) 10Faidon Liambotis: "Didn't you *just* add this with Ifafb9a2b8f70a8b0c79facaf102745cfd5416b0c ? Can you please revert that instead (and try to be sure next ti" [operations/puppet] - 10https://gerrit.wikimedia.org/r/103496 (owner: 10Yurik) [06:13:06] (03CR) 10Faidon Liambotis: [C: 04-1] Zero partner config: Removed Opera support [operations/puppet] - 10https://gerrit.wikimedia.org/r/103496 (owner: 10Yurik) [06:23:11] (03PS1) 10Yurik: Revert "Added carrier 436-04 to zero" [operations/puppet] - 10https://gerrit.wikimedia.org/r/103526 [06:31:38] yurik: where did you call it "testing"? [06:33:51] (03PS2) 10Ori.livneh: Revert "Hack: cron job to clean up tifs from /tmp on app servers" [operations/puppet] - 10https://gerrit.wikimedia.org/r/103390 [06:35:05] how I hate that appservers parse images... [06:35:12] how much* [06:37:13] yeah. parsing wikitext is hard enough :) [06:38:32] (03CR) 10Faidon Liambotis: Revert "Added carrier 436-04 to zero" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/103526 (owner: 10Yurik) [06:42:59] paravoid, last night i had a few hours with the partner, testing it [06:43:13] they were inspecting their tcp dumps and seeing how it was routing [06:43:20] it turned out they couldn't whitelist opera [06:43:24] hence - revert [06:43:36] just set up something in labs for that purpose [06:43:53] production isn't for carrier testing [06:44:07] paravoid, they were testing it against our production ips. how do you propose they test their system if they are whitelisting it... [06:44:23] or are you saying they should set it up one way, test, and than switch??? [06:44:44] and they are also testing against our production URLs [06:44:46] .... [06:46:59] paravoid, i wish we were not tied in with the ops as we are, and i wish ESI would work, making the whole point moot. But the reality is - this is the only way we have at this point, so please help us out until we can cleanly separate varnish from backend [06:47:17] this is completely besides the point [06:47:32] it's a statement to which I completely agree, but it's also besides the point [06:47:50] don't use production for testing, I'm not sure how can I say this in simpler words :) [06:47:55] ok, how do you propose they test against labs? [06:48:08] they need to whitelist all of our ips and often - URLs [06:48:59] and at the same time - not allow any holes like some of them accidently did - like .*\.wikipedia\..* -- which obviously allowed for ppl to set up their own proxies and get free internet for everything [06:57:49] so, paravoid, what are you proposing? [07:23:26] (03CR) 10Yurik: "There are no other ways to test other than in production. If you have any realistic alternatives, i will be happy to hear them, but noone " [operations/puppet] - 10https://gerrit.wikimedia.org/r/103526 (owner: 10Yurik) [07:23:50] paravoid, ^ [07:31:52] (03Abandoned) 10Yurik: Zero partner config: Removed Opera support [operations/puppet] - 10https://gerrit.wikimedia.org/r/103496 (owner: 10Yurik) [07:40:31] PROBLEM - Puppet freshness on analytics1005 is CRITICAL: Last successful Puppet run was Mon 23 Dec 2013 04:35:24 PM UTC [07:53:31] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Last successful Puppet run was Mon 23 Dec 2013 04:48:26 PM UTC [07:54:31] PROBLEM - Puppet freshness on analytics1008 is CRITICAL: Last successful Puppet run was Mon 23 Dec 2013 04:50:14 PM UTC [08:03:31] PROBLEM - Puppet freshness on analytics1006 is CRITICAL: Last successful Puppet run was Mon 23 Dec 2013 04:58:40 PM UTC [08:25:31] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Last successful Puppet run was Mon 23 Dec 2013 05:21:10 PM UTC [08:25:51] PROBLEM - MySQL Slave Running on db1034 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Table _page_new already exists on query. Default database: [08:26:21] meh [08:27:51] RECOVERY - MySQL Slave Running on db1034 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [09:08:25] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Revert "Added carrier 436-04 to zero" [operations/puppet] - 10https://gerrit.wikimedia.org/r/103526 (owner: 10Yurik) [10:41:31] PROBLEM - Puppet freshness on analytics1005 is CRITICAL: Last successful Puppet run was Mon 23 Dec 2013 04:35:24 PM UTC [10:42:33] (03CR) 10Ori.livneh: "In the interest of making the diff easier to read, I held back from applying cosmetic change. But since this is not getting reviews, might" [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 (owner: 10Ori.livneh) [10:42:48] (03PS9) 10Ori.livneh: Rewrite for multithreading [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 [10:54:12] (03PS10) 10Ori.livneh: Rewrite for multithreading [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 [10:54:31] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Last successful Puppet run was Mon 23 Dec 2013 04:48:26 PM UTC [10:55:31] PROBLEM - Puppet freshness on analytics1008 is CRITICAL: Last successful Puppet run was Mon 23 Dec 2013 04:50:14 PM UTC [11:04:31] PROBLEM - Puppet freshness on analytics1006 is CRITICAL: Last successful Puppet run was Mon 23 Dec 2013 04:58:40 PM UTC [11:26:31] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Last successful Puppet run was Mon 23 Dec 2013 05:21:10 PM UTC [12:56:21] apergos: PHP fatal error in /usr/local/apache/common-local/php-1.23wmf7/extensions/GlobalBlocking/SpecialGlobalBlock.php line 278: [12:56:21] Call to a member function getPrefixedText() on a non-object [12:56:56] is there a bug report for it yet? [12:57:04] i don' know [12:57:09] is it a know issue? [12:57:30] I don't know about it but wikimedia-dev would be a more likely place [12:57:40] since this is an mw issue [12:57:47] thanks apergos [12:57:53] sure [12:58:14] do you know how would know about this in the dev team? [12:58:47] no, you'll just have to ask [12:59:01] ok [13:24:12] I have been referred to this channel .... there are issues with labs ... the file system I am told [13:24:25] is there anyone who can have a look and is willing to do so ? [13:28:29] GerardM- Coren is the man for labs [13:29:02] GerardM-: it is labstore4 is broken [13:30:03] one man cannot provide 24*7 support [13:30:12] does he not have a colleague ? [13:30:30] (particularly not in the season when you are supposed to be jolly) [13:30:33] GerardM-: it is xmess eve, and the wmf hired one [13:30:54] still not ontop of all iirc [13:31:22] anyhow, labstore3 came in action, and should be fixed [13:31:42] !log reedy synchronized php-1.23wmf7/extensions/GlobalBlocking 'bug 58934 Icc47a2d6367c0b906e40e068635c9fda07108e0f' [13:31:43] cool [13:32:00] Logged the message, Master [13:39:42] as far as I know an xfs repair fixed things up yesterday and labs should be fine [13:39:46] is there a new problem? [13:40:13] there is apergos [13:42:31] PROBLEM - Puppet freshness on analytics1005 is CRITICAL: Last successful Puppet run was Mon 23 Dec 2013 04:35:24 PM UTC [13:43:26] can anyone have a look at: https://nl.wikipedia.org/w/index.php?title=Speciaal:Logboeken&type=block&page=User%3AHoogeveen123&uselang=en [13:43:32] please note the "with an expiry time of 20:02, 1 January 1970" [13:43:40] I didn't know wo could do time travel [13:43:52] you dont need to repeat yourself across channels [13:44:13] sorry [13:44:26] the other channel seemed on vacation [13:44:56] it's christmas eve in many locations [13:44:56] well this channel is mainly for operations issues (problems with the servers) [13:45:30] if no one is around right now, that should be reported as a bug [13:46:52] matanya: what do you know about the labs issue? [13:47:35] I don't see anything in the admin log [13:47:40] apergos: i don't even seem to be able to log in, no ping [13:47:42] no nothing [13:48:05] well you said something about labstore3 being used, I don't know about that even [13:48:18] actully there is ping now [13:48:43] apergos: thats what i saw in my logs, let me scroll a sec [13:50:56] apergos: i was wrong [13:51:07] ok [13:51:17] that is what you said. the current issue i see is : rm: cannot remove `catlib.py': Read-only file system [13:51:26] for example [13:52:04] read only? great [13:52:37] yes, no one will break the system further :) [13:53:49] (03PS1) 10Cmjohnson: Adding some renamed analytics boxes to decommissioning.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/103547 [13:54:19] labstore4 is still the active storage server [13:54:23] I just checked [13:55:31] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Last successful Puppet run was Mon 23 Dec 2013 04:48:26 PM UTC [13:56:31] PROBLEM - Puppet freshness on analytics1008 is CRITICAL: Last successful Puppet run was Mon 23 Dec 2013 04:50:14 PM UTC [13:56:59] (03CR) 10Cmjohnson: [C: 032] Adding some renamed analytics boxes to decommissioning.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/103547 (owner: 10Cmjohnson) [13:57:20] Dec 24 06:17:01 labstore4 kernel: [55944.738322] XFS (dm-0): metadata I/O error: block 0x1924ac670 ("xfs_trans_read_buf_map") error 117 numblks 8 [13:57:21] nice [13:57:29] ok well coren did an xfs repair yesterday [13:57:38] apparently that didn't really get to the bottom of the issue [13:57:41] so it failed again [13:57:50] apergos: Seriously? [13:57:51] well I don't know what else happened [13:57:56] ah you are there [13:57:58] yes, Coren [13:58:06] Seriously [13:58:14] I just got here. Coffee is fresh. [13:58:17] I was about to say that I had no idea if you looked at it further [13:58:18] ok [13:58:29] well have your coffee, do what you need to do [13:58:36] I'm just hearing about this now, it will wait a bit longr [13:58:43] Wait, the NFS is up. [13:59:32] I see nothing in today's syslog, only yesterdays' (well today at 6 am utc) [13:59:50] I didn't see anything actually ro mounted either [13:59:54] hi people [14:00:21] AFAICT, everything is running fine. Where did you hear of this? [14:00:51] Oh, wait, radonly filesystem? That's probably a project running on gluster. [14:01:01] Which instance was that? [14:01:22] how would i got about aquiring wikimedia logs of requests for Special:RecentChanges? i want to know which of the options are used most often (or at all) [14:01:22] e.g. like https://en.wikipedia.org/w/index.php?title=Special:RecentChanges&hideliu=1&hidebots=0&hideanons=1 to show only bots, does anybody ever look at that? [14:02:09] we have sampled logs (one of evry thousand hits) [14:02:19] * Coren would have, often, if he knew about hideliu! [14:02:53] as a regular wiki user I have certainly used that [14:03:10] I"m sure others too, but as far as how often, dunno [14:03:18] apergos: Where are you hitting the readonly filesystem? [14:03:22] I"m not [14:03:38] I'm just investigating matanya's report [14:04:06] the only thing I have found so far is these xfs errors from 6 am UTC today (the latest ones, they stop after that) [14:04:10] Coren: on tools mostly [14:04:20] and since I don't know if you were working on it at that point, I know nothing :-D [14:04:50] and bastion too Coren [14:05:19] Coren: touch test [14:05:20] touch: cannot touch `test': Read-only file system [14:05:30] matanya: AFAICT, tools is working fine (that one's NFS) [14:05:39] Ah! Bastion is gluster. Lemme go kick in its head. [14:05:49] yes, now tools is ok [14:05:59] but bastion sucks [14:06:24] * matanya is in debugging mode today :) [14:06:40] Yup. Sick gluster. [14:07:32] when are replacing gluster? [14:07:36] * Coren beats it up. [14:07:45] matanya: We're not bringing gluster to eqiad. [14:07:52] good start [14:08:04] what instead? [14:08:21] NFS. If you take out XFS, it's been rock solid for months now. [14:09:38] with which FS? [14:09:42] EXT? [14:10:44] Yeah, ext4 [14:11:25] good, i like ext4 :) [14:11:51] I was considering JFS for a while, but then I woke up. [14:11:53] :-) [14:12:18] To be fair, I really wish they'd sort the licensing crap out and get a good native zfs for linux. [14:13:09] basstion gluster has woken up and shold be properly writable now. [14:13:47] btrfs [14:17:41] (03PS1) 10Cmjohnson: Removing puppet entries for db31|3|4|6|7 db47|9 db50|4|7 [operations/puppet] - 10https://gerrit.wikimedia.org/r/103550 [14:19:19] so those xfs errors form earlier today, are they from before or after you finished what you were doing, Coren? [14:19:32] just so we know if something's still up [14:20:13] btrfs is still not fully cooked [14:20:27] I know, but it's getting there [14:20:32] After, but they are a known issue with aborted readahead and recent kernels with XFS. [14:20:51] Unrelated to the original issue, and mostly log noise. [14:21:02] good to know [14:21:04] thanks [14:21:34] and there is zfs for linux, just user space [14:21:37] (Post 3.4 kernels readahead agressively, and can cancel readaheads when they aren't deemed useful anymore but XFS still does its validation on the unread blocks (which are all 0)) [14:23:16] apergos: https://www.redhat.com/archives/dm-devel/2013-February/msg00104.html [14:24:38] hmm [14:31:02] (03PS2) 10Cmjohnson: Removing puppet entries for db31|3|4|6|7 db47|9 db50|4|7 [operations/puppet] - 10https://gerrit.wikimedia.org/r/103550 [14:31:28] thanks coren, it's ok now [14:41:23] (03CR) 10Cmjohnson: [C: 032] Removing puppet entries for db31|3|4|6|7 db47|9 db50|4|7 [operations/puppet] - 10https://gerrit.wikimedia.org/r/103550 (owner: 10Cmjohnson) [14:42:08] Coren: should labs tool work now ? [14:42:20] I get [14:42:24] Proxy Error [14:42:25] The proxy server received an invalid response from an upstream server. [14:42:27] The proxy server could not handle the request GET /widar/index.php. [14:42:28] Reason: Error reading from remote server [14:42:37] GerardM-: As far as I know, it has been working without problem. [14:42:45] ok [14:42:47] Ah, that's a problem with a specific tool. Lemme look at it. [14:42:55] always blame the user :) [14:42:56] in that case there are still problems ... [14:43:08] [14:44:01] hi andrewbogott [14:44:40] 'morning [14:46:01] I will be away for an hour ... [14:46:04] GerardM-: I see it working intermitently, mostly capacity problems. It's still using the old-style apache though, and a switch to the lighttpd setup would solve that. Do you know who maintains it? [14:46:19] yes ... Magnus [14:46:26] that is the Widar tool [14:46:50] is that what you are talking about Coren ? [14:46:53] I can switch it to the new scheme easily enough, but I'd rather not do so without consulting with him first. [14:47:06] he is in Germany for the holidays [14:47:07] GerardM-: Yes; the error message you showed was related to that. [14:47:13] * Coren ponders. [14:47:26] he is happy when knowledgeable people work on his tool [14:47:27] His setup is straightforward enough; lemme try it and see if it works. [14:47:31] he explicitly told me that [14:47:43] I think you qualify [14:48:10] be back in an hour [14:48:25] GerardM-: I just switch it. It should be fast and reliable this way. [14:51:19] magic [15:02:06] back in about 30 mins [15:02:59] coren .. thank you [15:03:05] I will inform Magnus [15:03:17] do you mind if I blog about it ? [15:05:54] did run it but it did not change anything [15:06:03] will look into it when I am back [15:27:22] PROBLEM - Host analytics1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:32:31] RECOVERY - Host analytics1001 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:39:11] PROBLEM - Host analytics1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:43:23] back [15:49:41] RECOVERY - Host analytics1001 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:58:42] (03CR) 10Qgil: "TTO you are of course right. I'm very sorry for the confusion." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/103107 (owner: 10Yatinmaan) [16:01:31] PROBLEM - Host analytics1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:06:43] (03PS3) 10Dan-nl: annotating-domain-whitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102739 [16:12:01] RECOVERY - Host analytics1001 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [16:28:41] apergos: Can you offer help with pxe-booting a cisco? All the bios settings look right to me but it just boots straight to the hdd anyway. [16:29:01] geez I know next to nothing about those [16:29:08] Oh, ok. [16:29:23] I've done this a bunch of times before. Must be forgetting something :( [16:29:30] I can look at the settings to see if another pair of eyes sees something you didn't [16:29:40] but that's about it [16:30:12] I'd appreciate it if you don't mind. [16:30:31] I suppose another possibility is that it's trying pxe, failing to connect upstream and falling back on hdd. [16:31:12] I don't see anything in the terminal output about that, but there's a lot ot read [16:32:57] heya paravoid [16:36:13] apergos: I think I found the issue. [16:36:14] Dec 24 15:41:24 brewster dhcpd: DHCPDISCOVER from 88:43:e1:c2:99:8e via 208.80.154.131: network 208.80.154.128/26: no free leases [16:36:32] So the networking is wrong :( [16:37:11] ahh there we go [16:37:16] i'm having network troubles too! :) but unrelated [16:37:21] so not your settings and you weren't missing anything [16:37:26] yeah, paravoid, I'm not so sure the analytics multicast stuff is fixe [16:37:27] d [16:37:50] apergos: So if I recall, that means the server is assigned to the wrong row? [16:37:50] I still can't get multicast traffic across rows [16:37:55] it depends [16:38:03] whatever you did made some things better though [16:38:06] in ganglia for sure [16:38:09] hm. [16:38:21] i'm not sure if the manual tests i'm doing (with iperf) are good tests [16:38:37] not sure if they are supposed to work given whatever existing acl rules or network settings there are [16:38:40] but they aren't working right now [16:38:42] it probably means that the host ip that's assigned doesn't match the vlan the switch port is in [16:38:44] and some ganglia data is stale [16:38:45] *probably* [16:39:58] ottomata: I would try tcpdump of the actual packets being sent/received by gmond [16:40:08] well, and virt1002 doesn't even have a non-mgmt entry in dns [16:40:08] that's guaranteed to tell you what's going on [16:40:10] um [16:40:22] if you have root on those boxes [16:40:26] I'm so confused by all this [16:40:38] yeah, i'm watching that, starting to wonder if this is multicast ttl being wrong again [16:40:56] is virt1002 one of the ... what is it. one of the reassigned boxes of some sort? [16:41:16] if it was converted from some old name then we can see if there is mgmt for the old name [16:42:15] hello [16:42:30] apergos: according to linux-host-entries.ttyS0-115200 I have virt1001 through 1009. And I'm told that all but 1009 are in row b. [16:42:36] But I'm not sure what the reality is... [16:42:51] hi paravoid, i'm looking into this, i *might* know what's wrong [16:42:51] but [16:42:54] first [16:42:58] I could really use someone who understands this (including vlans) to audit these 9 boxes and figure out what's happening. [16:43:30] weren't a bunch of these just renamed from analytics hosts or something? [16:43:31] if I start a multicast listener in Row B and in Row C [16:43:36] apergos: yes. [16:43:40] in an arbitrary multicast group [16:43:41] ugh [16:43:45] (i'm doing 239.192.1.51) [16:43:49] well that is probably part of it [16:43:54] should I be able to send multicast packets to that grou [16:44:01] and see them on the listeners in each row? [16:44:44] " Removing old analytics dns files (an1001,1002,1005,1006,1008). Replacing mgmt entries with virt100[1-3][7-8]." [16:44:54] conspicuously missing is virt1002 [16:44:55] hm [16:44:58] :-D [16:45:04] ottomata: no [16:45:09] ottomata: depending on port [16:45:15] oh? [16:45:24] is that a network rule? [16:45:38] yes, the routers have firewalls for the analytics VLANs [16:45:51] i thought we had it so that analytics vlans could send any traffic to each other [16:46:01] it was just that they couldn't do that out of any of the analytcis vlan [16:46:03] s [16:46:11] in october virt1002 became pc1002 [16:46:52] I see mgmt for virt1002 actually but no other ip address [16:47:02] Yeah, but I think that more recently a different analytics box was reassigned to virt1002 [16:47:07] ohhhhhh hmm, ok paravoid, it works on port 8649 [16:47:17] ottomata: show configuration firewall family inet filter analytics-in4 [16:47:19] so now we are back to 'what ports are these on anyways' [16:47:22] ok yeah, then, i think this is a multicast ttl issue with hadoop's ganglia lib [16:47:26] it's fairly self-explanatory [16:47:27] I totally don't understand why when pc and analytics swiped these boxes they took them from the /middle/ of the range. [16:47:53] errr, paravoid, where do I run that? [16:47:57] cr1-eqiad [16:48:15] analytics1002.mgmt.eqiad.wmnet may have become virt1002.mgmt.eqiad.wmnet [16:48:20] lemme see if there is an rt ticket [16:49:25] apergos: there is, I made it. But now I can't find it [16:49:29] yep but it doesn' t tell me which went to what [16:49:35] https://rt.wikimedia.org/Ticket/Display.html?id=6546 [16:49:37] there's that [16:49:43] Apparently I'm terrible and searching rt, I can never find what i'm looking for [16:50:04] I'm not fond of the search facility [16:50:41] ok, here was my original request: https://rt.wikimedia.org/Ticket/Display.html?id=6390 [16:51:29] oh bonded ports, more cables... [16:51:36] did that happen? [16:52:03] I think so. Rob created a bunch of other tickets for the subtasks… are they attached to that first ticket somehow? [16:52:20] ah here's the secondary ntworking one [16:52:28] yeah, 6481 and 6482 [16:52:44] 'did not make any changes to vlans' [16:52:50] Yeah, I just read that [16:53:03] So that sort of explains why virt1001 doesn't work, doesn't explain why there's no entry at all for virt1002 [16:53:32] I guess I'll just add virt1002 myself. [16:53:37] And then… who understands about vlans? [16:54:24] yes, I would say add an entry for any that are missing [16:54:50] I can slug through it somewhat slowly [16:54:55] (vlans) [16:55:02] hm, no virt1003 either [16:55:11] in the sf tz, leslie but dunno if she will be around [16:55:24] mark might still be up? [16:55:31] (03CR) 10BryanDavis: "Trivial documentation typos" (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/103080 (owner: 10Aaron Schulz) [16:55:55] I haven't seen signs of him being active today, he might be off [16:56:42] what's up [16:56:56] uhhh [16:57:11] you booted virt1001? [16:57:16] or tried to? [16:57:35] it still must have its old name, and be in stored configs [16:57:46] I'll double check all that later and clean it up I guess [16:58:09] I'm trying to pxe-boot virt1001. That fails, and it comes up thinking it is analytics1001 (which it used to be.) [16:58:31] The pxe boot is failing because of 'no free leases' [16:58:39] paravoid, now you're mostly caught up :) [16:58:45] andrewbogott is trying to sort out virt 1001,2,3,8,9 which were old analytics1001,2,5,8 [16:59:03] thought I read "no free lazers", luckily I was wrong [16:59:17] https://rt.wikimedia.org/Ticket/Display.html?id=6482 this was done but no changes to vlan [16:59:45] (03PS1) 10Andrew Bogott: Add the mysteriously-missing virt1002 and virt1003 entries. [operations/dns] - 10https://gerrit.wikimedia.org/r/103563 [17:00:32] Man, when I see such an obvious whole in a list like ^ I can't help but wonder if someone knows something that I don't know... [17:00:50] s/whole/hole/ [17:00:53] and you were told that 1001,2,3,8 are in row b, what about 1009? [17:01:07] I believe that 1009 was just moved to b as well. [17:01:29] ticket 6529 [17:01:37] the free leases is for public1-b-eqiad [17:01:44] virt1001 is in private address space [17:01:53] got it [17:02:19] or, not moved, it was always there and just mislabled previously... [17:02:34] "no free leases" is just a weird error message that means "I couldn't allocate this address on this subnet" [17:02:48] paravoid: So you think that's a red herring? [17:03:06] no, it's a misconfiguration for sure [17:03:15] ok [17:03:20] I'm assuming virt1001 is supposed to be on private address space, correct? [17:03:21] I figured it didn't really have to do with # of leases [17:03:26] yep [17:03:35] 1000 is a public host, 1001-1009 should be private [17:03:59] when things were renamed and new ips given out, the vlan configs weren't updated [17:04:03] says on the ticket [17:04:55] the switch port is configured on the wrong VLAN then [17:08:38] So I guess I wait for LeslieCarr to come to work (if she's working today) [17:08:49] to change the VLAN? [17:08:51] no, i can do it [17:09:03] chris johnson usually does these [17:09:16] is it more than just virt1001? [17:10:00] paravoid: It's probably for all of these: https://rt.wikimedia.org/Ticket/Display.html?id=6482 [17:10:17] If chris usually does it, why did he close that ticket without doing it I wonder? [17:10:32] shit happens :) [17:11:05] I need to have some wikidata related config for wikipedia beta updated. Anyone here that can help me with this? [17:11:11] analytics2,5,6 and 8 already had private ips [17:11:25] ge-3/0/7 up down virt1008:eth0 [17:11:25] ge-3/0/15 up up virt1008 eth0 [17:11:26] jeroendedauw, it's a no deploys week [17:11:27] yay... [17:11:41] ugh [17:11:52] here should be two ports but not like that [17:11:54] MaxSem: the change I want to happen is that it stops pulling Wikibase and WikibaseDataModel from master [17:11:57] *there [17:12:03] MaxSem: It should stick to the current HEAD [17:12:09] Else things will break [17:12:15] jeroendedauw, poke hashar [17:12:33] * jeroendedauw goes on a quest to find the hashar [17:12:54] MaxSem: no one else that can do this? [17:13:00] no idea [17:13:09] maybe Krinkle|detached [17:13:56] what's so wrong with master WD? [17:14:17] Nothing [17:14:21] Yet [17:14:25] they're all over the place [17:14:26] all wrong [17:14:31] hehe [17:14:57] MaxSem: we will be introducing some new dependencies, beta does not have them, so will break [17:15:14] paravoid: Are those software things or do they recall a dc visit? [17:15:17] you're scaring me [17:15:23] andrewbogott: just config change [17:15:24] Obviously it should be updated to have the dependeencies, but we decided to not block on this [17:15:28] *changes [17:15:37] oh, great. thank you! [17:18:29] MaxSem: the fragility of how dependencies are managed in WMF-verse scares me [17:18:35] meanwhile… apergos, does https://gerrit.wikimedia.org/r/#/c/103563/ look safe to you? [17:20:13] jeroendedauw: so, just fyi, this is a holiday break, please don't enact those changes that will break beta before someone has been able to deal with the dependency changes you introduce. [17:20:14] labs-hosts1-b-eqiad [17:20:15] yes [17:20:57] andrewbogott [17:21:04] 'k, thanks [17:21:14] jeroendedauw: aren't we just pulling from one central repo that you manage manually now, anyways? [17:21:52] (03CR) 10Andrew Bogott: [C: 032] Add the mysteriously-missing virt1002 and virt1003 entries. [operations/dns] - 10https://gerrit.wikimedia.org/r/103563 (owner: 10Andrew Bogott) [17:22:01] from the etherpad: [17:22:01] == October 24 == [17:22:01] * deployment from single repo - ok now? Can we get a commitment to try this for the next deployment? [17:22:04] ** Yes all good to go. Let's do this for the deployment on test on Oct 31st. [17:22:31] so, you all (wikidata) should be in charge of that single repo, and thus we (WMF) don't care about your dependencies, is that not true? [17:23:26] cmjohnson1: hey [17:23:37] hey [17:23:55] (03PS1) 10Cmjohnson: Removing dns entries for decom'd db's rt6512 [operations/dns] - 10https://gerrit.wikimedia.org/r/103564 [17:24:01] cmjohnson1: asw-b-eqiad ge-3/0/7 has a description of "virt1008:eth0", which is the same as ge-3/0/15 [17:24:13] cmjohnson1: the former is down though, can you confirm that it's unplugged? [17:26:29] paravoid: not in dc but that is confirmed. That virt1008 was from when I was hoping to get an1007 changed to virt1008. I will fix [17:26:36] description now [17:26:37] don't yet [17:26:45] I'm making a bunch of other changes [17:26:46] vlans [17:26:56] ah..okay..then if you wanna make the change while you are there [17:27:13] will do [17:27:13] jeroendedauw: can you clarify? [17:29:08] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for decom'd db's rt6512 [operations/dns] - 10https://gerrit.wikimedia.org/r/103564 (owner: 10Cmjohnson) [17:33:08] jeroendedauw: ?? [17:33:34] jeroendedauw: I'm willing to be told I'm wrong :) I'm just curious what the state of things is. [17:35:51] cmjohnson1: what should I set it to? [17:36:08] "available" [17:36:39] ok [17:36:58] thank you [17:39:06] andrewbogott: try again [17:40:02] ok, rebooting virt1001 [17:41:21] PROBLEM - Host analytics1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:41:35] wait, what? [17:41:40] ottomata: that you? [17:41:41] PROBLEM - Host analytics1005 is DOWN: PING CRITICAL - Packet loss = 100% [17:41:41] PROBLEM - Host analytics1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:41:41] PROBLEM - Host analytics1006 is DOWN: PING CRITICAL - Packet loss = 100% [17:41:41] PROBLEM - Host analytics1008 is DOWN: PING CRITICAL - Packet loss = 100% [17:41:46] greg-g: oh, is beta now using the single repo? [17:41:49] or is that me misconfiguring the vlans? [17:42:13] jeroendedauw: that's my question, I guess. I hope it is, as any changes on production should be made to betacluster as well [17:42:19] greg-g: the change I am asking for really ought to be trivial [17:42:35] jeroendedauw: were you looking at the relevant config on production? the labs variant just has -labs in the file name [17:42:38] greg-g: where can I see if this single repo is used? [17:42:43] see last sentence [17:42:48] paravoid: analytics1001 is me; virt1001 still thinks it's analytics1001. [17:42:59] I don't know about the other ones... [17:42:59] greg-g: I have not looked at any config - I do not know where it is [17:43:30] oh, paravoid, i don't think so, those are nodes that are being mobied [17:43:30] right? [17:43:36] I have no idea [17:43:37] yeah andrewbogott is the culprit! :) [17:43:48] jeroendedauw: https://git.wikimedia.org/tree/operations%2Fmediawiki-config.git [17:44:02] see the wmf-config directory [17:44:09] so it was caused by my change, seems to match with nodes that are not in "b" anymore [17:44:09] those are ciscos being given up for...labs? [17:44:12] what a mess [17:44:13] paravoid: I don't have the exact numbers in my head, but the other hosts you just modified were formerly analytics boxes as well. [17:44:21] ok [17:44:36] first decom, then repurpose next time please :) [17:44:40] So it seems correct. Although maybe we need to tell monitoring that those analytics boxes don't exist anymore? [17:44:58] jeroendedauw: I'm not sure we have the single repo situation on production, I see this in extensionlist: [17:45:01] $IP/extensions/Wikibase/client/WikibaseClient.php [17:45:03] they need to be decomissioned per our process [17:45:04] $IP/extensions/WikibaseDataModel/WikibaseDataModel.php [17:45:06] $IP/extensions/Wikibase/lib/WikibaseLib.php [17:45:09] $IP/extensions/Wikibase/repo/Wikibase.php [17:45:34] greg-g: we do not have the single instance thing on production, unless things actually happened, which I'd be VERY susprised about [17:45:43] paravoid: OK… I'm not sure why the process was so fraught this time. All I did was create an original ticket saying, I want to those servers for labs. [17:45:56] jeroendedauw: so, do you have the single repo set up? we can change our config once that is done on your side [17:47:01] basically: step 1) set up that single repo (may still need a gerrit repo created? not sure) step 2) do your team's testing with it 3) switch to it from the multi-repo state on Beta Cluster 4) test 5) switch on production during next deploy window 6) rejoice [17:47:08] greg-g: what exactly should I look at in this git repo and the mentioned dir? [17:47:24] jeroendedauw: extensionlist, as I said ;) [17:47:47] paravoid: Is there something I need to do now to prevent folks from getting paged? [17:48:11] greg-g: yes, I understand those steps. However what I want to have done now is labs simply no longer update Wikibase and WikibaseDataModel to latest master automatically [17:48:23] Just labs [17:48:36] so, here's the thing about labs: that's the point of it :) [17:48:43] Or rather all the testing things that automatically pull master [17:48:59] so, what I'm saying is: don't commit your breaking changes until the situation is fixed (as you all said you'd do back in October) [17:49:02] greg-g: I am well aware of that. This is a temporary measure [17:49:12] Which allows us to be not blocked on labs restrictions [17:49:34] what blocked you from doing what you said you'd do in October (making the single repo)? [17:49:39] I want to help push this through [17:49:39] those don't page us [17:49:53] paravoid: Also, virt1001 is still not getting an IP :( Brewster still says 'no free leases' [17:50:29] greg-g: somehow this single repo has not happened yet. I do not understand why. Some people seem extrenely incapable of getting things done [17:50:31] basically: I'm not going to accept an emergency change that wouldn't have been needed if you all had done what you said you would do in October. [17:50:40] jeroendedauw: so, do it then. [17:50:55] jeroendedauw: have you started on it? have you requested help from anyone anywhere? [17:50:59] what's the specific blocker? [17:51:32] if there's a specific blocker on our side, I can push on it, but I don't know what you have done so far [17:51:43] andrewbogott: if it's still happening, then descriptions are probably wrong [17:51:44] greg-g: https://gerrit.wikimedia.org/r/#/c/95996/ [17:52:25] andrewbogott: since I'm entirely sure I moved virt1001 to the right vlan [17:52:35] jeroendedauw: that doesn't implement the single repo situation [17:52:42] jeroendedauw: that's something different (but related) [17:52:49] greg-g: ? [17:52:55] paravoid: descriptions = in the dns repo? [17:52:59] https://gerrit.wikimedia.org/r/#/c/95996/12/wmf-config/extension-list-wikidata [17:53:08] andrewbogott: no, on the switch ports [17:53:27] andrewbogott: can you copy the error you're seeing now? [17:53:30] jeroendedauw: there are 3 extensions there, the idea was that it'd be one, no? [17:53:31] is is still 208.80...? [17:53:44] greg-g: yeah. idk what that is about [17:54:03] hoo: you too :) [17:54:03] No, private now. [17:54:04] DHCPDISCOVER from 88:43:e1:c2:99:8e via 10.64.20.3: network 10.64.20/24: no free leases [17:54:10] ffs, why is this so difficult? I got this shit up and running over half a year ago on TravisCI >_> [17:54:11] jeroendedauw: so, then let's figure out what needs to happen next, if I start a conversation on wikidata-tech, can you reply with what you know? [17:54:15] (absolutely not this channel, heh.) [17:54:33] jeroendedauw: hi there, running a big website isn't as easy as turning on travisci [17:54:48] so, can you commit to helping me help you? [17:55:04] andrewbogott: oh, then different error [17:55:23] yes, apologies, apparently I was reading right-to-left [17:56:23] paravoid: those analytics boxes were decom'd accordingly. I even disabled notifications in icinga. not sure why they would report still [17:58:20] (03PS1) 10Faidon Liambotis: dhcp: virt1001 is .eqiad.wmnet not .wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/103568 [17:58:22] andrewbogott: paravoid : need me for anything ? [17:58:26] or got it all yourself :) [17:58:28] nah [17:58:28] greg-g: switching the whole deployment thing to using a single repo is going to take some time. This moves extrenely slowely, so I expect this to not be done in the next week, even if we both try to push it now [17:58:46] (03CR) 10Faidon Liambotis: [C: 032 V: 032] dhcp: virt1001 is .eqiad.wmnet not .wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/103568 (owner: 10Faidon Liambotis) [17:59:20] greg-g: we however want to switch our dev process over to using the single repo appraoch. And we want to do this now as there is very little dev doing on now, and we are still far from our next deployment [17:59:33] greg-g: the ONLY blocker is that beta will break by already pulling this [17:59:43] So it should just stick to the current version for now [17:59:49] That is all I am asking to have done [18:00:21] jeroendedauw: the thing of the matter is: you all promised to do this single repo many months ago, not doing it then shouting 'emergency, we must change our dev style NOW" isn't responsible [18:00:48] greg-g: so in whatever loop over the extension list that executes "git pull", one just needs to add a single if statement, so the two repos in question get skipped [18:00:53] andrewbogott: fixed, see above; try again now :) [18:01:09] greg-g: we are not shotuing to have this deployed [18:01:20] you want me to undeploy it? [18:01:24] greg-g: in fact I am asking you do stop deploying our latest code [18:01:38] I want to to stop deploying mastert [18:01:56] greg-g: oh, are you asking me if I want to have Wikdiata go off the test site altogether? [18:02:27] so, let me write this email, please respond there so others can participate [18:04:11] greg-g: can you answe my last question? [18:05:48] (03PS4) 10Aaron Schulz: Make scap transport CDB files via JSON [operations/puppet] - 10https://gerrit.wikimedia.org/r/103080 [18:07:59] jeroendedauw: my question ("want to undeploy") was in reference to "not shouting to have this deployed" because it's a on/off thing on Beta Cluster. If its on production, it's on the beta cluster pulling from master. That's the point of it. If there are breaking changes needed to be done, we should deal with them, but the timing is bad because of holiday breaks and the issue wouldn't be an issue if the single repo was done. Let's try to fix t [18:10:04] greg-g: ok, I did not realize beta is so bound to actual deployment [18:10:44] it's a required pre-step, yeah [18:10:50] greg-g: do you think this will likely be done in the next few days if we push it? [18:10:53] paravoid: installing! Only eight more to go :) [18:10:56] andrewbogott: :-) [18:11:12] paravoid: I'm not seeing "official" deb for logstash that you configured in reprepro yesterday when I use apt in labs. I'm still seeing the deb that Ori uploaded. I think that this is due to version numbers. The upstream version is "1.2.2-debian1" and the version Ori published is "1.2.2-8". [18:11:37] jeroendedauw: depends if someone who knows enough about beta cluster takes time away from teh holidays (probably not hashar, he's out and solidly out, maybe Krinkle|detached ?) [18:11:43] bd808: no, it's just because I didn't run "reprepro update" because I wanted to do it after Nik confirmed it's okay for him [18:11:58] paravoid: Ah. Okey doke. [18:12:03] bd808: as he didn't comment on the changeset after my comment :) [18:12:21] (03PS11) 10Ori.livneh: Rewrite for multithreading [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 [18:13:31] jeroendedauw: thanks, btw, for discussing this and trying to work through it [18:13:44] (03CR) 10Ori.livneh: [C: 032 V: 032] Rewrite for multithreading [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 (owner: 10Ori.livneh) [18:31:11] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:33:11] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [18:34:01] PROBLEM - Host mw27 is DOWN: PING CRITICAL - Packet loss = 100% [18:34:57] paravoid: The logstash 1.2.2 deb uses a sysv init script. Should I stick with using their script in the puppet config or would it be ok to replace with an upstart script? [18:35:01] RECOVERY - Host mw27 is UP: PING OK - Packet loss = 0%, RTA = 35.42 ms [18:37:30] Actually their sysv init script is pretty decent. I think I'll just use it [18:37:42] yeah, no reason to go our own way imho [18:37:54] (but I haven't looked at their init script) [18:38:57] I had been looking at the upstart script they provide in git and it is pretty lame, but the debian package uses a sysv script and it is nicely configurable [18:42:31] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [18:43:22] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 35.54 ms [18:44:30] (03PS13) 10Aude: Enable Wikidata build on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 [18:59:26] (03PS14) 10Aude: Enable Wikidata build on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 [18:59:42] greg-g: ^ [19:01:41] aude: can you reply to that email thread about it so that those who can potentially review/merge can see? [19:02:38] * greg-g is on a call right now [19:03:03] done [19:03:45] i think this is all we need ... changes absolutely nothing for production, only changes beta [19:04:47] * greg-g nods [19:07:03] we might want to tweak with localisation stuff, but for now it stays same except in new files [19:07:19] so beta / production can have different [19:10:43] (03CR) 10Aude: "ideally extension-list-wikidata is something that can be managed in our new git repo." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 (owner: 10Aude) [19:20:18] (03PS5) 10Aaron Schulz: Make scap transport CDB files via JSON [operations/puppet] - 10https://gerrit.wikimedia.org/r/103080 [19:20:26] (03PS6) 10Aaron Schulz: Make scap transport CDB files via JSON [operations/puppet] - 10https://gerrit.wikimedia.org/r/103080 [19:37:12] (03Abandoned) 10Jdlrobson: Disable VisualEditor experimental mode in mobile [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102188 (owner: 10Jdlrobson) [19:45:41] PROBLEM - Host virt1007 is DOWN: PING CRITICAL - Packet loss = 100% [19:45:41] PROBLEM - Host virt1005 is DOWN: PING CRITICAL - Packet loss = 100% [19:50:44] ^ that's me! [19:50:54] RECOVERY - Host virt1007 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [19:52:51] PROBLEM - SSH on virt1007 is CRITICAL: Connection refused [19:53:13] PROBLEM - puppet disabled on virt1007 is CRITICAL: Connection refused by host [19:53:13] PROBLEM - Disk space on virt1007 is CRITICAL: Connection refused by host [19:53:13] PROBLEM - DPKG on virt1007 is CRITICAL: Connection refused by host [19:53:22] PROBLEM - RAID on virt1007 is CRITICAL: Connection refused by host [19:56:29] paravoid: did you do your magic for virt1009 as well? I'm seeing a similar failure for that one now. DHCPDISCOVER from 88:43:e1:c2:50:7e via 10.64.20.3: network 10.64.20/24: no free leases [19:56:36] That's a box that was formerly analytics1008. [20:02:30] (03PS1) 10Ottomata: Adding run_interval and log_level parameters [operations/puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/103577 [20:02:45] (03CR) 10Ottomata: [C: 032 V: 032] Adding run_interval and log_level parameters [operations/puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/103577 (owner: 10Ottomata) [20:05:21] PROBLEM - NTP on virt1007 is CRITICAL: NTP CRITICAL: No response from NTP server [20:05:34] (03PS1) 10Ottomata: Added jmxtrans.default.erb template [operations/puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/103578 [20:05:44] (03CR) 10Ottomata: [C: 032 V: 032] Added jmxtrans.default.erb template [operations/puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/103578 (owner: 10Ottomata) [20:07:08] (03PS1) 10Ottomata: Subscribing jmxtrans service to /etc/default/jmxtrans [operations/puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/103580 [20:07:26] (03CR) 10Ottomata: [C: 032 V: 032] Subscribing jmxtrans service to /etc/default/jmxtrans [operations/puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/103580 (owner: 10Ottomata) [20:07:53] (03PS1) 10Ottomata: Using more frequent run_interval and log_level of info for jmxtrans [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/103582 [20:08:38] (03CR) 10Ottomata: [C: 032 V: 032] Using more frequent run_interval and log_level of info for jmxtrans [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/103582 (owner: 10Ottomata) [20:09:58] andrewbogott: virt1009? there's no dns entry for it, how can that possibly work? [20:10:18] (03PS1) 10Ottomata: Updating jmxtrans and kafka modules for more frequent jmxtrans run interval [operations/puppet] - 10https://gerrit.wikimedia.org/r/103583 [20:10:34] (03CR) 10Ottomata: [C: 032 V: 032] Updating jmxtrans and kafka modules for more frequent jmxtrans run interval [operations/puppet] - 10https://gerrit.wikimedia.org/r/103583 (owner: 10Ottomata) [20:11:00] apergos: you're right, I need to add it. [20:15:24] and… virt1008 is in here twice. *scowl* [20:16:23] oops :-D [20:17:08] (03PS1) 10Andrew Bogott: Assign one of virt1008's IPs to virt1009. [operations/dns] - 10https://gerrit.wikimedia.org/r/103584 [20:24:44] (03CR) 10Andrew Bogott: [C: 032] Assign one of virt1008's IPs to virt1009. [operations/dns] - 10https://gerrit.wikimedia.org/r/103584 (owner: 10Andrew Bogott) [20:37:39] (03PS1) 10Ottomata: Moving jmx port variables into defaults.pp [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/103585 [20:37:57] (03CR) 10Ottomata: [C: 032 V: 032] Moving jmx port variables into defaults.pp [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/103585 (owner: 10Ottomata) [20:39:53] (03PS1) 10Ottomata: Adding namenode/jmxtrans.pp for sending NameNode jmx metrics [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/103586 [20:41:44] (03PS2) 10Ottomata: Adding namenode/jmxtrans.pp for sending NameNode jmx metrics [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/103586 [20:42:31] (03PS3) 10Ottomata: Adding namenode/jmxtrans.pp for sending NameNode jmx metrics [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/103586 [20:43:19] (03CR) 10Ottomata: [C: 032 V: 032] Adding namenode/jmxtrans.pp for sending NameNode jmx metrics [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/103586 (owner: 10Ottomata) [21:20:59] (03PS1) 10coren: Tool Labs: package installs [operations/puppet] - 10https://gerrit.wikimedia.org/r/103592 [21:26:27] (03CR) 10coren: [C: 032] Tool Labs: package installs [operations/puppet] - 10https://gerrit.wikimedia.org/r/103592 (owner: 10coren) [21:27:44] ok… this time I see lots of DHCPOFFER on 10.64.20.9 to 88:43:e1:c2:5b:fa via 10.64.20.3 and DHCPOFFER on 10.64.20.9 to 88:43:e1:c2:5b:fa via 10.64.20.2 [21:27:53] but the host hangs rather than accepting the IP [21:53:33] LeslieCarr, if you are still working, can you check out the vlan settings for virt1009 (aka 88:43:e1:c2:50:7e) and verify that they make sense? [21:57:31] * andrewbogott hears the distinctive echo of an empty chatroom [22:03:58] (03PS1) 10Ottomata: Allowing jmx to be passed in as parameter, set custom title, and use group_prefix [operations/puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/103599 [22:03:59] echo echo echo [22:04:12] (03CR) 10Ottomata: [C: 032 V: 032] Allowing jmx to be passed in as parameter, set custom title, and use group_prefix [operations/puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/103599 (owner: 10Ottomata) [22:07:01] (03PS1) 10Andrew Bogott: Provide emptyish manifests for virt100[1-9] and labnet1001. [operations/puppet] - 10https://gerrit.wikimedia.org/r/103600 [22:08:19] (03CR) 10Andrew Bogott: [C: 032] Provide emptyish manifests for virt100[1-9] and labnet1001. [operations/puppet] - 10https://gerrit.wikimedia.org/r/103600 (owner: 10Andrew Bogott) [22:09:52] (03PS1) 10Ottomata: Removing bad jmxtrans objects, adding log_level and run_interval params [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/103601 [22:09:53] (03PS1) 10Ottomata: Adding jmxtrans/resourcemanager.pp [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/103602 [22:10:17] (03CR) 10Ottomata: [C: 032 V: 032] Removing bad jmxtrans objects, adding log_level and run_interval params [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/103601 (owner: 10Ottomata) [22:10:30] (03CR) 10Ottomata: [C: 032 V: 032] Adding jmxtrans/resourcemanager.pp [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/103602 (owner: 10Ottomata) [22:16:53] (03PS1) 10Ottomata: Removing manual inclusion of jmxtrans class [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/103603 [22:17:07] (03CR) 10Ottomata: [C: 032 V: 032] Removing manual inclusion of jmxtrans class [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/103603 (owner: 10Ottomata) [22:21:39] (03PS1) 10Ottomata: Removing no longer used parameters from jmxtrans classes [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/103604 [22:21:47] (03CR) 10Ottomata: [C: 032 V: 032] Removing no longer used parameters from jmxtrans classes [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/103604 (owner: 10Ottomata) [22:30:31] PROBLEM - Puppet freshness on virt1007 is CRITICAL: Last successful Puppet run was Tue 24 Dec 2013 07:29:50 PM UTC [22:58:37] (03PS2) 10BryanDavis: [WIP] Logstash puppet class [operations/puppet] - 10https://gerrit.wikimedia.org/r/100395 [22:58:46] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Logstash puppet class [operations/puppet] - 10https://gerrit.wikimedia.org/r/100395 (owner: 10BryanDavis) [23:09:04] (03PS3) 10BryanDavis: [WIP] Logstash puppet class [operations/puppet] - 10https://gerrit.wikimedia.org/r/100395 [23:09:49] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Logstash puppet class [operations/puppet] - 10https://gerrit.wikimedia.org/r/100395 (owner: 10BryanDavis) [23:11:31] (03PS4) 10BryanDavis: [WIP] Logstash puppet class [operations/puppet] - 10https://gerrit.wikimedia.org/r/100395 [23:13:18] andrewbogott: i'm back now [23:13:18] checking [23:16:35] andrewbogott: i see 88:43:e1:c2:50:7e showing up correctly for virt1009, however i don't see dns when i look it up on bast1001 --- guessing that's the problem ? [23:20:25] ok... [23:20:27] * andrewbogott looks at dns [23:21:53] Hm, it should be 10.65.3.180. I see it set in dns [23:23:18] brewster is saying "DHCPDISCOVER from 88:43:e1:c2:50:7e via 10.64.20.2: network 10.64.20/24: no free leases" [23:23:31] Can that be a dns problem? [23:25:24] (03PS1) 10Aaron Schulz: [TEST] Disable the mergeCdbFileUpdates step for testing [operations/puppet] - 10https://gerrit.wikimedia.org/r/103610 [23:28:35] LeslieCarr: I think all the other times I've seen that 'no free leases' error someone has done some vlan magic and then it started working :/ [23:28:54] 10.65 ? [23:28:59] that's a management ip [23:29:08] yeah, that's the management ip [23:29:10] it needs it's 10.64 range [23:29:13] (03PS1) 10Kaldari: Updating schema ID for EchoInteraction [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/103611 [23:29:38] 10.64.20.X [23:29:43] line 1256 of wmnet [23:30:49] (03CR) 10Kaldari: [C: 04-2] "Don't merge until lightning deployment time." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/103611 (owner: 10Kaldari) [23:33:03] Oh, I see it has two mgmt entries... [23:34:43] (03PS1) 10Andrew Bogott: Further attempts to get virt1009 a proper IP [operations/dns] - 10https://gerrit.wikimedia.org/r/103612 [23:34:51] LeslieCarr: ^ ? [23:35:11] weird [23:35:44] (03CR) 10Lcarr: [C: 032] Further attempts to get virt1009 a proper IP [operations/dns] - 10https://gerrit.wikimedia.org/r/103612 (owner: 10Andrew Bogott) [23:36:42] (03PS7) 10Ori.livneh: Make scap transport CDB files via JSON [operations/puppet] - 10https://gerrit.wikimedia.org/r/103080 (owner: 10Aaron Schulz) [23:37:59] (03CR) 10Ori.livneh: [C: 032 V: 032] "likely to break things, should be fun" [operations/puppet] - 10https://gerrit.wikimedia.org/r/103080 (owner: 10Aaron Schulz) [23:38:36] (03PS2) 10Ori.livneh: [TEST] Disable the mergeCdbFileUpdates step for testing [operations/puppet] - 10https://gerrit.wikimedia.org/r/103610 (owner: 10Aaron Schulz) [23:38:57] (03CR) 10Ori.livneh: [C: 032 V: 032] [TEST] Disable the mergeCdbFileUpdates step for testing [operations/puppet] - 10https://gerrit.wikimedia.org/r/103610 (owner: 10Aaron Schulz) [23:41:24] LeslieCarr: So, while I'm running that test… suppose you can also magically fix virt1006? That one asks for an IP, brewster offers one up, and then it just… freezes. Brewster keeps offering, virt1006 just sits there. [23:41:33] weird [23:41:36] um... lemme try [23:41:54] thanks [23:42:49] this is weird [23:42:54] i'm unable to get to its management interface [23:42:58] is it a different password ? [23:43:02] admin@ [23:43:13] because of being a cisco [23:43:17] oh, ciscos [23:46:48] LeslieCarr: virt1009 is installing. So, all good there. [23:47:08] i'm on minute 327 of virt1006 rebooting [23:48:10] Yeah… it does get far enough to ask for an IP though, so it's not a total hardware crap-out [23:49:10] i just like that ciscos tried to replicate the router rebooting experience in a server [23:52:43] it looks like it doesn't attempt a tftp boot ? [23:52:49] from tailing the log on carbon [23:53:01] It doesn't get that far [23:53:08] It needs an IP first doesn't it? [23:54:17] At least, when I was watching the console it was hanging very early in the process. [23:58:14] (03PS1) 10Ori.livneh: scap: place mergeCdbFileUpdates & refreshCdbJsonFiles in scriptpath [operations/puppet] - 10https://gerrit.wikimedia.org/r/103616 [23:58:36] hrm, what's the command to set bios one time to boot via ethernet ? [23:58:55] as far as I can tell you can only set it to boot to ethernet for good. [23:59:04] I always do that, then switch it back quick while the install is in process :) [23:59:09] Anyway, for that it's [23:59:11] scope bios [23:59:20] set boot_order pxe [23:59:22] (03CR) 10Aaron Schulz: [C: 031] scap: place mergeCdbFileUpdates & refreshCdbJsonFiles in scriptpath [operations/puppet] - 10https://gerrit.wikimedia.org/r/103616 (owner: 10Ori.livneh) [23:59:26] commt [23:59:30] um… *commit [23:59:42] also boot-order, not boot_order. sheesh. [23:59:46] https://wikitech.wikimedia.org/wiki/Cisco_UCS_C250_M1 [23:59:47] (03CR) 10Ori.livneh: [C: 032] scap: place mergeCdbFileUpdates & refreshCdbJsonFiles in scriptpath [operations/puppet] - 10https://gerrit.wikimedia.org/r/103616 (owner: 10Ori.livneh)