[03:14:53] !log puppet-diffs migrating VMs to eqiad1-r, then hopefully figuring out how to make them work in Jenkins again [03:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet-diffs/SAL [05:22:18] How do i get a list of tools I made if i can't remember the name of the tool account my bot is running on? [05:28:22] nevermind, i figured out what my tool was named [05:29:32] lol [05:29:40] toolsadmin has a lisr [05:29:42] List [09:41:39] If you run `id` you can see which groups you are in and find it that way [15:46:42] !log wikidocumentaries migrating project to eqiad1-r [15:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidocumentaries/SAL [15:47:09] !log wikidiff2-wmde-dev migrating project to eqiad1-r [15:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidiff2-wmde-dev/SAL [15:47:52] !log wpx migrating project to eqiad1-r [15:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wpx/SAL [16:22:37] !log tools.stashbot Testing SAL [16:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL [16:48:29] hi! So in the analytics project I have two instances that cannot ping themselves (tried on both hosts) but the traffic with the rest of the hosts (in the project) is fine [16:48:52] any idea why? [16:58:24] elukey: which instances? [17:00:41] hadoop-master-3.analytics.eqiad.wmflabs <-> kdc.analytics.eqiad.wmflabs [17:01:05] it is friday evening for me so it could be a PEBCAK [17:01:27] but I am a bit puzzled about the ping not working [17:05:52] eqiad1-r [17:06:02] gtirloni: err sorry, hadoop-worker-3.analytics.eqiad.wmflabs [17:06:08] not master [17:06:23] wonder if they have security groups set up to allow ICMP for 10/8 but not the new range or the right security groups [17:07:21] FWIW I can reproduce the problem [17:07:25] both hosts have 172.x ips and ping works generally [17:07:32] but not on those [17:08:24] let's see [17:08:36] ping works for me on puppet-paladox to those instances. [17:08:47] kdc is in the following security groups: default [17:09:04] hadoop-worker-3 is in the following security groups: default [17:09:12] okay [17:09:17] elukey: thanks, checking [17:09:23] so we should see a rule in the default security group allowing ICMP internally [17:09:40] okay interesting, it allows from 0.0.0.0/0 [17:09:49] * elukey sends a beer to gtirloni [17:10:00] and *any* traffic internally [17:10:09] so we should be good on the security group front [17:10:57] there is an IPv6 rule in here for some reason but we can ignore that [17:11:15] interesting [17:11:21] krenair@kdc:~$ ping hadoop-worker-3 [17:11:22] PING hadoop-worker-3.analytics.eqiad.wmflabs (172.16.2.243) 56(84) bytes of data. [17:11:22] From kdc.analytics.eqiad.wmflabs (172.16.2.235) icmp_seq=9 Destination Host Unreachable [17:12:28] iptables look okay [17:12:39] yep I'm out of ideas [17:12:40] that is the wrong ip Krenair [17:12:45] hm? [17:12:47] root@puppet-paladox:/home/paladox# ping hadoop-worker-3 [17:12:47] PING hadoop-worker-3.eqiad.wmflabs (172.16.2.243) 56(84) bytes of data. [17:12:47] 64 bytes from hadoop-worker-3.analytics.eqiad.wmflabs (172.16.2.243): icmp_seq=1 ttl=64 time=0.380 ms [17:13:09] well, yes, it got a Destination Host Unreachable, the packet won't be from the destination host paladox :D [17:13:33] oh [17:14:14] (thank all for checking btw) [17:14:18] *thanks [17:16:31] kdc can't find worker-3's mac address. if I force it, the packets go out and worker-3 replies, but the packets don't make their way to kdc [17:17:03] nice! I didn't check that far [17:17:05] wow [17:17:28] okay this is out of my depth [17:17:33] good luck [17:19:46] I think we got a routing loop somewhere. I'm seeing TTLs expiring [17:21:40] gtirloni: ip neigh is interesting on hadoop-worker-3 [17:22:04] ah no now it is better [17:22:05] nevermind [17:22:37] now it seems pinging [17:22:42] gtirloni: did some magic? [17:23:05] nope.. [17:23:06] 172.16.2.1 dev eth0 FAILED [17:23:11] 172.16.2.2 dev eth0 FAILED [17:23:19] I was of course checking the wrong instance [17:23:23] worker reports the above --^ [17:23:26] nope, still looking around [17:24:22] gtirloni could it be the cloud virt? [17:24:23] I don't really understand the exact role that the hypervisors play in networking but it seems puppet-paladox, kdc, and hadoop-worker-3 are all on separate hypervisor hosts [17:24:34] yup [17:24:46] im thinking maybe cloudvirt1018.eqiad.wmnet? [17:25:23] https://tools.wmflabs.org/openstack-browser/server/kdc.analytics.eqiad.wmflabs [17:25:30] and [17:25:30] https://tools.wmflabs.org/openstack-browser/server/hadoop-master-3.analytics.eqiad.wmflabs [17:25:34] both run on the same virt [17:25:54] does pinging hadoop-master-4.analytics.eqiad.wmflabs work elukey ? [17:25:57] paladox, master is the wrong one [17:26:00] thats on a different virt [17:26:03] he corrected it to hadoop-worker-3 [17:26:07] oh [17:26:40] https://tools.wmflabs.org/openstack-browser/server/hadoop-worker-3.analytics.eqiad.wmflabs [17:26:41] ah ok [17:26:54] so that's on cloudvirt1023 [17:27:08] yeah but I don't know how relevant this info is [17:28:48] gtirloni: I removed your PERMANENT ip neigh setting to test if it was still working on kdc [17:28:53] (fyi) [17:28:59] ok [17:29:08] and it works, it returns incomplete [17:29:12] weird [17:32:42] * elukey is too ignorant about this part of the infra [17:35:33] elukey: don't feel bad. the neutron SDN layer is new to all of us. I was just reading lots of wikitech pages and then decided that there are probably better things for me to spend my morning on ;) [17:35:55] bd808: :D [17:37:09] gtirloni: so from hadoop-worker-3 ping to 172.16.2.1 and 172.16.2.2 fail too [17:37:18] (if it can help) [17:37:30] and arp -n shows them as "incomplete" [17:37:55] PTR points to labs-ns1.wikimedia.org. [17:38:31] mmmm ping labs-ns1.wikimedia.org. works, ping to the IPs doesn't [17:38:58] no it turns out that I am not even able to read a DNS response on a friday evening [17:39:02] * elukey cries in a corner [17:39:21] nevermind, those IPs don't have PTRs (apparently) [17:39:26] lol, thanks.. that's useful info [17:40:42] I don't think anything can ping those 172.16.2.[12] IPs? [17:41:20] Krenair yup [17:41:25] i get 100% packet loss [17:41:25] 3 packets transmitted, 0 received, 100% packet loss, time 2031ms [17:41:31] can ping 172.16.0.1 which has no PTR [17:42:32] yeah 172.16.0.1 is the default gw [17:44:02] (https://phabricator.wikimedia.org/T202886 is about naming that) [17:46:53] elukey: do you mind if I reboot kdc? [17:57:07] I don't see the icm packets leaving kdc and getting to the internal bridge so I'm guessing something is wrong with that VM in particular [17:57:39] I'd like to shut it down and bring it back up, in the hope that it'll reset any l2 bridging shenanigans [17:57:45] elukey: ^^ [18:00:03] gtirloni: sure [18:02:05] I rebooted hadoop-worker-3 previously [18:03:09] gtirloni: rebooting kdc now [18:03:38] I'vejust rebooted it [18:03:39] it's back [18:04:21] but it didn't help [18:04:42] I've tested a new security group that allowed everything but still no go [18:04:58] tricky stuff [18:05:51] I forgot to ask, when did this start? [18:07:50] gtirloni: I am not sure since today I started experimenting with the host [18:09:42] I could try to kill hadoop-worker-3 and re-spawn it [18:17:04] !log admin restarted neutron-linuxbridge-agent on cloudvirt1018/1023 [18:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [18:42:26] gtirloni: gtg now, thanks for the help.. if the hosts are in the same situation on Monday I'll try to spawn a new instance [18:43:10] ok, i'll keep looking a bit more. have a great weekend [18:49:15] you too! [20:08:25] !log cloud restarted neutron-linuxbridge-agent on cloudvirt1018 and cloudvirt1023 [20:08:26] andrewbogott: Unknown project "cloud"