[00:10:25] (03Abandoned) 10Faidon Liambotis: Mobile Cookie Vary caching optimizations [puppet] - 10https://gerrit.wikimedia.org/r/75316 (owner: 10Mark Bergsma) [00:16:59] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [00:34:00] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [00:41:09] PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 40 failures [01:01:50] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=106.50 Read Requests/Sec=88.10 Write Requests/Sec=67.30 KBytes Read/Sec=44621.60 KBytes_Written/Sec=456.40 [01:08:29] PROBLEM - High load average on labstore1001 is CRITICAL 100.00% of data above the critical threshold [24.0] [01:13:09] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=141.80 Read Requests/Sec=149.10 Write Requests/Sec=5.10 KBytes Read/Sec=75598.40 KBytes_Written/Sec=80.05 [01:29:12] 7Blocked-on-Operations, 6Labs, 10Maps, 6Scrum-of-Scrums: Upgrade postgres on labsdb1004 / 1005 to 9.4, and PostGis 2.1 - https://phabricator.wikimedia.org/T101233#1343718 (10Yurik) [01:34:09] RECOVERY - mailman I/O stats on sodium is OK - I/O stats: Transfers/Sec=49.40 Read Requests/Sec=4.40 Write Requests/Sec=102.90 KBytes Read/Sec=1985.60 KBytes_Written/Sec=3134.05 [01:39:51] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=156.84 Read Requests/Sec=146.90 Write Requests/Sec=22.80 KBytes Read/Sec=74778.00 KBytes_Written/Sec=139.50 [01:46:50] PROBLEM - puppet last run on mw1256 is CRITICAL Puppet has 1 failures [01:54:00] any ops around? [01:54:12] labstore1001 is overloaded and labs seems crazy slow [01:59:11] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:04:10] RECOVERY - puppet last run on mw1256 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:21:39] !log l10nupdate Synchronized php-1.26wmf8/cache/l10n: (no message) (duration: 07m 09s) [02:22:00] Logged the message, Master [02:26:50] !log LocalisationUpdate completed (1.26wmf8) at 2015-06-07 02:25:13+00:00 [02:26:56] Logged the message, Master [02:27:30] Is anyone here? [02:29:02] already tried that [02:37:00] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=113.90 Read Requests/Sec=108.90 Write Requests/Sec=69.80 KBytes Read/Sec=55220.80 KBytes_Written/Sec=478.50 [02:52:02] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=134.60 Read Requests/Sec=152.30 Write Requests/Sec=1.10 KBytes Read/Sec=77760.80 KBytes_Written/Sec=14.00 [02:52:23] andrewbogott: hey [02:52:41] hello! [02:57:29] ori: I don’t supposed you are an NFS expert? [02:57:29] I know almost nothing about NFS [02:57:29] whatever it is, it started at 00:50: http://ganglia.wikimedia.org/latest/graph.php?r=4hr&z=xlarge&h=labstore1001.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=cpu_report&c=Labs+NFS+cluster+eqiad [02:57:29] Me neither [02:57:29] hah, wow! [02:57:29] One of these graphs is not like the others https://grafana.wikimedia.org/#/dashboard/db/labs-monitoring [02:57:29] I'm just looking for interesting things in /var/log with a timestamp close to that [02:57:39] RECOVERY - mailman I/O stats on sodium is OK - I/O stats: Transfers/Sec=50.40 Read Requests/Sec=0.10 Write Requests/Sec=68.10 KBytes Read/Sec=0.40 KBytes_Written/Sec=592.70 [02:58:06] Usually this kind of problem is caused by a client running amok. But it seems like that would clog the network before it got around to clogging IO [03:01:24] nfsd: peername failed (err 107)! <- that is what started happening at 1 [03:03:10] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=127.20 Read Requests/Sec=92.50 Write Requests/Sec=17.20 KBytes Read/Sec=46798.40 KBytes_Written/Sec=94.00 [03:04:23] err 107 means "kernel_getpeername failed on a new incoming tcp connection" [03:04:30] i think the reason it failed is file descriptor exhaustion [03:04:50] yeah [03:04:59] I can’t think of anything to do about that subtler than a reboot [05:13:00] !log service nfs-kernel-server restart on labstore1001 [05:13:02] !log rebooting labstore1001 [05:13:02] ohh. I was still looking. oh well. [05:13:02] sorry! [05:13:02] For what it’s worth, it doesn’t seem to actually be rebooting [05:13:03] yes it is :) [05:13:03] ah, there it goes [05:13:03] Krenair: thanks for fielding user inquiries :) [05:13:05] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [05:13:08] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=80.80 Read Requests/Sec=86.50 Write Requests/Sec=31.70 KBytes Read/Sec=43713.20 KBytes_Written/Sec=171.05 [05:13:08] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=85.70 Read Requests/Sec=110.30 Write Requests/Sec=23.10 KBytes Read/Sec=55206.80 KBytes_Written/Sec=152.40 [05:13:08] ori: Now I have a console on labstore1001 but I seem unable to ping it [05:13:08] or ping anything from it [05:13:08] Surely this is not true and I’m making some dumb mistake [05:13:08] check my work? [05:13:10] RECOVERY - mailman I/O stats on sodium is OK - I/O stats: Transfers/Sec=6.80 Read Requests/Sec=0.20 Write Requests/Sec=16.00 KBytes Read/Sec=2.00 KBytes_Written/Sec=168.70 [05:13:11] PROBLEM - puppet last run on ms-be1009 is CRITICAL Puppet has 1 failures [05:13:11] PROBLEM - puppet last run on mw1074 is CRITICAL Puppet has 1 failures [05:13:11] PROBLEM - puppet last run on mw1023 is CRITICAL Puppet has 1 failures [05:13:11] PROBLEM - puppet last run on mw2026 is CRITICAL puppet fail [05:13:11] PROBLEM - puppet last run on mw1243 is CRITICAL Puppet has 1 failures [05:13:12] PROBLEM - puppet last run on mw1087 is CRITICAL Puppet has 1 failures [05:13:13] PROBLEM - puppet last run on analytics1037 is CRITICAL Puppet has 1 failures [05:13:13] PROBLEM - puppet last run on ms-be1018 is CRITICAL Puppet has 1 failures [05:13:13] PROBLEM - puppet last run on mw1171 is CRITICAL Puppet has 2 failures [05:13:13] PROBLEM - puppet last run on mw2038 is CRITICAL Puppet has 1 failures [05:13:15] PROBLEM - puppet last run on mw1071 is CRITICAL puppet fail [05:13:15] PROBLEM - puppet last run on labcontrol1001 is CRITICAL Puppet has 1 failures [05:13:16] RECOVERY - puppet last run on mw1074 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [05:13:17] PROBLEM - puppet last run on mw2188 is CRITICAL Puppet has 1 failures [05:13:17] (03CR) 10Glaisher: [C: 04-1] "See task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/216415 (https://phabricator.wikimedia.org/T101604) (owner: 10Glaisher) [05:13:17] PROBLEM - puppet last run on mw1113 is CRITICAL Puppet has 1 failures [05:13:17] andrewbogott: your description seems accurate [05:13:17] I don't have console access so that's where I get off the troubleshooting train [05:13:17] PROBLEM - puppet last run on es1003 is CRITICAL Puppet has 1 failures [05:13:17] This has to be a new problem [05:13:17] I texted Faidon [05:13:17] The console just says ‘ Network is unreachable’ if I try to ping anything [05:13:17] although it shows the nics as ‘up' [05:13:17] Switch issue? [05:13:17] /cabling [05:13:17] Probably, but it’s a big-ass coincidence [05:13:17] Could be a fucked motherboard etc too [05:13:17] And since when does rebooting a server cause a cable to unplug itself? [05:13:18] PROBLEM - puppet last run on helium is CRITICAL puppet fail [05:13:18] It’s no wonder so many sysadmins turn to religion [05:13:18] PROBLEM - puppet last run on graphite1002 is CRITICAL puppet fail [05:13:18] PROBLEM - puppet last run on cp3045 is CRITICAL puppet fail [05:13:18] I wonder how long I wait for Faidon before I start texting people where it’s even earlier… [05:13:18] Coren? [05:13:18] well, let's give Faidon another few minutes and use the time to think about this. Which monitoring tools disclose switch failures, etc.? Racktables? [05:13:18] Racktables isn't monitoring [05:13:18] torrus has networking stuff [05:13:18] wow, I’ve never heard of torrus before. It has /lots/ of information that I don’t understand! [05:13:18] Just wondering if it drills far enough done, or it's teh bigger switches/routers [05:13:18] labstore1001 is in eqiad row C. is anything else in eqiad row C struggling? [05:13:18] PROBLEM - puppet last run on db1052 is CRITICAL puppet fail [05:13:18] PROBLEM - puppet last run on db2057 is CRITICAL puppet fail [05:13:18] https://racktables.wikimedia.org/index.php?page=object&tab=default&object_id=1567 [05:13:18] what are the puppet failures about? [05:13:18] PROBLEM - puppet last run on mw1111 is CRITICAL puppet fail [05:13:18] I don’t know, but they seem to be transient. I re-ran one and it updated some geoip stuff and then succeeded. [05:13:18] PROBLEM - puppet last run on analytics1035 is CRITICAL Puppet has 1 failures [05:13:18] PROBLEM - puppet last run on mw1079 is CRITICAL puppet fail [05:13:18] RECOVERY - puppet last run on ms-be1009 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [05:13:19] a few random hosts in c3 seem ok [05:13:19] to ping [05:13:19] PROBLEM - puppet last run on db2037 is CRITICAL puppet fail [05:13:19] PROBLEM - puppet last run on analytics1023 is CRITICAL puppet fail [05:13:19] PROBLEM - puppet last run on elastic1030 is CRITICAL Puppet has 1 failures [05:13:19] PROBLEM - puppet last run on mw1210 is CRITICAL puppet fail [05:13:19] the puppet failures are surely related [05:13:19] it's time to page more people [05:13:19] i'd start with bblack [05:13:19] andrewbogott: what if you just restart networking etc on that host? [05:13:19] since it's a sane hour for him [05:13:19] jgage too? [05:13:19] mutante [05:13:19] bblack would know how to respond to this [05:13:19] ori: bblack was here briefly but then didn’t respond to my query so I think he must’ve had to go [05:13:19] PROBLEM - puppet last run on protactinium is CRITICAL puppet fail [05:13:19] RECOVERY - puppet last run on ms-be1018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:19] PROBLEM - puppet last run on mw1200 is CRITICAL Puppet has 1 failures [05:13:19] PROBLEM - puppet last run on mw1163 is CRITICAL puppet fail [05:13:20] i'd still call him [05:13:20] Be worth ringing him [05:13:20] yeah [05:13:20] PROBLEM - puppet last run on rcs1002 is CRITICAL puppet fail [05:13:20] Reedy: when you say ‘resart networking’ can you be more specific? What exact command would you advise? [05:13:20] ifdown/ifup didn’t do much [05:13:20] /etc/init.d/networking restart [05:13:20] but that's roughyl the same? [05:13:20] RECOVERY - puppet last run on mw1171 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [05:13:20] andrewbogott: it's too severe to muck around any longer, I think [05:13:20] yeah, I’m calling people too, just multitasking [05:13:20] the puppet failures are do to /usr/local/bin/sshknowngen failing to run on the puppet master [05:13:20] well, texting [05:13:20] it's failing to run because it can't reach the database, I think [05:13:20] oh, so a second network issue? [05:13:20] RECOVERY - puppet last run on mw1023 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [05:13:20] RECOVERY - puppet last run on mw1243 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [05:13:20] probably the same network issue [05:13:20] something is flapping [05:13:20] RECOVERY - puppet last run on mw1087 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:20] RECOVERY - puppet last run on mw2188 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [05:13:20] RECOVERY - puppet last run on labcontrol1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:20] PROBLEM - puppet last run on mw2104 is CRITICAL Puppet has 1 failures [05:13:20] RECOVERY - puppet last run on mw2038 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:20] RECOVERY - puppet last run on mw2026 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:20] RECOVERY - puppet last run on mw1113 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [05:13:20] do we know which cr labstore1001 is connected to? [05:13:20] presumably the same as 1002? [05:13:20] RECOVERY - puppet last run on es1003 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [05:13:20] RECOVERY - puppet last run on mw1071 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:20] ae2-1002.cr2-eqiad.wikimedia.org [05:13:21] ok, texted brandon and Alex. [05:13:21] RECOVERY - puppet last run on helium is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [05:13:21] At least it's after 7am in Greece now [05:13:21] RECOVERY - puppet last run on mw1200 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:21] RECOVERY - puppet last run on graphite1002 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [05:13:22] RECOVERY - puppet last run on cp3045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:22] andrewbogott: can you try taking eth1 down and bringing eth0 up? [05:13:22] sure [05:13:22] RECOVERY - puppet last run on analytics1035 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [05:13:22] based on this, it looks like eth0 is the one that carries most of the traffic: http://torrus.wikimedia.org/torrus/Network?nodeid=if//asw-c-eqiad.mgmt.eqiad.wmnet//ge-3/0/5&view=expanded-dir-html [05:13:22] but after the reboot eth1 came to life for some reason: http://torrus.wikimedia.org/torrus/Network?nodeid=if//asw-c-eqiad.mgmt.eqiad.wmnet//ge-3/0/11&view=expanded-dir-html [05:13:22] whoops, console just said bnx2 0000:01:00.0 eth0: NIC Copper Link is Down [05:13:22] and bond0, whatever that is (presumably not a james bond reference): http://torrus.wikimedia.org/torrus/Network?nodeid=if//asw-c-eqiad.mgmt.eqiad.wmnet//ae3&view=expanded-dir-html [05:13:22] oh, that’s maybe right after I ifdown’d it [05:13:22] bonded ethernet for moar bandwidth [05:13:22] link aggregation [05:13:22] so now eth1 is down and eth0 is up [05:13:22] Reedy: stop being smart, it's making us look bad [05:13:22] and the bonded link is up too [05:13:22] well, so says ip info [05:13:22] * Reedy invoices the WMF [05:13:22] sounds like a dodgy cable/nic/switchport [05:13:22] RECOVERY - puppet last run on db1052 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [05:13:22] RECOVERY - puppet last run on elastic1030 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [05:13:22] RECOVERY - puppet last run on mw2104 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [05:13:22] ok, faindon en route [05:13:22] uh… faidon [05:13:22] what does ethtool eth0 say? [05:13:22] https://dpaste.de/9Hza [05:13:22] hey [05:13:22] RECOVERY - puppet last run on db2057 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:22] what's up? [05:13:22] paravoid: thanks for waking up :/ [05:13:22] RECOVERY - puppet last run on db2037 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [05:13:22] RECOVERY - puppet last run on analytics1023 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [05:13:22] So, my immediate issue is that I rebooted labstore1001 (long story) and it can’t connect to the network at all now. [05:13:22] RECOVERY - puppet last run on mw1111 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [05:13:22] ori thinks that there’s a more general networking issue happening (causing puppet runs to flap, as you see) [05:13:22] paravoid: recap: cpu system and iowait shot up at 00:50, syslog shows "nfsd: peername failed (err 107)" starting to appear around that time, err 107 means "kernel_getpeername failed on a new incoming tcp connection". andrewbogott rebooted, server booted but network link was not restored. [05:13:22] RECOVERY - puppet last run on mw1079 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:22] paravoid: shall I evacuate the labstore1001 console so you can log in? [05:13:22] torrus shows that prior to reboot, eth0 was the interface carrying traffic; post-reboot it appears to be eth1. andrewbogott reported console said "bnx2 0000:01:00.0 eth0: NIC Copper Link is Down" [05:13:22] RECOVERY - puppet last run on mw1210 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [05:13:22] what ip is eth1 using? [05:13:22] RECOVERY - puppet last run on analytics1037 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:22] RECOVERY - puppet last run on protactinium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:22] RECOVERY - puppet last run on mw1163 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:22] doesn't seem to be a router / switch issue because other hosts do not appear to be affected. Reedy plausibly suspects cable/nic/switchport. [05:13:22] RECOVERY - puppet last run on rcs1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:22] shit [05:13:22] switch is full of errors [05:13:22] random puppet failures are the result of failing to fetch geoip updates or puppetmaster failing to run /usr/local/bin/sshknowngen, perhaps because of same underlying issue? [05:13:22] aha [05:13:22] not good [05:13:23] switch for that cabinet? [05:13:23] (for later: where do you see switch errors?) [05:13:23] ssh to switch [05:13:23] Reedy: for that row... [05:13:23] Jun 7 04:04:55 asw-c-eqiad fpc1 MRVL-L2:mrvl_fdb_mac_entry_mc_set(),1089:Sanity Checks Failed(Invalid Params:-2) [05:13:23] do we have a spare onsite? [05:13:23] etc. [05:13:23] Jun 7 04:04:55 asw-c-eqiad fpc1 MRVL-L2:mrvl_fdb_mac_entry_rebake(),482:fdb_mac_entry_mc_set() failed(-1) [05:13:23] Jun 7 04:04:55 asw-c-eqiad fpc1 RT-HAL,rt_entry_topo_handler,4121: l2_halp_vectors->l2_entry_rebake failed [05:13:23] Jun 7 04:05:02 asw-c-eqiad fpc1 MRVL-L2:mrvl_fdb_mac_entry_mc_set(),1089:Sanity Checks Failed(Invalid Params:-2) [05:13:23] it's likely a software error [05:13:23] fscking software engineers [05:13:23] a reboot would be likely to fix it, but we really really don't want to reboot that [05:13:23] !log disabling asw-c-eqiad interfaces: ge-3/0/5, ge-3/0/11, ae3 [05:13:23] ok [05:13:23] log spam stopped for now, although it wasn't consistent [05:13:23] doesn't help labstore1001, but I'm not worried about labstore atm :) [05:13:23] that’s fine, we can wait! [05:13:23] thanks for waking up for this [05:13:23] can I get the labstore console? [05:13:23] andrewbogott: ^ [05:13:23] paravoid: I’m clear of it, help yourself. [05:13:23] thx [05:13:23] PROBLEM - puppet last run on cp3040 is CRITICAL puppet fail [05:13:23] I don't think the puppet errors are related btw [05:13:23] ok, so you think it’s contained to just this one host? [05:13:23] they /could/ be, but I'm guessing we'd be seeing a lot more errors on the app side [05:13:23] oh no, there is definitely some kind of switch issue [05:13:23] but it may not be affecting everything [05:13:23] I just don't find it plausible that this manifests in puppet 502s but not site 5xxs [05:13:23] there has been a bug in the switch concerning labstore's connectivity for many months [05:13:23] I'll explain in a bit [05:13:23] RECOVERY - Host labstore1001 is UPING OK - Packet loss = 0%, RTA = 1.70 ms [05:13:23] wait what [05:13:23] lol [05:13:23] now it will go down again [05:13:23] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [05:13:24] !log reenabling asw-c-eqiad:ge-3/0/5; dismantling ae3 [05:13:24] !log removing bond0 configuration from labstore1001, back to eth0 alone [05:13:24] RECOVERY - Host labstore1001 is UPING OK - Packet loss = 0%, RTA = 0.89 ms [05:13:24] oh, that’s going to be ugly :( [05:13:24] what is? [05:13:24] I still see switch errors even with the single port [05:13:24] If we reduce NFS bandwidth by 2/3s [05:13:24] maybe I misunderstand the implication of that [05:13:24] we're not [05:13:24] 07:32 <@paravoid> there has been a bug in the switch concerning labstore's connectivity for many months [05:13:24] we never succeeded in setting up bonding [05:13:24] oh! [05:13:24] Well, that’s sort of… good news, I guess :/ [05:13:24] RECOVERY - puppet last run on cp3040 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [05:13:24] anyway, sorry to interrupt [05:13:24] there was likely some switch bug that made it to not work [05:13:24] PROBLEM - puppet last run on mw1244 is CRITICAL Puppet has 1 failures [05:13:24] not as severe of a switch bug though [05:13:24] PROBLEM - RAID on labstore1001 is CRITICAL: Connection refused by host [05:13:24] PROBLEM - puppet last run on labstore1001 is CRITICAL: Connection refused by host [05:13:24] PROBLEM - salt-minion processes on labstore1001 is CRITICAL: Connection refused by host [05:13:24] PROBLEM - DPKG on labstore1001 is CRITICAL: Connection refused by host [05:13:24] PROBLEM - dhclient process on labstore1001 is CRITICAL: Connection refused by host [05:13:24] PROBLEM - Disk space on labstore1001 is CRITICAL: Connection refused by host [05:13:24] there's a start-nfs script i see [05:13:24] that has bond0 hardcoded [05:13:24] PROBLEM - configured eth on labstore1001 is CRITICAL: Connection refused by host [05:13:24] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [05:13:24] paravoid: it’s puppetized, shall I change it to eth0? [05:13:24] RECOVERY - salt-minion processes on labstore1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:13:24] RECOVERY - RAID on labstore1001 is OK optimal, 72 logical, 72 physical [05:13:24] RECOVERY - dhclient process on labstore1001 is OK: PROCS OK: 0 processes with command name dhclient [05:13:24] RECOVERY - Disk space on labstore1001 is OK: DISK OK [05:13:24] RECOVERY - DPKG on labstore1001 is OK: All packages OK [05:13:24] (03PS1) 10Faidon Liambotis: Switch labstore1001 from bond0 to eth0 [puppet] - 10https://gerrit.wikimedia.org/r/216519 [05:13:24] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Switch labstore1001 from bond0 to eth0 [puppet] - 10https://gerrit.wikimedia.org/r/216519 (owner: 10Faidon Liambotis) [05:13:24] oh sorry [05:13:24] not at all :) [05:13:24] RECOVERY - configured eth on labstore1001 is OK - interfaces up [05:13:24] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Jun 7 04:54:32 UTC 2015 (duration 54m 31s) [05:13:24] RECOVERY - puppet last run on labstore1001 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [05:13:24] ok [05:13:24] let's start NFS [05:13:24] andrewbogott: can you? I'm not very familiar with all that [05:13:24] sure. All I know to do, though, is run that script. [05:13:24] But, I will run it! [05:13:24] Hm, it’s either hanging or doing lots of hard work [05:13:24] [ 4733.612740] EXT4-fs warning (device dm-9): ext4_multi_mount_protect:320: MMP interval 42 higher than expected, please wait. [05:13:24] did you already have the filesystem mounted from another labstore? [05:13:24] that's usually indicative of an attempt of mounting the same filesystem twice from two different systems [05:13:24] I don’t know. [05:13:24] (which is a very bad thing if it happens) [05:13:24] labstore1002 is up and running, but nothing was mounted [05:13:24] when I checked an hour ago, at least. [05:13:24] I certainly didn’t start it there. [05:13:24] RECOVERY - puppet last run on mw1244 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:24] Hm, iirc I cannot ctrl-c out of this hanging script. [05:13:24] What do you think, wait or kill? [05:13:24] paravoid: ^ [05:13:24] doubt you can kill it at this point [05:13:24] so all that sounds bad [05:13:24] let's bring Coren online too? [05:13:24] I haven’t been able to reach him but I can try again. [05:13:24] oh [05:13:24] I think it proceeded [05:13:24] [ 5132.432478] EXT4-fs (dm-9): ext4_orphan_cleanup: deleting unreferenced inode 474494162 [05:13:24] Does it look to you like labstore1002 is connected to the disks? I may be looking at the wrong thing. [05:13:24] no [05:13:24] it doesn't look like it to me either [05:13:24] ok [05:13:24] I already have a note about how that script message needs to be clarified (or labcontrol1002 shut down) in the outage report. [05:13:24] So you think that one of the start-nfs phases is just doing hard, honest work now? [05:13:24] probably not [05:13:24] it's stuck at mounting the fs [05:13:24] /bin/mount /srv/project <- that one? [05:13:24] yes [05:13:24] so we can't really mount from labstore1002, right? [05:13:24] yeah, it's a jessie system, I think he ran into trouble when he tried it... [05:13:24] I think that’s right. [05:13:24] I haven’t tried ‘known issues’ https://wikitech.wikimedia.org/wiki/Labs_NFS yet [05:13:24] although I guess those instructions aren’t much different from what I’ve done alrady [05:13:24] minus the unmount /srv [05:13:24] hey [05:13:24] it's back :) [05:14:04] inaction pays off [05:14:24] is there a "stop-nfs"? no right? [05:14:44] no, just ‘service nfs-kernel-whatever stop’ [05:15:34] so much duct tape [05:15:43] morebots, welcome! [05:15:43] I am a logbot running on tools-exec-1203. [05:15:44] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [05:15:44] To log a message, type !log . [05:16:01] !log we did a whole lot of things to labstore1001 while morebots was away [05:16:05] Logged the message, Master [05:16:43] I still can’t log in anywhere, which suggests that $home is still not mounting... [05:16:45] I killed everything [05:19:58] ok [05:19:59] PROBLEM - NTP on labstore1001 is CRITICAL: NTP CRITICAL: Offset unknown [05:20:00] is it expected for start-nfs to spew all kinds of "No such file or directory" errors? [05:20:00] I'm going to guess yes [05:20:00] ok, I restarted NFS [05:20:00] I don’t know if it’s expected, but it’s done it every time tonight [05:20:05] the switch hasn't emitted any new errors [05:20:14] so the bond0->eth0 at least stabilized things [05:20:17] probably :) [05:20:30] I can log into instances and see my homedir [05:21:43] paravoid: do you think the switch issue accounts for the /original/ problem we saw (IO flooded and peername failed) or just the fact that it couldn’t join the network after reboot? [05:22:10] I can't be sure but my money would be just on the latter [05:22:42] RECOVERY - NTP on labstore1001 is OK: NTP OK: Offset -0.05862987041 secs [05:22:42] and the former… a badly behaving client? [05:22:53] the mount errors now suggest that these mountpoints did not unmount cleanly [05:23:01] before shutdown [05:23:05] you did a clean reboot, right? [05:23:20] When you say ‘clean'... [05:23:31] I stopped NFS and rebooted. I didn’t unmount before the reboot. [05:23:37] reboot does that [05:23:42] the kernel does [05:23:44] that’s what I thought. [05:23:57] So, yeah, it didn’t crash, I explicitly rebooted. [05:24:46] right, crash or powercycle was the question [05:24:54] so something may have happened on the I/O side of the machine [05:25:18] anything from kernel to RAID controller or disk shelf misbehaving, dunno [05:25:25] Here’s the timeline: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150607-LabsNFS-Outage [05:25:46] oh awesome [05:25:48] There’s space for you to add things at the bottom, but that’s not urgent. [05:26:52] yup [05:33:22] paravoid: I think we’re done for now unless you see something especially incriminating in the syslog from back at 01:00 [05:33:38] Thanks again for donating your Sunday morning :( [05:34:11] more like night :P [05:35:02] Oh, I guess it’s only 8:30? Maybe you can get back to sleep. [05:35:45] lulled by the sound of your inbox filling with ‘RECOVERY’ emails [05:36:31] ori, Reedy, thanks to both of you as well. [05:36:47] and Krenair [05:39:26] thanks paravoid [05:39:30] hope you get some rest [05:39:43] you too andrewbogott [05:40:26] I'm not at home, I'm in Paris [05:40:31] so it's 7:30 atm [05:40:49] but that's okay, I'm glad you called [05:41:17] andrewbogott: I updated the timeline now that I have it fresh [05:41:32] andrewbogott: we can do actionables etc. later [05:42:23] switch hasn't logged any other errors [05:42:26] things look stable [05:43:07] what’s in Paris? [05:43:09] I'm going to bed again -- call me if bad things are happening [05:43:28] I came for dotScale, but it's not technically a work trip :P [05:43:39] I’m going to bed too, so someone else will have to call both of us. [05:43:44] sleep well [05:44:11] bye! [06:31:12] PROBLEM - puppet last run on db1021 is CRITICAL Puppet has 1 failures [06:34:22] PROBLEM - puppet last run on mw2022 is CRITICAL Puppet has 1 failures [06:34:41] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:35:12] PROBLEM - puppet last run on mw2113 is CRITICAL Puppet has 1 failures [06:35:21] PROBLEM - puppet last run on mw2212 is CRITICAL Puppet has 1 failures [06:35:43]