[00:10:25] (03Abandoned) 10Faidon Liambotis: Mobile Cookie Vary caching optimizations [puppet] - 10https://gerrit.wikimedia.org/r/75316 (owner: 10Mark Bergsma) [00:16:59] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [00:34:00] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [00:41:09] PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 40 failures [01:01:50] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=106.50 Read Requests/Sec=88.10 Write Requests/Sec=67.30 KBytes Read/Sec=44621.60 KBytes_Written/Sec=456.40 [01:08:29] PROBLEM - High load average on labstore1001 is CRITICAL 100.00% of data above the critical threshold [24.0] [01:13:09] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=141.80 Read Requests/Sec=149.10 Write Requests/Sec=5.10 KBytes Read/Sec=75598.40 KBytes_Written/Sec=80.05 [01:29:12] 7Blocked-on-Operations, 6Labs, 10Maps, 6Scrum-of-Scrums: Upgrade postgres on labsdb1004 / 1005 to 9.4, and PostGis 2.1 - https://phabricator.wikimedia.org/T101233#1343718 (10Yurik) [01:34:09] RECOVERY - mailman I/O stats on sodium is OK - I/O stats: Transfers/Sec=49.40 Read Requests/Sec=4.40 Write Requests/Sec=102.90 KBytes Read/Sec=1985.60 KBytes_Written/Sec=3134.05 [01:39:51] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=156.84 Read Requests/Sec=146.90 Write Requests/Sec=22.80 KBytes Read/Sec=74778.00 KBytes_Written/Sec=139.50 [01:46:50] PROBLEM - puppet last run on mw1256 is CRITICAL Puppet has 1 failures [01:54:00] any ops around? [01:54:12] labstore1001 is overloaded and labs seems crazy slow [01:59:11] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:04:10] RECOVERY - puppet last run on mw1256 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:21:39] !log l10nupdate Synchronized php-1.26wmf8/cache/l10n: (no message) (duration: 07m 09s) [02:22:00] Logged the message, Master [02:26:50] !log LocalisationUpdate completed (1.26wmf8) at 2015-06-07 02:25:13+00:00 [02:26:56] Logged the message, Master [02:27:30] Is anyone here? [02:29:02] already tried that [02:37:00] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=113.90 Read Requests/Sec=108.90 Write Requests/Sec=69.80 KBytes Read/Sec=55220.80 KBytes_Written/Sec=478.50 [02:52:02] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=134.60 Read Requests/Sec=152.30 Write Requests/Sec=1.10 KBytes Read/Sec=77760.80 KBytes_Written/Sec=14.00 [02:52:23] andrewbogott: hey [02:52:41] hello! [02:57:29] ori: I don’t supposed you are an NFS expert? [02:57:29] I know almost nothing about NFS [02:57:29] whatever it is, it started at 00:50: http://ganglia.wikimedia.org/latest/graph.php?r=4hr&z=xlarge&h=labstore1001.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=cpu_report&c=Labs+NFS+cluster+eqiad [02:57:29] Me neither [02:57:29] hah, wow! [02:57:29] One of these graphs is not like the others https://grafana.wikimedia.org/#/dashboard/db/labs-monitoring [02:57:29] I'm just looking for interesting things in /var/log with a timestamp close to that [02:57:39] RECOVERY - mailman I/O stats on sodium is OK - I/O stats: Transfers/Sec=50.40 Read Requests/Sec=0.10 Write Requests/Sec=68.10 KBytes Read/Sec=0.40 KBytes_Written/Sec=592.70 [02:58:06] Usually this kind of problem is caused by a client running amok. But it seems like that would clog the network before it got around to clogging IO [03:01:24] nfsd: peername failed (err 107)! <- that is what started happening at 1 [03:03:10] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=127.20 Read Requests/Sec=92.50 Write Requests/Sec=17.20 KBytes Read/Sec=46798.40 KBytes_Written/Sec=94.00 [03:04:23] err 107 means "kernel_getpeername failed on a new incoming tcp connection" [03:04:30] i think the reason it failed is file descriptor exhaustion [03:04:50] yeah [03:04:59] I can’t think of anything to do about that subtler than a reboot [05:13:00] !log service nfs-kernel-server restart on labstore1001 [05:13:02] !log rebooting labstore1001 [05:13:02] ohh. I was still looking. oh well. [05:13:02] sorry! [05:13:02] For what it’s worth, it doesn’t seem to actually be rebooting [05:13:03] yes it is :) [05:13:03] ah, there it goes [05:13:03] Krenair: thanks for fielding user inquiries :) [05:13:05] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [05:13:08] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=80.80 Read Requests/Sec=86.50 Write Requests/Sec=31.70 KBytes Read/Sec=43713.20 KBytes_Written/Sec=171.05 [05:13:08] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=85.70 Read Requests/Sec=110.30 Write Requests/Sec=23.10 KBytes Read/Sec=55206.80 KBytes_Written/Sec=152.40 [05:13:08] ori: Now I have a console on labstore1001 but I seem unable to ping it [05:13:08] or ping anything from it [05:13:08] Surely this is not true and I’m making some dumb mistake [05:13:08] check my work? [05:13:10] RECOVERY - mailman I/O stats on sodium is OK - I/O stats: Transfers/Sec=6.80 Read Requests/Sec=0.20 Write Requests/Sec=16.00 KBytes Read/Sec=2.00 KBytes_Written/Sec=168.70 [05:13:11] PROBLEM - puppet last run on ms-be1009 is CRITICAL Puppet has 1 failures [05:13:11] PROBLEM - puppet last run on mw1074 is CRITICAL Puppet has 1 failures [05:13:11] PROBLEM - puppet last run on mw1023 is CRITICAL Puppet has 1 failures [05:13:11] PROBLEM - puppet last run on mw2026 is CRITICAL puppet fail [05:13:11] PROBLEM - puppet last run on mw1243 is CRITICAL Puppet has 1 failures [05:13:12] PROBLEM - puppet last run on mw1087 is CRITICAL Puppet has 1 failures [05:13:13] PROBLEM - puppet last run on analytics1037 is CRITICAL Puppet has 1 failures [05:13:13] PROBLEM - puppet last run on ms-be1018 is CRITICAL Puppet has 1 failures [05:13:13] PROBLEM - puppet last run on mw1171 is CRITICAL Puppet has 2 failures [05:13:13] PROBLEM - puppet last run on mw2038 is CRITICAL Puppet has 1 failures [05:13:15] PROBLEM - puppet last run on mw1071 is CRITICAL puppet fail [05:13:15] PROBLEM - puppet last run on labcontrol1001 is CRITICAL Puppet has 1 failures [05:13:16] RECOVERY - puppet last run on mw1074 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [05:13:17] PROBLEM - puppet last run on mw2188 is CRITICAL Puppet has 1 failures [05:13:17] (03CR) 10Glaisher: [C: 04-1] "See task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/216415 (https://phabricator.wikimedia.org/T101604) (owner: 10Glaisher) [05:13:17] PROBLEM - puppet last run on mw1113 is CRITICAL Puppet has 1 failures [05:13:17] andrewbogott: your description seems accurate [05:13:17] I don't have console access so that's where I get off the troubleshooting train [05:13:17] PROBLEM - puppet last run on es1003 is CRITICAL Puppet has 1 failures [05:13:17] This has to be a new problem [05:13:17] I texted Faidon [05:13:17] The console just says ‘ Network is unreachable’ if I try to ping anything [05:13:17] although it shows the nics as ‘up' [05:13:17] Switch issue? [05:13:17] /cabling [05:13:17] Probably, but it’s a big-ass coincidence [05:13:17] Could be a fucked motherboard etc too [05:13:17] And since when does rebooting a server cause a cable to unplug itself? [05:13:18] PROBLEM - puppet last run on helium is CRITICAL puppet fail [05:13:18] It’s no wonder so many sysadmins turn to religion [05:13:18] PROBLEM - puppet last run on graphite1002 is CRITICAL puppet fail [05:13:18] PROBLEM - puppet last run on cp3045 is CRITICAL puppet fail [05:13:18] I wonder how long I wait for Faidon before I start texting people where it’s even earlier… [05:13:18] Coren? [05:13:18] well, let's give Faidon another few minutes and use the time to think about this. Which monitoring tools disclose switch failures, etc.? Racktables? [05:13:18] Racktables isn't monitoring [05:13:18] torrus has networking stuff [05:13:18] wow, I’ve never heard of torrus before. It has /lots/ of information that I don’t understand! [05:13:18] Just wondering if it drills far enough done, or it's teh bigger switches/routers [05:13:18] labstore1001 is in eqiad row C. is anything else in eqiad row C struggling? [05:13:18] PROBLEM - puppet last run on db1052 is CRITICAL puppet fail [05:13:18] PROBLEM - puppet last run on db2057 is CRITICAL puppet fail [05:13:18] https://racktables.wikimedia.org/index.php?page=object&tab=default&object_id=1567 [05:13:18] what are the puppet failures about? [05:13:18] PROBLEM - puppet last run on mw1111 is CRITICAL puppet fail [05:13:18] I don’t know, but they seem to be transient. I re-ran one and it updated some geoip stuff and then succeeded. [05:13:18] PROBLEM - puppet last run on analytics1035 is CRITICAL Puppet has 1 failures [05:13:18] PROBLEM - puppet last run on mw1079 is CRITICAL puppet fail [05:13:18] RECOVERY - puppet last run on ms-be1009 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [05:13:19] a few random hosts in c3 seem ok [05:13:19] to ping [05:13:19] PROBLEM - puppet last run on db2037 is CRITICAL puppet fail [05:13:19] PROBLEM - puppet last run on analytics1023 is CRITICAL puppet fail [05:13:19] PROBLEM - puppet last run on elastic1030 is CRITICAL Puppet has 1 failures [05:13:19] PROBLEM - puppet last run on mw1210 is CRITICAL puppet fail [05:13:19] the puppet failures are surely related [05:13:19] it's time to page more people [05:13:19] i'd start with bblack [05:13:19] andrewbogott: what if you just restart networking etc on that host? [05:13:19] since it's a sane hour for him [05:13:19] jgage too? [05:13:19] mutante [05:13:19] bblack would know how to respond to this [05:13:19] ori: bblack was here briefly but then didn’t respond to my query so I think he must’ve had to go [05:13:19] PROBLEM - puppet last run on protactinium is CRITICAL puppet fail [05:13:19] RECOVERY - puppet last run on ms-be1018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:19] PROBLEM - puppet last run on mw1200 is CRITICAL Puppet has 1 failures [05:13:19] PROBLEM - puppet last run on mw1163 is CRITICAL puppet fail [05:13:20] i'd still call him [05:13:20] Be worth ringing him [05:13:20] yeah [05:13:20] PROBLEM - puppet last run on rcs1002 is CRITICAL puppet fail [05:13:20] Reedy: when you say ‘resart networking’ can you be more specific? What exact command would you advise? [05:13:20] ifdown/ifup didn’t do much [05:13:20] /etc/init.d/networking restart [05:13:20] but that's roughyl the same? [05:13:20] RECOVERY - puppet last run on mw1171 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [05:13:20] andrewbogott: it's too severe to muck around any longer, I think [05:13:20] yeah, I’m calling people too, just multitasking [05:13:20] the puppet failures are do to /usr/local/bin/sshknowngen failing to run on the puppet master [05:13:20] well, texting [05:13:20] it's failing to run because it can't reach the database, I think [05:13:20] oh, so a second network issue? [05:13:20] RECOVERY - puppet last run on mw1023 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [05:13:20] RECOVERY - puppet last run on mw1243 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [05:13:20] probably the same network issue [05:13:20] something is flapping [05:13:20] RECOVERY - puppet last run on mw1087 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:20] RECOVERY - puppet last run on mw2188 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [05:13:20] RECOVERY - puppet last run on labcontrol1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:20] PROBLEM - puppet last run on mw2104 is CRITICAL Puppet has 1 failures [05:13:20] RECOVERY - puppet last run on mw2038 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:20] RECOVERY - puppet last run on mw2026 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:20] RECOVERY - puppet last run on mw1113 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [05:13:20] do we know which cr labstore1001 is connected to? [05:13:20] presumably the same as 1002? [05:13:20] RECOVERY - puppet last run on es1003 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [05:13:20] RECOVERY - puppet last run on mw1071 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:20] ae2-1002.cr2-eqiad.wikimedia.org [05:13:21] ok, texted brandon and Alex. [05:13:21] RECOVERY - puppet last run on helium is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [05:13:21] At least it's after 7am in Greece now [05:13:21] RECOVERY - puppet last run on mw1200 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:21] RECOVERY - puppet last run on graphite1002 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [05:13:22] RECOVERY - puppet last run on cp3045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:22] andrewbogott: can you try taking eth1 down and bringing eth0 up? [05:13:22] sure [05:13:22] RECOVERY - puppet last run on analytics1035 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [05:13:22] based on this, it looks like eth0 is the one that carries most of the traffic: http://torrus.wikimedia.org/torrus/Network?nodeid=if//asw-c-eqiad.mgmt.eqiad.wmnet//ge-3/0/5&view=expanded-dir-html [05:13:22] but after the reboot eth1 came to life for some reason: http://torrus.wikimedia.org/torrus/Network?nodeid=if//asw-c-eqiad.mgmt.eqiad.wmnet//ge-3/0/11&view=expanded-dir-html [05:13:22] whoops, console just said bnx2 0000:01:00.0 eth0: NIC Copper Link is Down [05:13:22] and bond0, whatever that is (presumably not a james bond reference): http://torrus.wikimedia.org/torrus/Network?nodeid=if//asw-c-eqiad.mgmt.eqiad.wmnet//ae3&view=expanded-dir-html [05:13:22] oh, that’s maybe right after I ifdown’d it [05:13:22] bonded ethernet for moar bandwidth [05:13:22] link aggregation [05:13:22] so now eth1 is down and eth0 is up [05:13:22] Reedy: stop being smart, it's making us look bad [05:13:22] and the bonded link is up too [05:13:22] well, so says ip info [05:13:22] * Reedy invoices the WMF [05:13:22] sounds like a dodgy cable/nic/switchport [05:13:22] RECOVERY - puppet last run on db1052 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [05:13:22] RECOVERY - puppet last run on elastic1030 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [05:13:22] RECOVERY - puppet last run on mw2104 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [05:13:22] ok, faindon en route [05:13:22] uh… faidon [05:13:22] what does ethtool eth0 say? [05:13:22] https://dpaste.de/9Hza [05:13:22] hey [05:13:22] RECOVERY - puppet last run on db2057 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:22] what's up? [05:13:22] paravoid: thanks for waking up :/ [05:13:22] RECOVERY - puppet last run on db2037 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [05:13:22] RECOVERY - puppet last run on analytics1023 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [05:13:22] So, my immediate issue is that I rebooted labstore1001 (long story) and it can’t connect to the network at all now. [05:13:22] RECOVERY - puppet last run on mw1111 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [05:13:22] ori thinks that there’s a more general networking issue happening (causing puppet runs to flap, as you see) [05:13:22] paravoid: recap: cpu system and iowait shot up at 00:50, syslog shows "nfsd: peername failed (err 107)" starting to appear around that time, err 107 means "kernel_getpeername failed on a new incoming tcp connection". andrewbogott rebooted, server booted but network link was not restored. [05:13:22] RECOVERY - puppet last run on mw1079 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:22] paravoid: shall I evacuate the labstore1001 console so you can log in? [05:13:22] torrus shows that prior to reboot, eth0 was the interface carrying traffic; post-reboot it appears to be eth1. andrewbogott reported console said "bnx2 0000:01:00.0 eth0: NIC Copper Link is Down" [05:13:22] RECOVERY - puppet last run on mw1210 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [05:13:22] what ip is eth1 using? [05:13:22] RECOVERY - puppet last run on analytics1037 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:22] RECOVERY - puppet last run on protactinium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:22] RECOVERY - puppet last run on mw1163 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:22] doesn't seem to be a router / switch issue because other hosts do not appear to be affected. Reedy plausibly suspects cable/nic/switchport. [05:13:22] RECOVERY - puppet last run on rcs1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:22] shit [05:13:22] switch is full of errors [05:13:22] random puppet failures are the result of failing to fetch geoip updates or puppetmaster failing to run /usr/local/bin/sshknowngen, perhaps because of same underlying issue? [05:13:22] aha [05:13:22] not good [05:13:23] switch for that cabinet? [05:13:23] (for later: where do you see switch errors?) [05:13:23] ssh to switch [05:13:23] Reedy: for that row... [05:13:23] Jun 7 04:04:55 asw-c-eqiad fpc1 MRVL-L2:mrvl_fdb_mac_entry_mc_set(),1089:Sanity Checks Failed(Invalid Params:-2) [05:13:23] do we have a spare onsite? [05:13:23] etc. [05:13:23] Jun 7 04:04:55 asw-c-eqiad fpc1 MRVL-L2:mrvl_fdb_mac_entry_rebake(),482:fdb_mac_entry_mc_set() failed(-1) [05:13:23] Jun 7 04:04:55 asw-c-eqiad fpc1 RT-HAL,rt_entry_topo_handler,4121: l2_halp_vectors->l2_entry_rebake failed [05:13:23] Jun 7 04:05:02 asw-c-eqiad fpc1 MRVL-L2:mrvl_fdb_mac_entry_mc_set(),1089:Sanity Checks Failed(Invalid Params:-2) [05:13:23] it's likely a software error [05:13:23] fscking software engineers [05:13:23] a reboot would be likely to fix it, but we really really don't want to reboot that [05:13:23] !log disabling asw-c-eqiad interfaces: ge-3/0/5, ge-3/0/11, ae3 [05:13:23] ok [05:13:23] log spam stopped for now, although it wasn't consistent [05:13:23] doesn't help labstore1001, but I'm not worried about labstore atm :) [05:13:23] that’s fine, we can wait! [05:13:23] thanks for waking up for this [05:13:23] can I get the labstore console? [05:13:23] andrewbogott: ^ [05:13:23] paravoid: I’m clear of it, help yourself. [05:13:23] thx [05:13:23] PROBLEM - puppet last run on cp3040 is CRITICAL puppet fail [05:13:23] I don't think the puppet errors are related btw [05:13:23] ok, so you think it’s contained to just this one host? [05:13:23] they /could/ be, but I'm guessing we'd be seeing a lot more errors on the app side [05:13:23] oh no, there is definitely some kind of switch issue [05:13:23] but it may not be affecting everything [05:13:23] I just don't find it plausible that this manifests in puppet 502s but not site 5xxs [05:13:23] there has been a bug in the switch concerning labstore's connectivity for many months [05:13:23] I'll explain in a bit [05:13:23] RECOVERY - Host labstore1001 is UPING OK - Packet loss = 0%, RTA = 1.70 ms [05:13:23] wait what [05:13:23] lol [05:13:23] now it will go down again [05:13:23] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [05:13:24] !log reenabling asw-c-eqiad:ge-3/0/5; dismantling ae3 [05:13:24] !log removing bond0 configuration from labstore1001, back to eth0 alone [05:13:24] RECOVERY - Host labstore1001 is UPING OK - Packet loss = 0%, RTA = 0.89 ms [05:13:24] oh, that’s going to be ugly :( [05:13:24] what is? [05:13:24] I still see switch errors even with the single port [05:13:24] If we reduce NFS bandwidth by 2/3s [05:13:24] maybe I misunderstand the implication of that [05:13:24] we're not [05:13:24] 07:32 <@paravoid> there has been a bug in the switch concerning labstore's connectivity for many months [05:13:24] we never succeeded in setting up bonding [05:13:24] oh! [05:13:24] Well, that’s sort of… good news, I guess :/ [05:13:24] RECOVERY - puppet last run on cp3040 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [05:13:24] anyway, sorry to interrupt [05:13:24] there was likely some switch bug that made it to not work [05:13:24] PROBLEM - puppet last run on mw1244 is CRITICAL Puppet has 1 failures [05:13:24] not as severe of a switch bug though [05:13:24] PROBLEM - RAID on labstore1001 is CRITICAL: Connection refused by host [05:13:24] PROBLEM - puppet last run on labstore1001 is CRITICAL: Connection refused by host [05:13:24] PROBLEM - salt-minion processes on labstore1001 is CRITICAL: Connection refused by host [05:13:24] PROBLEM - DPKG on labstore1001 is CRITICAL: Connection refused by host [05:13:24] PROBLEM - dhclient process on labstore1001 is CRITICAL: Connection refused by host [05:13:24] PROBLEM - Disk space on labstore1001 is CRITICAL: Connection refused by host [05:13:24] there's a start-nfs script i see [05:13:24] that has bond0 hardcoded [05:13:24] PROBLEM - configured eth on labstore1001 is CRITICAL: Connection refused by host [05:13:24] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [05:13:24] paravoid: it’s puppetized, shall I change it to eth0? [05:13:24] RECOVERY - salt-minion processes on labstore1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:13:24] RECOVERY - RAID on labstore1001 is OK optimal, 72 logical, 72 physical [05:13:24] RECOVERY - dhclient process on labstore1001 is OK: PROCS OK: 0 processes with command name dhclient [05:13:24] RECOVERY - Disk space on labstore1001 is OK: DISK OK [05:13:24] RECOVERY - DPKG on labstore1001 is OK: All packages OK [05:13:24] (03PS1) 10Faidon Liambotis: Switch labstore1001 from bond0 to eth0 [puppet] - 10https://gerrit.wikimedia.org/r/216519 [05:13:24] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Switch labstore1001 from bond0 to eth0 [puppet] - 10https://gerrit.wikimedia.org/r/216519 (owner: 10Faidon Liambotis) [05:13:24] oh sorry [05:13:24] not at all :) [05:13:24] RECOVERY - configured eth on labstore1001 is OK - interfaces up [05:13:24] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Jun 7 04:54:32 UTC 2015 (duration 54m 31s) [05:13:24] RECOVERY - puppet last run on labstore1001 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [05:13:24] ok [05:13:24] let's start NFS [05:13:24] andrewbogott: can you? I'm not very familiar with all that [05:13:24] sure. All I know to do, though, is run that script. [05:13:24] But, I will run it! [05:13:24] Hm, it’s either hanging or doing lots of hard work [05:13:24] [ 4733.612740] EXT4-fs warning (device dm-9): ext4_multi_mount_protect:320: MMP interval 42 higher than expected, please wait. [05:13:24] did you already have the filesystem mounted from another labstore? [05:13:24] that's usually indicative of an attempt of mounting the same filesystem twice from two different systems [05:13:24] I don’t know. [05:13:24] (which is a very bad thing if it happens) [05:13:24] labstore1002 is up and running, but nothing was mounted [05:13:24] when I checked an hour ago, at least. [05:13:24] I certainly didn’t start it there. [05:13:24] RECOVERY - puppet last run on mw1244 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:13:24] Hm, iirc I cannot ctrl-c out of this hanging script. [05:13:24] What do you think, wait or kill? [05:13:24] paravoid: ^ [05:13:24] doubt you can kill it at this point [05:13:24] so all that sounds bad [05:13:24] let's bring Coren online too? [05:13:24] I haven’t been able to reach him but I can try again. [05:13:24] oh [05:13:24] I think it proceeded [05:13:24] [ 5132.432478] EXT4-fs (dm-9): ext4_orphan_cleanup: deleting unreferenced inode 474494162 [05:13:24] Does it look to you like labstore1002 is connected to the disks? I may be looking at the wrong thing. [05:13:24] no [05:13:24] it doesn't look like it to me either [05:13:24] ok [05:13:24] I already have a note about how that script message needs to be clarified (or labcontrol1002 shut down) in the outage report. [05:13:24] So you think that one of the start-nfs phases is just doing hard, honest work now? [05:13:24] probably not [05:13:24] it's stuck at mounting the fs [05:13:24] /bin/mount /srv/project <- that one? [05:13:24] yes [05:13:24] so we can't really mount from labstore1002, right? [05:13:24] yeah, it's a jessie system, I think he ran into trouble when he tried it... [05:13:24] I think that’s right. [05:13:24] I haven’t tried ‘known issues’ https://wikitech.wikimedia.org/wiki/Labs_NFS yet [05:13:24] although I guess those instructions aren’t much different from what I’ve done alrady [05:13:24] minus the unmount /srv [05:13:24] hey [05:13:24] it's back :) [05:14:04] inaction pays off [05:14:24] is there a "stop-nfs"? no right? [05:14:44] no, just ‘service nfs-kernel-whatever stop’ [05:15:34] so much duct tape [05:15:43] morebots, welcome! [05:15:43] I am a logbot running on tools-exec-1203. [05:15:44] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [05:15:44] To log a message, type !log . [05:16:01] !log we did a whole lot of things to labstore1001 while morebots was away [05:16:05] Logged the message, Master [05:16:43] I still can’t log in anywhere, which suggests that $home is still not mounting... [05:16:45] I killed everything [05:19:58] ok [05:19:59] PROBLEM - NTP on labstore1001 is CRITICAL: NTP CRITICAL: Offset unknown [05:20:00] is it expected for start-nfs to spew all kinds of "No such file or directory" errors? [05:20:00] I'm going to guess yes [05:20:00] ok, I restarted NFS [05:20:00] I don’t know if it’s expected, but it’s done it every time tonight [05:20:05] the switch hasn't emitted any new errors [05:20:14] so the bond0->eth0 at least stabilized things [05:20:17] probably :) [05:20:30] I can log into instances and see my homedir [05:21:43] paravoid: do you think the switch issue accounts for the /original/ problem we saw (IO flooded and peername failed) or just the fact that it couldn’t join the network after reboot? [05:22:10] I can't be sure but my money would be just on the latter [05:22:42] RECOVERY - NTP on labstore1001 is OK: NTP OK: Offset -0.05862987041 secs [05:22:42] and the former… a badly behaving client? [05:22:53] the mount errors now suggest that these mountpoints did not unmount cleanly [05:23:01] before shutdown [05:23:05] you did a clean reboot, right? [05:23:20] When you say ‘clean'... [05:23:31] I stopped NFS and rebooted. I didn’t unmount before the reboot. [05:23:37] reboot does that [05:23:42] the kernel does [05:23:44] that’s what I thought. [05:23:57] So, yeah, it didn’t crash, I explicitly rebooted. [05:24:46] right, crash or powercycle was the question [05:24:54] so something may have happened on the I/O side of the machine [05:25:18] anything from kernel to RAID controller or disk shelf misbehaving, dunno [05:25:25] Here’s the timeline: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150607-LabsNFS-Outage [05:25:46] oh awesome [05:25:48] There’s space for you to add things at the bottom, but that’s not urgent. [05:26:52] yup [05:33:22] paravoid: I think we’re done for now unless you see something especially incriminating in the syslog from back at 01:00 [05:33:38] Thanks again for donating your Sunday morning :( [05:34:11] more like night :P [05:35:02] Oh, I guess it’s only 8:30? Maybe you can get back to sleep. [05:35:45] lulled by the sound of your inbox filling with ‘RECOVERY’ emails [05:36:31] ori, Reedy, thanks to both of you as well. [05:36:47] and Krenair [05:39:26] thanks paravoid [05:39:30] hope you get some rest [05:39:43] you too andrewbogott [05:40:26] I'm not at home, I'm in Paris [05:40:31] so it's 7:30 atm [05:40:49] but that's okay, I'm glad you called [05:41:17] andrewbogott: I updated the timeline now that I have it fresh [05:41:32] andrewbogott: we can do actionables etc. later [05:42:23] switch hasn't logged any other errors [05:42:26] things look stable [05:43:07] what’s in Paris? [05:43:09] I'm going to bed again -- call me if bad things are happening [05:43:28] I came for dotScale, but it's not technically a work trip :P [05:43:39] I’m going to bed too, so someone else will have to call both of us. [05:43:44] sleep well [05:44:11] bye! [06:31:12] PROBLEM - puppet last run on db1021 is CRITICAL Puppet has 1 failures [06:34:22] PROBLEM - puppet last run on mw2022 is CRITICAL Puppet has 1 failures [06:34:41] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:35:12] PROBLEM - puppet last run on mw2113 is CRITICAL Puppet has 1 failures [06:35:21] PROBLEM - puppet last run on mw2212 is CRITICAL Puppet has 1 failures [06:35:43] PROBLEM - puppet last run on mw2095 is CRITICAL Puppet has 1 failures [06:35:43] PROBLEM - puppet last run on mw1123 is CRITICAL Puppet has 2 failures [06:36:02] PROBLEM - puppet last run on mw2056 is CRITICAL Puppet has 2 failures [06:36:02] PROBLEM - puppet last run on mw2184 is CRITICAL Puppet has 1 failures [06:36:03] PROBLEM - puppet last run on mw2079 is CRITICAL Puppet has 1 failures [06:46:32] RECOVERY - puppet last run on db1021 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:47:11] RECOVERY - puppet last run on mw2113 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:47:21] RECOVERY - puppet last run on mw2212 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:47] RECOVERY - puppet last run on mw2095 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:47:51] RECOVERY - puppet last run on mw1123 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:11] RECOVERY - puppet last run on mw2022 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:12] RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:12] RECOVERY - puppet last run on mw2056 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:48:12] RECOVERY - puppet last run on mw2079 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:31] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:03:22] !log springle Synchronized wmf-config/db-eqiad.php: repool db1073, warm up (duration: 01m 09s) [07:03:28] Logged the message, Master [10:26:05] (03PS2) 10Yuvipanda: Tools: Do not require package python-sh [puppet] - 10https://gerrit.wikimedia.org/r/213849 (https://phabricator.wikimedia.org/T91874) (owner: 10Tim Landscheidt) [10:26:14] (03CR) 10Yuvipanda: [C: 032 V: 032] Tools: Do not require package python-sh [puppet] - 10https://gerrit.wikimedia.org/r/213849 (https://phabricator.wikimedia.org/T91874) (owner: 10Tim Landscheidt) [10:47:04] (03PS2) 10Giuseppe Lavagetto: conftool: adding the cli-tool and integration tests [software/conftool] - 10https://gerrit.wikimedia.org/r/215891 [11:03:07] (03PS1) 10Yuvipanda: Rename to toollabs.webservice from tools.webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/216527 [11:03:09] (03PS1) 10Yuvipanda: Explicitly handle 'stop' in update_manifest for generic webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/216528 [11:03:13] valhallasw: ^ [11:04:03] 6operations: Change main branch of puppet repository to be 'master' instead of production - https://phabricator.wikimedia.org/T101632#1343905 (10yuvipanda) 3NEW [11:09:44] YuviPanda: YESSSS PLEASE [11:10:04] I'm testing the package changes now [11:10:05] YuviPanda: or, alternatively, a toollabs branch :> [11:10:31] (03CR) 10Merlijn van Deen: [C: 032] Explicitly handle 'stop' in update_manifest for generic webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/216528 (owner: 10Yuvipanda) [11:10:33] I don't think that's gonna happen tho, unless we get our own puppetmaster [11:11:20] which might not even be the worst idea [11:11:27] (03CR) 10Merlijn van Deen: [C: 032] Rename to toollabs.webservice from tools.webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/216527 (owner: 10Yuvipanda) [11:11:41] (03Merged) 10jenkins-bot: Rename to toollabs.webservice from tools.webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/216527 (owner: 10Yuvipanda) [11:11:43] (03Merged) 10jenkins-bot: Explicitly handle 'stop' in update_manifest for generic webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/216528 (owner: 10Yuvipanda) [11:12:40] valhallasw: yeah, I'd avoid it if possible but you're the one who's going to be more affected by it than I am going to be, so if you think it'll be worth the trouble we can try it out [11:14:26] YuviPanda: yeah, or we can just go the '+1 from valhallasw of scfc means YuviPanda/Coren merge the patch' route, I guess. [11:14:39] valhallasw: I'm totally happy with the latter [11:14:43] sounds good [11:14:55] btw, should HBA work now? [11:15:07] valhallasw: only in hosts that have toollabs::hba included [11:15:11] so that's exec and webgrid hosts [11:15:18] ooooh. [11:15:19] we can expand them to all if you'd like [11:15:21] yes, very nice [11:15:26] yeah, my mosh would like that [11:15:39] let's do it! I'm going to make a patch [11:16:05] valhallasw: btw, I wrote the script to check if instances are on varied enough virt* hosts. everything's ok except master and shadow are on same host :) [11:16:13] cool! [11:17:07] so we should rebuild shadow, and document the process while we're at it? [11:17:22] like mailrelay-01 that I still have to finish *look of disapproval* [11:18:36] valhallasw: yup, plan is to move to tools-master-01 and -02, and have them be trusty as well [11:34:35] valhallasw: hmm, enabling it for everything might be slightly messy becaues of how security/access.conf is done [11:35:19] ? [11:36:01] valhallasw: toollabs::hba writes to security/access.conf and so does toollabs::infrastructure [12:22:42] I see [13:21:01] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 38.46% of data above the critical threshold [500.0] [13:22:51] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [13:22:59] (03PS8) 10Merlijn van Deen: Tools: Simplify and fix mail setup [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) [13:23:23] (03CR) 10Merlijn van Deen: "Changes vs the previous version:" [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [13:23:33] YuviPanda: ^ if you have time [13:23:44] I should probably test-deply on beta, but ugh [13:24:01] (03CR) 10jenkins-bot: [V: 04-1] Tools: Simplify and fix mail setup [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [13:24:09] :{ [13:24:41] automergefail [13:26:01] (03PS9) 10Merlijn van Deen: Tools: Simplify and fix mail setup [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) [13:26:25] (03CR) 10Yuvipanda: Tools: Simplify and fix mail setup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [13:26:45] valhallasw: ^ [13:26:53] I don't know enough about exim to comment on the config itself, however [13:26:56] did I forget to remove that line? odd. [13:27:12] what on earth is going on here [13:27:43] doh :< [13:28:29] yeah, I had moved it, but I messed up my commit [13:28:37] (03PS10) 10Merlijn van Deen: Tools: Simplify and fix mail setup [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) [13:32:36] YuviPanda: sorry, you basically re-reviewed the earlier patch, but rebased. sorry :( ^ this one actually changed stuff [13:32:52] ah, heh :) no worries, looing agin [13:32:54] *looking [13:33:21] YuviPanda: I'm still not entirely happy with $is_mail_relay, but at least there's a sanity check now [13:33:32] yeah, I can't think of any alternative to that atm [13:35:04] hm, what hiera files does toolsbeta read? hieradata/labs/toolsbeta/*? (that dir doesn't exist) [13:36:41] valhallasw: wikitech, Hiera:toolsbeta [13:36:51] nothing from disk? [13:36:54] you can create it in hieradata/labs/toolsbeta too if you want [13:37:46] mkay [13:38:02] * valhallasw is wondering how scfc tested stuff [13:39:32] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [13:39:34] (03CR) 10Yuvipanda: "minor nit but the puppet part looks good to me - I've no idea about the exim parts tho" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [13:41:31] YuviPanda: :D well the puppet docs also don't know [13:41:45] and I'm not sure if undef == undef is true [13:41:56] (just like nan == nan returns false) [13:42:20] YuviPanda: https://docs.puppetlabs.com/references/latest/function.html#defined [13:46:39] valhallasw: :D [13:46:51] valhallasw: undef == undef is true [13:52:40] also it's puppet, so it might just give you 'trololol' back, which then becomes true [13:53:02] anyway, will fix, and I'll add some toolsbeta hiera stuff [13:54:32] (03PS11) 10Merlijn van Deen: Tools: Simplify and fix mail setup [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) [13:54:41] or maybe first time to screw around with puppet-compiler -_-' [14:10:13] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [14:20:05] (03PS2) 10Yuvipanda: [sshd] Disable agent forwarding [puppet] - 10https://gerrit.wikimedia.org/r/199936 (owner: 10Chad) [14:20:36] (03PS3) 10Yuvipanda: ssh: Disable agent forwarding for production [puppet] - 10https://gerrit.wikimedia.org/r/199936 (owner: 10Chad) [14:20:42] valhallasw: ^ I've disabled it in production only, and labs projects can individually opt in if they wish [14:21:31] (03CR) 10Yuvipanda: "I've made it production only - where I believe it can go ahead now? Labs projects can opt in to this change one by one over time and then " [puppet] - 10https://gerrit.wikimedia.org/r/199936 (owner: 10Chad) [14:21:42] (03CR) 10Merlijn van Deen: [C: 031] ssh: Disable agent forwarding for production [puppet] - 10https://gerrit.wikimedia.org/r/199936 (owner: 10Chad) [14:42:53] (03PS1) 10ArielGlenn: dumps: on dry run for streaming dumps, no compressors write to files [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/216539 [14:46:25] (03CR) 10ArielGlenn: [C: 032] dumps: on dry run for streaming dumps, no compressors write to files [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/216539 (owner: 10ArielGlenn) [15:35:25] (03PS1) 10Yuvipanda: Don't crash hard if unknown webservice type is encountered [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/216542 [15:39:26] (03CR) 10Yuvipanda: [C: 032] Don't crash hard if unknown webservice type is encountered [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/216542 (owner: 10Yuvipanda) [15:52:12] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [15:57:02] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL 20.00% of data above the critical threshold [100000000.0] [16:04:12] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [16:16:25] (03PS6) 10Paladox: Add json and less highlight support to gitblit and gerrit [puppet] - 10https://gerrit.wikimedia.org/r/216421 [16:17:05] 6operations, 6Labs: Make Labs NFS alerts paging - https://phabricator.wikimedia.org/T101650#1344262 (10yuvipanda) 3NEW [16:18:17] valhallasw: fun fact: puppet functions can't distinguish between 'undef' the string and undef the value [16:18:27] and can only return 'undef' the string [16:18:46] ori: :D what happens if you return nil? (functions are in the ruby level, right? :/) [16:19:13] iirc you get 'nil' [16:19:16] though let's see [16:19:30] * valhallasw hands ori a Picard [16:20:12] no one appreciated my humor on https://gerrit.wikimedia.org/r/#/c/216028/ [16:20:37] ori: it did have me in splits, but I'm bound to not participate in that thread anymore due to a gag order :) [16:20:49] hah [16:22:39] valhallasw: empty string [16:22:47] ori: *facepalm* [16:23:24] I love how DSLs always grow into 'real' languages with huge warts [16:23:35] puppet, php, ... [16:27:07] wikitext [16:29:51] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [16:50:47] 6operations: plan workflow for blocked on ops patches - https://phabricator.wikimedia.org/T88315#1344300 (10yuvipanda) 5Open>3Invalid No movement, and probably too vague as well. [16:52:07] 6operations, 5Patch-For-Review: Make puppet the sole manager of user keys - https://phabricator.wikimedia.org/T92475#1344305 (10yuvipanda) a:5yuvipanda>3None [16:52:47] 6operations: Add updating labs/private with $::puppetmaster_autoupdate feature flag - https://phabricator.wikimedia.org/T75904#1344310 (10yuvipanda) a:5yuvipanda>3None [16:53:42] 6operations, 7Shinken: Shinken hostname column is not large enough - https://phabricator.wikimedia.org/T1362#1344315 (10yuvipanda) a:5yuvipanda>3None [16:53:46] 6operations, 5Patch-For-Review: Make ircecho run as its own user - https://phabricator.wikimedia.org/T76203#1344316 (10yuvipanda) a:5yuvipanda>3None [16:55:12] RECOVERY - Outgoing network saturation on labstore1001 is OK Less than 10.00% above the threshold [75000000.0] [16:57:09] 7Puppet, 6operations: Puppetize ircyall & set up instance appropriately - https://phabricator.wikimedia.org/T1357#1344333 (10yuvipanda) a:5yuvipanda>3None [17:09:22] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [17:10:25] apergos: are you doing something on dumps-3 instance? [17:11:02] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [17:14:24] 7Puppet, 6operations, 10Beta-Cluster: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#1344368 (10yuvipanda) a:5yuvipanda>3None [17:31:32] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [17:32:52] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL 24.14% of data above the critical threshold [100000000.0] [17:57:16] Krinkle: does ci need NFS? We can turn it off [17:59:07] Syntax error at '}'; expected '}' [17:59:11] thank you puppet [18:02:21] RECOVERY - Outgoing network saturation on labstore1001 is OK Less than 10.00% above the threshold [75000000.0] [18:09:12] * YuviPanda is back [18:18:19] YuviPanda: nope [18:19:12] not doing anything on any of the instances [18:59:32] apergos: cool. [19:00:55] 6operations, 10Beta-Cluster, 5Patch-For-Review: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#1344425 (10yuvipanda) a:5yuvipanda>3None [19:58:45] YuviPanda: What do you mean, turn it off? [20:00:15] Krinkle: it's posible to just not use NFS. then you won't get /home nor /data/project on NFS, and you can treat each instanace as a separate entity [20:00:19] improves reliability massively [20:00:45] YuviPanda: If nothing is using NFS, then it being mounted wouldn't matter [20:00:50] clearly something is depending on it, right? [20:01:11] Or is it possible for NFS to occupy an instance's connectivity proactively/passively? [20:01:17] Krinkle: possibly, but could be as simple as someone with an ssh session in. [20:02:01] YuviPanda: what do you mean? [20:02:11] Krinkle: and it also lives inside the kernel, so I won't be surprised if just having it mounted causes issues [20:03:04] YuviPanda: LDAP and puppet run indedendant right? [20:03:15] And the only thing using NFS is /home and /data ? [20:03:28] yes [20:03:42] having my home directory mounted would be convenient [20:03:46] but it's fine without [20:03:50] as long as ssh works to debug [20:04:16] you can go to 'manage projects', 'configure' and turn them off [20:04:21] we don't use public /data and we don't use project /data [20:04:23] checkboxes! [20:04:36] that's too easy [20:04:39] that's per-project [20:04:40] I need something to complain about [20:04:43] hehe :P [20:06:07] I'llgive it a try [20:06:13] ok! [20:07:19] YuviPanda: Hm.. can you help debug why 4 instances are unreachable? [20:07:26] For almost 24 hours now I can't ssh into 4 of them [20:07:33] just disappeared off the radar [20:07:36] Krinkle: have you tried restarting them? [20:07:42] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL 14.29% of data above the critical threshold [100000000.0] [20:07:56] YuviPanda: I'd rather figure out what happened first. [20:08:00] We can handle reduced load. [20:08:04] and also #not-my-problem [20:08:26] But I'd be nice to take a peek if it's something relatively obvious [20:08:33] alright, which instances? [20:08:39] and which project? [20:08:46] (because monday they'll be restarted by someone, and it wont be fixed) [20:09:01] integration-slave-trusty-1012, trusty-1013 and 1015 unresponsive to pings or ssh [20:09:31] the other trusty slaves and precise slaves are responding [20:09:50] 1012 just got an ssh response [20:09:53] no shell yet [20:11:04] I'm using integration-puppetmaster to see whats on our /data and if we're using any of it [20:11:14] what's /data/scratch? whoah, that looks like someone's attic [20:11:22] so many different uids [20:11:34] Krinkle: it's shared across projects :) [20:11:45] like a dropbox? [20:11:52] Krinkle: I get Connection closed by Unknown, usually this means something OOM'd [20:12:03] I'm going to try looking at console [20:12:17] Krinkle: something like that, but not permanent - can be wiped at any time, has far less redundancy isn't backed up, etc [20:15:59] Krinkle: no output in 'get console output' due to wikitech flakiness, I presume. [20:16:31] I disabled shared project storage [20:16:41] home directories, what happens if I disable that [20:17:06] ssh keys are set by other means? [20:22:04] Krinkle: yes for trusty, and most probably yes for precise [20:22:16] Krinkle: any negative effects from disabling shared storage? [20:22:33] We didn't use /data/project for anything [20:22:43] We do use /home for a few things, but only for sysadmin tasks. [20:22:49] right [20:22:50] https://phabricator.wikimedia.org/T90610 [20:22:59] I'll propose to have it be phased our [20:23:01] out [20:23:17] YuviPanda: So what about those instances not responding [20:23:34] Krinkle: OOM is my guess, matches the symptoms [20:24:05] YuviPanda: Hm.. [20:24:36] According to nagf, they were actually the less heated instances [20:24:49] and no spike or upward trend in graphite [20:24:54] https://tools.wmflabs.org/nagf/?project=integration#h_integration-slave-trusty-1012_memory [20:24:59] in fact, a downward trend [20:25:06] and it's still reporting variable metrics [20:25:20] (not a flat line looping the same results) [20:26:31] seems to be correlated with a puppet failure around 06/07 midnight if you scroll own to the trusty-1012 puppet agent graph [20:26:39] it had 70+ failures and then back to 0 again [20:26:42] but ot's been down since then [20:26:54] unreachable I mean [20:27:18] disk usage also went down around the same time. 4GB disappeared [20:27:26] (viewing by week) [20:28:08] 06/06 I mean, not 06/07 [20:28:19] hmmm [20:30:13] RECOVERY - Outgoing network saturation on labstore1001 is OK Less than 10.00% above the threshold [75000000.0] [20:30:34] Krinkle: console output is blank, so we both have same amount of info to go by [20:30:57] at this point it's basically: 1. reboot, 2. look at logs, 3. if it doesn't come back after reboot, have andrewbogott look at it [20:44:40] 6operations, 10Analytics-Cluster, 3Fundraising Sprint Lou Reed, 10Fundraising Tech Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1344499 (10AndyRussG) Another quick question... would it be possible to get a more detailed description of the two pipeli... [20:47:11] YuviPanda: OK. I'm rebooting one of them [20:47:21] YuviPanda: btw, how does the shared project store thing get disabled? [20:47:27] I unticked the box a while ago but nothing changes [20:47:34] puppet? [20:50:59] Jun 7 20:50:19 integration-puppetmaster puppet-master[18758]: You cannot collect without storeconfigs being set on line 77 in file /etc/puppet/modules/monitoring/manifests/service.pp [20:50:59] Jun 7 20:50:19 integration-puppetmaster puppet-master[18758]: You cannot collect without storeconfigs being set on line 84 in file /etc/puppet/modules/ssh/manifests/server.pp [20:51:01] Jun 7 20:50:20 integration-puppetmaster puppet-master[18758]: You cannot collect without storeconfigs being set on line 46 in file /etc/puppet/modules/monitoring/manifests/host.pp [20:51:03] Jun 7 20:50:20 integration-puppetmaster puppet-master[18758]: You cannot collect without storeconfigs being set on line 64 in file /etc/puppet/modules/monitoring/manifests/host.pp [20:51:14] No idea what that is [20:51:20] OK too many errors and I'm tired. [20:51:32] I'll report it on monday. I've been out of touch with this for too long. [21:16:52] PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL 24.14% of data above the critical threshold [100000000.0] [21:31:25] Krinkle|detached: it gets disabled on the NFS server [21:32:17] 6operations, 6Labs: Make Labs NFS alerts paging - https://phabricator.wikimedia.org/T101650#1344507 (10yuvipanda) Need to figure out: 1. Who all should be paged? 2. What's the paging condition? [21:42:42] RECOVERY - Outgoing network saturation on labstore1001 is OK Less than 10.00% above the threshold [75000000.0] [21:57:36] YuviPanda: But remains mounted? [21:57:38] YuviPanda: btw, integraiton-slave-trusty-1015 didn't become reachable after rebootign [21:57:55] Krinkle: hmm, yes. that needs a puppet fix. [21:58:33] and it seems puppet isn't running on any instances other than precise-1012. I made a change on the puppetmaster that ensures/absent a certain file, but it still exists on all instances except that one. [21:58:35] something weird is going on [21:58:51] I made that change 6 hours ago [21:58:52] hmm [21:59:00] I have to go now, can you file a bug? [21:59:18] Sure [21:59:35] making dinner now, so bbl myself [21:59:48] have a good sunday afternoon/night there [22:03:05] Krinkle|detached: you too :) (I presume you're in the UK as well) [22:04:36] YuviPanda, Krinkle|detached: I've been using NFS on CI slaves to upload scripts on only one server and run them everywhere :/ [22:14:22] PROBLEM - very high load average likely xfs on ms-be2008 is CRITICAL - load average: 105.48, 100.35, 97.73 [22:25:41] YuviPanda there is a question regarding labs and bots on #wikipedia-en [22:32:36] ToAruShiroiNeko: then it should go to -labs :) [22:32:47] use the channels, not SPOFs (people) ;) [22:33:13] eh? [22:33:42] we dont have -labs-operations yet? :p [22:33:51] PROBLEM - Host mw2031 is DOWN: PING CRITICAL - Packet loss = 100% [22:35:22] RECOVERY - Host mw2031 is UPING OK - Packet loss = 0%, RTA = 43.50 ms [22:35:34] why would we have a -labs-operations? is -labs not sufficient? [22:40:01] PROBLEM - very high load average likely xfs on ms-be2008 is CRITICAL - load average: 104.87, 100.26, 98.10 [22:54:48] ^ what's the standard response to these? restart swift? [23:10:30] 6operations, 5Patch-For-Review: Make puppet the sole manager of user keys - https://phabricator.wikimedia.org/T92475#1344562 (10faidon) a:3faidon [23:13:51] PROBLEM - very high load average likely xfs on ms-be2008 is CRITICAL - load average: 105.83, 100.81, 99.08 [23:18:52] PROBLEM - very high load average likely xfs on ms-be2008 is CRITICAL - load average: 100.72, 100.38, 99.31 [23:21:57] jgage: sadly, reboot (doing now) [23:24:11] PROBLEM - very high load average likely xfs on ms-be2008 is CRITICAL - load average: 101.60, 101.15, 99.80 [23:26:20] 6operations, 10ops-codfw: ms-be2008.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T101665#1344579 (10fgiunchedi) 3NEW [23:27:26] !log reboot ms-be2008 sdg failed, xfs unhappy [23:27:30] Logged the message, Master [23:39:53] (03PS1) 10Alex Monk: Fix pflwiki logo transparency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/216595 [23:51:31] RECOVERY - very high load average likely xfs on ms-be2008 is OK - load average: 3.01, 0.68, 0.22 [23:51:52] RECOVERY - RAID on ms-be2008 is OK optimal, 13 logical, 13 physical [23:56:38] 6operations, 7Graphite: graphite2001 OOM and unresponsive - https://phabricator.wikimedia.org/T101572#1344593 (10fgiunchedi) correction, that's carbon-cache ``` graphite2001:~$ ps fwwaux | grep -i carbon-cache filippo 20910 0.0 0.0 11864 936 pts/0 S+ 23:52 0:00 \_ grep -i carbon-cach...