[08:48:31] !log wikilabels 6cf3cc0 is going to staging [08:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL [11:06:49] (03PS1) 10Lokal Profil: [WIP]Use table creation logic from common [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/447782 [11:32:07] (03PS1) 10Lokal Profil: Hotfix for missing commonscat [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/447784 [11:49:20] (03PS2) 10Lokal Profil: Hotfix for missing commonscat [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/447784 (https://phabricator.wikimedia.org/T200326) [12:23:43] !log wikilabels 528619e is going to staging [12:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL [12:33:23] !log wikilabels 528619e is going to prod [12:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL [12:50:22] (03PS1) 10Lokal Profil: [WIP]Ensure unicode encoding of query results [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/447794 (https://phabricator.wikimedia.org/T200325) [13:09:20] (03PS1) 10Lokal Profil: Standardise SQL query representations [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/447795 [13:25:41] Technical Advice IRC meeting starting in 90 minutes in channel #wikimedia-tech, hosts: @@Pablo_WMDE & @CFisch_WMDE - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [14:04:41] (03PS1) 10Lokal Profil: Handle missing lat, lon in monument config [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/447806 (https://phabricator.wikimedia.org/T176845) [14:06:30] (03CR) 10jerkins-bot: [V: 04-1] Handle missing lat, lon in monument config [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/447806 (https://phabricator.wikimedia.org/T176845) (owner: 10Lokal Profil) [14:10:06] (03PS2) 10Lokal Profil: Handle missing lat, lon in monument config [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/447806 (https://phabricator.wikimedia.org/T176845) [16:40:11] (03CR) 10Jean-Frédéric: "Do we really want to do that? I mean, for PH we will eventually write a converter to break down coord in both lat/lon, no?" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/447806 (https://phabricator.wikimedia.org/T176845) (owner: 10Lokal Profil) [17:02:56] (03CR) 10Jean-Frédéric: [C: 032] "Looks much better, thanks :+1:" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/447795 (owner: 10Lokal Profil) [17:05:02] (03Merged) 10jenkins-bot: Standardise SQL query representations [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/447795 (owner: 10Lokal Profil) [17:06:12] (03CR) 10jenkins-bot: Standardise SQL query representations [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/447795 (owner: 10Lokal Profil) [17:37:42] looks like the tools job grid is full ? [17:37:51] "All queues dropped because of overload or full" [17:37:56] it's the first time i see this [17:39:51] ? [17:40:24] That doesn't sound good [17:44:18] bstorm_: tools-bastion-03 has load of 56 [17:44:42] Yeah, unmounting the nfs thing. I'd hoped that with all that time, nothing would be accessing it [17:44:45] buuut now [17:44:47] nope [17:45:10] heard :) [17:48:42] chasemp: load is still high after unmounting.... [17:48:47] kind of odd [17:48:55] How long did it take to help last time [17:49:01] bstorm_: at least on tools-bastion-03 it appears to be going down but slowly [17:49:03] https://tools.wmflabs.org/admin/oge/status is reporting insane load on all of the grid nodes [17:49:09] I'll watch for another few minutes [17:49:28] waht I see is trending down but it can take a bit assuming it's on the right path [17:50:31] I should have known....blasted NFS :-p [17:52:45] Seems like it should have dropped more than that to me. It'll be a while before I can put this mount back [17:55:19] !help is there anyone in here who could possibly approve my toolsforge membership request? I am working on the replacement for Legobot's GAN work and would like to run it on the servers wh [17:55:19] This key already exists - remove it, if you want to change it [17:55:32] seems about 1 per minute atm [17:55:43] Ugh! :-/ [17:55:50] when it is ready* [17:56:15] just want to get everything set and ready to go [17:56:15] TheSandDoctor: I'll take a look "soon" (within the next hour) [17:56:19] Lesson learned: always unmount no matter what I think, lol [17:56:20] okay, thanks [17:56:31] what happened @bstorm_? [17:57:15] I'm working on an NFS server, and I wrongly assumed I could get away without an unmount after shifting every service away from it [17:57:30] It turns out...this is not a good idea here [17:57:53] what happened? [17:57:56] :( [18:02:46] TheSandDoctor: https://wikitech.wikimedia.org/wiki/User_talk:TheSandDoctor#Welcome_to_Toolforge.21 -- welcome to toolforge [18:03:12] thanks [18:03:27] Toolforge will be unbroken soon enough. [18:06:13] * TheSandDoctor now knows who to blame if toolforge dies on him....@bstorm_ :P [18:06:23] @bstorm_ * [18:06:25] ^ [18:09:30] bstorm_: seems to be dropping albeit slowly still :) [18:09:49] Really slowly :( [18:10:09] * andrewbogott also watching those graphs nervously [18:14:05] Disabled puppet across it as well. I realized that I may have removed it from the puppet stuff, but that doesn't mean puppet unwinds fstab and is probably still trying to mount everything :-/ [18:14:13] a few are still trending up [18:14:23] hm it might [18:14:25] yeah [18:14:26] so the goal now is to have 1006 mounted and 1007 unmounted? Or to have neither mounted? [18:14:29] bc for say tools-worker-1004.tools.eqiad.wmflabs [18:14:33] 18:13:46 up 27 days, 1:22, 0 users, load average: 79.22, 78.34, 73.92 [18:14:38] that's going /up/ [18:14:40] 1007 unmounted. 1006 can be mounted [18:14:40] still [18:14:53] so is tools-worker-1006.tools.eqiad.wmflabs [18:14:58] ugh [18:15:01] and tools-worker-1008.tools.eqiad.wmflabs [18:15:06] The filesystem is still expanding [18:15:07] load average: 77.53, 77.04, 72.80 [18:15:12] Yeah, the k8s nodes are not behaving well [18:15:35] Do the containers mount the NFS? They could... [18:15:48] bstorm_: they map to the host mount [18:15:58] so they access it but don't mount it distinctly [18:16:31] filsystem resize is what I'm waiting on to put it back in service [18:16:37] I cannot really interrupt that [18:16:59] When it finishes, hopefully that will resolve this. [18:17:14] those k8s workers could be belly up by then depending on how long [18:17:32] yeah [18:17:43] tools-paws-worker-1001.tools.eqiad.wmflabs: 18:13:51 up 27 days, 1:10, 0 users, load average: 76.43, 76.54, 72.39 [18:17:45] is off too [18:18:23] what's weird is tools-worker-1004 has no 1007 mount [18:18:26] but load is still rising [18:18:29] jessie and nfs ugh [18:18:46] well it's maybe just hovering high [18:18:56] with [18:18:57] labstore1007.wikimedia.org:/dumps /mnt/nfs/dumps-labstore1007.wikimedia.org nfs vers=4,bg,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=all,nofsc,ro,soft,timeo=300,retrans=3 0 0 [18:19:05] still in /etc/fstab a reboot will be problematic [18:19:25] not sure what the puppet disabled intermediary state is, maybe just purge 1007 entirely from puppet and let that rollout [18:20:17] I nearly did already [18:20:21] Days ago [18:20:36] There are mentions of it in some locations related to stats servers (for rsync) [18:20:39] but that's about it [18:20:41] bstorm_: I was thinking have puppet remove from fstab [18:20:56] so reboots could happen if needed [18:21:04] Ah. [18:21:09] w/o that if load really does get too high for responsiveness you're in a bad spot [18:21:20] definitely [18:21:33] This is just ridiculous [18:21:41] that's a good word for it [18:21:59] The kernel is still retrying mounts: Jul 25 18:20:44 tools-bastion-03 kernel: [4230882.988348] nfs: server labstore1007.wikimedia.org not responding, timed out [18:22:07] interesting [18:22:09] fstab entries? [18:22:14] maybe [18:22:18] should I sed 1007 out of fstab? [18:22:23] is that only happening on jessie/stretch I wonder? [18:22:38] andrewbogott: that sounds like a plan, I think [18:22:44] no I'm on a trusty node [18:22:46] ok, will do [18:23:13] bstorm_: just a note as a last resort trick to fool things, you could add the IP for labstore1007 to eth0:fakenfs and then the client suddenly sees remotely dropped share offerings and shoudl untangle [18:23:18] that's slightly crazy but it can work [18:23:35] eth0:fakens is on hte instances themselves in that case [18:23:40] eth0:fakenfs even [18:23:42] I need sleep :) [18:23:54] but atm I guess purge from /etc/fstab yeah and see what happens [18:24:07] that's weird it's still trying unless it's a) puppet or b) fstab [18:24:22] huh [18:24:38] I'm running a big sed via cumin. Is puppet going to replace those lines or has it already been removed from puppet? [18:24:52] puppet is disabled [18:24:55] ok [18:25:00] just in case [18:25:09] I think that that fake ip thing is what resolved this for real last time it happened [18:25:27] It wasn't quite this bad when labstore1007 actually was crashed out [18:25:35] I think because we unmounted earlier [18:25:38] nah, I've not yet had to resort to it but I have always had it in mind in case [18:25:53] yeah this was building for a good while I suspect right? [18:26:05] but I'm unsure why things would still be trying to mount 1007 [18:26:08] that's really odd [18:26:31] but yeah sure enough [18:26:32] Jul 25 18:26:23 tools-worker-1004 kernel: [2338505.645828] nfs: server labstore1007.wikimedia.org not responding, timed out [18:26:37] it seems pretty abrupt to me [18:26:39] https://grafana.wikimedia.org/dashboard/db/labs-capacity-planning?orgId=1&from=1532532384722&to=1532543184723&panelId=94&fullscreen&var-labvirt=All [18:26:39] 18:26:35 up 27 days, 1:35, 1 user, load average: 77.99, 77.85, 76.07 [18:26:46] huh [18:29:14] bstorm_: andrewbogott removing from /etc/fstab and running mount -a on tools-worker-1004 still did not tell it to stop mounting 1007 [18:29:22] I think /this is a bug/ [18:29:32] This sucks [18:29:47] It has been out of nfs-mount-manager for a week now [18:29:55] so puppet really wasn't the cause, per se [18:30:03] This is just how it handles that mount [18:30:05] ok, for the dns hack… chasemp are you thinking you'd change that in auth dns? Or make a local change on the VMs? [18:30:44] I think he was talking about a local interface hack [18:30:58] ah, ok. I'm… not sure I know how to do that [18:31:10] https://stackoverflow.com/questions/40317/force-unmount-of-nfs-mounted-directory one of the responses in there [18:31:37] It seems looney [18:31:47] :) [18:31:52] heh [18:31:54] it is looney [18:32:11] filesystem is still busily resizing [18:32:14] but then again so is this thing trying to mount 1007 w/ it missing from fstab and no current mount [18:32:18] that's extra looney [18:32:28] agreed [18:32:47] it has to be some interaction we don't understand w/ k8s things? I'm really no tsure [18:32:49] not sure [18:33:27] possibly. The bastion: [18:33:27] Jul 25 18:33:13 tools-bastion-03 kernel: [4231631.859161] nfs: server labstore1007.wikimedia.org not responding, timed out [18:33:39] So I mean, that's not a k8s worker [18:33:43] even that trick isn't working it seems [18:33:50] what the hell [18:33:58] and it is ubuntu [18:34:03] yeah [18:34:03] trusty I mean [18:34:23] load is slowly dropping across the grid [18:34:35] I wonder if that's an exponential backoff [18:35:11] I ran ifconfig eth0:fakenfs 208.80.155.106 netmask 255.255.255.255 up on tools-worker-1004 [18:35:12] which is yes looney [18:35:19] :) [18:35:21] but it doesn't survive a reboot and I want to see if load drops [18:35:22] and no dice? [18:35:37] well, maybe I spoke too soon idk yet [18:35:41] fair [18:35:54] I'm trying not to top/htop myself into more pain [18:36:29] yeah it's def not dropping linearly [18:36:39] load average: 78.13, 78.05, 77.03 [18:36:41] bastion is finally showing below 20 [18:36:53] and much of the grid is dropped below 30 [18:37:00] k8s is just angry, though [18:37:17] This is so much worse than our last run on this server [18:37:22] I'm going to attempt to reboot tools-worker-1004 [18:37:27] 👍🏻 [18:37:31] if we reboot a given worker node is it... [18:37:36] ah, I see you had the same question :) [18:38:04] I'm still seeing new timeout messages for that mount on the bastion [18:38:07] which is baffling me [18:38:16] same [18:38:25] (03CR) 10Lokal Profil: "I'm not sure the converter setup allows for one field to convert to two so that isn't straight forward." [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/447806 (https://phabricator.wikimedia.org/T176845) (owner: 10Lokal Profil) [18:39:08] I'd wonder if the issue was actually unrelated if not for the fact that load is dropping for the grid [18:39:34] some grid nodes are under 20 [18:40:18] unrelated, but… did cumin stop working entirely on VMs? [18:40:23] I'm getting 100% failure for all my runs [18:40:31] :-/ [18:41:15] so a reboot trigged a puppet run? [18:41:21] or maybe not [18:41:32] but yeah sure seems like it [18:42:28] diamond nfsiostat is also flaking out [18:42:30] that's so weird [18:43:23] I'd be looking elsewhere (because this isn't that much different than when labstore1007 was down before) if I wasn't seeing messages still that it is trying to mount it. [18:44:17] load starting rising right away on tools-worker-1004 again [18:44:41] maddening [18:44:56] some grid nodes are below 10 now [18:47:16] so it was up to 7+ right away and I did the 'ifconfig eth0:fakenfs 208.80.155.106 netmask 255.255.255.255 up' thing to see and it's dropping now [18:47:23] but puppet agent did run I'm sure [18:47:27] from the reboot [18:47:43] So my thought is that it was just noticed later than last time? [18:48:35] I wish this resize would finish, but it's a huge filesystem [18:48:39] andrewbogott: is 1007 gone from /etc/fstab now on all tools-workers? [18:48:46] no, because cumin doesn't work [18:48:51] andrewbogott: try clush? [18:48:54] Ahhh. clush [18:48:55] yeah [18:49:05] at least you'll get tools [18:49:18] there are isntances outside of tools tho probably in a bad state [18:50:15] two grid nodes are back to normal-ish load :-/ [18:50:16] so 1004 did some interesting stuff on reboot and came back going up to almost 8 for load and then 'ifconfig eth0:fakenfs 208.80.155.106 netmask 255.255.255.255 up' settled it and now it's down to 1-2 [18:50:31] but the last attempt was [18:50:31] Jul 25 18:35:52 tools-worker-1004 kernel: [2339074.994917] nfs: server labstore1007.wikimedia.org not responding, timed out [18:50:33] and holding [18:51:00] I don't entirely understand the sequence of events there [18:51:05] filesystem resize is done! [18:51:49] bstorm_: I'm guessing you're bringing 1007 back online? :D [18:52:17] NFS is back up on 1007 [18:52:22] chasemp: ok, removed 1007 from fstab throughout tools [18:52:50] andrewbogott: ack, well maybe just in time for it to be fixed idk, I'm going to check out tools-worker-1005 [18:52:53] see if things go right [18:53:59] so for tools-worker-1005 all I did was enable and run puppet [18:54:01] which /did nothing/ [18:54:19] 😐 [18:54:19] 1007 is /not/ in etc/fstab [18:54:21] https://www.irccloud.com/pastebin/TKpMOAeI/ [18:54:23] but [18:54:30] load average: 19.26, 58.62, 70.63 [18:54:32] wtf is that [18:54:34] I'm glad but [18:54:36] wtf is that [18:54:39] bastion is happy now [18:54:50] those load graphs are dropping like rocks [18:54:50] this just makes no sense [18:55:04] yeah [18:55:09] tools-workers are happy now [18:55:24] I would love to understand why these hosts care that 1007 exists [18:55:28] that's creepy [18:55:30] very much [18:55:43] I'm so sorry everyone [18:56:22] running puppet on the bastion [18:58:01] As long as that goes ok, I figure I'll enable puppet everywhere [18:58:12] everywhere I disabled it anyway, I was using clush [18:58:35] doing that [18:58:57] fwiw spot checking 2 tools-workers seems good [19:00:08] The grid is happy again [19:00:08] T198420 is maybe way more interesting than we thought [19:00:09] T198420: Improve unmount/relink setup for dumps (labstore1006/1007) failovers - https://phabricator.wikimedia.org/T198420 [19:00:16] Agreed [19:00:25] And I think I just bumped up its priority [19:00:31] mentally [19:00:37] * chasemp nods [19:01:08] bstorm_: it's possible the fakenfs hack was too late to be useful past load 80 on a 2 core instance, it did seem to settle things on the worker after the reboot when it was only at 8 at the time I added it [19:01:14] now that doesn't explain the why of any of that but [19:01:19] it's worth mentioning [19:01:51] !log tools ifconfig eth0:fakenfs 208.80.155.106 netmask 255.255.255.255 up on tools-worker-1004 (late log) [19:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:02:03] so it's in sal somewhere [19:02:14] !log tools tools-worker-1004 reboot [19:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:02:25] to clear out that bad IP and just reset things generally to see effect [19:03:20] (03PS2) 10Lokal Profil: [WIP]Ensure unicode encoding of query results [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/447794 (https://phabricator.wikimedia.org/T200325) [19:04:20] 👍🏻 [19:04:56] bstorm_: it's working again, thanks ! [19:05:09] * bstorm_ shakes fist at NFS [19:07:40] tools-worker-1004 seems fine post reboot now [19:09:34] 😑 [19:10:38] that could be a zen face or a face in need of a drink [19:10:43] now that I'm scrutinizing it [19:10:59] 😐🍸 [19:11:30] Sorta both [19:11:39] I think I'll settle for a cup of tea [19:13:59] (03CR) 10Lokal Profil: "if we want to externalise the wikitext even more then even the row could be turned into a method in common.py looking something like" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/447782 (owner: 10Lokal Profil) [19:18:21] and… cumin is working now that I don't need it. [19:19:48] (03CR) 10Lokal Profil: [C: 04-1] [WIP]Use table creation logic from common (031 comment) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/447782 (owner: 10Lokal Profil) [19:22:13] (03CR) 10Lokal Profil: "If there is a solution where you can set all queries to return unicode that would probably be nicer." [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/447794 (https://phabricator.wikimedia.org/T200325) (owner: 10Lokal Profil) [19:24:33] Good to know. High enough load probably breaks cumin [19:25:01] yes I suppose it would [19:25:10] if the process can't run, nt much good is going to come out of that [19:25:18] clush still worked 😬 [19:25:29] sweet! [20:49:54] !help can tools have python installed/does it have it already? [20:49:54] TheSandDoctor: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team [20:50:11] Python is installed [20:50:15] and lots of libraries for it [20:50:19] TheSandDoctor: definitely already installed [20:50:22] virtualenv as well [20:50:22] TheSandDoctor: you didn't even look ;) [20:50:54] Was suggested I ask here [20:50:57] :P [20:51:00] Thanks [20:51:20] TheSandDoctor: we have pretty ok docs -- https://wikitech.wikimedia.org/wiki/Help:Toolforge [20:51:26] not perfect for sure, but ok [20:51:39] Thanks [20:52:29] for bot things you will probably be most interested in https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid and https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database [21:20:04] 👍 [22:58:18] (03PS1) 10Legoktm: Add Wikimedia-deployed filter [labs/codesearch] - 10https://gerrit.wikimedia.org/r/447925 (https://phabricator.wikimedia.org/T186290) [22:59:38] (03CR) 10Legoktm: [C: 032] Add Wikimedia-deployed filter [labs/codesearch] - 10https://gerrit.wikimedia.org/r/447925 (https://phabricator.wikimedia.org/T186290) (owner: 10Legoktm) [23:01:06] (03Merged) 10jenkins-bot: Add Wikimedia-deployed filter [labs/codesearch] - 10https://gerrit.wikimedia.org/r/447925 (https://phabricator.wikimedia.org/T186290) (owner: 10Legoktm)