[13:45:21] !log admin restarting nova-api and nova-conductor on cloudcontrol1003 and 1004 [13:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:01:54] Wondering if anyone plans to update data on https://tools.wmflabs.org/trusty-tools/ ? Asking as I add comments like https://phabricator.wikimedia.org/T241985#5777837 pointing to that page. [14:23:27] I’m not sure what’s there to update? [14:23:35] the Trusty grid engine is no longer running, if I understand correctly [14:23:55] so the list of tools *currently* running there would be empty [14:24:23] that list just reflects the last snapshot while the grid was still alive – tools may have been migrated since then [14:24:28] (I assume) [14:25:35] though it looks like dykstats wasn’t, so your comment is correct [16:31:42] bd808: BTW, I never say enough how you're amazing. Thanks for all your help (currently triggered by seeing your work on jouncebot). :-) [16:33:08] James_F: aww thanks. I saw you were pretty prominent in andre__'s end of year stats, so props to you too. :) [16:33:45] Oh, there are end-of-year stats? I've got a lot of e-mail to catch up on. But thanks, provisionally, before I see what that means. :-) [16:34:25] the jouncebot stuff was mostly me being pedantic while moving it to the new k8s cluster [16:34:50] James_F: its "[Wikitech-l] Some Phabricator and Gerrit 2019 statistics" -- https://lists.wikimedia.org/pipermail/wikitech-l/2020-January/092916.html [16:35:38] you are in all 4 lists :) and the most impressive to me is you being at the top of "The 10 people who reviewed the most patchsets" [16:38:25] Oh, hah. Yeah, code review is fun and grows the community. [16:42:11] !log tools Removed files for old S1tty that wasn't working on sge-grid-master [16:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:42:45] !log tools failed sge-shadow-master back to the main grid master [16:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:51:16] hm, hi. i just created a new instance in deployment-prep, and am trying to edit hiera for it in horizon [16:51:20] getting [16:51:21] Danger: There was an error submitting the form. Please try again. [16:51:27] and no hiera is saved [16:51:30] this is on [16:51:30] https://horizon.wikimedia.org/project/instances/94eef137-56f8-42aa-93a8-b1d8bbfef4bf/ [16:53:57] ottomata: fun! what's the instance name? That deep link ended up being a 404 for me (unable to retrieve details...) [16:54:05] deplioyment-eventstrea s-1 [16:54:12] deployment-eventstreams-1 [16:55:16] * bd808 realizes the 404 was because project is not embedded in those links [16:56:45] ottomata: maybe just intermittent something? I was able to set a "bd808: testing hiera" key on that instance's direct puppet config [16:57:10] HMMM maybe its just my hiera [16:57:15] bad syntax maybe! [16:59:44] if that turns out to be the problem I think it would be worth a bug report to have us try and fix the messages one gets on a validation error [17:01:24] that was it! [17:01:27] making ticket... [17:03:33] https://phabricator.wikimedia.org/T241999 [17:12:13] ok one more q...i'm getting certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: deployment-puppetmaster03.deployment-prep.eqiad.wmflabs] [17:12:16] when running puppet [17:12:27] this seems familiar to me [17:12:32] am searchign for docs but can't find [17:12:44] i don't see a cert to sign on pm03 [17:13:17] oh i have to remove the puppet agent ssl dirs? [17:13:18] hm [17:13:42] ottomata: https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster#Step_2:_Setup_a_puppet_client [17:13:55] gah right. i was searching wikitech for that error message [17:13:57] there is an ugly dance that we have never been able to reliably automate [17:14:19] ty [17:14:23] been a while... [17:15:46] The issue is that the instance first boots and runs puppet against the Cloud VPS shared master and that Puppet run changes the config to point to the project's puppetmaster. Cleaning up all the things with Puppet has been tried a few times, but never made stable. Too many races to solve. [17:16:22] yeah [18:24:52] !log tools Deleted shutdown instance tools-worker-1029 (was an SSSD testing instance) [18:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:47:01] !log tools added mount_nfs=false to tools-k8s-haproxy puppet prefix T241908 [18:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:47:04] T241908: tools-k8s-haproxy-2 NFS problems - https://phabricator.wikimedia.org/T241908 [18:49:09] !log tools edited /etc/fstab to remove NFS and rebooted to clear stale mounts on tools-k8s-haproxy-2 T241908 [18:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:54:44] !log tools edited /etc/fstab to remove NFS and unmounted the nfs volumes tools-k8s-haproxy-1 T241908 [18:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:54:47] T241908: tools-k8s-haproxy-2 NFS problems - https://phabricator.wikimedia.org/T241908 [19:00:05] !log tools.grid-jobs Migrated to new Kubernetes cluster as a Ingress redirect to https://tools.wmflabs.org/sge-jobs/ [19:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.grid-jobs/SAL [19:11:09] Here's a fun one-liner to run in Toolforge or toolsbeta: `groups|tr " " "\n"|grep "$(cat /etc/wmcs-project)\."|sort` -- lists all the tools you are a co-maintainer of [19:11:25] task on a `|wc -l` at the end if you just want a count [19:11:49] * bd808 is at 56 apparently [19:43:39] Hi. How do I know if i am affected by disk issues ? [19:43:58] I tried ls /data/scratch/bothasava and it seems to be stuck [19:44:15] my bot runs somewhat more slowly that i anticipated [19:44:38] i am on tools-sgebastion-07) [19:49:49] Hi kotz looks like there's some heavy user activity on tools-sgebastion-07 that's causing things to hang [19:50:34] (again) [19:51:14] it should recover soon [19:52:25] there should be something in place that prevents the bastion to be hogged [19:52:35] If it's really annoying, you can use `webservice --backend kubernetes shell` and work there, as long as you don't need grid access [19:53:34] I hope it's not me which is hogging [19:54:03] Are you running anything on the bastion? [19:56:06] yes i'm debugging something. [19:56:16] but htop show me at 0% [19:57:40] Disk activity is the biggest problem, not memory or CPU use [19:58:48] But you should run from k8s or login to tools-dev.wmflabs.org instead [20:01:24] I tried now to login to tools-dev.wmflabs.org i [20:01:37] I giot logged into "kotz@tools-sgebastion-08:~$" [20:02:04] *got [20:07:49] kotz while there was some heavy user activity, you're directory /data/scratch/bothasava has way too many files. You have 1,117,050 text files there [20:08:32] that's going to cause anything that looks into that directory like `ls` to take a long time to respond [20:08:42] your* directory [20:23:32] wow. Yah, ext3 will not be a fan of 1.3M files in a single directory [20:27:29] kotz: see the different answers on how to split a directory into multiple subdirs. https://stackoverflow.com/questions/36511098/split-large-directory-into-subdirectories you'll want something like that in Bash or Perl or whatever [20:27:57] so create a bunch of subdirs and move files into them to reduce the number of files in each dir [20:29:23] or you could use tar / gzip and since they are all text files that will make them small and you can still use zgrep/zcat to search in them [20:31:21] alternatively something like this could work too `mkdir /data/scratch/bothasava/2018 && find . -maxdepth 1 -type f -name "BotHasava_cache_2018*txt" -exec mv -t /data/scratch/bothasava/2018/ {} \;` [20:32:25] `mkdir /data/scratch/bothasava/2018 && find /data/scratch/bothasava -maxdepth 1 -type f -name "BotHasava_cache_2018*txt" -exec mv -t /data/scratch/bothasava/2018/ {} \;` [20:32:36] updated find search path ^ [20:33:47] nice, yes, sorting by age is also a good one (or maybe stuff from 2018 could just be deleted with the same "find" command above but -delete at the end) [20:43:52] 😱 [20:44:09] I'll add a monthly delete cron [20:53:33] !log tools.admin Preparing to move webservice to new Kubernetes grid [20:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.admin/SAL [21:01:53] !log tools.admin Getting 503 response for https://tools.wmflabs.org/admin/; dynamicproxy seems to be confused about which cluster to route to [21:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.admin/SAL [21:07:36] !log tools Restarted kube2proxy on tools-proxy-05 to try and refresh admin tool's routes [21:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:12:23] !log tools.admin Brooke fixed routing by deleting a dangling Kubernetes Service for the admin tool on the legacy Kubernetes grid [21:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.admin/SAL [21:15:07] !log tools.admin kubectl scale --replicas=2 $(kubectl get deployment -o name) [21:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.admin/SAL [21:15:35] bstorm_: ^ that magic is pretty cool. 2 admin tool pods running now :) [21:15:49] :) [21:15:53] Nice [21:17:35] redundancy! [21:18:00] yeah, and a future chance for blue/green deploys [21:22:44] !log tools.meetbot Migrating webservice to new Kubernetes grid (bot still runs on grid engine) [21:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.meetbot/SAL [21:26:04] * bd808 resists urge to figure out how to more meetbot itself to k8s right now [21:42:29] !log toolsbeta disabled rpcbind on toolsbeta-sgebastion-04 to test some things [21:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [22:01:19] !log tools.sal Migrating to new Kubernetes cluster [22:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.sal/SAL [22:31:05] !log tools.sal Update to 9298e99; new features: 'firehose' view, show project for each log entry [22:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.sal/SAL [22:32:04] All the SAL messages in a single view -- https://tools.wmflabs.org/sal/__all__ [22:57:12] !log tools Disabling queues on tools-sgewebgrid-lighttpd-090[2-9] [22:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:58:43] !log tools Depooling tools-sgewebgrid-lighttpd-090[2-9] [22:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:01:29] !log tools Depooled tools-sgewebgrid-lighttpd-0910.tools.eqiad.wmflabs [23:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:13:18] !log tools Repooled tools-sgeexec-0922 because I don't know why it was depooled [23:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:34:18] !log tools Decommissioned tools-sgewebgrid-lighttpd-09{0[1-9],10} [23:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:34:36] * bd808 makes up new glob syntax with wild abandon [23:36:30] !log tools Shutdown tools-sgewebgrid-lighttpd-09{0[1-9],10} [23:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:40:09] !log tools Deleted tools-sgewebgrid-lighttpd-09{0[1-9],10} [23:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:59:03] I see Yellowstone has still not blown up bd808 :-) [23:59:13] For which I'm happy