[00:00:44] !log tools Draining tools-worker-1002 to reboot for NFS problems [00:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:02:38] !log tools Rebooting tools-worker-1002 [00:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:08:48] !log tools Uncordoned tools-worker-1002.tools.eqiad.wmflabs [00:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:11:42] !log tools Uncordoned tools-worker-1009.tools.eqiad.wmflabs [00:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:29:08] !log tools Drained tools-worker-1009 for reboot (NFS flakey) [00:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:43:52] bd808, So i know I have a wikitech account, where can I find out what groups I'm a member of? [00:44:19] Krenair, let me know what information you need and how to proceed please. [00:44:45] EdTadros__, Bryan is busy I'll look at this [00:44:55] Krenair, thanks. [00:44:59] what is your LDAP username? [00:45:10] or developer account username or 'uid' or 'cn' or whatever... [00:47:06] haha....Edward Tadros is how I log into wikitech and gerrit [00:47:10] ok [00:47:56] doesn't look like you're in any groups whatsoever [00:48:13] does look like you have a wikimedia staff email though [00:48:36] well that's something [00:49:16] can you make an edit somewhere on a prod wiki from your ETadros (WMF) account saying EdTadros__ on IRC is you and you'd like deployment-prep access? [00:49:47] https://en.wikipedia.org/wiki/User:ETadros_(WMF)/sandbox would do [00:52:05] Krenair, done. [00:52:09] ok great [00:52:29] out of curiosity are you a product analytics person? [00:52:56] i'm on the QTE team and I do QA for the Web Team and I'm just rolling off as the QA person for SDC. [00:53:12] !log tools.hat-collector stopped the start job in this tool to see if it was the source of a large IO spike [00:53:14] I need to read some event logs for a task. [00:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.hat-collector/SAL [00:53:55] EdTadros__, okay you should be able to SSH in shortly [00:54:15] # ldapsearch -x member=uid=edtadros,ou=people,dc=wikimedia,dc=org -LLL dn [00:54:15] dn: cn=project-bastion,ou=groups,dc=wikimedia,dc=org [00:54:15] dn: cn=project-deployment-prep,ou=groups,dc=wikimedia,dc=org [00:54:17] now it should work [00:54:24] Krenair , thanks...checking. [00:54:59] !log tools.hat-collector restarted the start job because that wasn't it [00:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.hat-collector/SAL [01:09:20] EdTadros__, all ok? [01:13:40] Krenair, can't tell. I get a hostname cannot be resolved, which I expected because I'm not sure how my machine could resolve that hostname. I feel like there's a step missing. [01:13:55] EdTadros__, yeah so you should have some SSH config [01:14:31] that looks something like this: https://wikitech.wikimedia.org/wiki/Help:Accessing_Cloud_VPS_instances#Set_default_configuration [01:14:39] Krenair, makes sense. [01:30:35] I can't start a k8s continuous job on toolforge, tool is anticompositebot, error is "error validating "/data/project/anticompositebot/AntiCompositeBot/harvcheck_deployment.yaml": error validating data: forbidden: User "anticompositebot" cannot get path "/swaggerapi/apis/apps/v1"; if you choose to ignore these errors, turn validation off with --validate=false", yaml file is https://github.com/AntiCompositeNumber/AntiComposi [01:30:36] teBot/blob/master/harvcheck_deployment.yaml [01:31:13] https://github.com/AntiCompositeNumber/AntiCompositeBot/blob/master/harvcheck_deployment.yaml could someone tell me what I'm doing wrong? I've already used `kubectl config use-context toolforge` and aliased to the new kubectl [01:36:47] and what command are you running to get this AntiComposite ? [01:37:06] kubectl create --validate=true -f /data/project/anticompositebot/AntiCompositeBot/harvcheck_deployment.yaml [01:44:16] hmmmmmmmmmm [01:45:52] AntiComposite, it does look like a permissions problem [01:46:08] I picked a random client cert/key pair on the k8s control host and it authenticates and gets that URL just fine [01:46:27] so I can do e.g. root@tools-k8s-control-1:/etc/kubernetes/pki# curl --cert apiserver-kubelet-client.crt --key apiserver-kubelet-client.key -k https://tools-k8s-haproxy-1.tools.eqiad.wmflabs:6443/swaggerapi/apis/apps/v1 [01:46:47] if I run `kubectl get deployments` I get this error: error: group map[storage.k8s.io:0xc820404380 :0xc82034dc00 authentication.k8s.io:0xc82034dce0 componentconfig:0xc820404150 extensions:0xc820404230 policy:0xc8204042a0 certificates.k8s.io:0xc8204040e0 rbac.authorization.k8s.io:0xc820404310 federation:0xc82034d730 apps:0xc82034dc70 authorization.k8s.io:0xc82034de30 autoscaling:0xc82034dea0 batch:0xc820404070] is already [01:46:47] registered [01:47:04] but if I try something like [01:47:10] tools.anticompositebot@tools-sgebastion-07:~$ curl --cert .toolskube/client.crt --key .toolskube/client.key -k https://tools-k8s-haproxy-1.tools.eqiad.wmflabs:6443/swaggerapi/apis/apps/v1 [01:47:19] then I get an error like you saw [01:47:47] oh, ew [01:49:24] oh hang on this is the wrong kubectl version [01:50:50] AntiComposite, try running that kubectl create with /usr/bin/kubectl instead of just kubectl [01:51:23] created [01:51:31] so looks like the alias didn't alias right [01:51:37] okay so whatever was done to alias to the new kubectl... didn't quite work [01:53:21] I rain alias again and it worked [01:56:18] The alias in $HOME/.profile will only work if your tool doesn't have a .bash_profile or similar shell init script that clobbers .profile handling [01:57:31] we will fix all this confusion really soon be getting rid of /usr/local/bin/kubectl. That is blocked on finishing the migration of everything to the 2020 Kubernetes cluster [01:58:10] speaking of which, are we done with migrations for the day? [01:58:46] yeah. I have related things to do still but I'm not going to be moving any more tools [01:59:28] shall we unquiet stashbot? [01:59:39] Krenair: I figured out that leaving stashbot with /mode +q is easier [01:59:44] ok [01:59:45] stashbot: hello [01:59:45] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [02:00:04] as long as it is voiced that overrides +Q [02:00:07] +q [02:00:11] ah [02:00:13] cool [02:00:26] yup, that was the problem, a darn .bash_profile for shared pywikibot [02:00:57] AntiComposite: you can put your alias line in .bash_profile [02:01:05] yeah, will do [02:01:24] * AntiComposite now has to figure out why the pod is immediately crashing and not giving any logs either [02:01:33] shell startup scripts have E_TOOMANY options [02:02:20] AntiComposite: if it goes into CrashLoopBackOff state, you should still be able to use `kubectl logs ` to see why [02:02:29] it returns nothing [02:03:45] failed to start container "harvcheck": Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "exec: \"/data/project/anticompositebot/AntiCompositeBot/harvcheck_start.sh\": permission denied" [02:03:57] oh. [02:03:59] duh. [02:04:03] that's from `/usr/bin/kubectl describe pod anticompositebot.harvcheck-6b7577dccd-khsng` [02:04:18] it would help if I flipped the executable bit. [02:04:46] :) easy miss [02:05:18] * Krenair -> zzz [02:10:47] it would also help if I remembered to actually install the dependencies in the venv... [02:13:01] yay, it's working now, thanks for the help [02:15:28] yw [14:33:39] !log tools Drained tools-worker-10{39,38,37} yesterday but did not !log [14:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:33:57] !log tools Drained tools-worker-1036 [14:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:34:37] !log tools Drained tools-worker-1035 [14:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:35:16] !log tools Drained tools-worker-1034 [14:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:37:20] !log tools Drained tools-worker-1033 [14:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:41:45] !log tools Drained tools-worker-1032 [14:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:44:34] !log tools Uncordoned tools-worker-1009.tools.eqiad.wmflabs [14:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:54:02] !log tools Hard reboot tools-worker-1016. Direct virsh console unresponsive. Stuck in shutdown since 2020-01-22? [14:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:00:52] !log tools Drained tools-worker-1031 [15:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:06:27] !log tools Uncordoned tools-worker-10[16-20]. Was over optimistic about repacking legacy Kubernetes cluster into 15 instances. Will keep 20 for now. [15:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:07:16] !log tools Drained tools-worker-1030 [15:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:09:05] !log tools Drained tools-worker-1028 (there is no tools-worker-1029) [15:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:11:23] !log tools Drained tools-worker-1027 [15:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:39:03] !log tools Drained tools-worker-1026 [15:39:30] !log tools Drained tools-worker-1025 [15:39:46] stashbot: ? [15:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:39:59] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [15:44:39] !log tools Drained tools-worker-1023 (there is no tools-worker-1024) [15:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:50:55] Hi, need some help. Trying to upload files on Toolforge (to /data/project/twltools), but getting 'Permission denied'. As instructed here https://wikitech.wikimedia.org/wiki/Help:Access_to_Toolforge_instances_with_PuTTY_and_WinSCP I can see I don't have the required write permission, so I proceed to chmod -R g+w /data/project/twltools, and that [15:50:55] doesn't work either (Operation not permitted) [15:51:21] !log tools Drained tools-worker-1022 [15:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:52:18] Aaron-: that sounds un-fun. Let me peek at the directory permissions [15:52:25] thanks! [15:53:25] are you on your personal account or your tool account when trying to do that? [15:54:01] tool account [15:54:24] the file I wanted to upload is currently in my user home directory, so it's now a matter of moving to the project [15:55:33] Aaron-: the permissions on /data/project/twltools looks ok and most of the files inside have g+w perms. [15:56:00] Maybe we should back up and have you explain exactly what commands are giving you the error message [15:56:12] sure [15:57:15] tried scp'ing the file directly to /data/project/twltools and got a permission denied [15:57:26] then I uploaded the file to my user home directory [15:58:25] after which I tried to mv the file to /data/project/twltools [15:58:38] wait [15:58:45] it worked now? [15:58:48] hmm [15:59:32] yeah, it worked. Moved the file from the home directory to the project [15:59:42] so sorry to have wasted your time guys [16:00:08] !log toolsbeta increase CloudVPS quota instance count for new elasticsearch servers [16:00:09] Aaron-: no worries. Scp into a tool is a big pain [16:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [16:00:54] yeah. Anyways, thanks for the help. Bye! [16:01:02] Aaron-: once you have moved the files, `become $tool; take $files` will set the permissions so that the tool owns them [16:01:19] * bd808 grumbles about drive-bys [16:02:38] !log tools Drained tools-worker-1021 [16:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:03:55] !log toolsbeta create 3 new VMs toolsbeta-elastic7-0[1,2,3] [16:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [16:14:43] !log tools Decommissioning tools-worker-10[21-40] [16:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:28:21] bd808: excessive account creation: https://wikitech.wikimedia.org/wiki/Special:Log/newusers [16:28:41] ugh [16:33:37] need to set up a blacklist / abuse filter on this case I think [16:35:49] * bd808 crosses fingers that he blocked the correct range [16:36:41] no such luck [16:36:49] it's still going [16:38:24] yeah. the are hopping proxies [16:42:46] are you f'n kidding me right now, they're still at it [16:47:40] think it's time for https://wikitech.wikimedia.org/wiki/MediaWiki:Titleblacklist [16:53:22] did a purge on the page after you edited it - hope that helps it to engage faster [17:08:13] Something happening with tool-labs? [17:08:32] (I got a warning about my crontab not running on time :-p) [17:08:40] account creation spam [17:08:53] That and NFS server just got a bad firewall rule [17:09:00] That doesn't relate to crontab not running, DSquirrelGM [17:09:02] the NFS issue is being fixed [17:09:06] Oh, ok [17:11:03] that'll certainly break things [17:13:13] bd808: on prodwikis we usually globally block all the proxy ranges we find (and any IP using same AS) [17:14:26] revi: *nod* I blocked a couple of ranges but the attacker was moving faster than my CU skills [17:14:47] For the moment I have used TitleBlacklist to stop all account creation [17:15:19] (see -staff for something I don't want to be logged), and AFAIK there is no automatic block scripts [17:17:45] And yes, English Wikipedia always have a solution [17:17:45] https://en.wikipedia.org/wiki/User:Timotheus_Canens/massblock.js [17:17:54] NFS should be ok now [17:18:25] Untested tho [17:21:12] Maybe not entirely...we'll get it sorted out [17:21:39] (I meant the script is not tested) [17:24:56] revi: thanks for the suggestion. I'm more likely to go with pywikibot over user scripts. They are just too hard to test/debug [17:25:03] thanks [17:25:11] (in advance) [17:25:18] * revi hopes he shares the script with me [17:25:57] revi: I totally will! [17:27:04] xD [17:27:22] now I probably need to make it work on my macbook automation apps and use it on mobile [17:27:23] * revi dreams [17:37:08] Ok! NFS is working for sure now [17:50:00] * bd808 tries to remember what he was doing before the wikitech excitement [17:56:20] !log tools Deleted instances tools-worker-10[21-40] [17:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:19:11] something about decommissioning worker nodes or something? [18:20:46] !log tools Building tools-k8s-worker-[36-55] [18:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:39:38] `crontab -e` on revibot-ii doesn't load at all for me [18:39:40] is it just me? [18:39:46] (i mean, toolforge) [18:40:50] does it time out or something? [18:44:04] confirmed [18:44:18] tools-sgecron-01.tools.eqiad.wmflabs is not responsive to ssh afaict [18:46:56] zhuyifei1999_: not responsive with root key either. bstorm_, andrewbogott ^ can one of you figure out why it is dead? [18:47:05] looking at nagf we had a load skyrocket around 15:15 with increase in memory use as well [18:47:19] so it isn't some nfs maintenance or something [18:47:21] It was during the NFS issue. I'll reboot it with openstack [18:47:38] nfs issue shouldn't increase ram right? [18:47:58] It can do a lot of things depending on what was running when the filesystem vanished, no? [18:48:25] doesn't processes just starts to D-sleep? [18:48:44] hmm [18:48:59] That's what it should do [18:49:03] cron host will start more and more processes that D-sleeps [18:49:04] But if it is a loop or something leaky [18:49:10] yeah that makes sense [18:49:22] yeah, leaky like crond :) [18:49:54] :) [18:52:33] It's up now [18:52:55] :) [18:52:55] And apparently crons are working again according to icinga [18:53:53] !log tools hard rebooted a rather stuck tools-sgecron-01 [18:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:56:59] yay [18:57:27] records broken :( https://usercontent.irccloud-cdn.com/file/FR9Cw9oI/image.png [19:04:43] aww [19:05:36] Well, in theory, the last couple changes make NFS more stable and available going forward. 🤞🏻 [19:09:03] \o/ [19:10:53] * zhuyifei1999_ remembers in 2015 days nfs fails every few months [19:10:57] https://wikitech.wikimedia.org/wiki/Incident_documentation/20150617-LabsNFSOutage the horror [19:18:58] !log toolsbeta upgraded toollabs-webservice to 0.64 on stretch-toolsbeta for testing [19:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [20:19:55] !log tools update elasticsearch VPS security group to allow toolsbeta-elastic7-1 access on tcp 80 T236606 [20:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:20:02] T236606: Rebuild Toolforge elasticsearch cluster with Stretch or Buster - https://phabricator.wikimedia.org/T236606 [20:57:00] !log tools upgrading toollabs-webservice to stretch-toolsbeta version for jdk8:testing image only [20:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:03:13] !log tools add reindex service account to elasticsearch for data migration T236606 [21:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:03:19] T236606: Rebuild Toolforge elasticsearch cluster with Stretch or Buster - https://phabricator.wikimedia.org/T236606 [21:14:21] !log tools.sammour switched the sammour webservice to use the toollabs-jdk8-web:testing image to see if that fixes the problem with the service [21:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.sammour/SAL [21:28:53] bd808 - want to open account creation back up yet, see if they're still trying to attempt it? [21:29:27] All their existing accounts are now blocked at least [21:31:58] at least I found out about it within a few minutes or so of it starting and reported it here - could have been a lot worse if it had been hours [21:32:44] once it hit 4 or 5 I knew it wasn't accidental [22:13:55] Reedy: y'all got them all blocked? magic! [22:14:14] Yeah, mix of an sql query and importing a gadget [22:14:33] <3 I owe you something nice :) [22:16:23] DSquirrelGM: do you want to be the alert system for an hour or two if I open it back up? We obviously do not have any automated monitoring right now four account creation rate spikes. :/ [22:17:00] I'm game to try it, but I also need to be head down on some other things for a few more hours today [22:17:32] I'll keep an eye out as much as I can [22:18:12] cool. Let's see what happens [22:18:34] and thank you for caring and helping DSquirrelGM [22:20:13] * bd808 opened the gate [22:42:30] saw one user creation so far and it's not even related [22:42:43] so that's progress [22:44:00] yeah. I spot checked that one in the LDAP backend and it looks non-spammy [22:44:10] thanks for checking :) [23:03:29] !log tools.vcat redeploying with new webservice code to test if this fixes things [23:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.vcat/SAL [23:21:10] Is xtools.wmflabs.org down? [23:21:59] Not for me [23:22:09] Most all of the tools hosted there in the Rfx toolbox toolbox aren’t responding https://en.wikipedia.org/wiki/Template:RfA_toolbox [23:22:55] * bd808 now knows that musikanimal stalks "xtools" in this channel [23:23:18] Hehe I do! Every channel [23:23:49] All those links are working for me phuzion [23:23:49] I used to stalk "vagrant". I feel your concern and pain [23:23:58] !log tools pushed toollabs-webservice version 0.64 to all toolforge repos [23:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:24:24] But someone reported problems with accessing XTools a few days ago, even though all monitors say it's up and running [23:24:28] musikanimal: must have been a transient issue. Seems to be working now. [23:24:38] Might be some general instability, idk [23:24:38] https://stats.uptimerobot.com/BN16RUOP5/779220479 [23:25:13] Great! [23:26:30] If it happens again do say something, and test some other tools too to make sure it's not just XTools [23:27:45] !log tools installed toollabs-webservice 0.64 on the bastions [23:27:46] the only thing xtools shares with other tools at this point is the dynamicproxy ingress right musikanimal? [23:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:28:54] Hmm I'm not sure, hehe! I think the security group stuff is the same as our tools, like https://wsexport.wmflabs.org [23:29:33] Nothing has changed with XTools in over a month or so, code-wise or in horizon