[08:55:12] !log tools adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node (T267082) [08:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:55:17] T267082: Rebuild Toolforge servers that should not have NFS mounted (and with affinity) - https://phabricator.wikimedia.org/T267082 [09:53:48] !log tools Removing etcd member tools-k8s-etcd-4.tools.eqiad.wmflabs (T267082) [09:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:53:52] T267082: Rebuild Toolforge servers that should not have NFS mounted (and with affinity) - https://phabricator.wikimedia.org/T267082 [10:05:51] !log tools installed aptly from buster-backports on tools-services-05 to see if that makes any difference with an issue when publishing repos [10:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:07:17] !log tools adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node (T267082) [10:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:07:20] T267082: Rebuild Toolforge servers that should not have NFS mounted (and with affinity) - https://phabricator.wikimedia.org/T267082 [10:21:01] !log tools aptly repo had some weirdness due to the cinder volume: hardlinks created by aptly were broken, solved with `sudo aptly publish --skip-signing repo stretch-tools -force-overwrite` [10:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:21:15] !log tools published jobutils & misctools 1.42 [10:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:21:41] !log tools published jobutils & misctools 1.42 (T278748) [10:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:21:45] T278748: Toolforge: introduce support for selecting grid queue release - https://phabricator.wikimedia.org/T278748 [10:31:34] !log tools Removing etcd member tools-k8s-etcd-6.tools.eqiad.wmflabs (T267082) [10:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:31:39] T267082: Rebuild Toolforge servers that should not have NFS mounted (and with affinity) - https://phabricator.wikimedia.org/T267082 [11:39:48] !log tools cleaning up aptly: old package versions, old repos (jessie, trusty, precise) etc [11:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:45:51] !log tools upgrading jobutils & misctools to 1.42 everywhere [11:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [12:59:37] !log tools Removing etcd member tools-k8s-etcd-7.tools.eqiad1.wikimedia.cloud to get an odd number (T267082) [12:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [12:59:42] T267082: Rebuild Toolforge servers that should not have NFS mounted (and with affinity) - https://phabricator.wikimedia.org/T267082 [13:11:11] !log toolsbeta Removing etcd member toolsbeta-test-k8s-etcd-7.tools.eqiad1.wikimedia.cloud to get an odd number (T267082) [13:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [13:11:15] T267082: Rebuild Toolforge servers that should not have NFS mounted (and with affinity) - https://phabricator.wikimedia.org/T267082 [14:50:18] !log videocuttool bump quota to 26 core, 36G of RAM (T278605) [14:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Videocuttool/SAL [14:50:21] T278605: Request increased quota for VideoCutTool Cloud VPS project - https://phabricator.wikimedia.org/T278605 [15:16:41] !log tools cleared queue state since a few had "errored" for failed jobs. [15:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:17:16] Urbanecm: it actually tries to upload all with pywikibot if it fails it will go to ssu [16:17:38] not sure if it is due to an old pywikibot install or something else. [16:21:38] https://github.com/toolforge/video2commons/blob/master/video2commons/backend/upload/__init__.py#L50 [16:42:25] thanks chicocvenancio, that helps [16:43:37] I think we could check for other automated ways to upload files between 1GB and 4GB and have it try that before ssu [16:43:55] My avalability might only be the hackathon, though [16:44:15] *availability [16:55:19] if there is a physical hackathon chicocvenancio :/ [16:55:40] I'm considering the remote hackathon as well [16:55:58] chicocvenancio: could it capture details (the full response, perhaps) in a log or something? Maybe we could use it to find why it fails [17:00:18] sure, I'm a bit afraid to break things right now, so I wouldn't be able to do it alone, but it should be possible [18:47:10] !log wikidocumentaries deleting log files on hupu.wikidocumentaries.eqiad1.wikimedia.cloud; puppet is failing due to a full disk [18:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidocumentaries/SAL [18:47:57] !log wikidocumentaries it wasn't enough :( [18:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidocumentaries/SAL [22:19:08] hi everyone, I'm trying to set up wmcz-stats-turnilo.wmcloud.org as a web proxy, but it resolves to 172.16.6.36? That looks like an internal IP? [22:20:02] it does [22:21:33] Reedy: What I mean is that I don't think a public web proxy should resolve to an internal IP from outside of WM network [22:22:26] Urbanecm: from a WMCS machine or externally? [22:22:31] Majavah: externally [22:22:58] dev.wiki.local.wmftest.net uses this to its benefit ;) [22:23:02] uhhh, yeah that sounds like a bug, maybe related to the openstack updates [22:23:29] Reedy: again, I don't think that's how web proxies are supposed to work :) [22:23:48] Why not? [22:23:56] if you were inside the network, you want that resolution [22:24:04] Reedy: but I'm _outside_ of the network [22:24:58] * Reedy relocates Urbanecm inside the DC [22:25:22] plus, it wouldn't work even if I was inside the network - the IP is the instance's IP, not proxy's IP [22:25:31] wmcs does split brain dns at least for floating addresses, but not sure if it was used with web proxies, and this is definitely not the wanted result here [22:26:20] Urbanecm: it doesn't resolve at all for me. Have you already made the proxy using Horizon? [22:26:27] bd808: yes [22:27:14] google seems to have it already https://www.irccloud.com/pastebin/iX8HaseD/ [22:27:16] other *.wmcloud.org things are working, so this sounds like a bug in creating new proxies [22:27:19] wmcz-stats-turnilo.wmcloud.org resolves on https://cachecheck.opendns.com/ too [22:27:54] alex@alex-laptop:~$ dig wmcz-stats-turnilo.wmcloud.org @ns1.openstack.eqiad1.wikimediacloud.org +short [22:27:54] 172.16.6.36 [22:27:54] alex@alex-laptop:~$ dig wmcz-stats-turnilo.wmcloud.org @ns0.openstack.eqiad1.wikimediacloud.org +short [22:27:54] 172.16.6.36 [22:28:06] yeah. that's a bug and very likely related to the OpenStack updates that andrewbogott and others rolled out today [22:28:39] Write it up in Phab please :) [22:28:41] filled at T279486 [22:28:41] T279486: Web proxies are resolved to internal IPs outside of WMCS network - https://phabricator.wikimedia.org/T279486 [22:28:54] interestingly it is the private IP of what I would assume to be the instance behind it [22:29:03] $ dig -x 172.16.6.36 @ns0.openstack.eqiad1.wikimediacloud.org +short [22:29:03] turnilo02.wmcz-stats.eqiad1.wikimedia.cloud. [22:29:30] Krenair: affirmative, it's not even the proxy's private IP [22:29:33] yeah [22:32:13] Is that happening with existing proxies or just need ones? [22:32:18] *new [22:32:33] just new ones I think andrewbogott [22:32:33] andrewbogott: https://codesearch.wmcloud.org/ works for me [22:32:50] Great, I will look later on then [22:33:15] w-beta.wmflabs.org works too [22:33:20] so yeah just new I think [22:33:42] note that's wmcloud.org one - but dunno if that would actually make any difference [22:34:47] It's kind of like the custom dashboard is suddenly picking the wrong value to setup the dns [22:34:56] Krenair: while you're around I'd appreciate your feedback on https://phabricator.wikimedia.org/T276650, since I believe you're behind the current deployment-docker-* architecture [22:35:07] uh oh [22:35:18] hmm [22:35:32] I don't think I can claim credit for that [22:35:44] I think I took an existing pattern and applied it in a bunch of places :) [22:36:29] I remember proposing this idea (creating a beta k8s cluster) years ago and it being discouraged, because it's beta and who is going to take care of it [22:37:39] that's my biggest worry as well [22:38:05] The real point to hit on is that MediaWiki is supposedly moving to containers on k8s and if deployment-prep does not get a system to match up with that then who knows what will happen. [22:38:52] have you spoken to WMF SRE Service Operations? [22:39:10] legoktm replied on the task :) [22:39:16] not yet [22:39:20] he is service ops now [22:40:04] * legoktm puts on his official serviceops hat [22:40:54] there was some discussion about it somewhere when I filed that task, but I don't think any prod k8s maintainers were around then [22:41:48] my understanding of the current status is that there is agreement/understanding that we need some kind of k8s cluster to replace the current beta cluster environment, but no real work has been done towards it [22:42:22] I think that's because what prod MW k8s is supposed to look like is still not set in stone [22:42:54] are we expecting that prod MW k8s will be the same k8s as prod uses for existing microservices? [22:43:36] the plan is to have it in the same k8s cluster yes [22:43:54] (there's a separate new ML k8s cluster that's...different) [22:43:55] well, prod currently runs those services on k8s right? [22:44:13] yes [22:44:16] yes [22:44:26] so it would make sense for beta to at least mirror that, right? [22:45:06] yeah [22:45:38] sure, then it comes to the question of who's going to do the work for that, and a feeling that beta needs a overhaul anyways when MW goes to k8s and just waiting for that [22:45:47] :\ [22:46:10] I do worry about how long it feels like this can has been collectively kicked down the road for [22:46:46] even if the reasons always make some sense in isolation [22:47:47] I mean it's only been since 2013 that no SREs help with deployment-prep [22:47:51] I don't disagree with you [22:48:09] I don't think that's entirely fair [22:48:39] * bd808 waits to find out that there was a lot of support that he never heard of [22:48:52] it's not entirely wrong either [22:49:05] there have been some SREs who have helped with some aspects of deployment-prep [22:49:58] not to the extent some of us would like [22:50:11] I don't blame any individuals, but collectively there has been much more hostility to asks for help than help in my personal experince [22:50:29] yeah probably [22:51:48] Majavah: I suspect at some point we should have a think about which would be lower maintenance, the status quo or a prod-like k8s cluster [22:52:06] and also what impact that would have on quota, if it's significantly more would WMCS be willing to support that, etc. [22:52:53] as beta is not really used to test deployment, it's used to test software (if anyhow at all), staying at status quo might not be a bad option [22:53:37] as someone who has never set up a k8s cluster before, I don't think that's the hard part. I think the difficulty has always been in getting master deployed, and then maintaining all the various config, keeping the variations in sync with prod [22:53:38] quota-wide I'm not even sure how much more we would need for k8s, since we'll likely be able to have more than one service per node [22:54:11] Urbanecm: do you know how the MediaWIki code gets into deployment-prep? It's done with scap which is 100% the same tooling used in prod. I built that all out when I ported scap from bash/php to python [22:54:35] bd808: I know. I'm talking about the hypothetical future when prod is fully on k8s, incl MW [22:54:41] Majavah: we would but there's probably some extra overhead to provide a control plane. not sure how much [22:55:43] control plane + etcd for data store + proxies for lb basically [22:55:51] it's pretty minimal. for PAWS we are running the etcd nodes and k8s control nodes collapsed together. So it's like 3 small instances [22:56:58] the k8s clusters that Toolforge and PAWS use are not built the same way that the production k8s clusters are [22:57:13] a simple cluster shouldn't take that much, but of course it depends on how prod-like we want to have it [22:57:37] it would be possible I think to adapt the PAWS setup into deployment-prep, but I don't know what would need to be added on to make things semi-prod like [22:58:23] but I really have never understood the arguments against SREs testing in deployment-prep like the rest of us mortals [22:58:37] given there's 7 deployment-docker-* instances it sounds like overall it could be a reduction in terms of quota usage [22:59:29] the one I used to hear was how different deployment-prep was from prod, making it useless, the end result being to leave deployment-prep even less like prod, a bit self-fulfilling [22:59:58] I think we also have a few dockerized services on instances that don't have "docker" in their names [23:00:08] oh yeah probably true [23:01:52] the current instances need to have the docker image versions manually changed in hiera, so any updates on the update process would be an improvement [23:02:09] https://openstack-browser.toolforge.org/project/paws shows what is going on in paws right now. 3 control+etcd nodes, 2 haproxy nodes, 2 ingress nodes (which are really just isolated workload workers), and the a worker pool [23:03:47] I suspect to make informed decisions someone might need to sit down and figure out how easy it is to take prod's k8s setup and apply the equivalent inside labs, and figure out what ongoing maintenance is needed, what is needed to keep up with prod, etc. [23:05:05] 2am is definitely not the best time to make decisions for this [23:05:17] yeah [23:07:20] I think if beta is only 80% the same as prod, it's still incredibly valuable [23:07:38] obviously 99% is an impossible goal but that's really not where the most value is gained [23:10:19] imo having the same up-to-date services that prof has adds a significant amount of value [23:11:19] the current setup for that is flawed, given that most services are running fairly old versions and new ones aren't always deployed when they are to prod [23:14:24] also I'm still waiting to requirements for https://phabricator.wikimedia.org/T278390 handled, experience with the toolforge cluster would probably be helpful if we end up making one in deployment-prep [23:19:21] Hi.  I've just created my first VPS instance (https://horizon.wikimedia.org/project/instances/43d3e9f2-b7d5-4a51-92b6-6a636839d66f/) but I can't log into it. [23:19:55] I can ssh into bastion.wmcloud.org, but when I try to ssh into my instance, it times out. [23:20:03] ssh roysmith@spi-tools-host-1.spi-tools.eqiad1.wikimedia.cloud [23:20:14] hangs for about 30 seconds, then: [23:20:19] ssh: connect to host spi-tools-host-1.spi-tools.eqiad1.wikimedia.cloud port 22: Connection timed out [23:21:49] RoySmith: in Horizon you can click on the instance and see console output in browser. if it's not done yet running puppet then you cant login yet, but it will work in a couple minutes [23:22:28] it needs to run a couple things before it's configured to accept logins from LDAP users [23:22:44] easiest is to have a coffee and try again [23:22:53] OK, I'll try again in 20 minutes, thanks. [23:22:58] np [23:25:52] mutante: I once had a VM have role::mediawiki::appserver on the first boot, wasn't fun having to wait for the initial scap pull before being able to have its proper certs signed etc :/ [23:26:03] yeah [23:26:15] I avoid letting instances have roles during first boot [23:26:33] I've swapped prefixes out for per-instance roles in the past over it [23:26:44] some roles just break first boot horribly [23:26:56] I think that one was from some prefix that I didn't realise [23:27:56] Majavah: recently I was upgrading mediawiki appservers in production, got asked how long it takes per host.. the answer was like "5 min to install OS, 90 minutes for first puppet run" and that's not very far off [23:28:30] things break especially with cloud vps project puppetmasters, where the first run is with a different puppetmaster and secrets that the rest [23:28:43] also in production I would never want the role on the host before OS is installed. always applying role(insetup) first and other things later [23:28:52] it's always better in the end [23:29:42] so I confirm all that for cloud VPS as well, first just bring the host up and make sure you can login, later apply role [23:29:52] yep, but prefix puppet having roles will lead to surprises [23:30:34] yea, I dunno if prefix puppet is helping more than it hurts [23:31:12] RoySmith, did you delete and then re-create that instance of yours? [23:31:15] could go one of the ends of the scale.. either always hiera based on individual instance names.. or never [23:31:42] hmm no I think I've just been fooled by puppet backdating certs by a day or something [23:32:40] I can't explain why but when I make new instances it "feels" like sometimes I can login almost instantly and other times it takes longer, separate from the role part [23:33:28] but then I also have root access so I would ssh root@ and get in early [23:33:36] probably worth taking a look in the console output to see what's up at this point [23:33:47] it's not even responding to SSH at all, root@ isn't gonna do the trick [23:34:05] in that case.. reboot it, delete it, try again [23:34:25] I don't think I have special permissions that would let me see other project's console logs anymore [23:39:48] does the default network policy allow ssh coming from bastion? [23:40:00] i had the same problem, and added a rule in the default network policy [23:40:38] Krenair, yes I did.  I re-read the instructions and discovered I had not followed the suggested instance naming scheme, so I deleted it and recreate a new one. [23:40:41] Hope that's OK. [23:40:48] if a new project's default policy does not allow ssh from the bastion project instances than that's a bug :) [23:40:52] interesting [23:41:01] bd808: yet another one? :_) [23:41:14] should probably work but I wouldn't count on being able to give two instances the same name, even if you deleted the first one [23:41:27] what does the console output for your instance say? [23:41:28] wasn't sure, thought it's a feature [23:42:14] I think we have mostly fixed the problems that reusing instance names used to cause, but also we did an OpenStack version upgrade and a Horizon version upgrade ~8 hours ago so there may be new regressions in pretty much anything [23:42:27] The end of the log is: [23:42:27] Last login: Tue Apr 6 23:11:09 UTC 2021 on ttyS0 [23:42:28] Linux spi-tools-host-1 4.19.0-16-cloud-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 [23:42:28] Debian GNU/Linux 10 (buster) [23:42:29] The last Puppet run was at Tue Apr 6 23:11:59 UTC 2021 (0 minutes ago). [23:42:29] Last puppet commit: (85b6cda290) Razzi - superset: Temporarily rollback victorops-analytics contact_group [23:42:30] root@spi-tools-host-1:~# [23:42:34] ok that's promising [23:42:38] that looks like it should work [23:42:55] and yet for some reason SSH is blocked from a bastion [23:43:08] Krenair: i vote for the secrity policy issue i mentioned above [23:43:15] could be security groups, could be iptables, could be sshd not running [23:43:19] bit early to say [23:43:47] * bd808 tries to get setup to look at RoySmith's problem in Horizon [23:43:48] Well, I'm not in any real rush.  I'll have more coffee and give puppet more time. [23:43:55] I'm saying that because i was debugging the same issue a few hours ago, and that fixed it :) [23:43:59] puppet won't be doing anything at this point [23:44:01] but obv could be just a coincidence [23:45:06] RoySmith: there is a bug in the security group rules (the one Urbanecm mentioned) [23:45:15] ah [23:45:30] let me take some notes first and then manually fix this for you [23:45:36] well, not allowing anybody to log in is certainly a secure configuration :-) [23:45:56] haha [23:46:18] bd808: when we're on Victoria, can we add security policies to openstack-browser? [23:47:09] Majavah: *shrug* can we not now? I never thought of making that very public [23:47:20] there used to be a permissions issue around it IIRC [23:47:48] ah, little while ago though: https://phabricator.wikimedia.org/T199272 [23:53:33] https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/portals/+/refs/heads/master/data/ [23:53:33] > WMF Labs Tools [23:53:33] I haven't seen anyone butcher the name of toolforge that bad [23:55:03] RoySmith: security group updated. Does ssh work for you now? T279491 [23:55:04] T279491: [Regression] default security group for new project missing ssh rules for bastion - https://phabricator.wikimedia.org/T279491 [23:56:05] yup, I can connect now.  I still can login because I've got some ssh key problems to sort out, but that's straight-forward.  Thanks for your help. [23:56:12] s/can/can [23:56:18] s/can/can't/ [23:57:15] cool. I got in with my "mortal" key after adding myself to the project so hopefully it will all just work once you sort out your key issue [23:59:27] yup, I'm in.