[12:28:25] !log tools puppet broken in toolforge due to a refactor. Will be fixed in a bit [12:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [12:59:10] !log tools the puppet issue has been solved by reverting the code [12:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:01:43] !log deployment-prep a puppet refactor for the aptly module may have caused some puppet issues. Should be solved now [13:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [13:02:05] !log mwv-apt a puppet refactor for the aptly module may have caused some puppet issues. Should be solved now [13:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Mwv-apt/SAL [13:05:25] !log tools T207970 there is now a `stretch-toolsbeta` repo in tools-services-01, still empty [13:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:05:29] T207970: toolforge: add misctools and jobutils packages to stretch - https://phabricator.wikimedia.org/T207970 [13:22:18] !log tools T207970 misctools and jobutils v1.32 are now in both `stretch-tools` and `stretch-toolsbeta` repos in tools-services-01 [13:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:22:22] T207970: toolforge: add misctools and jobutils packages to stretch - https://phabricator.wikimedia.org/T207970 [13:29:19] !log tools Changed active mail relay to tools-mail-02 (T209356) [13:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:29:23] T209356: Point Toolforge to tools-mail-02 - https://phabricator.wikimedia.org/T209356 [13:32:53] !log tools pointed mail.tools.wmflabs.org to new IP 208.80.155.158 [13:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:28:07] https://discourse-mediawiki.wmflabs.org is down :( [15:28:40] hm, proxy [15:28:52] discourse project [15:29:13] admins are: Austin, ebernhardson, tgr|away, qgil, and samwilson [15:29:20] not sure who Austin is [15:48:33] Krenair: fixed apparently [15:53:08] !log wikilabels ran "sudo service uwsgi-wikilabels-web restart on wikilabels-02" [15:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL [16:02:20] Hello, I'm trying to migrate from Trusty to Stretch. My current VM has 16gb RAM and my max allocation in Horizon is 20GB, so in order to copy it over I would need to request a temporary quota increase, is that correct? [16:03:49] notconfusing: yes [16:04:17] Great, i'll submit a phabricator ticket now. [16:04:55] Hey folks. I'm having issues connecting to labsdb1004 from Wiki Labels. https://phabricator.wikimedia.org/T209381 [16:04:56] "psycopg2.OperationalError: FATAL: no pg_hba.conf entry for host "172.16.4.244", user "u_wikilabels", database "u_wikilabels", SSL off" [16:05:08] Did anything change with labsdb1004 recently? [16:05:55] Maybe wikilabels was migrated and now we need to add a new entry to allow our web client to connect? [16:08:38] halfak: hmmm... pinging from tools-bastion-02 say s "Packet filtered". Looks like that is coming from the network switch layer? -- "From ae2-1118.cr2-eqiad.wikimedia.org (10.64.20.3) icmp_seq=1 Packet filtered" [16:10:04] pg_hba.conf includes IP space, so the issue is probably that 172.16.0.0/12 space is not listed there [16:10:05] halfak: wikilabels was migrated on the 9th. You should have something in your inbox about that :) [16:10:10] that's all I know though [16:10:24] packet filtered is router-related indeed, and that's probably ICMPs being blocked [16:10:42] Right. I've been traveling and my team is offline at the moment. Not sure if this has been a problem for a long time or not. [16:11:14] paravoid: I'm getting that ping response from a host in 10.68.16/21 [16:11:26] ICMP being blocked isn't necessary a problem [16:12:18] in a meeting, so I haven't looked into any configs or anything, just making educated guesses :) [16:13:12] IMHO it's imperative to figure out next steps (or address) https://phabricator.wikimedia.org/T209011 before we start adding 172.16 ACLs to a bunch of different configs [16:13:28] (like pg_hba.conf which is likely the problem here!) [16:15:19] Hmm. I'd like to have the broken service restored ASAP. Would I find that pg_hba.conf managed by puppet. [16:17:23] * halfak digs around puppet. [16:22:27] bd808, who is the best person to talk to about labsdb1004? [16:23:02] halfak: in WMCS team meeting right now. I will bring up your ticket and see if we can get somebody to help look into it [16:23:35] Great. Thank you. I'll keep digging in the meantime. [16:37:10] o/ bstorm_ [16:37:18] Let me know if I can help. Thanks for taking us on :) [16:37:18] hm ideally ICMP wouldn't be blocked in this situation but I agree it's probably not that [16:37:26] a quick grep shows modules/postgresql/manifests/user.pp is involved in all this but it's probably not the source of the problem [16:38:25] no obvious suspects when looking at the stuff that includes postgresql::user [16:38:37] wouldn't be surprised if those users are setup in the DB and unpuppetised [16:39:05] there is stuff like [16:39:13] modules/role/manifests/osm/master.pp postgresql::user { 'osm@labs': [16:39:19] cidr => '10.68.16.0/21', [16:39:35] same with a few others in there actually [16:39:47] nothing for u_wikilabels however [16:42:40] * halfak digs and confirms. [16:45:33] Just looking for the right place. wikilabels might be locally configured [16:49:16] ugh :( [16:52:39] It's not :) [16:52:41] Found it [16:52:52] paravoid: ^ [16:53:42] oh ok [16:53:45] where was it [16:55:26] It's controlled by the postgres role, I believe. Just finding the correct way to add the cidr. [16:57:14] I could be wrong... [16:59:18] I really think we're kicking this can down the road by adding 172.16 space in a bunch of different places [17:05:57] I can't think of another quick way to unblock this right now [17:09:10] I'm trying a local change to see what, if anything, in puppet reverts it [17:09:26] *sigh* [17:09:27] awight, ^ [17:09:28] It's local [17:09:29] paravoid: would you prefer if we add a floating IP so connections are from a public addr? [17:09:37] halfak: it might be fixed [17:09:46] hi! [17:10:01] I went into meeting mode. awight can you kick wikilabels to see if it is fixed? [17:10:34] This DB is an unusual quirk in the environment, but I think I'll at least make a subtask to look into puppetizing the values. [17:10:35] paravoid: your concerns about long term solutions vs short term fixes are noted. We are at a place right now where we either make short term fixes or we stop moving things to eqiad1-r and actually roll back things there instead. arturo is going to take some time to work on a straw dog proposal for the long term fixes to discuss with you and others so we can figure out a timeline. [17:10:36] I'm trying to catch up, so is the current situation that the wikilabels server and postgresql are on different subnets? [17:10:49] halfak: Yes, I can kick the service [17:11:20] bd808: I'm actually proposing both short-term and long-term fixes :) [17:11:30] awight: no, not really. It's just that the wikilabels server is on a new subnet that I had to inform the postgresql server about [17:11:38] the short-term fix to me would be to have an explicit whitelist of IPs (including potentially text-lb) that we don't NAT against [17:12:04] rather than just having a blanked "all of production", pierce a bunch of holes in a bunch of places, then have to track these down in a few months again [17:12:12] postgresql requires that information no matter what [17:13:07] also where I am coming from is that I've been asking for those changes for a few years, was told that it was too complicated and that we'll address them with Neutron because it'd be easier that way [17:13:45] so the idea that they're too complicated to do together with Neutron and we should do them separately is... backwards of what I've been told before :) [17:14:22] but I'd be amenable to that if we figure out a plan soon and commit to some dates to complete rather than just push it off to the "long-term" [17:15:33] awight: it looks like it is working again [17:15:35] paravoid: there was a lot of discussion between Chase and Arzel as far as I know, but not apparently the exact issues you are now raising. And I agree that we should not just say "later" and instead have a more concrete plan [17:15:56] \o/ bstorm_ wonderful, thanks for the note [17:16:13] this goes back to conversations back in 2015 and 2016 :) [17:19:06] paravoid: We hope to be moving these servers into the cloud [17:19:18] Which doesn't address wider questions [17:19:29] but that is a long term plan for these [17:19:46] The virts are just having some HP raid fun. [17:19:49] I like that idea :) [17:20:07] these == user maintained databases exposed to Cloud clients [17:20:11] yeah [17:20:24] I think in the short/immediate-term we should document all of these edge cases as such and have an explicit whitelist of hosts we don't NAT to [17:20:25] To be general as well as specific [17:20:44] that would also allow us to identify those kind of issues before projects are migrated rather than after when users complain :) [17:21:02] That isn't the case for labsdb* hosts? [17:21:03] and would also help with not having more of these cases creep in without us realizing it [17:22:02] I may be misunderstanding that of course :) I haven't dug into some of those discussions. [17:22:35] it's complicated stuff, took a while to grasp and I don't think I'm there 100% yet [17:22:50] https://phabricator.wikimedia.org/T174596 was resolved today [17:22:56] and we have now this: [17:22:57] hieradata/eqiad/profile/openstack/eqiad1/neutron.yaml:profile::openstack::eqiad1::neutron::dmz_cidr: '172.16.0.0/21:91.198.174.0/24,172.16.0.0/21:198.35.26.0/23,172.16.0.0/21:10.0.0.0/8,172.16.0.0/21:208.80.152.0/22,172.16.0.0/21:103.102.166.0/24,172.16.0.0/21:172.16.0.0/21' [17:23:21] this basically says... don't NAT 172.16 space towards any destination in prod [17:24:03] anyway, people are waiting for me, ttyl :) [17:33:30] we have a meeting on 29th Nov to talk about neutron & networking stuff. Was focused on the transport renumbering, but we can probably reconvert the meeting in a more top-level discussion on where we want to go with CloudVPS [17:40:37] !log tools remove misctools 1.31 and jobutils 1.30 from the stretch-tools repo (T207970) [17:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:40:42] T207970: toolforge: add misctools and jobutils packages to stretch - https://phabricator.wikimedia.org/T207970 [18:11:59] reading all of this :) I think part of the complication is 'should instance IP be preserved to addresses in public VLANs for wikimedia infrastruture?' has never been officially decided to my knowledge. Separate from the question 'should instance IP be preserved to private VLANs for wikimedia infrastructure?' which was talked about here https://phabricator.wikimedia.org/T174596#4530767 [18:12:37] !log integration moving integration-slave-docker-1034 to eqiad1-r [18:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/SAL [18:14:50] !log integration moving integration-slave-docker-1038 to eqiad1-r [18:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/SAL [18:18:29] paravoid: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron_ideal_model edits welcome so we are all in the same track [18:24:36] andrewbogott hi, im wondering if we could prioritise https://phabricator.wikimedia.org/T41785 please ? Since it's causing gerrit to throw a exception everytime someone trys to add a email [18:26:01] GCI students are trying to use it. [18:26:18] im getting [18:26:19] com.google.gerrit.common.errors.EmailException: Mail Error: Server localhost rejected recipient moltenshard@gmail.com: 550 Administrative prohibition [18:26:24] oh [18:26:32] paladox, alternative solution would be to create a repository on production gerrit [18:26:34] (maybe simplier) [18:26:47] meh, wrong paste"com.google.gerrit.common.errors.EmailException: Mail Error: Server localhost rejected recipient xxx: 550 Administrative prohibition" [18:26:54] Urbanecm hmm? [18:27:23] paladox, I meant to create a test repository on production gerrit and tell GCI students to use this test repo on prod gerrit [18:27:45] ah yeh, you will need to ask releng (as im not sure if test repos are allowed) [18:32:07] yeah... [18:45:45] !log git applying https://gerrit.wikimedia.org/r/c/operations/puppet/+/472720 (and restarting gerrit on gerrit-test3) [18:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Git/SAL [19:01:28] Okay! Cloud Services survey distributed. [19:01:40] zhuyifei1999_: you asked for a reminder when the survey went out. check your email :) [19:05:34] ok :) [19:21:23] !log shinken moving shinken-02 to eqiad1-r [19:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Shinken/SAL [19:32:10] !log iiifls1aio deleting project and instances — project is abandoned. [19:32:11] andrewbogott: Unknown project "iiifls1aio" [20:12:12] harej: sweet! [21:43:50] !log deployment-prep moving deployment-elastic05 to a new labvirt to clear out labvirt1016 [21:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [21:50:01] !log deployment-prep moving deployment-dumps-puppetmaster02 to a new labvirt [21:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [21:55:01] !log deployment-prep moving deployment-webperf12 to a new labvirt [21:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [21:59:57] !log deployment-prep moving deployment-deploy02 to another labvirt [22:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [22:19:34] !log deployment-prep moving deployment-urldownloader02 to labvirt1012 [22:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [22:26:31] andrewbogott, how's elastic05 doing? [22:28:08] Krenair: there was a network outage and I lost track of everything, give me a minute... [22:28:19] fun [22:28:41] was just wondering if all those mentioned above were still in flight or... [22:28:58] search still appears to work on beta enwiki so [22:29:14] Looks like 05 is done moving [22:29:21] deploy02 and elastic07 still in progress [22:31:23] so just elastic07 to go? [22:31:31] oh right that was on your list of in progress [22:31:36] * Krenair did not successfully read. [22:41:24] bd808: i already booked a grants office hours session with chris :D [22:41:39] harej: oh good!