[00:01:01] <bd808>	 I know there is something simple that a.ndrewbogott does to increase logging verbosity when helping people figure out auth problems with horizon...
[00:02:50] <Krenair>	 dunno about horizon, we used to have some config available to turn up LDAP logging on wikitech though, could be confusing it with that
[00:07:02] <Krenair>	 there is a LOGGING thing in ./modules/openstack/templates/ocata/horizon/local_settings.py.erb but I wouldn't describe it as simple
[00:07:35] <bd808>	 heh, yeah. Looking at source now to see if there is any useful log that may be being discarded
[00:07:59] <bd808>	 its "simple" in that its just a python logging config
[00:19:40] <bd808>	 ebernhardson: I figured out how to turn the logging volume up. when you get a chance, try logging in again and hopefully I will see some specific error messages this time.
[00:19:47] <ebernhardson>	 bd808: alright, sec
[00:20:16] <ebernhardson>	 bd808: done
[00:20:40] * bd808 shakes fist at puppet
[00:20:57] <bd808>	 it undid my hack sooner than I'd hoped... 1 minute please :)
[00:22:01] <ebernhardson>	 lol, i had same problem with logstash today
[00:23:17] <bd808>	 ebernhardson: ok, please try again
[00:23:43] <ebernhardson>	 bd808: done
[00:24:46] <bd808>	 hmmm.. not a lot more help -- "2019-02-27 00:23:38.941528 The request you have made requires authentication. (HTTP 401)"
[00:25:11] <bd808>	 so ... its a step further away apparently in the keystone service.
[00:28:30] <bd808>	 and the log on the keystone side is equally worthless :/
[00:31:58] <ebernhardson>	 i could poke andrew tomorrow?
[00:32:52] <bd808>	 ebernhardson: I think that may be your best bet. Maybe create a phab task and I'll make sure he sees it?
[00:32:57] <ebernhardson>	 alright
[00:33:13] * bd808 also will ask for a runbook for debugging these in the future
[00:34:33] <ebernhardson>	 created https://phabricator.wikimedia.org/T217216
[00:49:28] <Betacommand>	 are we having issues with the login bastions? Im running an svn command that seems to be dragging
[00:51:04] <bd808>	 nothing that I'm aware of Betacommand. which bastion are you on?
[00:51:20] <bd808>	 NFS is still a bit slow everywhere unfortunately
[00:51:29] <Betacommand>	 looks like 07
[00:51:55] * bd808 takes a look
[00:54:27] <Betacommand>	 seems to be a better now
[00:54:57] <bd808>	 it feels like it is going back and forth, but I'm not seeing any particular thing to blame it on
[00:55:32] * bd808 sees a big NFS copy happening now in nethogs
[00:55:39] <Betacommand>	 trying to get things back in sync with my data repos
[00:56:29] <bd808>	 Betacommand: you can try using stretch-dev.tools.wmflabs.org instead and see if anything is better for you there
[00:56:46] * bd808 has sneakily not told everyone about that bastion
[00:59:36] <Betacommand>	 bd808: All I have left is an svn up command when Im done on my local machine so this was just making sure it wasnt the tip of the iceberg issue
[01:07:33] <Cyberpower678>	 bd808: Krenair: so.  I probably won't get this fix in until tomorrow.  But I have rebooted the bot in the meantime.  I will patch it quickly and clear out all of the large files.
[01:08:46] <Cyberpower678>	 Log sizes are not growing so fast as to pose a threat to the NFS for at least 24 hours/
[01:10:07] <bd808>	 Cyberpower678: can you hot patch the running code?
[01:10:38] <Cyberpower678>	 bd808: not really.  It's composer provided package.  If I overwrite it, I run into issues.
[01:10:44] <Cyberpower678>	 At least I did last time.
[01:10:55] <bd808>	 issues of ... it being hard to update later?
[01:11:33] <bd808>	 once the files are on disk the PHP runtime doesn't care where they came from
[01:11:36] <Cyberpower678>	 bd808: I think composer errored out because the file was modified.  It may have been a composer bug, but I had to wipe the vendor out and reinstall.
[01:13:12] <bd808>	 sure, but that's what composer is good at (recreating state from its composer.lock file)
[01:26:47] <bd808>	 !log tools Disabled job queues and rescheduled continuous jobs away from tools-exec-14{33,34,35,36,37,38,39,40,41,42}.tools.eqiad.wmflabs (T217152)
[01:26:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[01:26:51] <stashbot>	 T217152: Monitor and scale in the Trusty grid - https://phabricator.wikimedia.org/T217152
[01:29:34] <bd808>	 !log tools Depooled tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs (T217152)
[01:29:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[01:35:52] <Betacommand>	 bd808: I am getting permission denied on some of my own files :/
[01:35:58] <bd808>	 !log tools Shutdown tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs via horizon (T217152)
[01:36:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[01:36:01] <stashbot>	 T217152: Monitor and scale in the Trusty grid - https://phabricator.wikimedia.org/T217152
[01:36:22] <bd808>	 Betacommand: scp problems again or something else?
[01:36:32] <Betacommand>	 I have 5 files in tspywiki that I own instead of the tool and I cannot change that
[01:36:47] <bd808>	 `take` doesn't work?
[01:36:52] <Betacommand>	 take
[01:36:57] <Betacommand>	 was trying chown
[01:37:47] <bd808>	 we have a local utility called take that uses sudo tricks to re-parent files.
[01:37:55] <Betacommand>	 ah
[01:37:56] <bd808>	 `take filename`
[03:36:35] <Cyberpower678>	 bd808: the change should be live now
[09:51:55] <pintoch>	 I'm noticing that https://tools.wmflabs.org/openrefine-wikidata/ is quite slow. Are there any known issues at the moment? NFS maybe?
[10:30:06] <gtirloni>	 pintoch: it does seem slow, I'm checking
[10:39:36] <gtirloni>	 pintoch: NFS/toolsdb seem fine. it seems more like a CPU usage thing, I see 4-5 uwsgi/python processes for openrefine consuming all CPU on a kubernetes node
[10:41:31] <pintoch>	 gtirloni: thanks a lot for looking into that!
[10:42:19] <gtirloni>	 np. If you want, I can restart the tool.. it does seem to be doing some work though, but I don't know what the normal workload is
[10:42:22] <pintoch>	 I wonder why these processes are consuming all CPU, I did not make any changes recently
[10:42:48] <gtirloni>	 let me check the logs
[10:42:49] <pintoch>	 I restarted the webservice earlier today to try to solve the issue
[10:45:12] <gtirloni>	 other tools running on the same server seem fine, I'll look around a bit, one moment
[10:45:37] <pintoch>	 gtirloni: ah, it seems that the service is getting large queries from some user
[10:46:05] <pintoch>	 JSON deserialization fails with "ValueError: Expecting property name enclosed in double quotes: line 7579633 column 1 (char 181772288)"
[10:47:22] <pintoch>	 ah no it's data coming from the query service that is causing the issue
[10:48:39] <gtirloni>	 tools.wmflabs.org 24.134.105.189 - - [27/Feb/2019:10:48:02 +0000] "POST /openrefine-wikidata/en/api HTTP/1.1" 200 220 "-" "Java/1.8.0_191"
[10:48:50] <gtirloni>	 could this be it?
[10:49:03] <gtirloni>	 every few seconds
[10:50:32] <pintoch>	 that looks normal to me
[10:50:44] <pintoch>	 however, I am wondering if there isn't a problem with the Wikidata Query Service
[10:51:41] <pintoch>	 https://tinyurl.com/yyj92xnb does not render correctly for me
[10:54:36] <gtirloni>	 so today, openrefine-wikidata failed to answer requests and timed out 415 times so far, but I can't find a pattern in the requests that failed (yet) -- even some GET request for .png files timed out
[10:56:07] <pintoch>	 (forget about that, my query is wrong)
[10:57:20] <gtirloni>	 I'm monitoring the processes, let's see if something comes up
[10:58:53] <pintoch>	 gtirloni: thank you so much! now I'm pretty convinced this is an issue with my code - the environment seems to be fine :)
[10:59:15] <pintoch>	 I'll try a few things and let you know if I get stuck
[11:00:33] <gtirloni>	 pintoch: no worries, let us know if we can help. just so you can see what I'm seeing: https://phabricator.wikimedia.org/P8136 (user 53287 is openrefine)
[12:27:28] <pintoch>	 so, I have tried to speed things up a bit in the code, but that does not solve the problem
[12:36:23] <chicocvenancio>	 pintoch, gtirloni: I've seen other users complaining of OpenRefine slowness today on telegram groups
[12:37:01] <pintoch>	 chicocvenancio: that's useful to know
[12:37:10] <pintoch>	 yes it's impacting the user experience quite a bit
[12:38:25] <pintoch>	 gtirloni: I have cleared the possible sources of slowdown in the code. I am still unsure about why even static files are slow to load
[12:39:34] <gtirloni>	 pintoch: yeah, that's a good point
[12:40:05] <gtirloni>	 I'll take a second look in a moment
[12:41:08] <pintoch>	 thank you! there is no rush
[12:42:48] <pintoch>	 for what it's worth, I have monitoring tools which measure the time spent doing actual work in the tool (executing "reconciliation queries"), and that statistic is normal today (around 0.6 sec per query, which is not unusual)
[12:43:48] <pintoch>	 so that suggests that the slowdown is somewhere between the server and the uwsgi process, I guess (which is consistent with the slowness of HTTP requests which are not computationally expensive)
[14:36:19] <Zppix>	 bd808:  fyi cloudvirt1018 is spitting icinga alerts (see -operations)
[14:43:55] <andrewbogott>	 I silenced it — I don't know why it decided to alert today but it's a work in progress
[14:43:59] <andrewbogott>	 not hosting any VMs
[15:01:29] <wm-bot>	  Technical Advice IRC meeting starting in 60 minutes in channel #wikimedia-tech, hosts: @bmansurov & @Thiemo_WMDE - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting
[15:40:19] <andrewbogott>	 !log tools moving tools-worker-1002, 1005, 1028 to eqiad1-r
[15:40:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[15:51:10] <wm-bot>	  Technical Advice IRC meeting starting in 10 minutes in channel #wikimedia-tech, hosts: @bmansurov & @Thiemo_WMDE - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting
[16:20:10] <zhuyifei1999_>	 !log tools regenerating k8s creds for tools.whichsub & tools.permission-denied-test T176027
[16:20:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[16:20:14] <stashbot>	 T176027: namespaces "wdq_checker" not found error when trying to start webservice shell - https://phabricator.wikimedia.org/T176027
[19:34:05] <andrewbogott>	 !log tools uncordoning tools-worker-1028, 1002 and 1005, now in eqiad1-r
[19:34:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[19:37:21] <pintoch>	 chicocvenancio, gtirloni: the service seems to be back up to normal speed - no clue why, but I guess it's good news
[19:37:53] <Zppix>	 what is eqiad1-r?
[19:44:01] <chicocvenancio>	 pintoch: Great, could you fill that on the task?
[20:02:57] <zhuyifei1999_>	 Zppix: the new network region based on neutron (iirc)
[20:18:23] <thedj>	 question.. is there a firewall between eqiad and eqiad1 that would prevent my hosts from communicating with eachother ?
[20:22:31] <zhuyifei1999_>	 afaik, it's just that those 'no firewall within project' doesn't apply. they need explicit security group settings
[20:25:25] <thedj>	 zhuyifei1999_: ah, so if i configure security groups, then i can make inter cluster communication work.. i'll try that....
[20:27:11] <zhuyifei1999_>	 why not just move entirely to eqiad1-r?
[20:27:59] <thedj>	 zhuyifei1999_: he, been working on it since early december ;)
[20:28:25] <zhuyifei1999_>	 ok, like, too-many-hosts-to-move?
[20:28:27] <thedj>	 too many hosts, doing too many things, that i don't understand. one step at a time :)
[20:28:35] <zhuyifei1999_>	 I see
[20:41:25] <andrewbogott>	 !log tools restarting nginx on tools-checker-01
[20:41:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[20:47:26] <thedj>	 seems to work... 
[20:47:56] <thedj>	 but seems security groups on the old region no longer work or something...
[20:56:02] <thedj>	 bd808: any idea why i'm getting horizon errors trying to configure security groups for the old eqiad region ?
[20:56:12] <thedj>	 i try to add a rule and it doesn't show in the group.
[20:57:24] <thedj>	 try to add it again and it says: Security group rule already exists.
[20:58:04] <thedj>	 lol 'modified instance "unknown instance"'
[21:04:14] <bd808>	 thedj: yuck. I don't know if we have an open bug about security group edits in the old region failing or that is a new fun thing just for your project. I'll see if I can spot any issues. Sometimes its a hidden quota problem
[21:04:48] <thedj>	 bd808: i'm trying to open port 4949 in the munin group
[21:05:39] <thedj>	 for traffic from 172.16.5.154
[21:06:07] <thedj>	 connection towards 10.68.16.103 and 10.68.17.110
[21:07:01] <thedj>	 works the other way around. always fun ;)
[21:07:55] <bd808>	 thedj: hmmm.. I think it worked for me. Check out https://horizon.wikimedia.org/project/security_groups/983/ and see if that's mostly what you wanted
[21:09:01] <thedj>	 yup, exactly.. weird....
[21:09:21] <thedj>	 maybe i need to logout and back in or something
[21:09:48] <thedj>	 anyway. i think i now have all services running (although not clustered yet) and more importantly i think i now understand them. I hope to switch everything from eqiad old to new tomorrow
[21:26:15] <bd808>	 thedj: magic! Thanks for all your work to keep the maps project running
[21:50:12] <thedj>	 cool, all munin now runnig bidirectional
[22:00:22] <thedj>	 ps, not sure who/why maps-tiles2 instance was stopped on the 21st, but i have restarted it, as i still needed to glance config info from it.
[22:03:19] <Krenair>	 thedj, that is a trusty instance in the old main/eqiad deployment
[22:03:59] <Krenair>	 maybe someone decided to find out if anyone still needs it
[22:04:09] <Krenair>	 if you're capable of starting it you can probably look into the action log in horizon to find out who and when exactly
[22:05:06] <thedj>	 doesn't really matter.
[22:05:11] <Krenair>	 well
[22:05:18] <Krenair>	 whoever it is might like to know that you started it again
[22:05:23] <thedj>	 novaadmin, so...
[22:05:28] <Krenair>	 oh