[00:06:21] dbrant, well the semi-user-facing aspect that changed recently was moving from use of the old letsencrypt puppetisation, where we generated a huge list of SANs for all beta domains and did http-01 verification on them, to acme-chief with a designate DNS integration script that gets wildcard certs with dns-01 verification [00:07:46] I guess it's possible that it does not appreciate something new in the cert but I doubt it [00:08:51] I assume you haven't changed java versions or any SSL settings recently [00:09:11] nope [00:09:29] and the folks at the Commons App project are reporting a similar issue [00:12:33] maybe we could take it to -traffic if we frame this in the right way [00:13:18] beta is testing getting the unified cert via acme-chief [00:13:47] as part of this we've discovered that certain java installs do not trust the site anymore [00:14:39] or at least we assume it to be caused by that [00:18:09] at least we assume it to be caused by this [00:22:26] sounds good. I've got a phab task going, too. [00:22:31] https://phabricator.wikimedia.org/T221171 [00:22:55] but that we still don't fully understand what's going on because the same client seems able to talk fine to a prod misc site which also uses certs issued this way, and also some clients complain about different things - android seemed upset about revocation checking [08:56:58] !log tools T221205 disable puppet in all tools-sgewebgrid-* nodes [08:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:57:02] T221205: Toolforge: deploy sssd to tools-sgewebgrid* nodes - https://phabricator.wikimedia.org/T221205 [09:00:09] !log tools T221205 add `profile::ldap::client::labs::client_stack: sssd` in horizon for the puppet prefixes `tools-sgewebgrid-lighttpd` and `tools-sgewebgrid-generic` [09:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:12:19] !log tools T221205 start deploying sssd to sgewebgrid nodes [09:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:12:23] T221205: Toolforge: deploy sssd to tools-sgewebgrid* nodes - https://phabricator.wikimedia.org/T221205 [09:45:43] !log tools T221205 tools-sgewebgrid-lighttpd-0913 requires some manual intervention because unconfigured packages prevents a clean puppet agent run [09:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:45:47] T221205: Toolforge: deploy sssd to tools-sgewebgrid* nodes - https://phabricator.wikimedia.org/T221205 [09:52:22] !log tools T221205 tools-sgewebgrid-lighttpd-0915 requires some manual intervention because issues in the dpkg database prevents deleting nscd/nslcd packages [09:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:52:25] T221205: Toolforge: deploy sssd to tools-sgewebgrid* nodes - https://phabricator.wikimedia.org/T221205 [11:24:20] !log tools disable puppet in bastions to deploy sssd [11:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:30:38] !log tools deploy sssd to bastions [11:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:31:20] !log tools reboot bastions for sssd deployment [11:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:38:17] !log tools deploy sssd to tools-sge-services-03/04 (includes reboot) [11:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:48:22] When i type 'become tool_name' i get feedback saying "sudo: a password is required". Is there any particular reasons? Because I used the command a few minutes ago [11:48:36] Eugene233: let me check [11:48:47] I just made a change in the bastions that may be related [11:49:37] Eugene233: [11:49:39] https://www.irccloud.com/pastebin/o4BmuObD/ [11:49:54] which exact command are you typing? [11:51:56] arturo: become isa [11:53:16] Eugene233: I see this [11:53:19] eugene23@tools-sgebastion-07:~$ become isa [11:53:19] You are not a member of the group tools.isa. [11:53:19] Any existing member of the tool's group can add you to that. [11:54:37] @arturo: seriously? [11:55:48] It worked for me a couple of minutes ago [11:56:29] @arturo: I can still see myself as a maintainer if the tool. [11:57:00] I’m also getting an error for e. g. `become wdvd` even though `groups` tells me I’m in tools.wdvd [11:57:16] you are right [11:57:45] also [11:57:48] aborrero@tools-sgebastion-07:~$ getent group tools.isa [11:57:48] tools.isa:*:54010:eugene233,navino [11:58:58] !log tools T221205 sssd was deployed successfully into all webgrid nodes [11:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:59:02] T221205: Toolforge: deploy sssd to tools-sgewebgrid* nodes - https://phabricator.wikimedia.org/T221205 [11:59:04] I will revert my change Lucas_WMDE Eugene233 [12:05:43] and I just created T221225 as well [12:05:44] T221225: Toolforge: deploying sssd to bastions - https://phabricator.wikimedia.org/T221225 [12:08:28] it’s working again, yay [12:08:53] !log tools T221225 rebooting bastions to clean sssd. We are back to nscd/nslcd until we figure out what's wrong here [12:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [12:10:08] huh, okay [12:10:18] I thought you were already done fixing the issue [12:10:41] Lucas_WMDE: the reboot should be the last step, just to make sure we are fine and sssd is gone :-) [12:10:47] ah, okay :) [13:40:35] !log wikidata-dev wikidata-shex `git pull` in /srv/mediawiki-vagrant, then `vagrant reload` (T221231) [13:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidata-dev/SAL [13:40:39] T221231: wikidata-shex demo system is down - https://phabricator.wikimedia.org/T221231 [13:41:05] !log wikidata-dev wikidata-shex vagrant provision (T221231) [13:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidata-dev/SAL [13:47:38] !help I tried to update vagrant in a Cloud VPS instance and now it says “no wiki found” – “sorry, we were not able to work out what wiki you were trying to view” [13:47:38] Lucas_WMDE: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team [13:51:25] Lucas_WMDE sudo service apache2 restart [13:51:40] I've had that issue before (and doing ^^ fixed it) [13:52:11] !log wikidata-dev wikidata-shex sudo systemctl restart apache2 [13:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidata-dev/SAL [13:52:17] that did the trick, thank you <3 [13:52:27] your welcome :) [14:01:06] Technical Advice IRC meeting starting in 60 minutes in channel #wikimedia-tech, hosts: @amir1 & @subbu - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [14:50:51] Technical Advice IRC meeting starting in 10 minutes in channel #wikimedia-tech, hosts: @chiborg & @milimetric - all questions welcome, more infos: https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting [17:17:24] !log deployment-prep cherry-picking https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/504580/ to move off of soon-to-be-shutdown dns recursors [17:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [17:49:34] Can someone help me figure out why my webservice was off. [17:50:15] Cyberpower678: I have around 10 mins of time. what's up? [17:56:23] zhuyifei1999_: I received a phab ticket just a moment ago claiming the tool was coming back with 503 No Webservice [17:56:45] The webservice was indeed not active, but when I tried to load it, I got a 404 as well. [17:56:53] which tool? [17:57:18] Restarting the webservice fixed things, but iabot's webservice has never once gone down without ever coming back up on it's own. [17:57:48] Actually I think IABot's webservice doesn't go down at all. [17:58:06] So what happened here? [17:58:50] tools.iabot? [17:59:55] zhuyifei1999_: yes [18:01:27] Cyberpower678: it would be easier to help if you had concrete questions like "what does X in my error logs mean?" If you have already done a `webservice restart` there probably are not a lot of clues about what was malfunctioning before that [18:01:50] (sorry gtg now) [18:02:14] also being clear about the backend (grid engine or kubernetes) is important. They have completely different failure modes/causes [18:02:28] bd808: Well I don't where a webservice failure is logged. It would help to know that first. ;-) [18:03:09] that 100% depends on the failure. First place to look is always $HOME/error.log [18:03:25] Then $HOME/service.log if it exists [18:03:37] then it depends on Kubernetes vs grid engine [18:04:17] bd808: here [18:04:20] 2019-04-17 10:38:52: (server.c.1751) [note] graceful shutdown started [18:04:20] 2019-04-17 10:38:52: (server.c.1828) server stopped by UID = 53156 PID = 31052 [18:04:20] Traceback (most recent call last): [18:04:22] File "/usr/bin/webservice-runner", line 30, in [18:04:24] webservice.run(port) [18:04:26] File "/usr/lib/python2.7/dist-packages/toollabs/webservice/services/lighttpdwebservice.py", line 654, in run [18:04:28] with open(config_path, 'w') as f: [18:04:30] IOError: [Errno 13] Permission denied: '/var/run/lighttpd/iabot' [18:04:40] Looks like someone stopped it. But it wasn't me. [18:07:07] Webservices can be stopped and restarted by system maintenance. When we do work on an exec node in either the grid engine or kubernetes we "drain" off the running jobs by killing them and then letting the proper scheduler restart things on an active node [18:07:27] Mine wasn't restarted however [18:07:37] It's been down for 7 hours apparently. [18:07:37] that stack trace at the bottom is a sign of an NFS hiccup [18:10:00] bd808: so any ideas why the scheduler neglected to schedule mine for a restart? [18:10:11] Cyberpower678: you still have never specified if this is Kubernetes or grid engine? [18:10:20] No idea. [18:10:25] seriously? [18:10:37] Whatever "webservice start" spawns it on [18:10:46] I'm on stretch [18:10:58] today, that's grid engine [18:11:16] "soon" it will default to kubernetes instead [18:11:53] Cool. "Webservice start" doesn't output much other than that it started the webservice. [18:13:36] here's what I think happened: ar.turo was working on grid engine nodes earlier today -- https://tools.wmflabs.org/sal/tools -- the node your job was running on was depooled. then when the scheduler tried to start the job on a new node it failed because of an NFS error (that may have been transient). [18:13:55] bd808: How can I tell what's on Kubernetes and what's on the grid engine? [18:14:31] Open question is if the "watchdog" service for the grid tried to restart it again as intended or if that process broke somehow [18:15:08] bd808: is the watchdog logging somewhere? [18:15:50] Cyberpower678: $HOME/service.log [18:16:15] which does not seem to have any recent entries for tools.iabot [18:16:42] bd808: last entry is 2019-04-06T00:52:24.621536 No running webservice job found, attempting to start it [18:17:23] for the grid vs kubernetes question, $HOME/service.manifest should have a "backend: ..." entry that tells you [18:18:09] :-) [20:47:58] hi! my webservice dies sometimes because there is a bug in a php script. how do i restart it automatically when this occurs? [20:52:41] Sveta: good question! I would hope that if you are running your webservice with --backend=kubernetes it would restart automatically. This actually should be the case with --backend=gridengine too, but the kubernetes backend is a bit more robust in restarting things. [20:53:22] Both systems would only restart however if the bug produces a hard crash of the php process. [20:53:44] The best answer is "fix the bug", but I know that's not always easy to do [20:54:58] Sveta: can you tell me more about how you know it needs to be restarted? Is there something that you can see in the $HOME/error.log or does it just stop responding to requests or something else? [21:51:35] hey, I'm having problem with one of staging servers (readingwebstaging), I run `vagrant git-update` and most of the composer update ends with "Killed" [21:52:09] I tried to run composer from `php -d memory_limit=-1` didn't help [21:53:28] I tried many things, nothing worked, and honestly I don't know what to do next [22:04:14] pmiazga: hmmm... OOMKiller is stopping the composer update? Or something else? [22:05:17] no idea [22:05:20] it just returns killed [22:05:39] readingwebstaging.reading-web-staging.eqiad.wmflabs [22:06:11] I login as mwvagrant, I go to the vagrant dirr and call `vagrant git-update` [22:07:01] *nod* I'm looking in /var/log/syslog. It looks like cgroup limits are the issue [22:07:18] "Memory cgroup out of memory: Kill process 31747 (php7.0) score 121 or sacrifice child" [22:08:17] the cgroup there is probably the LXC container that mediawiki-vagrant is running tings inside of [22:09:05] yup, that vagrant runs in lxc [22:09:13] how can I fix it? [22:09:37] the php in vagrant has memory_limit set to -1 (in cli) [22:09:51] that what made me think wtf [22:11:09] `/usr/local/bin/mwvagrant config --get vagrant_ram` says that only 1.5G is allocated to the LXC container. I think you could try bumping that up to 2.5G. The host has 4G available [22:11:18] s/available/total/ [22:12:01] so something like: vagrant config vagrant_ram 2560; vagrant reload [22:21:41] bd808 - thx, I'll check that