[02:25:24] !log paws removed webproxies and created new A records pointing directly to paws-proxy-02 [02:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [02:25:39] !log paws activated TLS termination using Let's Encrypt on paws-proxy-02 [02:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [02:26:55] ^^ temporary changes, I just wanted the A record not to be a web proxy so I had to do this. It'll help later when we change the ingress setup (and it simplifies the current setup) [02:27:17] webproxy addresses can't point to multiple IPs, which is required for HA [02:28:12] gtirloni: how will tls be handled [02:28:14] ? [02:28:22] chicocvenancio: it's termined at paws-proxy-02 by nginx/LE [02:28:27] *terminated [02:28:34] Cool, no problem [02:29:01] certbot has a cronjob that checks for the LE cert renewal every 12 hours, so we should be fine.. in any case, it's all temporary :) [02:29:06] I do need it pointing to paws-deploy hook as well [02:29:19] ah ok, let me add that [02:30:39] paws-deploy-hook.tools.wmflabs.org [02:31:26] I can change that in the script if you need to drop the .tools or something [02:31:42] oh, I didn't change that one [02:31:57] I only removed the webproxies in the `paws` project, so that one is fine.. it's untouched [02:32:08] It's broken now [02:32:40] weird, let me check if there's anything special about that [02:32:46] I needed to add a web proxy or change the dns record and didn't bother to create the task [02:33:10] It's probably broken since we moved paws to the new region [02:34:01] I only noticed it later and since it's only used for code deploys I just manually went in and git cloned in the single time it was needed after that [02:34:22] ah, interesting. it's A record in the tools.wmflabs.org zone that pointed to a webproxy in the `paws` project [02:34:32] But if you want me to create a task and do it later it's not a problem [02:35:04] Huh, yeah, this setup predates me maintaining paws, so I don't know why it's like that [02:36:26] hmm actually, it points to the k8s master.. so yeah, I didn't change anything related to that. pretty weird. I'll point it back to the master and see what's wrong [02:36:43] chicocvenancio: no worries, should be easy to fix [02:38:24] It points to the master? I guess we can create an ingress for it. I thought it was as nodeport as well [02:38:52] * chicocvenancio is sleepy, not the best time to think about this [02:39:56] chicocvenancio: I can create a task and check it tomorrow, sorry about the late changes... I was in the flow :) yeah, something that should be listening on the master is not [02:41:02] Thanks, a task seems better now I can't reason about it [02:44:29] T218380 [02:44:30] T218380: paws-deploy-hook is unreachable - https://phabricator.wikimedia.org/T218380 [02:44:31] g'night :) [10:23:32] !log toolsbeta create VM `arturo-bastion-sssd-test` (T218126) [10:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [10:23:35] T218126: LDAP: try how sssd works with our servers - https://phabricator.wikimedia.org/T218126 [11:07:40] anyone here who can restart wikibugs? [11:08:43] paladox: does this link work for you? https://tools.wmflabs.org/?tool=wikibugs [11:12:04] mutante: I did that once, long time ago [11:12:10] I might be able to [11:12:33] er [11:12:38] $ ssh login.tools.wmflabs.org [11:12:38] Connection closed by 185.15.56.48 port 22 [11:12:40] Apparently not. [11:14:03] and of course, not the other one: denied: host "tools-sgebastion-08.tools.eqiad.wmflabs" is not an admin host [11:14:14] arturo, I have to go but ^ sounds bad [11:14:31] ack [11:15:24] works for me [11:15:58] I can ssh to both [11:19:00] arturo, I just tried again and it works now [11:19:06] arturo, must've been the temp LDAP thing? [11:19:14] dunno if that admin host thing is okay or not [11:19:51] actually -07 has that issue too [11:20:11] maybe this is caught up with the trusty migration [11:21:58] !log tools.wikibugs run `tools.wikibugs@tools-sgebastion-07:~/wikibugs2$ python3 manage.py start_jobs` as described in https://www.mediawiki.org/wiki/Wikibugswith after deleting the grid jobs [11:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL [11:22:18] o/ wikibugs [11:22:39] oh you did it at the same time [11:22:43] ok [11:22:45] thanks:) [11:22:54] was wondering why qstat had started saying nothing [11:23:24] sorry Krenair, I should have coordinated with you. I just went ahead and did it, bc I thought you were unable to ssh to the bastion [11:23:35] np [12:02:45] mutante: yes [12:03:37] paladox: thanks, i had problems loading it earlier but now it wfm [12:03:56] It seems http://tools.wmflabs.org/pageviews gives a "502 Bad Gateway"... [12:04:05] Could someone check that as well? [12:06:50] !help ^ [12:06:50] mutante: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team [12:07:11] we are in an interview right now [12:11:33] yep, please give us 30-45min [12:57:46] WHOIS throwing bad gateways as well (https://tools.wmflabs.org/whois/gateway.py) [12:59:15] * gtirloni checking both [13:01:27] thanks gtirloni [13:02:41] seems to be up now gtirloni [13:03:19] Eugene233 Kb03: I have restarted both of your tools. They were reporting as 'running' but their grid job wasn't present. We had some issues with NFS that have caused the grid master to miscommunicate job statuses.. this could be related. All I did was run `webservice restart` on them. Sorry about this. [13:04:03] gtirloni: Thanks very much. [13:04:07] ^ [13:04:41] :) [13:52:53] bd808: The process for AnomieBOT job 387055 seems to be stuck, similar to the ones a few days back that you killed for me. I didn't try killing this one myself in case someone wants to take a closer look. At a quick poke it looks like it may be some NFS thing related to /mnt/nfs/labstore-secondary-tools-project/anomiebot/botlogs/bot-5.log, since trying to tail that file on tools-sgeexec-0904 hangs (but the tail works on tools-sgebastion-07). [14:31:37] !log tools-sgebastion-07 - generating locales for user request in T130532 [14:31:38] mutante: Unknown project "tools-sgebastion-07" [14:31:38] T130532: Offer Korean Locales "ko_KR.euckr" and "ko_KR.utf8" on Tool Labs - https://phabricator.wikimedia.org/T130532 [14:32:22] !log tools tools-sgebastion-07 - generating locales for user request in T130532 [14:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:40:35] !log tools tools-sgebastion-07 - dpkg-reconfigure locales and adding Korean ko_KR.EUC-KR - T130532 [14:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:40:43] T130532: Offer Korean Locales "ko_KR.euckr" and "ko_KR.utf8" on Tool Labs - https://phabricator.wikimedia.org/T130532 [15:14:18] !log tools.anomiebot Force killed job 387055; victom of NFS hiccups on exec node [15:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.anomiebot/SAL [15:14:30] anomie: ^ [15:14:34] bd808: Thanks [15:58:46] !log tools rebooted tools-clushmaster-02 [15:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:00:38] !log admin increased nscd cache size (T217280) [16:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:00:41] T217280: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 [17:47:45] Hello! My webservice (startedt with --backend=kubernetes ) was down from 10:20 (UTC) until now (17:43 UTC). Is there a ready script which can be used as a kind of watchdog to restart the webservice automagical? [17:48:23] Wurgl: why did it die? [17:48:33] Status 502 [17:48:47] 502 normally means an app error [17:49:16] (or some infra died under the app) [17:49:44] errorlog shows nothing [17:49:57] Just a few warnings [17:50:03] 2019-03-15 08:55:24: (mod_fastcgi.c.2702) FastCGI-stderr: PHP 3. wd_getpersons_wikipedia() /data/project/persondata/public_html/inc/person/wikidata.inc.php:74 [17:50:08] 2019-03-15 17:42:39: (log.c.164) server started [17:50:13] nothing between [17:50:40] kubectl get deployment showed one line with data: [17:50:45] persondata 1 1 1 1 1d [17:51:34] Wurgl: In theory you could write a health check for your app and edit the deployment to restart the pod if/when it fails. Not quite simple to do it though (or supported by the cloud services team as far as I know) [17:51:51] !log git Prune old logs in /var/log on puppet-paladox [17:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Git/SAL [17:52:12] Wurgl: in general k8s is pretty self-healing, it must be hitting the error in a consistent manner to keep the 502 [17:52:36] those logs should automatically be removed after a perioid. [17:53:15] paladox: ls -a was showing atleast 104 total things in that dir most of them old logs [17:53:27] nothing in the error.log, nothing in my mailbox. So we can just guess or wait for the god of root who might see something interesting [17:53:31] yeh [17:54:07] Wurgl: well, the application has to properly log errors for them to show up :) [17:58:55] I will try a 10liner with php-curl and check the status when retrieving an image (or a small text file) … maybe every 30 minutes, and run it with jlocal [18:01:38] Wurgl: which tool? [18:02:42] persondata [18:03:01] Kubernetes is self-healing for services that fully crash, but it does not currently have any content related health checks. [18:05:05] Wurgl: `kubectl get pods` shows a pod (your webservice) stuck in Terminating state. This is probably the glitch that caused the ingress proxy to lose track of your service. [18:07:49] persondata-676273626-1bveb 1/1 Running 0 23h <-- this was *BEFORE* webservice stop and then webservice --backend=kubernetes start [18:08:01] Just this single line [18:09:12] Wurgl: ok. that's the pod that stuck in terminating mode now (even after I tried to force delete it) [18:10:31] Wurgl: your $HOME/error.log there is full of notices and a few fatals [18:10:47] Yes I know [18:11:03] I am trying to reduce them [18:11:45] Very old code, not written by me [18:13:11] !log tools.persondata Force deleted pod stuck in terminating state with `kubectl --namespace=persondata delete po/persondata-676273626-1bveb --grace-period=0` [18:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.persondata/SAL [19:02:18] Hey. I keep getting groups: cannot find name for group ID 50062. Still not fixed? [19:06:59] hauskatze: no, still not fixed. We are working on it... T217280 [19:07:00] T217280: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 [19:07:52] the warning at ssh login is pretty harmless. the real interruptions are when it keeps your ssh session from starting at all or causes a grid job launch to fail [19:09:06] I could login okay so no troubles then, thanks. [20:08:52] why are we forcing old versions of mysql client onto instances, causing dependency conflicts and uninstalling our entire database? [20:17:41] Izhidez: can you provide a bit more context? [20:18:36] This puppet change ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/495757/4/modules/role/manifests/labs/instance.pp#8 ) forced the installation of the mysql-client-5.5 package, which forced the removal of the mariadb-server package [20:19:29] That is better context! Thank you. [20:19:47] I would also say that was unintended and a bug we need to figure out how to correct [20:20:19] I've temporarilly disabled puppet on the affected instance, reinstalled mariadb-server, and I'm now rebuilding that instance with the puppet role::mariadb; hopefully that will let puppet see the conflict. [20:20:30] Krenair: ^ unintended fallout from adding ::profile::openstack::main::clientpackages to labs::instance [20:21:26] Oresrian: have you made a phabricator task for this problem yet? [20:21:44] no, I've been focused on getting stuff back up [20:22:58] *nod* I have added notes on T218009 and reopened it [20:22:59] T218009: Puppet failure emails sent to non-admin members of tools project causing user confusion - https://phabricator.wikimedia.org/T218009 [20:25:04] Oresrian: what operating system (Jessie? Stretch?) is the instance that you are fixing running? [20:25:20] yeah, accounts-db3 is the instance [20:25:26] I *htink* it's jessie [20:25:50] sorry had to walk away [20:25:54] yup: Description: Debian GNU/Linux 8.6 (jessie) [20:28:01] bd808: I've just added some log snippets there as well [20:28:14] thanks! [20:28:44] deb dependencies can be a real tangle, especially across multiple distro releases [20:29:28] yeah, though I thought there were metapackages things like this could rely on which don't force specific versions [20:30:32] there usually are but not all packages reference meta packages. I haven't looked yet, but the pacakges we are using on Jessie may even be backports from some other distro entirely which can make things even worse [20:32:31] Izhidez: I can see the frustration in your first comment here, but I would nicely ask that you try to remember that you are talking to people and that starting with an accusatory tone will put you at a disadvantage. [20:33:10] Fair enough, apologies. [20:34:14] thanks :) [20:41:51] !log utrs modify crontab to include correct data removal file [20:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Utrs/SAL [21:08:39] !log tools cleared error state on several queues T217280 [21:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:08:43] T217280: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 [21:10:14] Izhidez, if that is the same problem as Oresrian then this is unintentional and I apologise [21:10:16] It is not clear why openstack client packages would do anything with mysql packages [21:10:58] I think we should look into why openstack::clientpackages::mitaka::trusty included mysql packages [21:15:51] hold on a sec [21:16:06] jessie, not trusty [21:17:03] hm https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/475990/2/modules/openstack/manifests/clientpackages/mitaka/jessie.pp [21:20:36] Krenair: hmmm... I wonder why that landed there? [21:20:48] am digging [21:20:55] blames go through https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/475326/ :/ [21:22:00] hm was it introduced there? [21:23:08] alex@alex-laptop:~/Development/Wikimedia/Operations-Puppet ((fa879d575f...))$ git show e87f8d269dd | grep mysql [21:23:08] + 'virtual-mysql-client', [21:23:08] + # Why? because we switched to 'virtual-mysql-client', which is a more [21:23:08] + 'mysql-client-5.5', [21:23:09] + 'mysql-common', [21:23:11] alex@alex-laptop:~/Development/Wikimedia/Operations-Puppet ((fa879d575f...))$ [21:23:13] odd [21:26:22] first patch set of that commit added virtual-mysql-client [21:26:30] unclear why [21:26:54] I could understand this in the serverpackages, not clientpackages [21:29:43] bd808, I guess we wait for art.uro and ask him? [21:32:34] Looking at https://tools.wmflabs.org/admin/oge/status and wondering how many tools will be dead soon … [21:35:52] Wurgl: 484 -- https://tools.wmflabs.org/trusty-tools/ [21:37:00] 482 … I switched over 5 days ago :-) [21:37:05] :) [21:38:22] Wurgl: if you move a job to Kubernetes, here's a hack to get you to drop off of the board: on the Stretch job grid run `jsub -N /bin/true`. [21:39:05] that will make an accounting record for the job name in the Stretch grid and the reporting tool will see that and remove you from the things still needing migration list [21:39:33] I think, I will survive those 3 mails [21:40:58] I used that switch to rewrite a lot of scripts, so I have fewer jobs to run, they run faster, they use less database space (7GB vs. 17GB) and they run with PHP 7 :-) [21:41:03] So all is fine [21:42:38] nice :) [21:43:17] bd808, perhaps we could have the reporting tool check k8s too? [21:44:23] Krenair: I totally would if that was a feasible process today. We do not currently have a Kubernetes equivalent of novaobserver rights. Only root on the K8s master can see all the things. [21:44:37] darn [21:45:01] there is a ticket for the needed role and I think it is something we will be able to fix next quarter [21:45:26] but I don't want to hack it into the current Kubernetes authz setup [21:45:54] one thing that was being considered before novaobserver became a thing was a regular public dump of data from the control host [21:46:09] totally a hack around the problem though [21:46:26] *nod* [21:46:42] I guess I could do that... but at this point I think I'll just leave it be [21:47:12] root cron to dump data to a file that ends up somewhere folks could read it is possible though for sure [21:48:36] yeah plus I was one of the people who wanted an observer role done properly IIRC :D [21:49:05] can't really turn around now and go for the "meh it'll do" [22:07:30] anyway the openstack client packages mysql thing... I guess we should advise people having problems to revert the change locally and disable puppet until we can ask art.uro about the reasoning behind that? [22:07:56] it was a while ago so if he can't remember maybe we should test it without the mysql packages mixed in there [22:11:18] so I'm finally getting around to moving my tools off Trusty. [22:11:56] the instructions mention trusty-login.tools.wmflabs.org, but that does not resolve. [22:12:08] is it instead trusty.tools.wmflabs.org? [22:13:52] abartov, login-trusty.tools.wmflabs.org I think [22:14:06] based on a quick glance at https://tools.wmflabs.org/openstack-browser/project/tools [22:14:40] Krenair: ah, of course! :) [22:14:45] where are these instructions? [22:14:48] would like to fix them [22:15:32] I tested and it seems this hostname gets you to the right instance, tools-bastion-03 [22:15:47] that's the old trusty one [22:15:52] abartov, ^ [22:16:40] Krenair: yup. I mentally transposed the name, the instructions are fine. [22:16:56] next: what do I need to know about "Son of Grid Engine"? [22:17:08] cool [22:17:15] Would my old cron line invoking jsub still work? (No Web component for this tool. It's a daily e-mailer.) [22:18:12] AIUI it's a successor to SGE/OGE? [22:18:55] abartov: yes. the cli commands are identical. The only real difference you need to look out for is newer versions of the language runtimes (e.g. php 7.2) causing problems with your scripts [22:19:16] bd808: excellent, thanks. All my tools are Ruby, anyway. :-p [22:19:40] sounds like the grid itself works largely the same, it's just the fact that your jobs get run on debian stretch with newer software that you need to worry about then? [22:19:50] heh. I think the ruby version jumped quite a bit too, but hopefully not in any really breaking way [22:20:58] Krenair: yeah, "Son of Grid Engine" is a fork of the Sun Grid Engine code and has mostly focused on bug fixes and a few stability enhancements. [22:21:20] typical grid operations should be the same for everyone [23:02:48] It looks like KrinkleBot (Commons) is stuck again. Rather than doing the typicall turn of and on again cycle, perhaps there's something I can do to figure out what happened? [23:02:52] qstat [23:02:53] 853973 0.25432 fileprotec tools.krinkl r 03/15/2019 10:20:21 task@tools-sgeexec-0915.tools. 1 [23:03:03] It's been not doing anything for about 20 hours. [23:03:39] .out has no new lines appended; .err shows crontab is still trying to restart it, but refusing to as it is already running. [23:11:10] NFS issues? [23:11:20] Might need someone to force kill it for you [23:12:14] started today though? [23:12:27] has there been such issues today? [23:13:00] I've no idea [23:25:10] Krenair: It starts every 15 min normally [23:25:38] I can probably kill it. I just want to know why/how it got stuck, and if I can do something to preven tit. [23:25:41] prevent it* [23:25:46] yeah I don't know, sorry [23:25:50] like, why does whatver it failing not making the process exit. [23:27:50] z.huyifei1999_ loves to strace these sorts of things [23:33:32] qacct -j 853973 [23:33:33] error: job id 853973 not found [23:33:39] Hm.. maybe it doesnt' work the way I remember. [23:34:14] Krinkle: I'm on tools-sgeexec-0915 and the job state looks weird [23:34:22] * Krinkle likes weird [23:34:26] OK. Show me :) [23:34:40] tools.krinklebot 24308 24313 Ds /usr/lib/gridengine/sge_shepherd -bg [23:34:45] with no children [23:35:17] Ds is a session leader with hung io [23:35:40] I think you got bitten by an NFS hiccup at job start? [23:36:31] output is from `ps axwo user:20,ppid,pid,stat,cmd | grep krinklebot` [23:38:00] Krinkle: I killed the hung sge_shepherd [23:38:20] so the job should have dropped out of qstat output [23:38:53] Okay :) [23:39:12] No juicy story this time. Feeling dissapointed. [23:39:31] disappointed [23:40:03] well.. the juicy story is that the exec node your job landed on (tools-sgeexec-0915) is in a crap state