[00:08:20] 06Labs, 10Beta-Cluster-Infrastructure, 10Horizon: Can't remove role::logstash from deployment-logstash2 because the class has been removed from ops/puppet.git - https://phabricator.wikimedia.org/T152472#2849441 (10bd808) [00:27:13] RECOVERY - Puppet run on tools-cron-01 is OK: OK: Less than 1.00% above the threshold [0.0] [00:36:50] !log tools Updated toollabs-webservice to 0.31 on rest of cluster (T147350) [00:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:36:55] T147350: Change Python hashbang to `#! /usr/bin/env python -E -s` for user-facing tools - https://phabricator.wikimedia.org/T147350 [00:43:38] 06Labs, 10Beta-Cluster-Infrastructure, 10Horizon: Can't remove role::logstash from deployment-logstash2 because the class has been removed from ops/puppet.git - https://phabricator.wikimedia.org/T152472#2849441 (10scfc) I think the problem is caused by the old Puppet configuration in LDAP: ``` scfc@tools-ba... [00:48:25] 06Labs, 10Beta-Cluster-Infrastructure, 10Horizon: Can't remove role::logstash from deployment-logstash2 because the class has been removed from ops/puppet.git - https://phabricator.wikimedia.org/T152472#2849515 (10bd808) I //think// that the Puppet side has been changed to ignore the old LDAP data now. My h... [00:49:26] 06Labs, 10Tool-Labs, 13Patch-For-Review, 15User-bd808: Change Python hashbang to `#! /usr/bin/env python -E -s` for user-facing tools - https://phabricator.wikimedia.org/T147350#2849518 (10bd808) 05Open>03Resolved Packages have been built and deployed. [02:28:25] RECOVERY - Host tools-secgroup-test-102 is UP: PING OK - Packet loss = 0%, RTA = 1.65 ms [02:30:06] PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170) [02:32:48] RECOVERY - Host tools-secgroup-test-103 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [02:34:12] PROBLEM - Host tools-secgroup-test-103 is DOWN: CRITICAL - Host Unreachable (10.68.21.22) [03:04:11] 06Labs, 10Tool-Labs: Warnings/errors in /var/lib/gridengine/spool/qmaster/messages - https://phabricator.wikimedia.org/T152477#2849650 (10scfc) [03:10:31] RECOVERY - Host secgroup-lag-102 is UP: PING OK - Packet loss = 0%, RTA = 167.94 ms [03:43:46] PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218) [04:49:25] 10Labs-project-Wikistats: allthetropes is not updating on wikistats - https://phabricator.wikimedia.org/T146712#2849738 (10NDKilla) @Dzahn you might be able to check now. I'm hoping there's no underlying issues with our database migration (besides one I've noticed). All of the Miraheze wikis should be back to no... [05:09:03] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Wugapodes was created, changed by Wugapodes link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Wugapodes edit summary: Created page with "{{Tools Access Request |Justification=Running automated tasks per BRFA on enwiki |Completed=false |User Name=Wugapodes }}" [06:50:30] PROBLEM - Puppet run on tools-exec-1409 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [06:55:43] PROBLEM - Puppet run on tools-exec-1405 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [06:57:01] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [07:03:32] PROBLEM - Puppet run on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [07:20:30] RECOVERY - Puppet run on tools-exec-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [07:32:00] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [07:35:42] RECOVERY - Puppet run on tools-exec-1405 is OK: OK: Less than 1.00% above the threshold [0.0] [07:38:30] RECOVERY - Puppet run on tools-webgrid-lighttpd-1406 is OK: OK: Less than 1.00% above the threshold [0.0] [08:53:00] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [09:33:04] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:18:57] 10Labs-project-Wikistats: allthetropes is not updating on wikistats - https://phabricator.wikimedia.org/T146712#2850495 (10Dzahn) allthetropes fails because the API URL redirects, "bus" fails because of an error 503 "backend fetch failed". ---- A(1/1) - allthetropes.miraheze.org - calling API: https://allthetr... [14:20:22] 10Labs-project-Wikistats: allthetropes is not updating on wikistats - https://phabricator.wikimedia.org/T146712#2850502 (10Dzahn) https://bus.miraheze.org/w/api.php?action=query&meta=siteinfo&siprop=statistics&maxlag=5 ``` "servedby": "mw2", "error": { "code": "readapidenied", "info":... [16:08:51] 10Labs-project-Wikistats: allthetropes is not updating on wikistats - https://phabricator.wikimedia.org/T146712#2850859 (10Dzahn) so re: allthetropes. yes, the redirect is the problem, using https://allthetropes.org/w/api.php?action=query&meta=siteinfo&siprop=statistics&format=php&maxlag=5 would work, BUT so far... [16:09:22] 10Labs-project-Wikistats: allthetropes is not updating on wikistats - https://phabricator.wikimedia.org/T146712#2850861 (10Dzahn) re: bus.miraheze.org - this just tells me i'm not allowed to get info from API, per configuration [17:30:37] 06Labs, 10Labs-Infrastructure, 10Tool-Labs, 10DBA, 10Wikimedia-Developer-Summit (2017): Labsdbs for WMF tools and contributors: get more data, faster - https://phabricator.wikimedia.org/T149624#2758290 (10bd808) Soliciting SQL queries to fix/optimize on labs-l: https://lists.wikimedia.org/pipermail/labs-... [17:37:21] 06Labs, 10Labs-Infrastructure, 06Operations, 07Wikimedia-Incident: labservices1001 down - https://phabricator.wikimedia.org/T152340#2845101 (10greg) This is now back up, yes? :) Incident report filed at https://wikitech.wikimedia.org/wiki/Incident_documentation/20161204-labservices1001 Closable? Follow-ups... [17:53:26] 06Labs, 06Discovery, 13Patch-For-Review: Upgrade postgres on labsdb1004 / 1005 to 9.4, and PostGis 2.1 - https://phabricator.wikimedia.org/T101233#2851152 (10scfc) 05Resolved>03Open Reopening this because [[https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/manifests/site.pp;5e38ec5ba7c75... [18:04:26] 10Labs-project-Wikistats: allthetropes is not updating on wikistats - https://phabricator.wikimedia.org/T146712#2851202 (10NDKilla) @Dzahn Sites (at their configured URIs) have been accessible sporadically. As soon as @Southparkfan came back online, he reinstalled mysql on our old db server and started to migra... [18:11:15] 10Labs-project-Wikistats: allthetropes is not updating on wikistats - https://phabricator.wikimedia.org/T146712#2851223 (10Dzahn) @NDKilla sounds good. thanks. Currently the way to purge deleted wikis would be to ask me to drop it from DB, can you provide a list of wikis to be deleted? [18:29:41] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [18:31:55] 06Labs, 10Beta-Cluster-Infrastructure, 10Horizon, 13Patch-For-Review: Can't remove role::logstash from deployment-logstash2 because the class has been removed from ops/puppet.git - https://phabricator.wikimedia.org/T152472#2851317 (10Andrew) a:03Andrew [18:56:16] 06Labs, 10Labs-Infrastructure, 06Operations, 07Wikimedia-Incident: labservices1001 down - https://phabricator.wikimedia.org/T152340#2851417 (10fgiunchedi) @greg yeah now back up! this task is one of the followups though, I'll clarify it a bit [18:57:09] 06Labs, 10Labs-Infrastructure, 06Operations, 07Wikimedia-Incident: labservices1001 down, suspected overheating - https://phabricator.wikimedia.org/T152340#2851418 (10fgiunchedi) [19:04:42] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:14:39] 06Labs, 10Labs-Infrastructure, 10Horizon: Add a textbox to puppet roles config to add arbitrary roles - https://phabricator.wikimedia.org/T148481#2724268 (10scfc) I believe @Andrew's work at https://gerrit.wikimedia.org/r/#/c/325595 might cover this issue as well so T152472 may be related/parent/sub of this... [19:29:49] gilles mentioned performance team wanted to try grafana 4 on labmon, objections? [19:30:01] 06Labs: ldap userkeys broken on labtest - https://phabricator.wikimedia.org/T152518#2851586 (10Andrew) [19:31:51] upgrade to grafana 4 on labmon godog? [19:32:08] +1 I love it [19:32:32] there are a lot of tools devs that use it, or generally so probably best to let ppl know [19:32:34] chasemp: correct, that'd be T152473 [19:32:35] T152473: Upgrade labmon1001 Grafana to 4.0.1 - https://phabricator.wikimedia.org/T152473 [19:32:39] everytime we break it there is static [19:34:47] hehe fair enough, what would be the best way to do that? [19:34:54] email to labs-announce list [19:45:32] 06Labs: Simple logrotate service for users of Tools as stopgap before central logging - https://phabricator.wikimedia.org/T152235#2851723 (10chasemp) @scfc thanks, I had not seen that but yes very similar. I'm avoiding `logrotate` for this reason (though their is a potential solution there w/ setfacl but I'd rat... [19:46:10] chasemp: thanks I'll post there too [19:59:42] 06Labs: Simple logrotate service for users of Tools as stopgap before central logging - https://phabricator.wikimedia.org/T152235#2842545 (10Betacommand) I know jsub already has an option to log to a specific directory (Ive been using this for a while to avoid sge logs flooding my $HOME perhaps setting a global... [20:00:35] 06Labs: Simple logrotate service for users of Tools as stopgap before central logging - https://phabricator.wikimedia.org/T152235#2851879 (10chasemp) Of note some tools are already using the internal logs dir scheme > ls -ald /srv/tools/shared/tools/project/**/logs | wc > 76 684 7228 [20:10:00] I have one user who is unable to log in to an instance: Permission denied (publickey). I believed we have double checked the username, ssh config and the key. Is there anything else that could be the cause? [20:12:10] Nikerabbit: what instance and what user? [20:12:57] chasemp: urasiili to mora.wmwcourse.eqiad.wmflabs [20:14:33] Nikerabbit: it says the key they are presenting is not the key expected [20:14:34] AuthorizedKeysCommand /usr/sbin/ssh-key-ldap-lookup returned status 1 [20:14:42] pming you the public side of the pair I see [20:38:52] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Brynda1231 was created, changed by Brynda1231 link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Brynda1231 edit summary: Created page with "{{Tools Access Request |Justification=For good. |Completed=false |User Name=Brynda1231 }}" [21:08:26] RECOVERY - Host tools-secgroup-test-102 is UP: PING OK - Packet loss = 0%, RTA = 2.13 ms [21:23:25] I've got a jstart job that seems to always quit after 30-50 minutes and does *not* auto restart [21:23:28] not output in .err or .out [21:23:33] It's no longer in qstat [21:23:39] and yet it ain't coming back [21:23:45] What can I do to figure out what's wrong? [21:24:07] should I call jstart from cron instead? I figured jstart's jsub -continuous would essentially make that obsolete [21:25:07] hi Krinkle [21:25:17] what tool is this? what's the command you're executing? [21:25:19] YuviPanda: Heya [21:25:25] YuviPanda: perflogbot [21:25:33] new tool? [21:25:37] see start.sh and job.sh [21:25:50] ok [21:25:58] I run start.sh, then 30min later it stops (I'm not ruling out a bug in my code, just wanna know how to figure it out) [21:27:11] Krinkle: hmm, did you try without the -continous to jstart? that's my first vague theory, since jstart doesn't need -continuous [21:27:18] (jsub does, jstart is basically jsub + that) [21:27:38] YuviPanda: I only added that earlier today as an atempt to make it work :) [21:27:41] Krinkle: I see there's output from your shell script but not from your code [21:27:42] I'm fine with removing it again, though [21:27:52] Yeah, it's not outputting by default [21:28:05] right [21:28:14] it's just an irc bot that monitors something and reports to #wikimedia-perf-bots when something's up [21:28:23] Krinkle: we're also running an ancient node version [21:28:31] Krinkle: mind if I try to run it with k8s? [21:28:37] YuviPanda: go ahead :) [21:28:41] Krinkle: ok [21:28:47] I also added -N perflogbot in case that makes a difference or helps it remember what is what [21:28:50] but made no difference [21:28:55] previously it defaulted to "job" .err/out [21:29:26] PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170) [21:30:21] Krinkle: ok, am going to log my steps here [21:30:53] Krinkle: damn, npm install fails because it requires some C++ building [21:31:14] > #include [21:31:39] RECOVERY - Host secgroup-lag-102 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [21:31:54] libicu-dev needed [21:32:01] PROBLEM - Puppet run on tools-webgrid-generic-1403 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [21:33:17] Krinkle: lololol it actually succeeded [21:33:20] that wasn't a fatal error [21:33:22] lol [21:33:22] damn npm [21:33:27] it should be an optional dep [21:33:32] not sure why it tries that hard though [21:33:35] or maybe it is [21:33:36] it's used by the npm 'irc' module [21:33:40] I can't tell from the output [21:33:45] PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218) [21:34:08] Krinkle: https://phabricator.wikimedia.org/P4579 [21:34:13] Krinkle: does that look ok to you? [21:35:00] YuviPanda: Yeah, it's fine [21:35:02] "npm WARN optional dep failed, continuing node-icu-charset-detector@0.2.0" [21:35:07] It's confusing. [21:35:48] 06Labs: Simple logrotate service for users of Tools as stopgap before central logging - https://phabricator.wikimedia.org/T152235#2852405 (10chasemp) In an effort to create good defaults without stepping on existing toes at the moment I'm excluding anyone who is currently using a 'logs' directory intra tool home... [21:37:42] Krinkle: :D ok [21:47:40] 06Labs, 10Labs-Infrastructure, 10Horizon, 13Patch-For-Review: Add a textbox to puppet roles config to add arbitrary roles - https://phabricator.wikimedia.org/T148481#2852430 (10Andrew) 05Open>03Resolved a:03Andrew [21:48:52] 06Labs, 10Tool-Labs: Warnings/errors in /var/lib/gridengine/spool/qmaster/messages - https://phabricator.wikimedia.org/T152477#2852434 (10valhallasw) The only thing I can find is {T122638}, but that has no clear solution. A reboot of the master might solve it (but do we dare to do so?). The ghost job on t-w-l-... [21:52:08] Krinkle: is it running now? [21:52:19] YuviPanda: yep, it joined. [21:52:25] Krinkle: \o/ [21:52:42] Krinkle: I guess now we watch to see if it's dying in 30min? [21:52:48] Right. [21:52:51] Or maybe 1 hour [21:52:58] Krinkle: you can restart it by killing the pod now [21:52:58] if you need [21:53:08] kstat? [21:53:09] Krinkle: there's a perflogbot.yaml [21:53:15] kubectl get pod [21:53:22] kubectl get pods [21:53:23] right [21:53:25] thx [21:53:46] Krinkle: you can add 'source ~/.kube/completion' to your bashrc for the tool [21:53:47] YuviPanda: How do I long-term stop it, and then later start it? [21:53:54] will provide autocomplete for everything, including pod names [21:54:07] Krinkle: kubectl delete deployment perflogbot [21:54:14] Krinkle: to start it again, kubectl apply -f perflogbot.yaml [21:54:40] OK. I'll see how this goes and then update scripts/docs accordingly. [21:54:42] Thanks :) [21:55:03] YuviPanda: So this uses the k8s built-in monitoring and autorestart? [21:55:12] based on replicas:1 [21:55:13] Krinkle: yes [21:55:21] if the process dies, k8s restarts it [21:55:45] in general, I trust this far more than gridengine [21:55:49] 06Labs, 10Tool-Labs: tools.suggestbot web requests fail after a period of time - https://phabricator.wikimedia.org/T133090#2852456 (10scfc) 05Open>03Resolved `/data/project/suggestbot/service.log` shows no restarts after: ``` 2016-12-05T19:32:01.791443 No running webservice job found, attempting to start... [21:56:19] chasemp: bd808 so I think expecting users to run straight off kubectl only is not viable, because we need additional settings to mount the NFS homedirs at least. [21:56:24] and that gets ugly [21:56:33] yeah. [21:56:50] bd808: needs at least a simple wrapper script, but that's the same as the jsub path with a different name (so maybe not as bad) [21:57:00] but similar pathway [21:57:08] Krinkle: the control script I wrote for stashbot on k8s might give you some management ideas -- https://phabricator.wikimedia.org/diffusion/LTST/browse/master/bin/stashbot.sh [21:57:24] I feel ok having Krinkle deal with kubectl because he has dealt with it in the past, but idk if that'll work for random users [21:57:28] the big part was writing the yaml file [21:57:51] *nod* I had to read a lot of stuff to write -- https://phabricator.wikimedia.org/diffusion/LTST/browse/master/etc/deployment.yaml [22:05:29] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Wugapodes was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=1092412 edit summary: [22:12:04] RECOVERY - Puppet run on tools-webgrid-generic-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [22:22:52] 06Labs, 10Beta-Cluster-Infrastructure, 10Horizon, 13Patch-For-Review: Can't remove role::logstash from deployment-logstash2 because the class has been removed from ops/puppet.git - https://phabricator.wikimedia.org/T152472#2852578 (10bd808) 05Open>03Resolved ``` deployment-logstash2.deployment-prep:~ b... [23:00:26] !log upgrade grafana on labmon1001 - T152473 [23:00:26] Unknown project "upgrade" [23:00:27] T152473: Upgrade labmon1001 Grafana to 4.0.1 - https://phabricator.wikimedia.org/T152473 [23:00:40] wah wah, forgot !log is different here [23:00:59] godog: yeah, just do it in -operations :D no way to do it without a project here [23:01:33] YuviPanda: aye, did that in -operations instead [23:02:36] https://zippy.gfycat.com/AcceptableSelfishElephantseal.gif relevant?