[00:01:20] Platonides: I'm looking for a folder in the file system that is mounted via nfs [00:03:28] something like /public/dumps/public (but that folder does not seem to exist anymore) [00:58:45] (03PS2) 10BryanDavis: [WIP] Rewrite jsub in python [labs/toollabs] - 10https://gerrit.wikimedia.org/r/285435 (https://phabricator.wikimedia.org/T132475) [02:31:08] RECOVERY - Puppet run on tools-web-static-02 is OK: OK: Less than 1.00% above the threshold [0.0] [05:26:04] 06Labs, 10Labs-Sprint-100, 10Tool-Labs: Deploy new unified webservice code - https://phabricator.wikimedia.org/T98440#2241953 (10yuvipanda) Testing the different command invocations people use for it... ``` +-----------------------------------------------------------------+-------+ | event_commandline... [05:31:31] physikerwelt: hey! It is disabled by default on new projects, you can file a ticket in phabricator to have it enabled for you [05:32:46] 06Labs, 10Labs-Sprint-100, 10Tool-Labs: Deploy new unified webservice code - https://phabricator.wikimedia.org/T98440#2241956 (10yuvipanda) Without tools-services (aka webservicemonitor): ``` mysql:research@analytics-store.eqiad.wmnet [log]> select event_commandline, COUNT(*) as count FROM CommandInvocation... [05:33:42] 06Labs, 10Labs-Sprint-100, 10Tool-Labs: Deploy new unified webservice code - https://phabricator.wikimedia.org/T98440#2241958 (10yuvipanda) Some of it is weird - for example, tools.jembot was responsible for 19487 of the webservice restarts in non-services hosts... [06:11:59] 06Labs, 10Tool-Labs, 06Commons, 10pywikibot-catimages, and 3 others: Pywikibot : Fix Commons scripts broken by toolserver.org to labs migration - https://phabricator.wikimedia.org/T78462#2242012 (10jayvdb) [06:46:23] PROBLEM - Puppet run on tools-webgrid-generic-1405 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [06:47:57] PROBLEM - Puppet run on tools-webgrid-generic-1402 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [06:48:23] ^ is fine will go away soon etc [06:49:37] 06Labs, 10Labs-Sprint-100, 10Tool-Labs, 13Patch-For-Review: Deploy new unified webservice code - https://phabricator.wikimedia.org/T98440#2242077 (10yuvipanda) Just sent out an announcement email saying I'm doing this today \o/ [06:59:04] PROBLEM - Puppet run on tools-webgrid-generic-1403 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [06:59:30] PROBLEM - Puppet run on tools-webgrid-lighttpd-1414 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [07:01:00] PROBLEM - Puppet run on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [08:20:29] (03CR) 10Lokal Profil: [C: 04-1] "Everything looks fine with the once exception being my in-line comment. Not caused by this patch but since your already changing that line" (031 comment) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/285174 (owner: 10Jean-Frédéric) [08:26:53] 06Labs, 10Wikimedia-Site-requests, 10wikitech.wikimedia.org, 13Patch-For-Review: Enable math extension on wikitech - https://phabricator.wikimedia.org/T126338#2242189 (10fgiunchedi) looks like this isn't blocked on operations anymore @Dereckson ? [09:17:35] 06Labs, 10DBA, 13Patch-For-Review: Move labs pdns database off of m5-master - https://phabricator.wikimedia.org/T128737#2242263 (10jcrespo) @Andrew: @ holmium: ``` ps aux | grep mysql mysql 1259 0.0 0.3 492676 50004 ? Ssl Mar17 33:59 /usr/sbin/mysqld ``` What do we do with the existing runnin... [09:50:00] 06Labs, 10DBA, 13Patch-For-Review: Move labs pdns database off of m5-master - https://phabricator.wikimedia.org/T128737#2242323 (10jcrespo) Holmium and the others will page on stop, so make sure to downtime if you stop a service. [10:06:29] (03PS1) 10Yuvipanda: Remove old old webservice from package [labs/toollabs] - 10https://gerrit.wikimedia.org/r/285616 [10:06:56] (03CR) 10jenkins-bot: [V: 04-1] Remove old old webservice from package [labs/toollabs] - 10https://gerrit.wikimedia.org/r/285616 (owner: 10Yuvipanda) [10:08:01] (03PS2) 10Yuvipanda: Remove old old webservice from package [labs/toollabs] - 10https://gerrit.wikimedia.org/r/285616 [10:08:57] (03CR) 10jenkins-bot: [V: 04-1] Remove old old webservice from package [labs/toollabs] - 10https://gerrit.wikimedia.org/r/285616 (owner: 10Yuvipanda) [10:11:48] (03PS3) 10Yuvipanda: Remove old old webservice from package [labs/toollabs] - 10https://gerrit.wikimedia.org/r/285616 [10:15:12] (03CR) 10Yuvipanda: [C: 032] Remove old old webservice from package [labs/toollabs] - 10https://gerrit.wikimedia.org/r/285616 (owner: 10Yuvipanda) [10:55:41] physikerwelt: /shared/dumps [11:03:29] (03PS3) 10Jean-Frédéric: Add unit tests to test_update_database [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/283108 [11:03:31] (03PS2) 10Jean-Frédéric: Rename processText to processPage and remove processTextfile [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/285174 [11:04:35] (03CR) 10Jean-Frédéric: "Comments adressed." [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/285174 (owner: 10Jean-Frédéric) [11:37:05] 06Labs, 10Beta-Cluster-Infrastructure, 13Patch-For-Review, 07Puppet: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2242535 (10hashar) deployment-cache-text04 had the issue again. So at f... [11:41:38] 06Labs, 10Beta-Cluster-Infrastructure, 13Patch-For-Review, 07Puppet: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2242540 (10hashar) p:05Triage>03Normal https://gerrit.wikimedia.org... [11:45:53] (03CR) 10Lokal Profil: [C: 032] "Thanks!" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/285174 (owner: 10Jean-Frédéric) [11:48:53] (03Merged) 10jenkins-bot: Rename processText to processPage and remove processTextfile [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/285174 (owner: 10Jean-Frédéric) [11:58:57] 06Labs, 10Labs-Sprint-100, 10Tool-Labs, 13Patch-For-Review: Deploy new unified webservice code - https://phabricator.wikimedia.org/T98440#2242583 (10yuvipanda) Ok, so new webservice code is deployed and available in /usr/bin/webservice now! webservicemonitor will use it for version: 2 manifests. Things le... [12:02:22] the generic webservice nodes should post a puppet recovery soon [12:26:25] RECOVERY - Puppet run on tools-webgrid-generic-1405 is OK: OK: Less than 1.00% above the threshold [0.0] [13:43:47] The author of https://tools.wmflabs.org/fr-wikiversity/ and https://tools.wmflabs.org/fr-wikiversity-ns/ (which should hold the same code) said on https://www.mediawiki.org/wiki/Git/New_repositories/Requests (last entry) [13:43:51] that he wants a repo for that code and one of the two tools deleted. [13:43:55] I created the repo for him, but obviously I cannot delete the tool at https://tools.wmflabs.org/fr-wikiversity/ [13:43:59] Apparently, he has to ask an admin: https://wikitech.wikimedia.org/wiki/Help:Tool_Labs#Can_I_delete_a_tool.3F [13:44:03] But given that this poor soul already went through two Phab tickets (T132988 and T133297), before he requested on mediawiki, I'd prefer to not tell him �Go ask a Tool Labs admin� :-) [13:44:03] T133297: Gerrit access for fr-wikiversity-ns - https://phabricator.wikimedia.org/T133297 [13:44:03] T132988: fr-wikiversity new tool on wmflabs - https://phabricator.wikimedia.org/T132988 [13:44:17] YuviPanda: Could you please check if https://tools.wmflabs.org/fr-wikiversity/ and https://tools.wmflabs.org/fr-wikiversity-ns/ agree and delete the first one if they agree? [13:48:28] qchris_: I don't understand why the tool should be deleted. Shouldn't it just become a redirect to the new name? [13:49:40] Both tools exist already. [13:49:47] And the author asked to remove one of them. [13:50:28] The way I understood it, they are not yet in use. [13:50:45] And they only have some initial commit in them. [13:51:07] But hey, if a redirect is easier. I guess that should also do the trick for him. [13:51:51] I think we are not good at tool removal in general atm and it's kind of an issue [13:52:30] *\O/* [13:52:33] So I tell him he should just stop using the one name (and leave it deprecated) and instead just use the new name? [13:52:42] (That's a fine solution too) [13:52:53] (And it reminds me of how we 'delete' repos in gerrit) [13:52:59] can he make a task for the removal and hten at least it's on the radar? [13:53:13] I have a few of those in the back of my brain for when we make a real process for it [13:53:19] I think there is a delete-tool script in theory, but afaik scfc is the only one who has handled these cases [13:53:22] but otherwise ok w/ me [13:54:13] Ok, I'll file the ticket for the user. Do you have a Phab ticket under which I should create it? [13:54:53] qchris_: I don't actually but #labs and #toollabs low priority if you don't mind [13:55:03] thanks man [13:55:07] I've been so free to change https://wikitech.wikimedia.org/w/index.php?title=Help%3ATool_Labs&type=revision&diff=465613&oldid=286415 to say 'No, we don't remove tools, but these are things you can do' [13:55:08] good to see you around! [13:56:34] chasemp: Thanks. Will do. [13:57:16] Glad to have had a reason to chime in #wikimedia-labs again :-) [13:57:23] Thanks valhallasw`cloud. [14:00:31] valhallasw`cloud: where is that delete script at, do you know? [14:00:42] chasemp: rmtool in misctools [14:00:50] gotcha thanks [14:02:25] 06Labs, 10Tool-Labs: Tools that should get deleted - https://phabricator.wikimedia.org/T133777#2242852 (10QChris) [14:02:38] 06Labs, 10Tool-Labs: Tools that should get deleted - https://phabricator.wikimedia.org/T133777#2242864 (10QChris) p:05Triage>03Low [14:09:16] 06Labs, 10Tool-Labs: `fr-wikiversity` Tool should get deleted - https://phabricator.wikimedia.org/T133778#2242872 (10QChris) [14:36:08] 06Labs, 10wikitech.wikimedia.org: WikiPage::something error encountered while adding two users to Tools project at the same time - https://phabricator.wikimedia.org/T133742#2241219 (10Zppix) if you can reproduce error and tell us what it is I may take a look [14:59:02] bd808: hmm, I just setup a mwv at paws-base-01.paws.eqiad.wmflabs, with proxy at https://pawsbase.wmflabs.org, except the proxy sends me to paws-base-01.wmflabs.org which fails... [15:21:29] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Get a real (letsencrypt) cert for labtestwikitech.wikimedia.org - https://phabricator.wikimedia.org/T133167#2243064 (10Krenair) a:03Krenair [15:45:25] YuviPanda: you need to override the default canonical name for the wiki in hiera. Documented'ish at https://wikitech.wikimedia.org/wiki/Help:MediaWiki-Vagrant_in_Labs#Run_a_wikifarm_on_a_single_labs-vagrant_host [15:46:38] The bit you probably want is "role::mediawiki::hostname: pawsbase.wmflabs.org" [15:47:18] * bd808 looks for a place in the docs to mention that [15:54:24] YuviPanda: documented at https://wikitech.wikimedia.org/w/index.php?title=Help:MediaWiki-Vagrant_in_Labs&diff=466112&oldid=332604 ; please help make the wording more clear if you can [15:56:29] on frwikisource replica I get "MIN(rc_timestamp) --> 20150512201401" I thought the recentchanges tables contains only the last 30 days, it was changed to a longer time ? [15:57:18] ermm, it's 90 days now [15:57:21] nvm [15:58:45] 06Labs, 10Tool-Labs: Tools that should get deleted [Tracking] - https://phabricator.wikimedia.org/T133777#2243175 (10TerraCodes) [16:10:00] bd808: \o/ <3 thank you [16:11:00] YuviPanda: I changed mw-vagrant to use pretty hostnames by default and that was a mostly unintended side effect [16:11:32] I'm actually surprised that a bunch of people haven't complained about their Labs instances breaking because of it [16:12:10] * bd808 assumes this means that people don't update the /srv/mediawiki-vagrant clone very often (or ever) [16:12:37] yeah I bet it's never [16:17:49] 06Labs: Cleanup proxies that point to nonexistent instances - https://phabricator.wikimedia.org/T132231#2192111 (10AlexMonk-WMF) There at at least 14 pointing to pmtpa... [16:21:34] oO [16:22:19] valhallasw`cloud: ^ Maybe the 5 minute flood limit again? (This time I didn't do anything ;)) [16:25:18] let me see if we got running bulks [16:26:01] https://phabricator.wikimedia.org/daemon/bulk/view/497/ [16:36:44] 06Labs, 10Tool-Labs: [Tracking] Tools that should get deleted - https://phabricator.wikimedia.org/T133777#2243885 (10Luke081515) [16:41:52] 06Labs, 10Horizon, 13Patch-For-Review: Switch dynamicproxy to point back to IP rather than domain names - https://phabricator.wikimedia.org/T133554#2243928 (10AlexMonk-WMF) @yuvipanda, @chasemp: How's this? It should cover all the existing weird data apart from non-existent hosts which are {T132231} ```lang=... [16:47:03] bd808, it's a pain to update. You have to recursively clone, which takes a long time. [16:47:16] (at least on my dev machine anyway) [16:47:28] 06Labs, 10DBA, 13Patch-For-Review: Move labs pdns database off of m5-master - https://phabricator.wikimedia.org/T128737#2243947 (10Andrew) > What do we do with the existing (unpuppetized?) already running mysql instance? I'm pretty sure that that's unused leftovers from a previous implementation. Can you v... [16:48:29] tom29739: we've made some advances on the time to update clones, but yes the disk speed issues between the VM and the host cause problems there for many. [16:49:16] I wish I had a universal solution for that, but it seems that ever tweak we make helps someone at the expense of someone else [16:49:40] For my dev machine it's the network, but that would affect it too :) [16:58:35] 06Labs, 10DBA, 13Patch-For-Review: Move labs pdns database off of m5-master - https://phabricator.wikimedia.org/T128737#2244005 (10jcrespo) Actually, I see 4 connections from localhost (although using the public interface) from `/usr/sbin/pdns_server-instance --daemon --guardian=yes` [17:07:08] watroles worked yesterday, but today it's "not being served" [17:07:24] f.e. http://tools.wmflabs.org/watroles/role/role::puppet::sel [17:07:35] or is there work on toollabs? [17:19:27] 06Labs, 10Tool-Labs: watroles tool down - https://phabricator.wikimedia.org/T133789#2244127 (10Dzahn) [17:20:55] PROBLEM - Host tools-worker-1011 is DOWN: PING CRITICAL - Packet loss = 100% [17:26:41] mutante: it looks like it may be broken due to testing by YuviPanda [17:27:15] `webservice status` says "v2 service.manifest detected, please use webservice-new" and webservice-new is not found [17:28:39] bd808: ah, ok. well, i made a ticket and added yuvi [17:28:54] and i know he was up super late.. so he will see it when he comes back [17:30:00] 06Labs, 10Tool-Labs: watroles tool down - https://phabricator.wikimedia.org/T133789#2244127 (10bd808) ``` tools.watroles@tools-bastion-02:~$ webservice status Traceback (most recent call last): File "/usr/local/bin/webservice", line 301, in main() File "/usr/local/bin/webservice", line 249, in... [17:30:09] maybe you could paste that .. reading my mind :) thanks [17:31:08] 06Labs, 10Tool-Labs: watroles tool down - https://phabricator.wikimedia.org/T133789#2244159 (10bd808) Possibly related to {T98440} work being done by @yuvipanda? [17:36:46] mutante: it's alive! https://tools.wmflabs.org/watroles/role/role::puppet::self [17:36:59] I did things [17:36:59] also things are not quire right with the new runner [17:37:09] YuviPanda: I did too :) [17:37:14] the new webservice command is in /usr/bin/webservice, without https://gerrit.wikimedia.org/r/#/c/285656/ there yet. [17:37:37] it was also trying to run lightpd instead of uwsgi - I could blame one of my stop / starts when testing earlier maybe [17:37:45] bd808: oooh, what did you see /d o? [17:37:56] `/usr/bin/webservice --release trusty uwsgi-python restart` [17:38:18] oh hey, thanks. but cant confirm right now [17:38:23] hmm, it's down again? [17:38:29] the job itself is gone [17:38:32] yea [17:38:43] there's no 'web' field in 'service.manifest' either [17:38:46] No webservice [17:38:55] bd808: are we stepping on each other's toes doing things to this? :D [17:38:59] "SIGINT/SIGQUIT received...killing workers..." [17:39:03] maybe [17:39:07] * bd808 will lay off [17:39:22] hmm [17:39:30] ok, let me try [17:40:05] interesting, it didn't write out a service.manifest file [17:40:37] it's there now [17:41:05] nfs lag? [17:41:21] no probably not :D [17:41:53] I mean, I touched the service.manifest code earlier, so it's more likely to be that [17:42:19] I suspect that [17:42:41] 06Labs, 10Tool-Labs: watroles tool down - https://phabricator.wikimedia.org/T133789#2244211 (10bd808) Looks like it may be a path thing. ``` $ which webservice /usr/local/bin/webservice $ /usr/bin/webservice status Your webservice is not running ``` See https://gerrit.wikimedia.org/r/#/c/285656/1 [17:43:56] 06Labs, 10Tool-Labs: watroles tool down - https://phabricator.wikimedia.org/T133789#2244217 (10bd808) 05Open>03Resolved a:03yuvipanda @yuvipanda got it restarted. Follow up should be on T98440 which was the root cause. [17:51:42] bd808: I think after looking through this, my conclusion is: 1. I bought it back as a lighttpd while earlier testing, and 2. we stepped on each others' toes when doing stuff now. [17:52:01] now I should fix those arrows and merge that patch [17:52:22] *nod* plus the wrong binaries in path that need fixing with that patch [17:53:05] the wrong binaries thing is only a problem for places you tested webservice-new though right? [17:53:08] 06Labs, 06Operations: check_dns needs to be rewritten - https://phabricator.wikimedia.org/T133791#2244253 (10chasemp) p:05Triage>03High [17:54:43] bd808: yeah. [17:56:40] 06Labs, 06Operations: check_dns needs to be rewritten - https://phabricator.wikimedia.org/T133791#2244256 (10chasemp) A small addendum in case someone else runs into it. I was initially confused by the difference in behavior here: ```dig blah @labs-ns0.wikimedia.org ; <<>> DiG 9.8.3-P1 <<>> blah @labs-ns0.w... [17:58:56] 06Labs, 06Operations: check_dns needs to be rewritten - https://phabricator.wikimedia.org/T133791#2244262 (10chasemp) [18:03:39] physikerwelt: there is /data/scratch/dumps, but it's not what we expected… [18:16:44] 06Labs, 06Operations: check_dns needs to be rewritten - https://phabricator.wikimedia.org/T133791#2244402 (10chasemp) In the short term maybe it makes sense just to switch to `/usr/lib/nagios/plugins/check_dig` which seems semi sane, although a check built around http://www.dnspython.org/examples.html would be... [18:43:14] 06Labs, 10Labs-Infrastructure, 06Operations: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2244507 (10Krenair) ```lang=irc Krenair: It's a hack, but I tend to put those things in sink plugins, since sink is already in charge of cleaning up... [18:49:33] bd808: heh, looks like I've run into random CI issues now.. [18:58:47] 06Labs, 06Operations, 10Traffic: check_dns needs to be rewritten - https://phabricator.wikimedia.org/T133791#2244564 (10BBlack) Sticking the #Traffic tag on because this affects monitoring of the production DNS authservers too, and that check_dns utility is awful to be relying on for monitoring something so... [19:01:20] 06Labs, 10Tool-Labs: watroles tool down - https://phabricator.wikimedia.org/T133789#2244573 (10Dzahn) a:05yuvipanda>03None thank you both for the quick response. confirmed working again and i can check what i wanted to check [19:03:45] bd808: valhallasw`cloud chasemp the symlink is merged now, so new webservice code is live completely! \o/ [19:03:59] I'm going to slowly and carefully stagger a webservice restarting set now [19:04:06] sweet [19:06:47] bd808: valhallasw`cloud chasemp /usr/local/bin/webservice is still symlinked because I'm not sure if any code is directly specifying a path to that. I'll remove them later once I verify. So if you ran into differences between webservice-new and webservice in the last 24-48h, those differences no longer exist. just a fyi [19:06:57] * YuviPanda waits for puppet to run across the fleet [19:18:34] testing jsub2 is a pain :/ [19:19:07] the code is full of deep validation of file paths (which we want) [19:19:24] but that makes running test commands a real bother [19:23:02] PROBLEM - Puppet run on tools-checker-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [19:24:41] looking [19:26:48] if i gave you a list of just IP addresses, could you easily run a salt command on that? [19:26:57] (they are instances) [19:29:03] mutante: salt is useless in labs mostly. I recommend xargs + ssh (which is what I use) + root@ [19:29:55] YuviPanda: ok, can do. killing ganglia on machines that dont appear to use the puppetmaster [19:30:24] most are stopped by godog's change, now getting the remnants [19:32:12] !log integration integration-raita "Could not find class role::ci::raita" puppet error. manually stopping ganglia-monitor [19:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/SAL, Master [19:32:26] What's the difference between lighttpd and lighttpd-plain in webservice on tool labs? [19:35:11] tom29739: hey! lighttpd-plain has no PHP support built in [19:35:58] So it serves like static stuff, like plain http files, and css files, and js files, etc? [19:36:24] yeah, and you can customize it with a .lighttpd.conf file to do other things too. [19:45:26] !log catgraph - gptest1.catgraph many puppet errors due to failed mysql-server-5.5 install , broken dpkg/puppet [19:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Catgraph/SAL, Master [19:46:42] !log catgraph - gptest1.catgraph manually stopping ganglia (T115330) [19:46:43] T115330: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330 [19:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Catgraph/SAL, Master [19:48:02] RECOVERY - Puppet run on tools-checker-01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:58:13] !log contint-cloud IP address 10.68.16.66 has 26 names, 25 are in contintcloud, one is sm-puppetmaster-trusty2.servermon.eqiad.wmflabs. [19:58:13] contint-cloud is not a valid project. [19:58:23] !log contint IP address 10.68.16.66 has 26 names, 25 are in contintcloud, one is sm-puppetmaster-trusty2.servermon.eqiad.wmflabs. [19:58:23] contint is not a valid project. [19:58:32] oh really [19:58:43] !log integration IP address 10.68.16.66 has 26 names, 25 are in contintcloud, one is sm-puppetmaster-trusty2.servermon.eqiad.wmflabs. [19:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/SAL, Master [19:59:01] RECOVERY - Puppet run on tools-webgrid-generic-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [20:00:12] !log language language-dev puppet fail / broken dpkg, dpkg was interrupted [20:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Language/SAL, Master [20:01:33] !log language language-dev dpkg-configure -a to fix borked dpkg (manually interrupted dist-upgrade?) , manually removing ganglia (T115330) [20:01:34] T115330: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330 [20:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Language/SAL, Master [20:06:32] !log language language-dev running puppet, still fails due to issue with MW-singlenode and gitclone (but hey the kernels got installed) [20:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Language/SAL, Master [20:08:29] 06Labs: Duplicate entries in labs internal dns - https://phabricator.wikimedia.org/T126518#2244760 (10hashar) 05Resolved>03Open It is back around :( ``` lang=irc [21:50:04] dzahn@bastion-restricted-01:~$ host 10.68.16.66 [21:50:04] ;; Truncated, retrying in TCP mode. [21:50:04] 06Labs: Duplicate entries in labs internal dns - https://phabricator.wikimedia.org/T126518#2244766 (10hashar) I have now idea whether forward entries leak as well. Maybe we could dump the `contintcloud.eqiad.wmflabs` zone and see how many entries there is? Should be less than 20, the quota for that tenant. I... [20:12:15] 06Labs: Duplicate entries in labs internal dns - https://phabricator.wikimedia.org/T126518#2244767 (10Krenair) [20:12:17] 06Labs, 10Labs-Infrastructure, 06Operations: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2244768 (10Krenair) [20:15:48] Hey folks. I'm making a dashboard on grafana and not that it's come time to save I get "Access denied" [20:15:54] http://grafana.wmflabs.org/ FYI [20:16:06] Any insight into what I need to do in order to save my dashboard? [20:16:15] 06Labs, 10Labs-Infrastructure, 06Operations: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2244776 (10hashar) From T126518 It is back around :( ``` lang=irc [21:50:04] dzahn@bastion-restricted-01:~$ host 10.68.16.66 [21:50:04] ;; Tr... [20:17:43] halfak: no clue what that one is for maybe a dev instance? [20:18:01] https://grafana-admin.wikimedia.org/ has the labs graphite as a datasource if it can help [20:18:17] I assumed it was the wmflabs installation of graphana. Seems to work nicely with graphite.wmflabs.org metrics. [20:18:26] eg https://grafana-admin.wikimedia.org/dashboard/db/labs-project-board [20:18:45] yuvi might know [20:19:05] YuviPanda, ^ [20:20:39] The good news is that I figured out how to export my dashboard, so if it doesn't save, I can presumably import this file and get it back. [20:21:00] halfak: I think that was a partially setup grafana with that intention, there is some guest/guest type login, and ppl figured out howto make dashboards on the normal grafana w/ labs graphite as a source [20:21:12] halfak: i've no idea where grafana.wmfmlabs.org is from or who maintains it, unfortunately. use grafana-admin.wikimedia.org to create yours, yeah [20:21:18] but this is mostly percolated knowledge from watching this come up in chat [20:21:42] YuviPanda, I can't seem to access wmflabs metrics for ores on grafana.wikimedia.org [20:21:42] need to find it and kill it to avoid confusion [20:21:45] is there a trick? [20:22:22] When I import my JSON, I just get blank graphs. [20:25:08] halfak: I think you've to set the 'data source' separately [20:25:36] RECOVERY - Puppet run on tools-webgrid-lighttpd-1414 is OK: OK: Less than 1.00% above the threshold [0.0] [20:27:04] Found it! [20:44:55] 06Labs, 10Labs-Infrastructure, 06Operations: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2244874 (10Andrew) there are relatively many ldap connection failures in the sink log. That fits with the fact that our designate setup is subject to periodic OOMs... [20:45:43] YuviPanda, https://grafana-admin.wikimedia.org/dashboard/db/ores [20:45:47] 06Labs, 10Labs-Infrastructure, 06Operations: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2244890 (10Andrew) It would also be useful to know if we are leaking A records that correspond to the leaked PTR records. [20:46:13] Looks like we get a lot of memory back right after a deployment. [20:48:07] !log ores deployed ores-wikimedia-config:6453fe5 [20:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [20:53:58] Is there a good way to make a grafana dashboard publicly available? [20:55:17] (03PS3) 10BryanDavis: [WIP] Rewrite jsub in python [labs/toollabs] - 10https://gerrit.wikimedia.org/r/285435 (https://phabricator.wikimedia.org/T132475) [20:57:09] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Rewrite jsub in python [labs/toollabs] - 10https://gerrit.wikimedia.org/r/285435 (https://phabricator.wikimedia.org/T132475) (owner: 10BryanDavis) [20:58:02] halfak: they are all public at https://grafana.wikimedia.org/ [20:58:30] Gotcha. I have to drop "-admin" from the subdomain. [20:58:33] halfak: e.g. https://grafana.wikimedia.org/dashboard/db/ores [20:58:38] Thanks :) [20:59:34] the weird domain name split thing is a workaround for grafana's lack of authn/authz support [21:00:28] dashboard looks great halfak, what is scored errored? things we tried to score but couldn't for x reason? [21:01:02] chasemp, yeah. That's right. This often happens when a key bit of data is missing. E.g. it was deleted or suppressed. [21:01:08] But it sometimes happens due to timeouts. [21:01:14] gotcha [21:01:15] Oh! I should have a panel for timeout events. :) [21:03:37] (03CR) 10BryanDavis: "It looks like adding a setup.py makes cowbuilder grumpy -- "Can't exec "pyversions": No such file or directory at /usr/share/perl5/Debian" [labs/toollabs] - 10https://gerrit.wikimedia.org/r/285435 (https://phabricator.wikimedia.org/T132475) (owner: 10BryanDavis) [21:04:37] Hmm... I wonder how I get the graphs to behave as though there are zeros [21:05:44] Yay! Found it! [21:05:50] halfak: iirc there is a field to say 0 is legit or no data etc [21:05:51] :) [21:06:12] "null as 0" which is wall hidden [21:06:19] *well [21:07:11] Weird. It looks like, periodically, we get a bunch of errored scorings. [21:07:21] So it'll be fun to look into those! [21:13:19] (03PS4) 10BryanDavis: [WIP] Rewrite jsub in python [labs/toollabs] - 10https://gerrit.wikimedia.org/r/285435 (https://phabricator.wikimedia.org/T132475) [21:14:07] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Rewrite jsub in python [labs/toollabs] - 10https://gerrit.wikimedia.org/r/285435 (https://phabricator.wikimedia.org/T132475) (owner: 10BryanDavis) [21:17:33] (03PS5) 10BryanDavis: [WIP] Rewrite jsub in python [labs/toollabs] - 10https://gerrit.wikimedia.org/r/285435 (https://phabricator.wikimedia.org/T132475) [21:18:26] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Rewrite jsub in python [labs/toollabs] - 10https://gerrit.wikimedia.org/r/285435 (https://phabricator.wikimedia.org/T132475) (owner: 10BryanDavis) [21:19:53] * bd808 hangs his head [21:20:24] been there brother! [21:20:31] once I put up 4 in a row that failed [21:20:36] so yeah that was great [21:27:31] (03PS6) 10BryanDavis: [WIP] Rewrite jsub in python [labs/toollabs] - 10https://gerrit.wikimedia.org/r/285435 (https://phabricator.wikimedia.org/T132475) [21:28:44] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Rewrite jsub in python [labs/toollabs] - 10https://gerrit.wikimedia.org/r/285435 (https://phabricator.wikimedia.org/T132475) (owner: 10BryanDavis) [21:35:25] (03PS7) 10BryanDavis: [WIP] Rewrite jsub in python [labs/toollabs] - 10https://gerrit.wikimedia.org/r/285435 (https://phabricator.wikimedia.org/T132475) [21:38:06] chasemp: anytime I touch debian/* assume that I'm just cut-n-pasting things from other repos or a google search :) [21:41:53] right me as well [21:48:03] (03PS8) 10BryanDavis: [WIP] Rewrite jsub in python [labs/toollabs] - 10https://gerrit.wikimedia.org/r/285435 (https://phabricator.wikimedia.org/T132475) [21:49:06] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Rewrite jsub in python [labs/toollabs] - 10https://gerrit.wikimedia.org/r/285435 (https://phabricator.wikimedia.org/T132475) (owner: 10BryanDavis) [21:56:37] (03PS9) 10BryanDavis: [WIP] Rewrite jsub in python [labs/toollabs] - 10https://gerrit.wikimedia.org/r/285435 (https://phabricator.wikimedia.org/T132475) [21:57:16] (03CR) 10BryanDavis: "PS9 undoes the attempt in PS8 to use the existing test harness. The output from Jenkins is mostly useless." [labs/toollabs] - 10https://gerrit.wikimedia.org/r/285435 (https://phabricator.wikimedia.org/T132475) (owner: 10BryanDavis) [22:57:31] any folks around that might be able to troubleshoot some openstack/nova/nodepool things? [23:09:51] thcipriani: you probably need andrewbogott but I haven't seen him talking on irc this afternoon [23:10:15] chase might be able to help though if he's still at a keyboard [23:11:49] thcipriani: can you tell what part of nodepool is breaking? comm with nova or something else? [23:12:37] it seems to be communication with nova. [23:13:02] Krenair: do you have magic skills in debugging nova communication problems? [23:13:09] although, I can see connections in netstat to labnet1002 8774 [23:13:32] * bd808 is mostly useless other than pinging people for nova things [23:13:58] but there are a bunch of instances that are 'delete' state or ERROR state depending on if you ask nodepool or nova that can't be deleted, there are timeouts in the nodepool logs. [23:16:55] I'm probably not much use in real labs [23:17:28] what are the timeout errors exactly? [23:20:49] Exception: Timeout waiting for server b440cdb9-741c-41c1-a300-8daf5617debd deletion in wmflabs-eqiad [23:21:19] Krenair: this is https://wikitech.wikimedia.org/wiki/Nodepool stuff [23:21:22] it looks a lot like this ticket: https://phabricator.wikimedia.org/T122731 [23:23:00] and that ticket mentions some debugging tips: https://wikitech.wikimedia.org/wiki/Nodepool#Nodepoold_diagnostic [23:24:22] where does nodepool actually run? [23:24:36] labnodepool1001 [23:24:48] So nowhere we can actually get to [23:24:54] Great... [23:25:09] what? [23:25:25] that seems like a smart access control decision [23:25:31] we can get there. contint-admins have access [23:25:44] I'm not in that group [23:25:54] ah I am [23:25:57] Did you try booting/deleting an instance somewhere else? [23:26:21] ah but I don't have sudo permission [23:26:59] that's what you can do: https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/data/data.yaml#L133-L156 [23:27:32] contint-roots is just hashar. [23:27:40] according to `nova list` on silver, those instances have status ERROR, and some are Power State NOSTATE, some are Running [23:27:42] Might want to expand that pool a bit. ;-) [23:28:17] This is something a root needs to look at, nothing I can help with. [23:28:50] AllocationRequest for 58.0 of ci-jessie-wikimedia [23:28:56] Krenair: yeah, tried to manually delete, just loops connecting to labnet1002.eqiad.wmnet until Exception: Timeout waiting for server f59c2a4b-ce46-40e5-9a1d-7c1ef4b9368f deletion in wmflabs-eqiad [23:30:43] from labnodepool1001 `nova list` shows 18 ERROR state ones [23:31:02] * bd808 realized he had the rights to poke at this [23:32:22] yeah but if the instances are in ERROR state I don't think you can do much just with nodepool access [23:32:57] it's hung up in "waitForServerDeletion" [23:33:17] should we get a root to try the "restart nova-compute" option? [23:33:29] yes. [23:33:58] if they can confirm that's what's stuck but can't figure out why it's stuck, yes [23:34:00] I started trying to build an instance from wikitech when Krenair suggested it, it seems hung-up too. [23:34:01] At an apt afk for a a bit, headed back but will be a bit. Try mutante? [23:34:24] *nod* [23:34:40] yeah, and the instance I started from wikitech just entered an ERROR state. [23:35:32] mutante: around? we are having problems with labnodepool1001 and nova. There are some notes on old bugs that restarting nova-compute has fixed some errors like this before. [23:36:32] who else can we bug? [23:36:51] I've already tried poking the other labs ops, now moved on to other roots [23:37:03] alright, so failing a nova-compute restart, can we move the nodepool jobs back to regular jenkins-slaves? [23:37:32] that's a hashar level question [23:38:04] yeah, that's what I figured. /me flails [23:38:23] although, legoktm might know things about that too [23:38:57] legoktm: is it possible to move the jenkins jobs off of nodepool and back to "normal" instances? [23:39:11] he's currently on super shaky wifi :( [23:39:14] Texted andrew, best I can do atm except for releng to wake up hashar [23:39:17] uhh, some of them [23:40:34] Andrew is going to hop on [23:41:12] yo, what's happening? [23:41:32] we have a CI problem [23:41:33] Do y'all know anything beyond 'instance creation doesn't work'? [23:41:46] andrewbogott: nodepool isn't able to delete or create instances [23:41:50] 'a CI problem' is even less specific! [23:41:56] ok, looking... [23:42:24] how long has it been broken? [23:42:26] instance deletion is also broken [23:42:48] thcipriani: do you know when it started? [23:43:14] andrewbogott: I've been poking at it for about an 1hr 15mins [23:43:28] hour and a half? [23:43:43] oh? I was here back then and Yuvi originally said that CI was broken but then I thought it was resolved [23:43:45] Apr 27 23:18:23 nodepool jobs not running? [23:43:51] (GMT+1) [23:43:55] in -releng [23:44:10] I'm on a super slow network but working on it [23:44:15] o [23:44:16] n [23:44:19] i [23:44:20] t [23:44:33] ^ graphic illustration of the latency of my network [23:45:38] nodepool.log has lots of "Exception: Timeout waiting for server b440cdb9-741c-41c1-a300-8daf5617debd deletion in wmflabs-eqiad" at the end of big python stack traces [23:46:28] sounds like a symptom of instance deletion being broken to me [23:46:46] Is creation not also broken? [23:46:58] it is [23:47:31] I think it's not creating new instances because all the slots are filled with instances that are supposed to be deleted [23:47:43] agreed [23:47:57] I mean — is it broken in projects other than CI? [23:47:57] those instances are in ERROR state. Some have Task State "deleting", some are "-". Some have Power State "NOSTATE", some are "Running" [23:48:14] I started trying to build an instance from wikitech when Krenair suggested it, it seems hung-up too. [23:48:15] andrewbogott: apparently node creation also broken in wikitech according to thcipriani [23:48:16] yeah, and the instance I started from wikitech just entered an ERROR state. [23:48:39] thcipriani, what project was your instance in? [23:48:55] Krenair: staging [23:49:01] that still exists? ok [23:49:19] * twentyafterfour goes back to preparing the phabricator update which is due very soon. Will wait for this to be fixed before breaking phab though in case phab is needed [23:49:33] yeah, but doesn't list any instances: https://wikitech.wikimedia.org/wiki/Nova_Resource:Staging [23:49:47] thcipriani: That [23:49:52] *is normal [23:49:54] | ID | Name | Status | Task State | Power State | Networks | [23:49:57] | 23cbdaf9-c187-42c3-944e-e7bc973b2932 | test | ERROR | spawning | NOSTATE | | [23:50:11] it's a dynamic list, but not always dinamic ;;) [23:50:15] ^ that is what I saw, roughly when I spawned it. [23:50:33] that's what "OS_TENANT_NAME=staging nova list" currently shows [23:51:48] hm… so, is everything fixed now, by chance? [23:52:44] PROBLEM - Puppet run on tools-k8s-etcd-03 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [23:52:47] looks like there are still some ERROR instances by I see a new active one [23:52:56] Woo. [23:53:02] the ERROR ones aren't likely to revive [23:53:09] I do see a rake-jessie build. [23:53:13] but nodepool should clean 'em up in a few minutes [23:53:14] Yay! [23:53:30] So here was my debugging process: [23:53:44] andrewbogott, will nodepool's deletion work given that they're in error state? [23:53:48] Last night I was in a session about openstack troubleshooting with some redhat engineers [23:54:06] and they said "It's pretty much always rabbit" and then they had a cute gif of a rabbit on the slide [23:54:07] ah, we have action at the RC of wikitech [23:54:11] so I just now tried restarting rabbit [23:54:15] deletion works [23:54:21] PROBLEM - Puppet run on tools-exec-1207 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [23:54:24] which apparently fixed it [23:54:24] andrewbogott: thank you! [23:54:36] I mean at least we have a lot's of "instance info" updates :D [23:54:46] yeah, works [23:54:56] gate-and-submit runs again [23:55:02] new contintcloud instances have appeared and are working [23:55:55] the first tests at the test queue are running too [23:56:03] PROBLEM - Puppet run on tools-exec-1215 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [23:56:21] ok, I also (from that same session) have some ideas about how to tune rabbit and maybe prevent this. Although we are nowhere close to the usage levels that should have overloaded things [23:56:26] I issued `openstack server delete` commands for all of the ERROR instances. Not sure if it was needed but they seem to be going away [23:56:53] is rabbit actually RabbitMQ or something else? [23:57:13] yeah, rabbitmq [23:57:16] diagnosis: https://i.imgur.com/pK1Gn.gif [23:57:34] looks about right :) [23:58:35] Everything in `openstack server list` looks healthy now [23:58:47] thanks andrewbogott [23:59:07] Sure. Sorry it fell over :( [23:59:56] CI/nodepool does a good job of 'monitoring' the instance creation project, but that's probably not the right long-term monitoring solution