[00:00:01] PROBLEM - Puppet run on tools-webgrid-generic-1403 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [00:00:08] It's this: URI ldap://undef:389 ldap://undef:389 [00:00:10] 06Labs: Create project "code-testcluster" - https://phabricator.wikimedia.org/T135195#2291428 (10Luke081515) [00:00:10] prod upgraded exim and changed the Puppet config [00:00:18] fun [00:00:28] I should be able to fix the URI thing pretty quickly [00:00:45] you "just" need to `apt-get install exim` I think to fix [00:00:58] or maybe exim4? [00:01:41] PROBLEM - Puppet run on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [00:01:52] andrewbogott: ah that would do it [00:03:13] PROBLEM - Puppet run on tools-exec-1402 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [00:04:53] PROBLEM - Puppet run on tools-worker-1007 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [0.0] [00:04:57] PROBLEM - Puppet run on tools-flannel-etcd-03 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [00:05:50] 06Labs, 07Tracking: New Labs project requests (tracking) - https://phabricator.wikimedia.org/T76375#2291489 (10Luke081515) [00:05:52] 06Labs: Create labs project arcanist - https://phabricator.wikimedia.org/T130507#2291487 (10Luke081515) 05Open>03declined Closing this as declined, because we don't have an "obsolete" status. With the arcanist installer for windows, which works, we don't need this solution. [00:07:09] bd808: do you happen to know if the package name is exim4-base on all distros or if it's different on jessie vs. trusty? [00:07:22] (that's for sure the name on jessie) [00:07:30] exim4 on trusty I think [00:07:34] but...maybe I'm wrong [00:07:36] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [00:08:02] chasemp: this should fix logins: https://gerrit.wikimedia.org/r/#/c/288555/1 [00:08:44] give it a whirl [00:08:49] I'm having no luck w/ exim over salt [00:09:22] 'no luck'? [00:09:34] PROBLEM - Puppet run on tools-proxy-02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [00:09:39] not sure what package name is rihgt, but it won't upgrade w/o sudo it seems [00:09:42] PROBLEM - Puppet run on tools-webgrid-lighttpd-1206 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [00:09:43] and sudo fails w/o ldap [00:10:06] doesn't salt already run as root? [00:10:22] Well, anyway, ldap should be recovering now as puppet runs [00:10:26] so I thought could also be puppet stepping on me [00:10:27] idk [00:10:29] E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily unavailable) [00:10:30] E: Unable to lock the administration directory (/var/lib/dpkg/), is another process using it? [00:10:35] is normal need sudo type stuff [00:10:41] PROBLEM - Puppet run on tools-flannel-etcd-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [00:10:41] for apt [00:10:49] hm, that sounds more like an apt-get or a puppet run is in progress [00:11:04] could be but it's consistent and has lasted awhile [00:12:28] exim4-base is in jessie and trusty too [00:12:33] chasemp: my canaries are better now, at least [00:12:57] maybe the dependencies changed but the package name is there in both.. [00:13:04] I think we should just wait it out, trying to force puppet runs will probably just overwhelm the puppetmaster [00:13:58] PROBLEM - Puppet run on tools-exec-1201 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [00:14:07] so checks are failing as expected [00:14:07] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=tools [00:14:10] why no notice in -ops? [00:14:24] kaldari, tom29739, the issue is fixed in puppet now, your access should go back to normal in 15 minutes max [00:14:36] thanks! [00:14:40] I can push things along if you have a particular instance you covet [00:14:49] bd808: That command worked, I'l add that to the wiki page for others, thank you for your time <3 [00:15:22] chasemp: maybe because icinga-wm is a zombie, there were 2 processes for some reason and i killed one [00:15:32] CZauX: awesome! Have fun testing [00:15:40] mutante: I see it now actually [00:15:55] icinga-wm_PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 0.043 second response time [00:16:07] ok, good. i was about to say usually it's stable-ish [00:16:08] i.e. grid is down [00:16:08] PROBLEM - Puppet run on tools-webgrid-lighttpd-1210 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [00:17:37] PROBLEM - Puppet run on tools-exec-gift is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [00:18:34] chasemp: if you don't object, i'm going to salt a puppet run on tools-* [00:18:43] give it a try [00:18:52] I was just going to spot check a few [00:19:30] recovery on the icinga checks [00:19:52] except the ldap one ironically [00:23:22] so was the exim thing a race condition? [00:23:30] because places I hvae not explicitly upgraded now seem fine [00:23:43] seems like it failed w/ one puppet run and then corrected itself next round or something [00:23:52] or did someone fix it in a way I missed [00:24:04] I fixed it by hand on two or three instances [00:24:08] but that's it [00:24:13] I'd expect it to still be failing here and there [00:24:32] The exim thing has been happening for ~ a week [00:24:40] so I wouldn't expect anything new on that front [00:24:42] I didn't realize [00:24:45] apart from whatever you're doing with salt or clush [00:25:19] well this is strong fodder for those tools checks to page I think [00:25:26] apt-get install exim4-base on * via salt sounds like it would have fixed it.. besides the possible overload [00:25:28] we would probably known right away this was going down [00:25:31] yeah [00:25:49] 18:46 was broken [00:26:00] 19:19 was fixed [00:26:05] according to icinga [00:26:22] I'll put a few of those in paging mode tomorrow [00:26:30] yeah, seems worthwhile [00:26:43] Sorry if my bug made you late for dinner :( [00:27:11] I was out seraching for a lost cat for awhile and came back to everything broken and launching my finally done i/o profile on labstore1005 [00:27:16] so it was a happy coincidence [00:27:20] I was already around [00:27:53] find the cat? [00:28:13] yes, the cat had been closed up in a dresser drawer teh whole time [00:28:21] not a joke [00:28:23] Oh, sad cat :( [00:28:52] woah what happened now [00:29:01] * YuviPanda reads backscroll [00:29:15] YuviPanda: I just broke ldap with a puppet refactor [00:29:20] :) glad to hear the cat is back. are you putting GPS on it now :) [00:29:39] we need to but no :) [00:30:22] ok :D [00:30:27] got some gridengine failure notices [00:30:31] so I'm going to investigate that too [00:31:00] chasemp: Are things looking fixed enough for me to go eat dinner? [00:31:23] I think yes -- YuviPanda if you happen to see more fallout give a call please -- I'm off to deal w/ kids and bedtimes [00:31:37] chasemp: yup, I'll if needed [00:31:57] sudo works ok [00:32:01] am looking at the grid queues now [00:32:24] andrewbogott: can confirm the gridengine failures are also because of LDAP [00:32:33] it put some queues in error state, fixing now [00:32:37] 'stress test' :) [00:33:03] (I was halfway into debugging DNS when I realized it couldn't be that -- true story) [00:33:05] ok later [00:34:01] PROBLEM - Puppet run on tools-exec-1201 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [00:34:50] PROBLEM - Puppet run on tools-webgrid-lighttpd-1206 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [00:34:53] * andrewbogott checks on ^ [00:37:03] RECOVERY - Puppet run on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [00:37:15] well, those hosts are totally fine and I don't know why they are complaining [00:37:35] RECOVERY - Puppet run on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [00:37:35] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [00:38:09] PROBLEM - Puppet run on tools-exec-1402 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [00:40:26] !log tools cleared all queues that were in error state [00:40:31] andrewbogott: the alerts are lagged [00:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [00:40:35] andrewbogott: because graphite etc [00:40:42] andrewbogott: I think the fallout has been handled now [00:40:53] ok, thanks for cleaning up [00:40:58] I tested that patch so many times on the puppet compiler [00:41:04] but of course didn't test it on an actual labs host :( [00:41:27] * andrewbogott -> eats [00:42:41] :D [00:46:28] So we now know the consequences of an ldap outage I guess? [00:47:12] uri ldap://undef:389 ldap://undef:389 [00:47:14] ahh [00:47:35] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [00:53:37] YuviPanda: any idea if our hiera setup will support my making a labs.yaml for instances in .codfw.labtest? [00:53:57] andrewbogott: let me look [00:54:22] andrewbogott: we already seem to have one? [00:54:26] andrewbogott: ./modules/puppetmaster/files/labtest.hiera.yaml [00:56:12] YuviPanda: ok, I'm trying to understand that... [00:56:24] "%{::site}/labtest-instances" [00:56:35] seems like that will let me specify hiera for specific instances but not in general, right? [00:56:59] isn't labtest it's own $::realm? [00:57:13] hmm [00:57:20] labtest hosts are their own realm [00:57:24] but I don't think labs instances are [00:57:33] *labtest instances [00:57:39] I think %{::site} will expand to codfw [00:57:44] so codfw/labtest-instances [00:57:45] ah, right, they'll just have realm labs [00:57:53] should be equivalent of labs.yaml for them all I think [00:57:55] so for labtest, realm is labs and site is codfw [00:58:02] whereas for labs, realm is labs but site is eqiad [00:58:08] Oh, you're totally righ [00:58:14] I just need to twiddle hieradata/codfw/labtest-instances.yaml [00:58:15] I bet [00:58:43] what did you do to try to fix the ldap issue andrewbogott? [00:58:47] everything is still very broken [00:58:50] Krenair: where? [00:58:53] Krenair: in deployment-prep? [00:59:02] yes, and I've got at least one report from another project [00:59:03] Krenair: just this: https://gerrit.wikimedia.org/r/#/c/288555/ [00:59:07] Krenair: puppet should run, I guess [00:59:15] Puppet is failing because of ldap [00:59:31] maybe the masters can't update themselves? [00:59:35] ah, did this just completely fuck over self hosted puppetmasters then? [00:59:45] is wikidata-query using a self hosted puppetmaster? [01:00:16] yeah, can confirm, k8s puppet hosts are all failing [01:00:21] the commit is on deployment-puppetmaster, but puppet does this: [01:00:22] Warning: Unable to fetch my node definition, but the agent run will continue: [01:00:22] Warning: Error 400 on SERVER: LDAP Search failed [01:00:27] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Failed when searching for node deployment-puppetmaster.deployment-prep.eqiad.wmflabs: LDAP Search failed [01:00:27] Warning: Not using cache on failed catalog [01:00:27] Error: Could not retrieve catalog; skipping run [01:00:31] login is broken, sudo is broken [01:00:37] Krenair: yeah, you've to manually fix the ldap config on puppetmaster [01:00:43] doing [01:00:51] andrewbogott: ^ all self hosted puppetmasters might be broken now [01:01:54] Krenair: ldap://ldap-labs.eqiad.wikimedia.org:389 ldap://ldap-labs.codfw.wikimedia.org:389 [01:02:01] is the correct one, if you needed [01:02:03] I did /etc/ldap.conf and /etc/ldap.yaml [01:02:05] Hm... [01:02:08] it also needs passwords YuviPanda [01:02:09] why do they fail to update themselves? [01:02:14] But it still fails [01:02:40] that doesn't seem to fix it [01:02:44] andrewbogott: they can't run puppet [01:02:48] what host are you working on? [01:02:49] andrewbogott: since puppet depends on LDAP [01:02:59] andrewbogott: am working on tools-puppetmaster-01, which serves all the k8s hosts [01:03:09] deployment-puppetmaster [01:03:11] Warning: Unable to fetch my node definition, but the agent run will continue: [01:03:13] Warning: Error 400 on SERVER: LDAP Search failed [01:03:26] wdqs-puppetmaster is also probably in the same mess, based on what SMalyshev says in -releng [01:03:30] YuviPanda: mind if I tinker for a moment? [01:03:46] Probably you need to update settings in /etc/puppet/puppet.conf [01:04:10] yep, that's broken as well [01:04:22] andrewbogott: ah, I see. [01:04:28] andrewbogott: yes, please go ahead. [01:04:48] what's the correct ldapserver syntax in there? [01:05:00] I tried this: [01:05:01] ldappassword = Eche0ieng8UaNoo [01:05:01] ldapserver = ldap://ldap-labs.eqiad.wikimedia.org:389 ldap://ldap-labs.codfw.wikimedia.org:389 [01:05:02] no luck [01:05:17] (public password, before anyone panics, although you should all recognise it as this point) [01:05:41] I pronounce it as 'Echoing Eightuano!' [01:05:44] in puppetmaster: [01:06:05] ldapserver = ldap-labs.eqiad.wikimedia.org [01:06:13] um, that's in /etc/puppet/puppet.conf [01:06:16] and in /etc/ldap.conf [01:06:41] uri ldap://ldap-labs.eqiad.wikimedia.org:389 ldap://ldap-labs.codfw.wikimedia.org:389 [01:06:54] yeah I know the server is in both places, I needed the syntax [01:06:59] did that work? [01:07:01] yes [01:07:04] still broken. maybe a puppetmaster restart needed? [01:07:06] tools-puppetmaster-01 [01:07:11] what should I do next? [01:07:18] yes, you have to restart puppetmaster after changing puppet.conf [01:08:01] It works [01:08:03] ok, I'll reboot the puppetmaster and see if that allows me to log in [01:08:12] It won't SMalyshev [01:08:17] SMalyshev: wait, what host are you working on? [01:08:22] So update /etc/puppet/puppet.conf and service puppetmaster restart [01:08:29] he'll be on wdqs-puppetmaster [01:08:36] But he can't log in [01:08:41] And he didn't have a root key set up [01:08:46] So he's blocked on ops roots [01:08:46] yep, looking [01:08:48] andrewbogott: wdqs-puppetmaster is the puppetmaster for wikidata-query project [01:09:14] yeah I rarely really need root and usually never ssh as root [01:09:37] I usually never ssh as root, but it's required in labs when instance login/sudo breaks [01:09:45] right :) [01:09:49] as is the case here [01:09:50] SMalyshev: you should be back in business in a minute or two [01:09:58] andrewbogott: ok, cool! [01:10:11] so there's this in Hiera:Deployment-prep on wikitech: [01:10:13] "passwords::root::extra_keys": [01:10:13] alex: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDdYrWghgUYZZsdh8tw4mRfpsNWCW56ns6tKei3oSHEUNOXnxhAQwnwwkc1LeGYDdPglX3EAbtj0LAUf1muaC99mFYXw+e+7XjZCJYZ5cRrkvDXA/JvxaBvhOid8BZGImeoSqX2ZZ0HQSNRQfzVbs+7yAu0nMB3WSup9eoMnC3OfBsbPvQZ/WDD/OKIwuv8hnYyi8QmQgPngERqIf12ireGIXNn++IMncqu9Z6+skQHiuvkMsWlPjeYtg8gfIS8h8JY1QApVUyqkHuE4Fr/1n/NagsDu1WvE4zuEzQ3TmXZOtikPqSfsLDD5+gqxcaH1iiM+nEeG/RWD9/WyHnVUsIH krenair@gmail.com ubuntu laptop [01:11:48] * andrewbogott does sudo salt '*' cmd.run 'grep undef /etc/puppet/puppet.conf' [01:12:13] Didn't fix deployment-salt andrewbogott [01:12:22] where do puppet clients get their ldap config? [01:13:24] Krenair: I'll look, hang on [01:14:39] Krenair: I think you're looking on the wrong host, the puppetmaster is deployment-puppetmaster.deployment-prep.eqiad.wmflabs isn't it? [01:15:18] andrewbogott: so puppetmaster login now works, but wdq-beta still broken [01:15:29] Yes it is [01:15:40] SMalyshev: wdq-beta will need to run puppet to update [01:15:43] now that the master is updasted [01:15:44] dated [01:15:53] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: stack level too deep [01:15:55] Which has working puppet itself now [01:15:59] ... ah, no, it broke again [01:16:16] In fact the first time I ran puppet it broke itself [01:16:51] in /etc/nslcd.conf, /etc/ldap.yaml, /etc/ldap.conf [01:17:16] Krenair: I'm on deployment-puppetmaster, should be fixed in a moment. [01:18:02] andrewbogott: ok, seems to work now [01:18:23] still unhappy about some exim thing, but that happened before [01:18:41] and I don't care about exim anyway [01:18:58] You mean the normal labs exim puppet error? [01:19:08] Error: Could not start Service[exim4]: Execution of '/etc/init.d/exim4 start' returned 1: [01:19:08] Error: /Stage[main]/Exim4/Service[exim4]/ensure: change from stopped to running failed: Could not start Service[exim4]: Execution of '/etc/init.d/exim4 start' returned 1: [01:19:12] Krenair: what do you think about [01:19:14] sudo salt '*' cmd.run 'sed -i %undef%/etc/puppet/puppet.conf%g /etc/puppet/puppet.conf' [01:19:15] dunno if it's normal :) [01:19:24] oops, typo, hang on [01:19:43] SMalyshev, all jessie instances have the issue, IIRC the fix is to upgrade the exim4 package [01:20:02] apt-get install exim4-base [01:20:05] Krenair: I mean, of course [01:20:10] sudo salt '*' cmd.run 'sed -i %undef%ldap-labs.eqiad.wikimedia.org%g' /etc/puppet/puppet.conf [01:20:11] Krenair: ok, will try that, though as I said I'm not too worried about it. the login seems to be fixed [01:20:12] I'll try it on a test case [01:20:36] Will it work without fixing the password andrewbogott ? [01:20:49] SMalyshev: install "exim4-base" package and run puppet again [01:20:59] mutante: yeah will try tthat [01:21:09] andrewbogott, also that can't be right [01:21:19] your command ends before the filename [01:21:26] Krenair: I haven't seen any examples of the password being changed [01:21:34] they all get changed to undef andrewbogott [01:21:43] mutante: nope, puppet still hates exim. [01:21:56] # The DN to bind with for normal lookups. [01:21:56] binddn cn=proxyagent,ou=profile,dc=wikimedia,dc=org [01:21:57] bindpw undef [01:22:09] same in the ldap and puppet configs [01:22:14] andrewbogott: hmm, tools-puppetmaster-01 is still failing btw [01:22:21] Warning: Unable to fetch my node definition, but the agent run will continue: [01:22:23] Warning: Error 400 on SERVER: Could not retrieve facts for tools-puppetmaster-01.tools.eqiad.wmflabs: Could not autoload puppet/indirector/facts/active_record: cannot load such file -- active_record/deprecated_finders [01:22:24] SMalyshev: hmm, so i opened this ticket about it https://phabricator.wikimedia.org/T135033 [01:22:57] andrewbogott: hmm, I restarted puppetmaster again, maybe that'll help [01:23:04] looking better now [01:23:06] so far [01:23:08] apparently a new image is in the works but blocked by another ticket [01:23:22] but not sure which one that is [01:23:34] mutante: thanks... [01:23:42] Krenair: the instance I'm looking at is test-prometheus5 [01:23:53] in its puppet.conf it had an 'undef' ldap servername [01:23:56] but a valid password [01:23:59] and my salt command worked fine [01:25:08] mutante, https://phabricator.wikimedia.org/T133551 [01:25:54] the current blocker for the new image is https://phabricator.wikimedia.org/T134944 [01:26:04] but anyway that won't make much of a difference [01:26:37] why not [01:26:37] 18<andrewbogott> sudo salt '*' cmd.run 'sed -i %undef%ldap-labs.eqiad.wikimedia.org%g' /etc/puppet/puppet.conf [01:26:53] the filename is after the command you run on each instance [01:27:01] 06Labs: confirm that new base labs base image is adequate for kubernetes &c. - https://phabricator.wikimedia.org/T134944#2291610 (10Dzahn) [01:27:08] the filename should be part of the command [01:27:22] yes, sorry, I had to do some quote-mark tidying [01:27:23] it is now sudo salt "*" cmd.run 'sed -i "s%undef%ldap-labs.eqiad.wikimedia.org%g" /etc/puppet/puppet.conf' [01:27:43] which I'm running right now :) [01:28:02] instances seem to be being unbroken [01:28:08] 06Labs, 10Mail: failed exim service on labs instances - https://phabricator.wikimedia.org/T135033#2286378 (10Dzahn) 18:31 < andrewbogott> the current blocker for the new image is https://phabricator.wikimedia.org/T134944 18:31 < andrewbogott> but anyway that won't make much of a difference [01:28:20] I checked deployment-mediawiki01 though [01:28:32] /etc/ldap.yaml [01:28:33] blank password [01:28:43] /etc/ldap.conf - bindpw undef [01:28:52] /etc/ldap/ldap.conf - BINDPW undef [01:28:58] * andrewbogott nods [01:29:09] but if puppet has the password it can populate all of those [01:29:40] It should but it's not [01:29:57] ok... [01:29:59] what host is this? [01:30:00] The puppet diffs aren't showing it being changed [01:30:07] sorry, you just said [01:30:08] login still broken [01:30:09] I'll look [01:30:10] deployment-mediawiki01 [01:30:24] in fact puppet includes errors like these: [01:30:24] Error: Could not find group wikidev [01:30:24] Error: /Stage[main]/Mediawiki/File[/var/log/mediawiki]/group: change from 500 to wikidev failed: Could not find group wikidev [01:30:25] Error: Could not find group ops [01:32:21] root@deployment-mediawiki01:~# ldaplist -l group ops [01:32:21] Password incorrect. [01:33:05] however I can do ldapsearch -x dc:dn:=deployment-mediawiki01.deployment-prep.eqiad.wmflabs [01:33:54] so… who updates /var/lib/git/operations/puppet on a self-hosted master? [01:33:59] autoupdater [01:34:00] Is it a cron, or is it puppet itself? [01:34:47] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Puppet::Parser::AST::Resource failed with error ArgumentError: Invalid resource type motd::script at /etc/puppet/modules/base/manifests/puppet.pp:106 on node deployment-puppetmaster.deployment-prep.eqiad.wmflabs [01:34:50] what the heck is that? [01:35:14] It's a very good question, one I have a ticket open for. You're running via salt? [01:35:18] …and it's gone [01:35:22] Yep. [01:35:35] ok, will ignore for now :) [01:35:41] Salt? [01:35:47] nope [01:35:49] Huh. [01:35:51] I'm logged in [01:36:01] So, what is 'autoupdater' in this context? [01:36:09] trying to find it [01:36:37] login on deployment-puppetmaster is still broken [01:36:42] modules/beta/manifests/autoupdater.pp:# == Class: beta::autoupdater [01:36:52] # Puppet Name: rebase_operations_puppet [01:36:52] */10 * * * * /usr/local/bin/git-sync-upstream >>/var/log/git-sync-upstream.log 2>&1 [01:37:14] modules/puppetmaster/manifests/gitsync.pp [01:39:07] andrewbogott, ^ [01:41:36] ok, so it should be updated per puppet run [01:42:40] looks more like every ten minutes by cron [01:42:51] the cron is puppetised at modules/puppetmaster/manifests/gitsync.pp [01:42:56] but it is still a cron [01:43:32] is there a way I could mount /public/dumps on my labs instance? [01:43:47] I can see it when i login to tools but not on labs instance... [01:44:40] Krenair: /var/lib/git/labs/private had local changes that prevented rebasing [01:44:47] so that's bad, I don't know yet if it was the cause [01:45:17] you needed a labs/private update? [01:45:32] who knows? This was very out of date [01:45:55] SMalyshev, https://wikitech.wikimedia.org/wiki/Help:Shared_storage#.2Fpublic.2Fdumps [01:46:05] SMalyshev, "You can request them by filing a task on Phabricator " [01:47:02] andrewbogott, root@deployment-puppetmaster:/var/lib/git/operations/puppet# git log --oneline -n 20 | grep ldap [01:47:03] 18626de Define labsldapconfig for labs instances [01:48:44] SMalyshev: assign it to me and I'll take care of it? [01:49:14] created T135205 [01:49:14] T135205: Access to /public/dumps for wikidata-query project - https://phabricator.wikimedia.org/T135205 [01:49:25] YuviPanda: thanks! [01:49:40] Krenair: that was it, I think [01:50:07] yeah, all good now [01:50:48] SMalyshev: should work on next puppet run [01:51:02] YuviPanda: coo, thanks! [01:51:05] np [01:54:30] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Write a k8s admission controller to enforce that all containers running come from our private repository - https://phabricator.wikimedia.org/T133515#2291665 (10yuvipanda) When attempting to use gcr.io containers... ``` root@tools-k8s-master-01:/h... [01:54:35] So… apart from perverse cases like the above, I think this is resolved everywhere [01:54:38] andrewbogott, looks better... [01:55:12] root@deployment-salt:~ # grep undef /etc/nslcd.conf [01:55:12] bindpw undef [01:56:25] puppet run fixed that [01:56:46] Yeah, I'm pretty sure things are all catching up [01:56:50] somehow you didn't catch deployment-ms-be02, andrewbogott [01:56:52] except I will force yet another puppetmaster restart in 15-20 [01:56:54] did you use salt that time? :) [01:57:00] Krenair: is that a puppetmaster? [01:57:02] no [01:57:10] So it'll just have to catch up on its own time [01:57:16] oh I see [01:57:58] I don't think it makes sense to force puppet runs [01:57:58] so it's ldap config fixes on the puppetmasters + up to date labs/private + puppet runs on puppet clients? [01:58:03] since the puppetmaster is already running as fast as it can [01:58:22] yeah, that sounds right [01:59:56] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Goal: Allow using k8s instead of GridEngine as a backend for webservices (Tracking) - https://phabricator.wikimedia.org/T129309#2291678 (10yuvipanda) [01:59:59] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Write a k8s admission controller to enforce that all containers running come from our private repository - https://phabricator.wikimedia.org/T133515#2291675 (10yuvipanda) 05Open>03Resolved Aaaand, works otherwise! \o/ I call this done now. [02:01:08] Krenair: I'm pretty sure that any instance that is updating puppet without private (or private without puppet) is hanging by a thread [02:01:31] deployment-prep would have broken even without the mistake that caused the login failures elsewhere [02:01:47] I hadn't noticed that labs/private was not up to date [02:02:02] Yeah, I don't know how we would notice things like that, wholesale :( [02:02:17] the cron that does the rebases [02:02:32] when rebase is not possible, can't it report a bad number to graphite, else a good one? [02:02:55] hm, maybe [02:03:00] it does it already I think? or something like that [02:03:04] Or send an angry email :) [02:03:10] we have a cherry pick counter [02:03:23] angry mail might work better but we can't have it sending every 10 minutes [02:03:29] Or, honestly, it could just clobber local changes [02:03:39] every 10 minutes? [02:03:50] And just display "SORRY NOT SORRY" in red when it does it [02:03:59] What use is a local puppetmaster if you're going to sabotage it every 10 minutes like that? [02:04:16] Oh, I was thinking just 'private' [02:04:19] Think I'll make an ops-puppet local commit to stop that happening [02:04:29] it used to clobber changes [02:04:32] it was unpopular [02:04:37] and it doesn't anymore [02:04:37] yeah, ok [02:04:39] shockingly [02:04:42] well [02:04:47] More popular: leaving messes for Andrew to clean up :) [02:04:48] it now autostashes them [02:04:59] YuviPanda: it does? That wasn't what I saw just now [02:05:17] andrewbogott: it can easily get into a state where it gives up [02:05:19] You didn't have to clean that up, other deployment-prep roots could've [02:05:25] andrewbogott: and that only accounts for dirty states [02:05:29] Krenair: only for deployment-prep [02:05:36] And any other project with their own roots [02:05:42] I didn't because I wasn't aware of it [02:05:43] Krenair: most other self hosted puppetmasters don't have their own roots setup [02:06:13] and mostly we can't tell 'too bad you did not do X, now your instance is hosed start from scratch' without at least trying [02:06:16] mostly because andrewbogott is too nice... [02:06:17] Anyway, yeah, a nag is better than a clobber [02:06:19] Well I'll give you that it's more popular to leave messes for labs roots to clean up :p [02:06:41] …anyway... [02:06:47] Krenair: it also led to this narrative 'those labs people are breaking things all the time', which was very annoying. [02:06:52] There aren't any more currently broken things that we know of, right? [02:07:01] oooh, I've hit grumpy o'clock! I should stop saying things :) [02:07:26] I hit grumpy o'clock a while ago and should've shut up [02:07:47] YuviPanda: didn't you just get off of a 16-hour flight and then work for 14 hours? [02:08:15] andrewbogott: no, I got off a 26h flight then worked for a few hours and then slept and worked again [02:08:26] ah, so you slept a bit mid-day, that's good [02:08:26] and omg It's 10PM and I've basically been working since 10AM :| [02:08:32] andrewbogott: no, I slept at night. [02:08:35] are all puppetmasters fixed? [02:08:36] i'm on east coast time now! [02:08:43] Krenair: everything that I know of is fixed [02:08:56] how did you list all puppetmasters? [02:09:21] Krenair: I didn't, i just ran salt "*" with commands that were no-ops on non-puppetmasters [02:09:31] okay, how did you list all saltmasters? [02:09:52] Hm... [02:09:58] are there saltmasters that are their own saltmaster? [02:10:13] I assumed that most saltmasters in labs were themselves managed by labcontrol1001 [02:10:52] Yes [02:10:59] * andrewbogott curses [02:11:00] I have one called sm-puppetmaster-trusty2 [02:11:11] it's badly named because it started as a puppetmaster and sort of... grew [02:11:18] There's no end to the number of ways that users can avoid my benevolence :) [02:11:31] but it's working just fine [02:12:01] it has a saltmaster for the purposes of trebuchet, and a puppetmaster until one of my puppet changes (splitting trebuchet from the normal deployment server manifest) gets merged [02:12:18] So the vulnerable set is: Instances that are puppetmasters and which are also not managed by the central saltmaster [02:12:27] and also managed to time things badly [02:12:30] is that a big set? [02:12:42] I don't know the contents of that set [02:12:54] other than deployment-puppetmaster which you already fixed [02:15:19] I'm guessing that the answer is '0' [02:15:27] but I also sent a list to labs-l inviting complaints [02:15:29] https://wikitech.wikimedia.org/w/index.php?title=Special:Search&profile=advanced&profile=advanced&fulltext=Search&search=salt_master&ns666=1&searchToken=f4wgovzk2a78bw4c73a834ppe [02:16:42] So other than the two already discussed here: integration-saltmaster, st-puppetmaster, stashbot-deploy, and puppetmaster [02:17:04] (puppetmaster.logstash...) [02:17:31] My root key doesn't work on puppetmaster.logstash [02:19:20] bd808 should know about that one [02:19:21] it doesn't exist [02:19:32] there are no instances in the project 'logstash' [02:19:42] neither should stashbot-deploy [02:19:44] it's gone [02:19:55] puppetmaster is probably also gone, since it was in etcd [02:20:01] that project has no instances? [02:20:50] logstash looks empty to me [02:21:01] project logstash, I mean [02:21:26] st-puppetmaster does not exist [02:21:55] oh were you using wikitech and stuff [02:21:57] that always lies [02:21:59] so that leaves us with integration-saltmaster [02:21:59] liiieess [02:22:03] use http://tools.wmflabs.org/watroles/role/role::puppet::self [02:22:10] but with whatever role you were looking for instead [02:23:00] http://tools.wmflabs.org/watroles/role/role::salt::masters::labs::project_master [02:25:06] some people are still doing it with puppetVar [02:25:10] instead of hiera [04:06:46] RECOVERY - Puppet run on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [04:42:09] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Diffusion, 15User-bd808: Create application to manage Diffusion repositories for a Tool Labs project - https://phabricator.wikimedia.org/T133252#2291753 (10bd808) I've been doing some testing on https://phab-03.wmflabs.org/. I can create repos via the AP... [04:49:50] RECOVERY - Puppet run on tools-exec-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [05:34:59] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Diffusion, 15User-bd808: Create application to manage Diffusion repositories for a Tool Labs project - https://phabricator.wikimedia.org/T133252#2291759 (10bd808) There doesn't seem to be a conduit api for creating/editing/searching policies. We will nee... [07:24:13] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Diffusion, 15User-bd808: Create application to manage Diffusion repositories for a Tool Labs project - https://phabricator.wikimedia.org/T133252#2291840 (10mmodell) @bd808: I can deploy newer upstream code for you. With regard to policies, I've had cons... [07:43:08] RECOVERY - Puppet run on tools-flannel-etcd-03 is OK: OK: Less than 1.00% above the threshold [0.0] [08:13:05] RECOVERY - Puppet run on tools-webgrid-lighttpd-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [08:14:47] RECOVERY - Puppet run on tools-webgrid-lighttpd-1206 is OK: OK: Less than 1.00% above the threshold [0.0] [10:19:53] 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure: 'fatal: unable to look up current user in the passwd file: no such user - https://phabricator.wikimedia.org/T135217#2292072 (10hashar) [10:22:13] 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure: 'fatal: unable to look up current user in the passwd file: no such user - https://phabricator.wikimedia.org/T135217#2292089 (10hashar) At least it is reachable via salt though it is duplicated: ``` salt -v 'integration-slave-trusty-1004*... [10:26:40] 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure: 'fatal: unable to look up current user in the passwd file: no such user - https://phabricator.wikimedia.org/T135217#2292106 (10hashar) ``` $ id jenkins-deploy id: jenkins-deploy: no such user ``` And from syslog: ``` May 13 10:25:01... [10:35:02] 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure: 'fatal: unable to look up current user in the passwd file: no such user - https://phabricator.wikimedia.org/T135217#2292119 (10hashar) salt -v 'integration-slave-trusty-1004*' cmd.run 'sed -i -e "s/undef/ldap-labs.eqiad.wikimedia.org/g"... [10:38:13] 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure: 'fatal: unable to look up current user in the passwd file: no such user - https://phabricator.wikimedia.org/T135217#2292125 (10hashar) Ran the exact same thing for the integration-puppetmaster. [10:44:19] 06Labs, 10Labs-Infrastructure, 10Continuous-Integration-Infrastructure: 'fatal: unable to look up current user in the passwd file: no such user - https://phabricator.wikimedia.org/T135217#2292132 (10hashar) 05Open>03Resolved a:03hashar Puppet agent managed to run on the integration-puppetmaster and it... [11:59:41] RECOVERY - Puppet run on tools-webgrid-lighttpd-1406 is OK: OK: Less than 1.00% above the threshold [0.0] [12:47:43] more excitement after I left huh, so anything self puppetmaster had extra work [12:48:25] yeah [12:48:29] all sorted I think tho [12:49:26] chasemp: k8s deploy of the registry enforcer stuff went well as well, and I added a section on images to https://wikitech.wikimedia.org/wiki/Tools_Kubernetes [13:36:20] 06Labs, 10Tool-Labs: jsub's -once should clear jobs in E state and run things - https://phabricator.wikimedia.org/T135229#2292452 (10yuvipanda) [13:37:08] 06Labs, 10Tool-Labs: jsub's -once should clear jobs in E state and run things - https://phabricator.wikimedia.org/T135229#2292452 (10chasemp) Example of cron stuck: 6269345 0.30000 cron-tools tools.toolsc Eqw 05/13/2016 00:20:06 1 .error keeps saying: [Fri May 13 12:50:... [14:15:30] andrewbogott or bd808 maybe you can take a look at https://phabricator.wikimedia.org/T135195 later? Thanks :) [14:23:41] 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations, 07Blocked-on-Operations: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2292672 (10Dzahn) [14:36:13] PROBLEM - SSH on tools-webgrid-lighttpd-1408 is CRITICAL: Server answer [14:45:45] (03PS1) 10Andrew Bogott: Added labtest-specific ldap passwords [labs/private] - 10https://gerrit.wikimedia.org/r/288615 [14:46:35] (03CR) 10Andrew Bogott: [C: 032 V: 032] Added labtest-specific ldap passwords [labs/private] - 10https://gerrit.wikimedia.org/r/288615 (owner: 10Andrew Bogott) [14:47:47] (03CR) 10Rush: "thank you for continuing the lt- convention :)" [labs/private] - 10https://gerrit.wikimedia.org/r/288615 (owner: 10Andrew Bogott) [15:02:07] does anyone has a idea how long it will take to fix https://phabricator.wikimedia.org/T130529 ? (just curious, otherwikise i have to create a work-around for the maintime) [15:03:22] Steinsplitter: I'm pretty sure it is going to be resolved as declined with 'use a virtualenv' [15:03:44] actually, unless there's a python package for it already [15:03:46] let me check [15:03:57] there is [15:04:10] I'll add it now then, Steinsplitter [15:04:20] thx :) <3 [15:05:44] 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations, 07Blocked-on-Operations: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2286195 (10debt) Hi! Checking on the progress of this ticket... We're waiting for this to be looked at / fixed before we can update the wikipedia.org portal st... [15:06:13] 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations, 07Blocked-on-Operations: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2292790 (10debt) [15:06:59] 06Labs, 10Tool-Labs, 13Patch-For-Review: Tools bastions are often unreliable - https://phabricator.wikimedia.org/T131541#2292792 (10chasemp) [15:10:35] Steinsplitter: merged. should be installed in ~20mins [15:11:13] thx 😃) [15:12:40] np :) [15:13:56] !log codereview Created project (T135195) [15:13:56] T135195: Create project "code-testcluster" - https://phabricator.wikimedia.org/T135195 [15:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Codereview/SAL, Master [15:14:42] 06Labs, 07Tracking: New Labs project requests (tracking) - https://phabricator.wikimedia.org/T76375#2292805 (10bd808) [15:14:44] 06Labs, 15User-bd808: Create project "code-testcluster" - https://phabricator.wikimedia.org/T135195#2292802 (10bd808) 05Open>03Resolved a:03bd808 Changed the name to "codereview": https://wikitech.wikimedia.org/wiki/Nova_Resource:Codereview [15:50:39] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Diffusion, 15User-bd808: Create application to manage Diffusion repositories for a Tool Labs project - https://phabricator.wikimedia.org/T133252#2292901 (10bd808) >>! In T133252#2291840, @mmodell wrote: > @bd808: I can deploy newer upstream code for you.... [15:50:50] 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations, 07Blocked-on-Operations: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2292902 (10Krenair) >>! In T135029#2292786, @debt wrote: > Checking on the progress of this ticket... task status is stalled, we're waiting for ops [15:57:18] (03CR) 10Krinkle: "Fixes T104917" [labs/toollabs] - 10https://gerrit.wikimedia.org/r/287869 (owner: 10BryanDavis) [15:57:45] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 15User-bd808: tools.wmflabs.org landing page should not dump all tool accounts - https://phabricator.wikimedia.org/T104917#2292926 (10Krinkle) 05Open>03Resolved a:03bd808 Fixed in 058bc2b9ae07edbcffb714a4aa24f6ab1ef23919. [16:30:23] 06Labs, 10Tool-Labs, 06Operations: toolserver.org certificate to expire 2016-06-30 - https://phabricator.wikimedia.org/T134798#2292997 (10chasemp) >>! In T134798#2283060, @yuvipanda wrote: > We only need toolserver.org and www.toolserver.org I think. >>! In T134798#2283509, @Dzahn wrote: > So an existing... [16:35:47] 06Labs: Backup files request - https://phabricator.wikimedia.org/T135014#2285835 (10chasemp) We do not have this data from 9 months ago. AFAIK there is no ability to restore. [16:43:09] RECOVERY - Puppet run on tools-webgrid-lighttpd-1210 is OK: OK: Less than 1.00% above the threshold [0.0] [16:46:49] 06Labs: Backup files request - https://phabricator.wikimedia.org/T135014#2293049 (10Mjbmr) Alright then, Thanks. [17:00:38] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Diffusion, 15User-bd808: create conduit method for the creation of phabricator policy objects - https://phabricator.wikimedia.org/T135249#2293077 (10mmodell) [17:01:00] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Diffusion, 15User-bd808: Create application to manage Diffusion repositories for a Tool Labs project - https://phabricator.wikimedia.org/T133252#2226182 (10mmodell) > Should we fork a subtask to track working on something like that? done. [17:05:50] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Diffusion: create conduit method for the creation of phabricator policy objects - https://phabricator.wikimedia.org/T135249#2293099 (10bd808) [19:34:04] Hi. I'm trying to login to the beta cluster for the first time from a new computer [19:34:18] milimetric provided a config example that I duplicated: https://www.irccloud.com/pastebin/6LBanNMq/ [19:35:03] but the connection is closed by the server, presumably because of an auth issue [19:35:15] the public key has been added to https://wikitech.wikimedia.org/wiki/Special:Preferences#mw-prefsection-personal [19:36:16] The log from the connection is at http://pastebin.com/i81wC2DV [19:36:49] does anyone have any pointers? [19:39:22] strainu: if that is a literal copy of your ssh config [19:39:27] strainu, are you using the correct username? [19:39:28] it still has milimetric's name in it [19:39:41] also, let us know if you can ssh to bastion-eqiad.wmflabs.org directly [19:40:16] chasemp: tom29739 , I'm using the correct username; that's the pastebin from milimetric [19:41:02] strainu, looking though that connection log, I see this: 'debug3: Could not load "/home/andrei/.ssh/wmf_rsa" as a RSA1 public key' [19:41:26] Is that key there, and not corrupt and everything? [19:42:30] yeah, actually it appears that message is ok: http://stackoverflow.com/questions/12449626/trying-to-use-rsa-keys-to-ssh-into-ec2-getting-incorrect-rsa1-identifier-pe/15563793#15563793 [19:44:06] strainu: first step is to try to get on teh bastion itself [19:44:24] chasemp: I cannot connect to the bastion http://pastebin.com/q0P7DLcD [19:46:09] Failed publickey for strainu [19:46:55] I pm'd you the pub key in labs [19:50:09] yep, that's in wmf_rsa.pub [19:50:09] huh you're not a part of project-bastion [19:50:16] I thought that happend by default [19:53:34] I think it's supposed to... [19:53:37] strainu: ok try now [19:54:27] I'm in, thanks for the help chasemp and tom29739 [19:54:30] yw [19:54:36] np. [19:54:41] andrewbogott: any guesses as to why strainu would not be a member of project-bastion? [19:54:47] * chasemp off to a meeting tho [19:55:23] chasemp: nope, unless their account is very old [19:55:29] it is [19:55:34] it's from 2012 [19:55:49] ah [19:55:55] welcome back :) [20:54:25] 06Labs, 15User-bd808: Create project "code-testcluster" - https://phabricator.wikimedia.org/T135195#2291428 (10Legoktm) Wonderful! {T76245} is related. [21:08:18] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Diffusion, 15User-bd808: Create application to manage Diffusion repositories for a Tool Labs project - https://phabricator.wikimedia.org/T133252#2294002 (10mmodell) @bd808: I merged `upstream/master` on `phab-03` so that you can make more progress withou... [21:38:34] 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations, 07Blocked-on-Operations: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2294108 (10Dzahn) @volans could you take a look maybe? [21:41:18] bd808: thanks for the projectcreation :) [21:42:22] !log toolserver-legacy added Dzahn as project member + admin [21:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolserver-legacy/SAL, Master [21:43:12] 06Labs, 10Tool-Labs, 06Operations, 13Patch-For-Review: toolserver.org certificate to expire 2016-06-30 - https://phabricator.wikimedia.org/T134798#2294113 (10yuvipanda) a:05yuvipanda>03Dzahn nope, Dzahn is awesomer :) [22:23:23] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Install python-requests-oauthlib on labs - https://phabricator.wikimedia.org/T130529#2294263 (10Billinghurst) many thanks @yuvipanda [22:37:33] !log codereview Setup instance cr1 to start with setting up at wiki cluster, see T135288 [22:37:33] T135288: Create one testinstance at the codereview-cluster - https://phabricator.wikimedia.org/T135288 [22:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Codereview/SAL, Master [23:32:53] 06Labs, 10Tool-Labs, 06Operations, 13Patch-For-Review: toolserver.org certificate to expire 2016-06-30 - https://phabricator.wikimedia.org/T134798#2294401 (10Dzahn) which instance/host is the one that actually runs it?