[00:00:01] PROBLEM - Puppet run on tools-webgrid-generic-1403 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [00:00:08] It's this: URI ldap://undef:389 ldap://undef:389 [00:00:10] 06Labs: Create project "code-testcluster" - https://phabricator.wikimedia.org/T135195#2291428 (10Luke081515) [00:00:10] prod upgraded exim and changed the Puppet config [00:00:18] fun [00:00:28] I should be able to fix the URI thing pretty quickly [00:00:45] you "just" need to `apt-get install exim` I think to fix [00:00:58] or maybe exim4? [00:01:41] PROBLEM - Puppet run on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [00:01:52] andrewbogott: ah that would do it [00:03:13] PROBLEM - Puppet run on tools-exec-1402 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [00:04:53] PROBLEM - Puppet run on tools-worker-1007 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [0.0] [00:04:57] PROBLEM - Puppet run on tools-flannel-etcd-03 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [00:05:50] 06Labs, 07Tracking: New Labs project requests (tracking) - https://phabricator.wikimedia.org/T76375#2291489 (10Luke081515) [00:05:52] 06Labs: Create labs project arcanist - https://phabricator.wikimedia.org/T130507#2291487 (10Luke081515) 05Open>03declined Closing this as declined, because we don't have an "obsolete" status. With the arcanist installer for windows, which works, we don't need this solution. [00:07:09] bd808: do you happen to know if the package name is exim4-base on all distros or if it's different on jessie vs. trusty? [00:07:22] (that's for sure the name on jessie) [00:07:30] exim4 on trusty I think [00:07:34] but...maybe I'm wrong [00:07:36] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [00:08:02] chasemp: this should fix logins: https://gerrit.wikimedia.org/r/#/c/288555/1 [00:08:44] give it a whirl [00:08:49] I'm having no luck w/ exim over salt [00:09:22] 'no luck'? [00:09:34] PROBLEM - Puppet run on tools-proxy-02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [00:09:39] not sure what package name is rihgt, but it won't upgrade w/o sudo it seems [00:09:42] PROBLEM - Puppet run on tools-webgrid-lighttpd-1206 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [00:09:43] and sudo fails w/o ldap [00:10:06] doesn't salt already run as root? [00:10:22] Well, anyway, ldap should be recovering now as puppet runs [00:10:26] so I thought could also be puppet stepping on me [00:10:27] idk [00:10:29] E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily unavailable) [00:10:30] E: Unable to lock the administration directory (/var/lib/dpkg/), is another process using it? [00:10:35] is normal need sudo type stuff [00:10:41] PROBLEM - Puppet run on tools-flannel-etcd-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [00:10:41] for apt [00:10:49] hm, that sounds more like an apt-get or a puppet run is in progress [00:11:04] could be but it's consistent and has lasted awhile [00:12:28] exim4-base is in jessie and trusty too [00:12:33] chasemp: my canaries are better now, at least [00:12:57] maybe the dependencies changed but the package name is there in both.. [00:13:04] I think we should just wait it out, trying to force puppet runs will probably just overwhelm the puppetmaster [00:13:58] PROBLEM - Puppet run on tools-exec-1201 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [00:14:07] so checks are failing as expected [00:14:07] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=tools [00:14:10] why no notice in -ops? [00:14:24] kaldari, tom29739, the issue is fixed in puppet now, your access should go back to normal in 15 minutes max [00:14:36] thanks! [00:14:40] I can push things along if you have a particular instance you covet [00:14:49] bd808: That command worked, I'l add that to the wiki page for others, thank you for your time <3 [00:15:22] chasemp: maybe because icinga-wm is a zombie, there were 2 processes for some reason and i killed one [00:15:32] CZauX: awesome! Have fun testing [00:15:40] mutante: I see it now actually [00:15:55] icinga-wm_PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 0.043 second response time [00:16:07] ok, good. i was about to say usually it's stable-ish [00:16:08] i.e. grid is down [00:16:08] PROBLEM - Puppet run on tools-webgrid-lighttpd-1210 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [00:17:37] PROBLEM - Puppet run on tools-exec-gift is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [00:18:34] chasemp: if you don't object, i'm going to salt a puppet run on tools-* [00:18:43] give it a try [00:18:52] I was just going to spot check a few [00:19:30] recovery on the icinga checks [00:19:52] except the ldap one ironically [00:23:22] so was the exim thing a race condition? [00:23:30] because places I hvae not explicitly upgraded now seem fine [00:23:43] seems like it failed w/ one puppet run and then corrected itself next round or something [00:23:52] or did someone fix it in a way I missed [00:24:04] I fixed it by hand on two or three instances [00:24:08] but that's it [00:24:13] I'd expect it to still be failing here and there [00:24:32] The exim thing has been happening for ~ a week [00:24:40] so I wouldn't expect anything new on that front [00:24:42] I didn't realize [00:24:45] apart from whatever you're doing with salt or clush [00:25:19] well this is strong fodder for those tools checks to page I think [00:25:26] apt-get install exim4-base on * via salt sounds like it would have fixed it.. besides the possible overload [00:25:28] we would probably known right away this was going down [00:25:31] yeah [00:25:49] 18:46 was broken [00:26:00] 19:19 was fixed [00:26:05] according to icinga [00:26:22] I'll put a few of those in paging mode tomorrow [00:26:30] yeah, seems worthwhile [00:26:43] Sorry if my bug made you late for dinner :( [00:27:11] I was out seraching for a lost cat for awhile and came back to everything broken and launching my finally done i/o profile on labstore1005 [00:27:16] so it was a happy coincidence [00:27:20] I was already around [00:27:53] find the cat? [00:28:13] yes, the cat had been closed up in a dresser drawer teh whole time [00:28:21] not a joke [00:28:23] Oh, sad cat :( [00:28:52] woah what happened now [00:29:01] * YuviPanda reads backscroll [00:29:15] YuviPanda: I just broke ldap with a puppet refactor [00:29:20] :) glad to hear the cat is back. are you putting GPS on it now :) [00:29:39] we need to but no :) [00:30:22] ok :D [00:30:27] got some gridengine failure notices [00:30:31] so I'm going to investigate that too [00:31:00] chasemp: Are things looking fixed enough for me to go eat dinner? [00:31:23] I think yes -- YuviPanda if you happen to see more fallout give a call please -- I'm off to deal w/ kids and bedtimes [00:31:37] chasemp: yup, I'll if needed [00:31:57] sudo works ok [00:32:01] am looking at the grid queues now [00:32:24] andrewbogott: can confirm the gridengine failures are also because of LDAP [00:32:33] it put some queues in error state, fixing now [00:32:37] 'stress test' :) [00:33:03] (I was halfway into debugging DNS when I realized it couldn't be that -- true story) [00:33:05] ok later [00:34:01] PROBLEM - Puppet run on tools-exec-1201 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [00:34:50] PROBLEM - Puppet run on tools-webgrid-lighttpd-1206 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [00:34:53] * andrewbogott checks on ^ [00:37:03] RECOVERY - Puppet run on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [00:37:15] well, those hosts are totally fine and I don't know why they are complaining [00:37:35] RECOVERY - Puppet run on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [00:37:35] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [00:38:09] PROBLEM - Puppet run on tools-exec-1402 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [00:40:26] !log tools cleared all queues that were in error state [00:40:31] andrewbogott: the alerts are lagged [00:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [00:40:35] andrewbogott: because graphite etc [00:40:42] andrewbogott: I think the fallout has been handled now [00:40:53] ok, thanks for cleaning up [00:40:58] I tested that patch so many times on the puppet compiler [00:41:04] but of course didn't test it on an actual labs host :( [00:41:27] * andrewbogott -> eats [00:42:41] :D [00:46:28] So we now know the consequences of an ldap outage I guess? [00:47:12] uri ldap://undef:389 ldap://undef:389 [00:47:14] ahh [00:47:35] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [00:53:37] YuviPanda: any idea if our hiera setup will support my making a labs.yaml for instances in .codfw.labtest? [00:53:57] andrewbogott: let me look [00:54:22] andrewbogott: we already seem to have one? [00:54:26] andrewbogott: ./modules/puppetmaster/files/labtest.hiera.yaml [00:56:12] YuviPanda: ok, I'm trying to understand that... [00:56:24] "%{::site}/labtest-instances" [00:56:35] seems like that will let me specify hiera for specific instances but not in general, right? [00:56:59] isn't labtest it's own $::realm? [00:57:13] hmm [00:57:20] labtest hosts are their own realm [00:57:24] but I don't think labs instances are [00:57:33] *labtest instances [00:57:39] I think %{::site} will expand to codfw [00:57:44] so codfw/labtest-instances [00:57:45] ah, right, they'll just have realm labs [00:57:53] should be equivalent of labs.yaml for them all I think [00:57:55] so for labtest, realm is labs and site is codfw [00:58:02] whereas for labs, realm is labs but site is eqiad [00:58:08] Oh, you're totally righ [00:58:14] I just need to twiddle hieradata/codfw/labtest-instances.yaml [00:58:15] I bet [00:58:43] what did you do to try to fix the ldap issue andrewbogott? [00:58:47] everything is still very broken [00:58:50] Krenair: where? [00:58:53] Krenair: in deployment-prep? [00:59:02] yes, and I've got at least one report from another project [00:59:03] Krenair: just this: https://gerrit.wikimedia.org/r/#/c/288555/ [00:59:07] Krenair: puppet should run, I guess [00:59:15] Puppet is failing because of ldap [00:59:31] maybe the masters can't update themselves? [00:59:35] ah, did this just completely fuck over self hosted puppetmasters then? [00:59:45] is wikidata-query using a self hosted puppetmaster? [01:00:16] yeah, can confirm, k8s puppet hosts are all failing [01:00:21] the commit is on deployment-puppetmaster, but puppet does this: [01:00:22] Warning: Unable to fetch my node definition, but the agent run will continue: [01:00:22] Warning: Error 400 on SERVER: LDAP Search failed [01:00:27] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Failed when searching for node deployment-puppetmaster.deployment-prep.eqiad.wmflabs: LDAP Search failed [01:00:27] Warning: Not using cache on failed catalog [01:00:27] Error: Could not retrieve catalog; skipping run [01:00:31] login is broken, sudo is broken [01:00:37] Krenair: yeah, you've to manually fix the ldap config on puppetmaster [01:00:43] doing [01:00:51] andrewbogott: ^ all self hosted puppetmasters might be broken now [01:01:54] Krenair: ldap://ldap-labs.eqiad.wikimedia.org:389 ldap://ldap-labs.codfw.wikimedia.org:389 [01:02:01] is the correct one, if you needed [01:02:03] I did /etc/ldap.conf and /etc/ldap.yaml [01:02:05] Hm... [01:02:08] it also needs passwords YuviPanda [01:02:09] why do they fail to update themselves? [01:02:14] But it still fails [01:02:40] that doesn't seem to fix it [01:02:44] andrewbogott: they can't run puppet [01:02:48] what host are you working on? [01:02:49] andrewbogott: since puppet depends on LDAP [01:02:59] andrewbogott: am working on tools-puppetmaster-01, which serves all the k8s hosts [01:03:09] deployment-puppetmaster [01:03:11] Warning: Unable to fetch my node definition, but the agent run will continue: [01:03:13] Warning: Error 400 on SERVER: LDAP Search failed [01:03:26] wdqs-puppetmaster is also probably in the same mess, based on what SMalyshev says in -releng [01:03:30] YuviPanda: mind if I tinker for a moment? [01:03:46] Probably you need to update settings in /etc/puppet/puppet.conf [01:04:10] yep, that's broken as well [01:04:22] andrewbogott: ah, I see. [01:04:28] andrewbogott: yes, please go ahead. [01:04:48] what's the correct ldapserver syntax in there? [01:05:00] I tried this: [01:05:01] ldappassword = Eche0ieng8UaNoo [01:05:01] ldapserver = ldap://ldap-labs.eqiad.wikimedia.org:389 ldap://ldap-labs.codfw.wikimedia.org:389 [01:05:02] no luck [01:05:17] (public password, before anyone panics, although you should all recognise it as this point) [01:05:41] I pronounce it as 'Echoing Eightuano!' [01:05:44] in puppetmaster: [01:06:05] ldapserver = ldap-labs.eqiad.wikimedia.org [01:06:13] um, that's in /etc/puppet/puppet.conf [01:06:16] and in /etc/ldap.conf [01:06:41] uri ldap://ldap-labs.eqiad.wikimedia.org:389 ldap://ldap-labs.codfw.wikimedia.org:389 [01:06:54] yeah I know the server is in both places, I needed the syntax [01:06:59] did that work? [01:07:01] yes [01:07:04] still broken. maybe a puppetmaster restart needed? [01:07:06] tools-puppetmaster-01 [01:07:11] what should I do next? [01:07:18] yes, you have to restart puppetmaster after changing puppet.conf [01:08:01] It works [01:08:03] ok, I'll reboot the puppetmaster and see if that allows me to log in [01:08:12] It won't SMalyshev [01:08:17] SMalyshev: wait, what host are you working on? [01:08:22] So update /etc/puppet/puppet.conf and service puppetmaster restart [01:08:29] he'll be on wdqs-puppetmaster [01:08:36] But he can't log in [01:08:41] And he didn't have a root key set up [01:08:46] So he's blocked on ops roots [01:08:46] yep, looking [01:08:48] andrewbogott: wdqs-puppetmaster is the puppetmaster for wikidata-query project [01:09:14] yeah I rarely really need root and usually never ssh as root [01:09:37] I usually never ssh as root, but it's required in labs when instance login/sudo breaks [01:09:45] right :) [01:09:49] as is the case here [01:09:50] SMalyshev: you should be back in business in a minute or two [01:09:58] andrewbogott: ok, cool! [01:10:11] so there's this in Hiera:Deployment-prep on wikitech: [01:10:13] "passwords::root::extra_keys": [01:10:13] alex: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDdYrWghgUYZZsdh8tw4mRfpsNWCW56ns6tKei3oSHEUNOXnxhAQwnwwkc1LeGYDdPglX3EAbtj0LAUf1muaC99mFYXw+e+7XjZCJYZ5cRrkvDXA/JvxaBvhOid8BZGImeoSqX2ZZ0HQSNRQfzVbs+7yAu0nMB3WSup9eoMnC3OfBsbPvQZ/WDD/OKIwuv8hnYyi8QmQgPngERqIf12ireGIXNn++IMncqu9Z6+skQHiuvkMsWlPjeYtg8gfIS8h8JY1QApVUyqkHuE4Fr/1n/NagsDu1WvE4zuEzQ3TmXZOtikPqSfsLDD5+gqxcaH1iiM+nEeG/RWD9/WyHnVUsIH krenair@gmail.com ubuntu laptop [01:11:48] * andrewbogott does sudo salt '*' cmd.run 'grep undef /etc/puppet/puppet.conf' [01:12:13] Didn't fix deployment-salt andrewbogott [01:12:22] where do puppet clients get their ldap config? [01:13:24] Krenair: I'll look, hang on [01:14:39] Krenair: I think you're looking on the wrong host, the puppetmaster is deployment-puppetmaster.deployment-prep.eqiad.wmflabs isn't it? [01:15:18] andrewbogott: so puppetmaster login now works, but wdq-beta still broken [01:15:29] Yes it is [01:15:40] SMalyshev: wdq-beta will need to run puppet to update [01:15:43] now that the master is updasted [01:15:44] dated [01:15:53] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: stack level too deep [01:15:55] Which has working puppet itself now [01:15:59] ... ah, no, it broke again [01:16:16] In fact the first time I ran puppet it broke itself [01:16:51] in /etc/nslcd.conf, /etc/ldap.yaml, /etc/ldap.conf [01:17:16] Krenair: I'm on deployment-puppetmaster, should be fixed in a moment. [01:18:02] andrewbogott: ok, seems to work now [01:18:23] still unhappy about some exim thing, but that happened before [01:18:41] and I don't care about exim anyway [01:18:58] You mean the normal labs exim puppet error? [01:19:08] Error: Could not start Service[exim4]: Execution of '/etc/init.d/exim4 start' returned 1: [01:19:08] Error: /Stage[main]/Exim4/Service[exim4]/ensure: change from stopped to running failed: Could not start Service[exim4]: Execution of '/etc/init.d/exim4 start' returned 1: [01:19:12] Krenair: what do you think about [01:19:14] sudo salt '*' cmd.run 'sed -i %undef%/etc/puppet/puppet.conf%g /etc/puppet/puppet.conf' [01:19:15] dunno if it's normal :) [01:19:24] oops, typo, hang on [01:19:43] SMalyshev, all jessie instances have the issue, IIRC the fix is to upgrade the exim4 package [01:20:02] apt-get install exim4-base [01:20:05] Krenair: I mean, of course [01:20:10] sudo salt '*' cmd.run 'sed -i %undef%ldap-labs.eqiad.wikimedia.org%g' /etc/puppet/puppet.conf [01:20:11] Krenair: ok, will try that, though as I said I'm not too worried about it. the login seems to be fixed [01:20:12] I'll try it on a test case [01:20:36] Will it work without fixing the password andrewbogott ? [01:20:49] SMalyshev: install "exim4-base" package and run puppet again [01:20:59] mutante: yeah will try tthat [01:21:09] andrewbogott, also that can't be right [01:21:19] your command ends before the filename [01:21:26] Krenair: I haven't seen any examples of the password being changed [01:21:34] they all get changed to undef andrewbogott [01:21:43] mutante: nope, puppet still hates exim. [01:21:56] # The DN to bind with for normal lookups. [01:21:56] binddn cn=proxyagent,ou=profile,dc=wikimedia,dc=org [01:21:57] bindpw undef [01:22:09] same in the ldap and puppet configs [01:22:14] andrewbogott: hmm, tools-puppetmaster-01 is still failing btw [01:22:21] Warning: Unable to fetch my node definition, but the agent run will continue: [01:22:23] Warning: Error 400 on SERVER: Could not retrieve facts for tools-puppetmaster-01.tools.eqiad.wmflabs: Could not autoload puppet/indirector/facts/active_record: cannot load such file -- active_record/deprecated_finders [01:22:24] SMalyshev: hmm, so i opened this ticket about it https://phabricator.wikimedia.org/T135033 [01:22:57] andrewbogott: hmm, I restarted puppetmaster again, maybe that'll help [01:23:04] looking better now [01:23:06] so far [01:23:08] apparently a new image is in the works but blocked by another ticket [01:23:22] but not sure which one that is [01:23:34] mutante: thanks... [01:23:42] Krenair: the instance I'm looking at is test-prometheus5 [01:23:53] in its puppet.conf it had an 'undef' ldap servername [01:23:56] but a valid password [01:23:59] and my salt command worked fine [01:25:08] mutante, https://phabricator.wikimedia.org/T133551 [01:25:54] the current blocker for the new image is https://phabricator.wikimedia.org/T134944 [01:26:04] but anyway that won't make much of a difference [01:26:37] why not [01:26:37] 18<andrewbogott> sudo salt '*' cmd.run 'sed -i %undef%ldap-labs.eqiad.wikimedia.org%g' /etc/puppet/puppet.conf [01:26:53] the filename is after the command you run on each instance [01:27:01] 06Labs: confirm that new base labs base image is adequate for kubernetes &c. - https://phabricator.wikimedia.org/T134944#2291610 (10Dzahn) [01:27:08] the filename should be part of the command [01:27:22] yes, sorry, I had to do some quote-mark tidying [01:27:23] it is now sudo salt "*" cmd.run 'sed -i "s%undef%ldap-labs.eqiad.wikimedia.org%g" /etc/puppet/puppet.conf' [01:27:43] which I'm running right now :) [01:28:02] instances seem to be being unbroken [01:28:08] 06Labs, 10Mail: failed exim service on labs instances - https://phabricator.wikimedia.org/T135033#2286378 (10Dzahn) 18:31 < andrewbogott> the current blocker for the new image is https://phabricator.wikimedia.org/T134944 18:31 < andrewbogott> but anyway that won't make much of a difference [01:28:20] I checked deployment-mediawiki01 though [01:28:32] /etc/ldap.yaml [01:28:33] blank password [01:28:43] /etc/ldap.conf - bindpw undef [01:28:52] /etc/ldap/ldap.conf - BINDPW undef [01:28:58] * andrewbogott nods [01:29:09] but if puppet has the password it can populate all of those [01:29:40] It should but it's not [01:29:57] ok... [01:29:59] what host is this? [01:30:00] The puppet diffs aren't showing it being changed [01:30:07] sorry, you just said [01:30:08] login still broken [01:30:09] I'll look [01:30:10] deployment-mediawiki01 [01:30:24] in fact puppet includes errors like these: [01:30:24] Error: Could not find group wikidev [01:30:24] Error: /Stage[main]/Mediawiki/File[/var/log/mediawiki]/group: change from 500 to wikidev failed: Could not find group wikidev [01:30:25] Error: Could not find group ops [01:32:21] root@deployment-mediawiki01:~# ldaplist -l group ops [01:32:21] Password incorrect. [01:33:05] however I can do ldapsearch -x dc:dn:=deployment-mediawiki01.deployment-prep.eqiad.wmflabs [01:33:54] so… who updates /var/lib/git/operations/puppet on a self-hosted master? [01:33:59] autoupdater [01:34:00] Is it a cron, or is it puppet itself? [01:34:47] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Puppet::Parser::AST::Resource failed with error ArgumentError: Invalid resource type motd::script at /etc/puppet/modules/base/manifests/puppet.pp:106 on node deployment-puppetmaster.deployment-prep.eqiad.wmflabs [01:34:50] what the heck is that? [01:35:14] It's a very good question, one I have a ticket open for. You're running via salt? [01:35:18] …and it's gone [01:35:22] Yep. [01:35:35] ok, will ignore for now :) [01:35:41] Salt? [01:35:47] nope [01:35:49] Huh. [01:35:51] I'm logged in [01:36:01] So, what is 'autoupdater' in this context? [01:36:09] trying to find it [01:36:37] login on deployment-puppetmaster is still broken [01:36:42] modules/beta/manifests/autoupdater.pp:# == Class: beta::autoupdater [01:36:52] # Puppet Name: rebase_operations_puppet [01:36:52] */10 * * * * /usr/local/bin/git-sync-upstream >>/var/log/git-sync-upstream.log 2>&1 [01:37:14] modules/puppetmaster/manifests/gitsync.pp [01:39:07] andrewbogott, ^ [01:41:36] ok, so it should be updated per puppet run [01:42:40] looks more like every ten minutes by cron [01:42:51] the cron is puppetised at modules/puppetmaster/manifests/gitsync.pp [01:42:56] but it is still a cron [01:43:32] is there a way I could mount /public/dumps on my labs instance? [01:43:47] I can see it when i login to tools but not on labs instance... [01:44:40] Krenair: /var/lib/git/labs/private had local changes that prevented rebasing [01:44:47] so that's bad, I don't know yet if it was the cause [01:45:17] you needed a labs/private update? [01:45:32] who knows? This was very out of date [01:45:55] SMalyshev, https://wikitech.wikimedia.org/wiki/Help:Shared_storage#.2Fpublic.2Fdumps [01:46:05] SMalyshev, "You can request them by filing a task on Phabricator " [01:47:02] andrewbogott, root@deployment-puppetmaster:/var/lib/git/operations/puppet# git log --oneline -n 20 | grep ldap [01:47:03] 18626de Define labsldapconfig for labs instances [01:48:44] SMalyshev: assign it to me and I'll take care of it? [01:49:14] created T135205 [01:49:14] T135205: Access to /public/dumps for wikidata-query project - https://phabricator.wikimedia.org/T135205 [01:49:25] YuviPanda: thanks! [01:49:40] Krenair: that was it, I think [01:50:07] yeah, all good now [01:50:48] SMalyshev: should work on next puppet run [01:51:02] YuviPanda: coo, thanks! [01:51:05] np [01:54:30] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Write a k8s admission controller to enforce that all containers running come from our private repository - https://phabricator.wikimedia.org/T133515#2291665 (10yuvipanda) When attempting to use gcr.io containers... ``` root@tools-k8s-master-01:/h... [01:54:35] So… apart from perverse cases like the above, I think this is resolved everywhere [01:54:38] andrewbogott, looks better... [01:55:12] root@deployment-salt:~ # grep undef /etc/nslcd.conf [01:55:12] bindpw undef [01:56:25] puppet run fixed that [01:56:46] Yeah, I'm pretty sure things are all catching up [01:56:50] somehow you didn't catch deployment-ms-be02, andrewbogott [01:56:52] except I will force yet another puppetmaster restart in 15-20 [01:56:54] did you use salt that time? :) [01:57:00] Krenair: is that a puppetmaster? [01:57:02] no [01:57:10] So it'll just have to catch up on its own time [01:57:16] oh I see [01:57:58] I don't think it makes sense to force puppet runs [01:57:58] so it's ldap config fixes on the puppetmasters + up to date labs/private + puppet runs on puppet clients? [01:58:03] since the puppetmaster is already running as fast as it can [01:58:22] yeah, that sounds right [01:59:56] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Goal: Allow using k8s instead of GridEngine as a backend for webservices (Tracking) - https://phabricator.wikimedia.org/T129309#2291678 (10yuvipanda) [01:59:59] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Write a k8s admission controller to enforce that all containers running come from our private repository - https://phabricator.wikimedia.org/T133515#2291675 (10yuvipanda) 05Open>03Resolved Aaaand, works otherwise! \o/ I call this done now. [02:01:08] Krenair: I'm pretty sure that any instance that is updating puppet without private (or private without puppet) is hanging by a thread [02:01:31] deployment-prep would have broken even without the mistake that caused the login failures elsewhere [02:01:47] I hadn't noticed that labs/private was not up to date [02:02:02] Yeah, I don't know how we would notice things like that, wholesale :( [02:02:17] the cron that does the rebases [02:02:32] when rebase is not possible, can't it report a bad number to graphite, else a good one? [02:02:55] hm, maybe [02:03:00] it does it already I think? or something like that [02:03:04] Or send an angry email :) [02:03:10] we have a cherry pick counter [02:03:23] angry mail might work better but we can't have it sending every 10 minutes [02:03:29] Or, honestly, it could just clobber local changes [02:03:39] every 10 minutes? [02:03:50]