[00:22:41] PROBLEM - Puppet errors on tools-worker-1022 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [00:36:01] PROBLEM - Puppet errors on tools-exec-1428 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [00:57:41] RECOVERY - Puppet errors on tools-worker-1022 is OK: OK: Less than 1.00% above the threshold [0.0] [01:11:03] RECOVERY - Puppet errors on tools-exec-1428 is OK: OK: Less than 1.00% above the threshold [0.0] [01:34:42] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Create an XTools logo - https://phabricator.wikimedia.org/T167345#3376098 (10Ricordisamoa) >>! In T167345#3375455, @MusikAnimal wrote: > Perhaps @Ricordisamoa is able to make it look just as fancy with an uppercase T? With uppercase T: {F8517584} Do you thin... [02:12:50] PROBLEM - Puppet errors on tools-exec-1418 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [02:23:07] PROBLEM - Puppet errors on tools-exec-1407 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [02:47:53] RECOVERY - Puppet errors on tools-exec-1418 is OK: OK: Less than 1.00% above the threshold [0.0] [03:03:07] RECOVERY - Puppet errors on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [03:47:51] PROBLEM - Puppet errors on tools-exec-1417 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [03:52:11] !halp [03:52:15] !help [03:52:15] dschwen: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team [03:52:21] This is not good: [03:52:25] dschwen@fastcci-worker1:~$ cat bin/* [03:52:25] cat: bin/restart_fastcci.sh: Input/output error [03:52:26] cat: bin/update_database.sh: Input/output error [03:56:32] dschwen: is this on an NFS mount again? [03:56:43] yeah [03:56:47] the home mount [03:57:02] nfs-tools-project.svc.eqiad.wmnet:/project/fastcci/home [03:57:54] :/ ok. The best advice I can give at the moment is to try rebooting the instance [03:59:11] If you can figure out how to make everything work without using NFS your life will be better in the long term. [03:59:43] It didn't sound like you were using nfs for massive storage [04:00:18] even /home is nfs [04:00:34] what I was trying to access are small shell scripts [04:00:50] I can put my database into /tmp (that's "local") [04:00:57] whatever that means on teh VMs [04:01:31] If the instance is having nfs communication problems the file size won't matter much [04:01:36] rebooting [04:03:05] Local on the vms means in the sparse disk image the is stored on the server that hosts and runs the vm [04:04:06] ugh, rats [04:04:18] VM has trouble coming back up [04:04:33] "Permission denied (publickey)." [04:04:45] which indicates my home dir being screwed up [04:05:03] (my authorized_keys file is probably toast or inaccessible) [04:05:31] Actual that sounds like networking. Authorized keys should be read from LDAP [04:06:10] Let me find my laptop... [04:09:40] veeery slowly logging in now [04:10:04] and I have a prompt [04:10:40] looks like my service on that instance started [04:11:14] is that really still a Precise instance or did it get an in-place upgrade? [04:11:55] oh... it's xenial? [04:12:39] in place [04:12:53] RECOVERY - Puppet errors on tools-exec-1417 is OK: OK: Less than 1.00% above the threshold [0.0] [04:13:29] Puppet is all messed up because our roles don't kow how to deal with xenial hosts [04:13:42] that could lead to lots of random problems [04:15:19] the things that are failing in the puppet run look to be related to our data collection tools and not nfs at least [04:16:53] dschwen: would this compile and run on Debian Jessie or Stretch instead of xenial? [04:19:54] I'll set up a stretch instance and test it tomorrow [04:20:06] how does that sound? [04:20:19] I think I can tear down one one of teh other instances [04:20:27] quota is tight [04:20:35] ok, gott go hit the sack [04:20:36] fastcci-puppetmaster.fastcci.eqiad.wmflabs won't even let me in with my root ssh key [04:20:43] muhahaha [04:20:54] sorry :-( [04:21:05] anyhow, will work on it tomorrow [04:21:11] thanks! [04:21:25] ok. if you need a temporay quota bump to build an instance let me know with phab task [04:21:30] thx [04:21:55] btw, if xenial is a problem, then I have a lot more work to do [04:22:15] Wikiminiatlas is on an in-place upgraded Xenial instance as well [04:22:27] wma1 in teh maps project [04:22:30] running a distro we support will make it easier for us to help you [04:22:34] sure [04:22:44] ok, cu tomorrow (maybe) [04:22:48] o/ [08:18:36] PROBLEM - Puppet errors on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [08:58:35] RECOVERY - Puppet errors on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [12:13:44] PROBLEM - Puppet errors on tools-webgrid-lighttpd-1422 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [12:37:33] PROBLEM - Puppet errors on tools-exec-1404 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [12:43:44] RECOVERY - Puppet errors on tools-webgrid-lighttpd-1422 is OK: OK: Less than 1.00% above the threshold [0.0] [13:12:34] RECOVERY - Puppet errors on tools-exec-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [13:23:39] PROBLEM - Puppet errors on tools-worker-1022 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [13:54:34] (03Draft1) 10Paladox: Add groups.ini [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361209 [13:54:35] (03PS2) 10Paladox: Add groups.ini [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361209 [13:54:49] (03CR) 10Paladox: [V: 032 C: 032] Add groups.ini [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361209 (owner: 10Paladox) [13:56:40] (03Draft1) 10Paladox: Fix authentication.ini [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361211 [13:56:42] (03PS2) 10Paladox: Fix authentication.ini [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361211 [13:56:45] (03CR) 10Paladox: [V: 032 C: 032] Fix authentication.ini [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361211 (owner: 10Paladox) [13:58:40] RECOVERY - Puppet errors on tools-worker-1022 is OK: OK: Less than 1.00% above the threshold [0.0] [14:25:05] (03Draft1) 10Paladox: Update groups.ini [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361212 [14:25:07] (03PS2) 10Paladox: Update groups.ini [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361212 [14:25:10] (03CR) 10Paladox: [V: 032 C: 032] Update groups.ini [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361212 (owner: 10Paladox) [14:55:33] (03Draft1) 10Paladox: Add mail-host and mail-service scripts [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361214 [14:55:35] (03PS2) 10Paladox: Add mail-host and mail-service scripts [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361214 [14:55:38] (03CR) 10Paladox: [V: 032 C: 032] Add mail-host and mail-service scripts [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361214 (owner: 10Paladox) [14:56:34] (03Draft1) 10Paladox: Fix syntax error [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361216 [14:56:36] (03PS2) 10Paladox: Fix syntax error [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361216 [14:56:38] (03CR) 10Paladox: [V: 032 C: 032] Fix syntax error [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361216 (owner: 10Paladox) [14:57:59] (03Draft1) 10Paladox: Fix location for shell scripts [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361217 [14:58:01] (03PS2) 10Paladox: Fix location for shell scripts [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361217 [14:58:04] (03CR) 10Paladox: [V: 032 C: 032] Fix location for shell scripts [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361217 (owner: 10Paladox) [15:08:45] (03Draft1) 10Paladox: Add ores notification script [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361218 [15:08:47] (03PS2) 10Paladox: Add ores notification script [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361218 [15:08:50] (03CR) 10Paladox: [V: 032 C: 032] Add ores notification script [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361218 (owner: 10Paladox) [15:10:01] (03Draft1) 10Paladox: Fix path for irc-ores.log [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361219 [15:10:03] (03PS2) 10Paladox: Fix path for irc-ores.log [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361219 [15:10:05] (03CR) 10Paladox: [V: 032 C: 032] Fix path for irc-ores.log [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361219 (owner: 10Paladox) [15:24:32] (03Draft1) 10Paladox: Remove [] [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361220 [15:24:34] (03PS2) 10Paladox: Remove [] [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361220 [15:24:36] (03CR) 10Paladox: [V: 032 C: 032] Remove [] [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361220 (owner: 10Paladox) [15:39:18] PROBLEM - Puppet errors on tools-webgrid-lighttpd-1421 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:01:44] !log tools Created and provisioned elasticsearch password for tools.wmde-uca-test (T167971) [16:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:01:49] T167971: Elasticsearch credential request for wmde-uca-test - https://phabricator.wikimedia.org/T167971 [16:07:26] RECOVERY - Puppet staleness on tools-elastic-02 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:10:51] 10Labs, 10Tool-Labs, 10User-bd808: Elasticsearch credential request for wmde-uca-test - https://phabricator.wikimedia.org/T167971#3376373 (10bd808) 05Open>03Resolved a:03bd808 Credentials have been created and can be found in /data/project/wmde-uca-test/.elasticsearch.ini. > When the write request is... [16:10:55] RECOVERY - Puppet staleness on tools-elastic-03 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:19:18] RECOVERY - Puppet errors on tools-webgrid-lighttpd-1421 is OK: OK: Less than 1.00% above the threshold [0.0] [16:25:09] (03Draft1) 10Paladox: Move somethings to puppet and puppicise some things [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361221 [16:25:11] (03PS2) 10Paladox: Move somethings to puppet and puppicise some things [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361221 [16:25:38] (03CR) 10Paladox: [V: 032 C: 032] Move somethings to puppet and puppicise some things [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361221 (owner: 10Paladox) [16:31:57] (03Draft1) 10Paladox: Install some modules for icingaweb2 [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361224 [16:31:59] (03PS2) 10Paladox: Install some modules for icingaweb2 [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361224 [16:32:03] (03CR) 10Paladox: [V: 032 C: 032] Install some modules for icingaweb2 [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361224 (owner: 10Paladox) [16:42:02] (03Draft1) 10Paladox: Fix ido-mysql variable did not work properly [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361226 [16:42:04] (03PS2) 10Paladox: Fix ido-mysql variable did not work properly [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361226 [16:42:06] (03CR) 10Paladox: [V: 032 C: 032] Fix ido-mysql variable did not work properly [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361226 (owner: 10Paladox) [16:47:34] (03Draft1) 10Paladox: Fix permission on ido-mysql file [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361227 [16:47:36] (03PS2) 10Paladox: Fix permission on ido-mysql file [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361227 [16:47:38] (03CR) 10Paladox: [V: 032 C: 032] Fix permission on ido-mysql file [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361227 (owner: 10Paladox) [16:52:23] (03Draft1) 10Paladox: Fix host name in notifications [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361228 [16:52:25] (03PS2) 10Paladox: Fix host name in notifications [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361228 [16:52:27] (03CR) 10Paladox: [V: 032 C: 032] Fix host name in notifications [labs/icinga2] - 10https://gerrit.wikimedia.org/r/361228 (owner: 10Paladox) [17:15:36] Hi bd808, I created a stretch instance. I'm still testing and reworking some of this stuff. I have automated server setup (with a script, not with puppet - yet(?)) [17:15:43] seems to work quite well [17:15:54] nice [17:18:58] I'll change my setup from one medium sized instanmce to two or three small ones [17:19:09] then I'll add load balancing to my front ened script [17:19:33] that way the downtime on database update will be unnoticable for teh users [17:19:37] is that complexity actually needed? [17:19:52] well, is anything we do actually needed? [17:19:58] heh [17:20:19] the Image Search tool is enabled for all users on commons (IPs, too) [17:20:32] it is used quite a bit, and I'd liek teh user experience to be nice [17:20:44] it should "just work" and be FAST [17:20:54] (that's in the name, so I better deliver) [17:21:31] right now database update adds about two - three minutes every two hours where teh tool is quite sluggish [17:21:54] I'd love to update the database more often, but right now that would reduce service quality too much [17:22:00] that's because of warming up the memmap? [17:22:06] exactly! [17:22:21] I used to just load the DB into memory [17:22:28] higher upfront cost [17:22:44] but I think memmap is better [17:22:56] I'll let the OS make decisions on how to manage memory [17:23:03] I think it knows better than me [17:23:44] moving the data to the instance local storage really should help with speed to load the data [17:25:58] Yeah [17:26:07] just copy it to /tmp ? [17:26:44] that should work, yes [17:27:18] is it just one large file? [17:27:30] it currently is ~1.3GB (depending on the size of commons - so it's growing slowly but steadily) [17:27:33] two files [17:27:40] data and index [17:27:47] (sort of) [17:28:23] ok. I was just trying to think of ways to avoid NFS completely. Like rsync between the VMs [17:28:43] Ok, I may have to keep the Xenial instance up for a few more days until I have everything nailed down [17:28:49] I see... [17:28:56] that could work, too [17:29:05] good idea! [17:29:39] ok, gotta run, thanks! [17:29:50] o/ [18:42:33] 10cloud-services-team, 10DBA, 10Operations: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3376445 (10Halfak) Hey folks! I've been traveling and just getting caught up. It looks like we ought to make a quick announcement for Wiki labels. I can get a notice out as soon as w... [20:30:08] bd808, coming back to the rsync idea. I've hit a little snag. I'd like to set up passwordless authentication with public key that has read-only access to only the database directory [20:30:21] but authorized_keys are only pulled from LDAP [20:30:29] not from ~/.ssh/authorized_keys [20:31:37] keys can be put in /etc/ssh/userkeys [20:32:24] Ahh, ok [20:32:27] there are a couple of "service users" in LDAP that you can use instead of a privileged account too [20:33:43] deploy-service is one and mwdeploy is another [20:35:49] to use /etc/ssh/userkeys you make a file that has the same name as the user that is a normal authorized_keys file [21:17:32] could it be that puppet resets /etc/ssh/userkeys? [21:30:51] dschwen: oh, yes I think it does. If you have a local puppetmaster you can add things that way. [21:32:40] there may be keys already in labs/private.git too. let me see if I can confirm that [21:34:59] dschwen: I think all of these keys are passwordless -- https://github.com/wikimedia/labs-private/tree/master/modules/secret/secrets/keyholder [21:35:23] the deploy_service one is for sure [22:10:32] wait, what? [22:10:57] there are keypairs that just let me jump to arbitrary instances?! [22:12:24] I am confused [22:17:38] Also, is there some docu on setting up your own puppetmaster? I did that a few years back and have no recollection. Also the nova interface is gone (I knew how to set roles there) [22:19:59] dschwen hi, everything is done through horizion now [22:20:08] http://horizon.wikimedia.org [22:20:20] that's where you will find the replacement for nova interface. [22:20:42] also puppet master docs are at https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster [22:52:52] ah "Standalon" was the keyword I needed [22:53:02] I was searching for "local"... [22:55:51] can I use stretch for the standalone puppet master? [22:56:06] docs say use Jessie, but that might just be outdated [22:57:10] !help [22:57:10] dschwen: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team [22:58:05] how do set the role in horizon? [23:07:05] I only see the project puppet stuff in the side bar accordeon [23:07:29] looks scary. It says "These puppet settings will affect all VMs in the fastcci project" [23:07:48] I don't want that, I just want to spin up a standalone puppet server [23:18:14] this should exist according to https://phabricator.wikimedia.org/T91990 [23:18:20] but where is that tab?! [23:19:35] dschwen hi, you click on the instance name, then there will be a tab there with words puppet on it, click that. Then you click on the tab to the far right [23:19:42] that should allow you to select your roles [23:19:43] ooohh [23:19:56] * paladox has to go now. [23:19:57] weird, I thought I'd have to set the roles upon creation [23:20:03] thanks paladox [23:20:12] your welcome :) [23:25:26] dschwen: did you find the right screen now? [23:25:32] yep [23:25:41] We need to put some screen shots up in the help for all of that [23:25:41] I need to read up on this [23:25:41] a [23:25:43] lot [23:25:58] like conceptually, how do I add custom puppet "rules" [23:26:09] while retaining the wikimedia stuff [23:26:33] I never figured out how to use teh magic deploy_service account [23:26:38] you can do that by adding patches to /var/lib/git/operations/puppet on your puppetmaster [23:27:05] the help page is a complete stub :/ -- https://wikitech.wikimedia.org/wiki/Help:Puppet [23:28:08] Yeah, but then I either need to get my stuff merged upstream [23:28:20] or I will have to constantly manually rebase against upstream [23:28:32] to avoit getting completely out of date [23:29:07] there is a feature to keep up to date with upstream already baked in. your local patches will be rebased automatically [23:29:14] Are we still using Gerrit? [23:29:19] Oh, automatic rebase [23:29:20] that only breaks when there is a merge conflict [23:29:34] ah, that's what that git sync interval in teh config is? [23:29:34] * bd808 obviously needs to write down what he knows [23:29:52] yes. I think the default is 30 minutes or so [23:30:00] You linked to a github repo [23:30:08] further above [23:30:23] yes, github is a mirror. [23:30:32] gerrit is... well, not quite my usual workflow [23:30:48] we do PRs to github at work [23:30:54] the master repo is https://gerrit.wikimedia.org/r/#/admin/projects/operations/puppet [23:31:11] *nod* gerrit takes some getting used to [23:31:31] although to be fair so does github [23:31:36] ok, thanks again! I'm sure I'll have further questions down the road :-) [23:31:56] (yeah, but github has an arguable much larger userbase already) [23:32:11] and its non-free sadly [23:32:17] as in libre [23:34:23] yeah [23:34:40] well, tehre is gitlab (which we also use at work - self hosted for internal stuff) [23:34:53] anyhow, I guess I should have used Jessie for teh puppetmaster [23:35:02] getting errors on my puppetruns [23:35:20] the geoipupdate packeage doe snot exist on stretch apparently... [23:35:33] oh, yeah that is likely [23:36:13] that's a package we host locally and production only has 3 or 4 stretch hosts so far [23:36:42] jessie is probably the most stable base image at the moment