[00:19:35] is there anything obvious that would be making a host keep falling into ssh unreachable status? [00:19:55] `wdqs2007` became unreachable, I powercycled it from the mgmt console, and then it become unreachable again an hour later [00:20:49] The fact that powercycling restores ssh reachability temporarily is weird to me...my mental model is probably overly simplistic but if it were some kind of networking hardware issue I'd think that powercycling wouldn't help [00:34:13] ryankemper: ferm/some firewall accidentally blocking port 22? [00:34:30] can you ping it or get in via cumin? [00:35:50] actually I think cumin also uses ssh, so probably not that [00:36:36] depending on the reset type, it can take the network port down and back up, which can fix (temporarily) some networking issues [00:46:29] ryankemper: I think the hard disk is broken and/or more hardware fail [00:46:42] [11509.130995] print_req_error: I/O error, dev sdh, sector 344831321 [00:46:49] [11539.519042] ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen[11539.526192] ata10.00: failed command: WRITE DMA EXT [00:46:55] this is when i would give it back to dcops [00:47:30] cant login on mgmt as root.. instead a screen full of that kind of stuff [00:48:06] would recommend to paste all that on a ticket for dcops to open warranty case [00:49:51] you can {P15622} in a comment on the racking ticket or so: https://phabricator.wikimedia.org/P15622 [00:55:28] thanks all [00:56:10] I'd already made a hardware ticket (https://phabricator.wikimedia.org/T281437) but then was realizing I wasn't sure if it was hardware failure so that makes it more clear which is great [00:59:00] mutante: what command did you use to find that output? [01:10:41] ryankemper: in this case.. nothing.. I just connected to mgmt console and tried to login as root and all that was in the console output [01:11:49] mutante: weird, I ssh'd into `wdqs2007.mgmt.codfw.wmnet` and fed it the ipmi mgmt pw from pws [01:11:51] ryankemper: but in general, do run "racadm getsel" on DRAC shell [01:11:56] when reporting hw issues [01:12:12] yeah I ran `getsel` and it was a bunch of PSU messages from months ago and little else iirc [01:13:29] ryankemper: it's behaving chaotically.. guess it's about to die, heh [01:14:23] that stuff should be in /var/log/syslog though [01:15:07] it also already has status: failed in netbox and should be easily under warranty [01:15:44] mutante: status failed is from me today, not before :P [01:16:03] (it's part of the checklist for dc-ops' hw troubleshooting request template) [01:19:06] gotcha, then you got the right template [07:53:17] volans: thanks for the prospector comment, totally escaped me :] [07:53:24] guess I am not familiar with that tool [07:54:04] hashar: anytime :) I didn't dig much into the venvs though. Let me know if that's needed [07:56:35] <_joe_> hashar: I think the most important thing to understand about prospector is: if you're using it, you will need anti-headache meds [07:56:59] <_joe_> because somehow it manages to make the code that passed it invalid with every patch release [07:57:14] <_joe_> and sometimes also the requirements are in conflict with each other [07:57:17] <_joe_> real fun! [07:57:38] <_joe_> sorry for the disrespect for your pet volans :P [07:58:00] * dcaro agrees with _joe_ xd [07:58:05] tss tss [07:58:19] just append `| volans` [07:58:29] debmonitor is reporting stall data, where can I get help for that? [07:59:17] (not really much of an issue for me right now, used cumin, but might be interesting to git it sorted out) [07:59:46] dcaro: stale compared to what operation [07:59:46] ? [07:59:59] compared to what apt says on the machine since yesterday evening [08:00:25] https://debmonitor.wikimedia.org/packages/ceph-common shows 8 non-updated machines, but those are running the new version already since yesterday [08:00:47] (tried forcing apt update on them too to see if there was an error running the hook, but looked ok) [08:00:50] how was it installed? is puppet running on those hosts? [08:01:24] it was installed with apt directly (manual upgrade). let me check if puppet is running (but it should) [08:02:22] so, every apt operation is reported via apt-hook in real time. To cover also dpkg -i operations or any failure in the apt-hook (given that is non-blocking on failure) there is a systemd timer that reports the full status every 24h to fix any discrepancy. [08:02:38] for sure it was run several times after the package was installed on cloudvirt1016 (we had issues with the upgrade process) [08:03:07] there were some changes by jbond42 in the last few days on teh debmonitor server/clients for the certificates and so I'm wondering if it happened during that rollout that might have had some temporary failure [08:03:14] puppet is enabled and running on all (but 1040, but that on is aside for investigation) [08:03:22] can you try to start the debmonitor-client service in systemd? [08:03:26] it should report the full status [08:03:34] *force to report [08:03:53] it shoudl anyway auto-sync by itself within 24h from the last discrepancy [08:04:14] (randmly splayed across the day, so each host would run that at ad different time) [08:04:20] that service was dead yep, not active on 1016 [08:04:33] is normal, it's a timer, run once and dies [08:04:35] shouldn't puppet start it? [08:04:50] the service is normally down, the timer starts it and it exits [08:04:59] I see [08:05:54] that updated the data yep, is it ok if I run it on all the cloudvirt hosts? [08:06:28] sure, maybe just batch it like -b 10 [08:06:31] or something like that [08:06:39] is not that heavy but does a bunch of queries on the db side [08:06:41] ack [08:06:42] just to be gentle [08:08:12] I think I messed up, the -b 10 does the systemctl start, but that does not wait for it to finish running... so I started all [08:08:36] ah right, no worries [08:08:48] didn't thought about it either, sorry [08:09:21] data update, thanks! [08:09:43] jbond42: do you think this could have been caused by the certificate rollout? ^^^ [08:09:49] np [08:11:37] dcaro: FYI this was your run's effect, nothing bad ;) [08:11:37] https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=16&orgId=1&var-server=db1107&var-port=9104&from=now-1h&to=now [08:16:46] the debmonitor hook seems to work fine, though.just updated a few hosts and all appeared in real time [08:20:01] arturo: to close yesterday's loop, I've run the script, it now prints a warning that says: [08:20:04] Skipping assigning existing IP 185.15.56.237/30 with role vip to ens2f1np1.1107. The IP might have the wrong netmask (expected /32 or /128 for VIP-like IPs) [09:03:11] cool volans [09:03:27] thanks [09:04:29] thanks for the report, there is still one case to be handled but that's harder to fix as there is missing information. I'll chat with Arz.hel on what we want to do there. [09:50:43] volans: dcaro: was just catching up depends on the host i did notis thet the (i think) cloudvirt1040-1-45 hosts where aleting in icinga yesterday due to not being able to install some ceph packages. if that we preventing puppet from running its quite likley that the debmonitor-client certificates had not been created by puppet. so when installing the packages manually the hook would have failed [09:50:48] with some ssl error. [09:54:52] BTW... anybody has been successful using vim-puppet indent plugin on buster? [09:59:33] vgutierrez: i use rodjek/vim-puppet and vim-syntastic/syntastic but i also add the following as our style guids suggest 4 spaces as aspoe to 2 `autocmd BufRead */git/puppet/**/*.pp set sw=4 ts=4 et` [09:59:46] seems to workfine [10:00:01] jbond42: yeah... that's what I'm using, but for some reason the => autoindent won't work here [10:02:06] vgutierrez: tbh my vim is a lot of copy pasting and hacking untill things work so possible there is something elses somewhere that helps. my fulle vimrc is https://github.com/b4ldr/profile/blob/master/.vimrc if it helps [10:02:27] * _joe_ grins [10:03:04] * vgutierrez checking [10:03:14] jbond42: => autoindent works for you? ;P [10:03:54] vgutierrez: i think so let me double check (i know it something i had some trouble with at some point) [10:04:50] hmm you're using VimPlug, I'm using debian package version of the plugins + vim-addon-manager [10:05:13] vgutierrez: god damme it!!!! no it dosn't i think it honours if whats already in a file. however when i had the following it definetly did work [10:05:27] autocmd BufRead */git/puppet/** set sw=4 ts=4 et [10:05:38] however that messegd up ruby tab spacing [10:12:52] vgutierrez: im now going made and cant get it to work now and wonder if i ever did :S. [10:12:56] * jbond42 successfully snipped [10:13:25] * vgutierrez sorry for undermining SRE team productivity... [10:14:09] :) [10:15:43] <_joe_> vgutierrez: you provided us non-cultists some laughter though [10:16:00] vim isn't that bad :) [10:17:42] <_joe_> sure :) [10:27:30] vgutierrez: i have also added `autocmd BufNewFile */git/puppet/**/*.pp set sw=4 ts=4 et` and its now working on current and new files [10:34:40] hmmm [10:35:44] yeah... sw=4 ts=4 et is my default for every file basically [10:36:45] at least right now [12:05:41] head up all i plan to switch debmonitor service SSL cert to use the new pki system. its possible that we may seems a couple of failed debmonitor submisions while the change propogates. they can be safley ignore but fel frre to ping me if you see anything unusual [12:15:28] ignore this i have reverted [13:11:40] hi all i have now merged the change to manae /etc/services (https://gerrit.wikimedia.org/r/c/operations/puppet/+/670918) i have run a couple of tests and things look good to me but please keep and eye out and ping if anything wiered is seen. [13:12:14] for now this is just managing the standard services and not adding the services from the service cataloge which is schdualed in a later CR (https://gerrit.wikimedia.org/r/c/operations/puppet/+/673105) [13:12:59] XioNoX: fyi ^^ [13:13:16] jbond42: that's awesome, thx! [13:13:43] next week we can look at adding it to your use case [13:14:48] but gussing you will want to loop over the results if netbase::services() in some template file [13:21:56] exactly! [14:28:14] godog: I'm trying to get switches network data in a grafana dashboard, but I'm not sure if the data is there or not, according to this (https://phabricator.wikimedia.org/T229542) it might in graphite, do you know if it is there? and if so, what series names it is under? [14:29:35] dcaro: about to jump in a meeting, iirc we ended up disabling librenms -> graphite [14:29:46] will ping you after the meeting [14:30:29] ack, no rush [14:30:31] thanks! [14:59:10] jbond, have a minute to talk me through a question about hiera search behavior? [14:59:23] specifically wondering about things under role/ [15:00:09] um.... that should be jbond42 ^ [15:00:56] andrewbogott: sure give me a a few mins (and jbond also pings me) [15:01:11] sorry for the double ping then :( [15:01:31] andrewbogott: worth checking https://wikitech.wikimedia.org/wiki/Puppet_Hiera#Role-based_lookup first i gave it an update recently (just grabbing a drinl brb) [15:05:05] ok andrewbogott whats the Q [15:06:18] I haven't fully digested that doc yet, but my first question was just going to be: the yaml in 'role/common/labs/openstack/nova' is definitely leftover and unused, right? [15:07:02] And then the next question was going to be... if I want to apply a hiera setting for every host that uses role::wmcs::openstack::eqiad1::control... [15:07:19] I'd make role/wmcs/openstack/eqiad1/control.yaml, correct? [15:07:34] And then will that pick up /every/ setting in that file, or only those with select prefixes? [15:07:54] (I realize these are very basic questions but I'm right on the tail of totally failing to make this work yesterday) [15:10:54] I guess that should be role/common/openstack/eqiad1/control.yaml [15:11:34] ' the yaml in 'role/common/labs/openstack/nova' is definitely leftover' -- yes it looks like it there is no longer a role matching that path so it wond get used [15:12:03] * andrewbogott deletes [15:12:13] role/wmcs/openstack/eqiad1/control.yaml -> role/common/wmcs/openstack/eqiad1/control.yaml other wise yes [15:12:54] 'And then will that pick up /every/ setting in that file' -- assumiing there is a lookup for the specific key "somewhere" in the manaifest then yes [15:13:18] jbond42: great. I'll make a patch and then will ping you again when it doesn't do what I expect :) [15:13:20] thank you! [15:13:23] feel free to throw up a PS and i can also add commenbts there [15:13:26] :) ack [15:13:49] dcaro: I was mistaken, librenms data is still pushed to graphite, you'll find it there! under 'librenms' hierarchy [15:14:20] jbond42: for starters, https://gerrit.wikimedia.org/r/c/operations/puppet/+/683667 [15:14:22] godog: awesome! [15:14:25] thanks! [15:15:36] dcaro: np! bear in mind that eventually graphite is going away, not soon but eventually [15:16:38] godog: I'm subscribed to the task, to keep an eye :) [15:19:54] FYI all the few alerts recived from systed relating to debmonito-client is due to those machines not having puppet disabled and can be ignored [15:20:10] dcaro: haha! sweet [16:15:31] Anyone around to rubber-stamp this super-simple `site.pp` patch? https://gerrit.wikimedia.org/r/c/operations/puppet/+/683679/1/manifests/site.pp (switches one node over from `wdqs-internal` -> `wdqs` in codfw) [16:17:56] ryankemper: merging [16:18:06] jbond42: ty [16:18:23] np, and merged