[05:40:18] apergos, elukey: does this ring a bell? T254124 [05:40:18] T254124: labstore1006/1007: issue copying mediawiki_history_dumps files from Hadoop HDFS - https://phabricator.wikimedia.org/T254124 [05:41:02] hdfs:///wmf/data/archive/mediawiki/history/2020-05 does not exist ? uh [05:41:20] maybe see on the analytics side if it's been generated or not, dunn [05:41:22] o [05:42:09] other files are being copied I suppose or there would be a lot more tickets, so it shouldn't be something as broad as kerb issues [05:46:52] that makes sense [05:47:13] the thing is, this paged for us [05:51:43] weird [05:51:54] arturo: hey! Yes there is a patch to fix this but wasn't deployed in time, really sorry. Did you say that it "paged" you? [05:52:06] the alarms are all directed to us [05:52:46] yes, icinga systemd unit failure, I think we (WMCS) are the main contact people for these 2 servers? elukey [05:53:24] (well, currently, icinga --> VO --> page) [05:54:55] arturo: okok, but just to clarify - do you guys get paged on the phone for anything that alerts in icinga for your hosts? [05:55:14] or with page you mean icinga alerting? (just to understand) [05:55:26] if the former we didn't know :( [05:56:27] elukey: right now, I believe we have our VO account as one of the icinga contacts for all of our alerts. So, icinga alerts that trigger emails should end paging. We should probably review this, we already found a couple cases where this was overkill [05:56:36] and this might be another example [05:56:39] :-P [05:57:45] elukey: anyway, if this is a non-urgent thing, I can just resolve the VO incident now... [05:58:11] arturo: yes definitely, I'll fix the timers later on in the morning I promise (other fires ongoing, lovely morning) [05:58:18] really sorry for the page [05:58:26] np! thanks!! [06:31:50] so the peek::cron class is spamming like crazy [06:32:29] see https://gerrit.wikimedia.org/r/c/operations/puppet/+/601170 [06:32:37] please review and I will deploy [06:32:47] it is configured to run every minute on the 1st of the month [06:32:54] I want to disable it, it only does reports [06:33:01] then created T254127 [06:33:01] T254127: peek is incorrectly configured to run every minute every 1st of the month, creating large amounts of cronspam - https://phabricator.wikimedia.org/T254127 [06:34:19] jynus: looks good, I think we can also just > /dev/null before 2>&1, the script seems sending email reports (I doubt that the aim was to report via cron's email) [06:34:43] jenkins is not happy though, ensure should be first [06:35:29] see amend [06:35:41] I don't think this should fail silently [06:35:55] given this seems it should only run monthly, I think it is ok to disable it [06:36:15] then asign the enable to rush to redirect/schedule correctly [06:36:49] it should output however to security only I think [06:37:01] I am not saying it should fail silently, but since it generates reports maybe it was ok for the moment to just redirect to /dev/null [06:37:11] so whatever report that the security team needs is computed [06:37:13] and not absented [06:37:25] oh, I agree it should be sent to the team [06:37:30] but that is up to them [06:37:46] I just want to stop it for now [06:39:05] if you just redirect to /dev/null this still hits phabricator [06:39:10] and may be bringing it down [06:39:30] please +1 it so it stops hitting phab [06:39:46] I suppose that it will not bring down phabricator, I assume it was tested [06:40:00] anyway, you can +2 yourself, no need for my +1 [06:40:01] I don't think this is supposed to run every minute [06:40:14] but weekly and monthly as it is on puppet [06:40:24] change is good, ok to merge [06:42:27] oh heh I did not see this conversation, I left a message in the secteam channel about it but no one watching right now [06:42:41] could I get a +1 from someone? [06:43:00] this is obviusly only a stopgap, not a permanent solution [06:43:21] jynus: done [06:43:22] but I fear phab extra load, and this only has to run every week/month [06:43:41] thanks [06:44:49] once we contact security we can talk with them what was the original intention [06:45:04] because it makes no sense to send the report to stdout [06:45:42] cron disabled [06:45:46] ty [06:45:48] I think Chase meant that to run once per month, on the 01 day of the month. [06:45:54] I am pretty sure that the script sends an email, and they forgot > /dev/null [06:45:56] yeah, that part is mostly clear [06:46:10] but why sending the report to root@ ? [06:46:53] given they only need 1 every month and they got 400 I think no harm will be done :-D [06:47:01] I will create a revery now [06:47:01] lol [06:47:15] and mention the 2 issues [06:47:19] it is not supposed to send anything to root@, the email target is surely the security team [06:47:34] I repeat, it was surely not intended for root@ since it comes from cron [06:48:13] the best next step will also be to transform the cron into a timer [06:48:24] but it is strange that they hid the error output [06:48:28] but not the stdout [06:48:29] so proper exit code validation/alarming and logging to journald [06:48:34] yeah [06:48:48] so maybe they expected root to redirect to security? [06:49:08] that is why I prefer to ask than to configure something guessing the intentions [06:49:44] okok makes sense [06:50:21] do we have a task about systemd timer? [06:52:17] found it T210818 [06:52:18] T210818: Move admin cron jobs to systemd timers - https://phabricator.wikimedia.org/T210818 [06:55:51] I wouldn't have disable it if I wasn't confident about what it does/does more than just reporting [07:43:16] it's just reporting and it's a brandnew VM / service that is probably still WIP [07:43:24] i think you did right just disabling it. and thanks for it [07:44:58] fyi, there are tons of alerts on https://icinga.wikimedia.org/alerts I could use more eyes to triage them [07:47:56] ok, i am taking the restbase2009 ones. i know that was failed RAID and hw replacement. a ticket was closed but reopening it [07:51:38] "idp-test" yet another thing with "test" or "dev" in the name but in prod monitoring. [07:52:04] that's for jbond42 ^ [07:52:18] the eqsin's CP servers are for ema [07:52:33] the BGP status are for me [07:52:53] labstore for arturo? [07:53:38] brooke i think [07:54:07] yeah but it's night for her [07:54:22] an-launcher for elukey? [07:55:06] yes I am working on it [07:55:07] same for "Netbox report puppetdb_physical" as it's about "an-presto1004 missing physical device in PuppetDB: state Active in Netbox" [07:55:15] cool, thanks! [07:55:20] elukey: good morning :) [07:55:52] an-presto1004 is down for hw failure, does it need a change in netbox? [07:55:52] cloudbackup for arturo too? [07:56:06] good morning :) [07:56:23] deneb has half-installed/re-install required packages and apt history.log is empty.. manual package installs (meh?) [07:56:25] prometheus1003 for godog ? [07:56:50] (disk space at 99% usage) [07:58:31] "UNKNOWN: More than half of the datapoints are undefined" <- the typical issue with grafana based monitoring [07:58:34] and idp1001/2001 performing a change on every puppet run for jbond42 too I guess or Moritz [08:00:10] XioNoX: let's check if we can fix those alerts before pinging a lot of people, it might be something resolvable quickly [08:00:17] mw1331 - SAL tells me there was some experiment done but it should also be over. re-enabling puppet [08:01:24] i am not sure it scales to look in detail and at all the different alerts, tbh [08:02:25] mutante: I think a simple triage is sufficient, nothing deep [08:02:36] XioNoX: elukey: idp1001/2001 is a known issue i deployed a change on friday and the yare currently trying to deploy a package that doesn;t exist for buster. sorry for the noise will fix both later today. as to test/dev being in production monitoring thats somethig different i didn;t realise staging serveices where not ment to go into icinga, however it is good to monitor theses systems as they [08:02:42] need to work [08:03:08] no issue don't worry [08:03:57] jbond42: thanks! I think they should still be in monitoring as even if test hosts they're in the production realm [08:04:14] eg. we want to know if iptables gets turned off [08:04:25] thats my thinking [08:07:33] XioNoX: I think that we should try to find a compromise, pinging people in here as 1:1 proxy for icinga may be overkill in my opinion [08:09:47] elukey: there are ton of alerts, so if the people that are most likely "owners" of the services are around it's better they have a look rather than 2 or 3 of us trying to solve or open tasks about them [08:10:36] XioNoX: sure but people should look at icinga first (in theory) when they start the day, and we should triage/ping people in here (in my opinion) when something urgent requires attention [08:10:59] if something is low priority it can stay there in icinga for a bit in my opinion [08:11:22] hmmm if it's low priority it should not be a CRIT in the first place? [08:12:51] I agree, but this is not what happens with our alarms at the moment, we know it and things are improving of course [08:21:12] For example, there were some criticals that I didn't ack this morning while working on some fires, including the one mentioned for labstore nodes earlier on in here [08:21:15] Bad Luca [08:21:27] and a ping is ok [08:21:33] next time I should do better [08:25:10] elukey: If there was less alerts I would not have bothered people, but I felt like it was a bit too much right now. Especially after the weekend there is a higher risk that there are "more" critical alerts as nobody looked at it for 2 days. I don't think pinging people for that reason is bad (especially during working hours). [08:26:15] XioNoX: nono didn't mean bad, my point was to try to find a good balance [08:27:00] it was a suggestion, as written above the ping for me was good, a WIP problem should have been acked [08:30:48] no big deal at the end of the day and icinga looks better, thanks! [08:31:53] unrelated, https://grafana.wikimedia.org/d/nULM0E1Wk/mailman?orgId=1&from=now-6h&to=now is that the 1st of the month mailing list reminder? [08:35:03] XioNoX: I think we'd need a discussion about this with the broader team, and set some standards, otherwise we may end up with more work and extra pings than needed for a better looking icinga page (that could be looking better with some ground work on TOIL) [08:35:56] 100% agree :) [08:36:37] I started editing Icinga#How_to_handle_active_alerts preciselly to have better standareds, but it is not complete [08:36:58] I added things like "If domain specific (e.g. Databases) ask on the relevant IRC channel/preferred mechanism (e.g. #wikimedia-databases). Avoid if possible single person pinging" [08:37:29] as well as "Our goal is to keep the number of CRITICALs low and reasonable, in a way that balances the "good knowledge of unhandled ongoing issues" and "spam alerting" (making more difficult to detect problems)" [08:38:13] personally, if we ping people all the time 2 bad things can happen: stressing people that already are high on stress [08:38:27] and making people not create alerts to avoid pinging [08:39:12] as I said on my email, crits are like SLAs- it cannot be 0, it has to be >0 [08:45:11] fixed deneb. there is no more CRIT now besides the netbox alert. i won't be getting into all the warnings though [08:49:02] yes plese [08:49:15] (in the sense, "don't go through them please" :) [08:57:42] elukey: re:netbox and broken hardware: yes it should be marked as failed: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Active_-%3E_Failed [08:59:17] * elukey marks as failed.. [08:59:21] be back a little while later, cya [09:00:17] done [19:15:37] https://www.nanog.org/meetings/nanog-79/agenda/ tomorrow have a "Demystifying Open Source Network Operating Systems" [19:32:41] XioNoX: do you know if they're posting videos at some point? [19:33:06] yep, they're usually on Youtube the following days [19:33:26] cool