[03:21:32] crud, NO_PAYMENT_PRODUCTS_AVAILABLE via Connect in Chile?
[03:24:11] Fundraising Sprint They Live, Fundraising Sprint USB stands for underhanded socket bureaucracy, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, Patch-For-Review: Extend deletion to multiple silverpop databases - https://phabricator.wikimedia.org/T205332 (Eileenmcnaughton) @CCogdill_WMF I...
[03:27:00] ejegg: is tht the failmail cause?
[03:27:27] that's what it looks like
[03:27:45] guessing the recurring-ness might have something to do with it
[03:28:13] sending an email to PPena to ask if she can confirm what we should have available
[03:28:42] ejegg: ok - do we need to take something down for tonight?
[03:29:53] let's see if this is just one donor
[03:35:52] hmph, that description should be translated
[08:31:28] PROBLEM - Host americium is DOWN: PING CRITICAL - Packet loss = 100%
[08:37:48] PROBLEM - check_rsyslog_backlog on frdb1001 is CRITICAL: CRITICAL frlog1001=11 [critical = 10]
[09:26:50] PROBLEM - check_rsyslog_backlog on payments1003 is CRITICAL: CRITICAL frlog1001=14 [critical = 10]
[09:50:00] PROBLEM - check_rsyslog_backlog on frpig1001 is CRITICAL: CRITICAL frlog1001=10 [critical = 10]
[10:05:00] PROBLEM - check_rsyslog_backlog on frpig1001 is CRITICAL: CRITICAL frlog1001=12 [critical = 10]
[10:15:00] PROBLEM - check_rsyslog_backlog on frpig1001 is CRITICAL: CRITICAL frlog1001=13 [critical = 10]
[10:25:00] PROBLEM - check_rsyslog_backlog on frpig1001 is CRITICAL: CRITICAL frlog1001=13 [critical = 10]
[10:35:00] PROBLEM - check_rsyslog_backlog on frpig1001 is CRITICAL: CRITICAL frlog1001=14 [critical = 10]
[10:45:10] PROBLEM - check_rsyslog_backlog on frpig1001 is CRITICAL: CRITICAL frlog1001=16 [critical = 10]
[10:55:10] PROBLEM - check_rsyslog_backlog on frpig1001 is CRITICAL: CRITICAL frlog1001=17 [critical = 10]
[11:05:00] PROBLEM - check_rsyslog_backlog on frpig1001 is CRITICAL: CRITICAL frlog1001=18 [critical = 10]
[11:15:10] PROBLEM - check_rsyslog_backlog on frpig1001 is CRITICAL: CRITICAL frlog1001=19 [critical = 10]
[11:25:10] PROBLEM - check_rsyslog_backlog on frpig1001 is CRITICAL: CRITICAL frlog1001=20 [critical = 10]
[11:35:00] PROBLEM - check_rsyslog_backlog on frpig1001 is CRITICAL: CRITICAL frlog1001=21 [critical = 10]
[11:41:01] Jeff_Green, having some issues
[11:41:11] I've mailed a couple of notes
[11:41:40] currently look at switching off the job related to this:
[11:41:41] Fail Mail (civi1001) run-job: Banner impressions loader timed out after 10 minutes
[11:41:48] as civi1001 is showing 600 in top
[11:48:34] jgleeson: hey, yup I just replied
[11:49:17] thanks Jeff_Green, read your reply but struggling to restart rsynclog. I don't have sudo perms
[11:49:25] yup
[11:49:28] I'm trying sudo service rsyslog restart
[11:49:30] i'm looking at it
[11:49:35] cool, thanks
[11:50:50] RECOVERY - check_rsyslog_backlog on payments1001 is OK: OK
[11:51:06] rsyslog*
[11:51:37] there are rsync problems too, it looks like one of the banner loggers fell over
[11:51:40] RECOVERY - check_rsyslog_backlog on payments1003 is OK: OK
[11:51:46] what the heck happened last night?!
[11:51:50] RECOVERY - check_rsyslog_backlog on payments1002 is OK: OK
[11:52:08] yeah I'm getting them mixed up lol
[11:52:37] I was looking at the log files for that and noticed the extremely high load in tp
[11:52:38] top
[11:52:49] although I can't see an offending process or CPU load to explain it
[11:52:53] on which host?
[11:53:22] civi1001
[11:53:45] load average: 666.33, 653.45, 619.75
[11:53:48] Jeff_Green, ^
[11:53:51] oh really
[11:54:05] is that still happening?
[11:55:00] RECOVERY - check_rsyslog_backlog on frpig1001 is OK: OK
[11:55:05] yes
[11:55:11] viewing top now
[11:55:15] looking
[11:55:25] this doesn't behave like a machine with load >600
[11:56:07] I've never seen load that high!
[11:57:39] if it were actually working that hard it should be impossible to do anything b/c there wouldn't be resources for ssh
[11:58:25] it is really slow for me
[11:58:39] but yes, I would imagine 600 would mean ground to a halt
[11:58:50] RECOVERY - Host americium is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms
[11:59:44] well americium had some kind of kernel panic
[12:00:34] killing prometheus_node_exporter seems to have coincided with load recovery
[12:01:22] I can see it dropping
[12:01:24] woah
[12:01:27] that was crazy
[12:01:44] I wonder if that is the root cause to the banner job timeouts
[12:02:12] OH!
[12:02:18] ok you just explained it right there
[12:02:26] the root cause was americium falling over
[12:02:47] americium is the banner logger, it exports its banner log archive by nfs
[12:03:02] ahh
[12:03:07] so when it fell over, civi1001 freaked out trying to access that nfs export
[12:03:21] nfs is notorious for not handling outages gracefully
[12:04:55] hmmm
[12:05:09] Jeff_Green, the load on civi1001 is increasing again
[12:05:12] 68+
[12:05:47] yup, watching too
[12:06:21] something is hammering rsyslog
[12:08:40] PROBLEM - check_ipsec on americium is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: civi1001_v4
[12:10:10] PROBLEM - check_load on civi1001 is CRITICAL: CRITICAL - load average: 79.54, 142.74, 354.62
[12:10:21] hahahah
[12:10:30] :)
[12:10:51] i think we should just reboot civi1001
[12:12:52] Jeff_Green, do we need to disable paymentswiki first
[12:12:59] no
[12:13:30]