[01:48:35] PROBLEM - jobrunner1 Current Load on jobrunner1 is CRITICAL: CRITICAL - load average: 8.61, 8.96, 6.67 [01:51:38] RECOVERY - jobrunner1 Current Load on jobrunner1 is OK: OK - load average: 3.72, 6.30, 6.02 [02:04:25] PROBLEM - jobrunner1 Current Load on jobrunner1 is WARNING: WARNING - load average: 5.17, 7.10, 6.66 [02:07:23] RECOVERY - jobrunner1 Current Load on jobrunner1 is OK: OK - load average: 5.20, 6.75, 6.64 [02:21:24] PROBLEM - jobrunner1 Current Load on jobrunner1 is CRITICAL: CRITICAL - load average: 15.29, 9.51, 7.97 [02:24:26] PROBLEM - jobrunner1 Current Load on jobrunner1 is WARNING: WARNING - load average: 6.01, 7.75, 7.55 [02:30:24] PROBLEM - jobrunner1 Current Load on jobrunner1 is CRITICAL: CRITICAL - load average: 8.54, 7.49, 7.57 [02:51:20] PROBLEM - jobrunner1 Current Load on jobrunner1 is WARNING: WARNING - load average: 5.20, 7.02, 7.99 [03:06:15] PROBLEM - jobrunner1 Current Load on jobrunner1 is CRITICAL: CRITICAL - load average: 12.27, 8.63, 8.16 [03:09:12] PROBLEM - jobrunner1 Current Load on jobrunner1 is WARNING: WARNING - load average: 7.25, 7.79, 7.88 [03:12:10] PROBLEM - jobrunner1 Current Load on jobrunner1 is CRITICAL: CRITICAL - load average: 8.20, 7.76, 7.83 [04:11:03] PROBLEM - jobrunner1 Current Load on jobrunner1 is WARNING: WARNING - load average: 4.44, 6.94, 7.80 [04:16:54] PROBLEM - jobrunner1 Current Load on jobrunner1 is CRITICAL: CRITICAL - load average: 10.83, 9.01, 8.37 [04:31:48] PROBLEM - jobrunner1 Current Load on jobrunner1 is WARNING: WARNING - load average: 3.80, 5.71, 7.67 [04:37:44] RECOVERY - jobrunner1 Current Load on jobrunner1 is OK: OK - load average: 3.81, 4.22, 6.34 [08:26:46] PROBLEM - wiki.counterculturelabs.org - LetsEncrypt on sslhost is CRITICAL: connect to address wiki.counterculturelabs.org and port 443: Connection refusedHTTP CRITICAL - Unable to open TCP socket [08:31:20] ^ ack [08:32:19] ^ lapsed regsitration on the full domian [08:32:42] RECOVERY - wiki.counterculturelabs.org - LetsEncrypt on sslhost is OK: OK - Certificate 'wiki.counterculturelabs.org' will expire on Sat 27 Jun 2020 09:13:35 GMT +0000. [08:33:19] flapping [08:56:00] PROBLEM - wiki.counterculturelabs.org - LetsEncrypt on sslhost is CRITICAL: connect to address wiki.counterculturelabs.org and port 443: Connection refusedHTTP CRITICAL - Unable to open TCP socket [08:56:14] ^ pr open for that [09:13:33] RECOVERY - wiki.counterculturelabs.org - LetsEncrypt on sslhost is OK: OK - Certificate 'wiki.counterculturelabs.org' will expire on Sat 27 Jun 2020 09:13:35 GMT +0000. [09:18:59] Reception123: will you merge my ssl pr as well [09:19:04] looking [09:23:18] PROBLEM - wiki.counterculturelabs.org - LetsEncrypt on sslhost is CRITICAL: connect to address wiki.counterculturelabs.org and port 443: Connection refusedHTTP CRITICAL - Unable to open TCP socket [09:29:07] RECOVERY - wiki.counterculturelabs.org - LetsEncrypt on sslhost is OK: OK - Certificate 'wiki.counterculturelabs.org' will expire on Sat 27 Jun 2020 09:13:35 GMT +0000. [10:29:27] Reception123: Your ACK for cp3 goes tomorrow, ns1 is warning (maybe ack that), and test2's puppet is dead [10:29:45] RhinosF1: I'll ack it again, just did it temp because it was BOLD [10:29:50] in case someone wanted to do it differently [10:30:02] Reception123: If cp3's check needs changing, maybe downtime the check and open a task [10:32:14] RhinosF1: well it doesn't really, high ping from a far away server is kind of normal [10:32:44] Reception123: then the check's useless. It shouldn't WARN for something known and expected. [10:32:57] anyway, ns1 & test2 need looking at [10:36:09] RhinosF1: but the other way is changing the threshold which will affect other cps [10:36:56] Reception123: that might not be true, we could create a new check I think that runs on faraway-cpus with a higher threshold [10:46:22] paladox: ^ could you look at that as I've got lost with puppet config [10:46:37] Reception123: still ns1 and test2 to handle/ack/downtime [10:48:07] * RhinosF1 creates a phab task [10:48:23] ok [10:51:27] Reception123: https://phabricator.miraheze.org/T5669 [10:51:28] [ ⚓ T5669 ping4 on cp3 thresholds too high ] - phabricator.miraheze.org [10:52:07] RhinosF1: I'll make it low prio since there's really way more important things to do than work on making a new check for that [10:55:47] Reception123: oh ye, I just don’t agree with permenantly ack’ing CRITICAL alerts or known non-issues [10:56:18] A CRITICAL should be something well critical not impact less [12:45:00] Well high ping is critical rhinosF1... [12:45:57] paladox: then handle it [12:56:11] PROBLEM - espiral.org - LetsEncrypt on sslhost is WARNING: WARNING - Certificate 'espiral.org' expires in 15 day(s) (Sun 14 Jun 2020 12:51:25 GMT +0000). [12:56:34] paladox: well no, it's because cp3 is far away [12:56:56] paladox: we've gone through the fact that we can't expect that server to not have high ping, it's normal. It's been acknowledged in icinga for the time being [13:03:58] [02miraheze/ssl] 07MirahezeSSLBot pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JfKIH [13:04:00] [02miraheze/ssl] 07MirahezeSSLBot 0397ea9be - Bot: Update SSL cert for espiral.org [13:14:16] RECOVERY - espiral.org - LetsEncrypt on sslhost is OK: OK - Certificate 'espiral.org' will expire on Thu 27 Aug 2020 12:03:52 GMT +0000. [13:26:12] RhinosF1 are you being rude to me? [13:29:14] * hispano76 greetings [13:36:28] paladox: no [13:36:45] well your message was definitely sounded rude to me [13:37:23] paladox: if the alert is actually critical, it need handling rather than being ACK’d. It wasn’t aimed at anyone. More a if it’s critical, it should be dealt with. [13:37:51] ok, then your message should have been "Please could you handle it?" [13:37:59] If as Reception123 said, it’s expected, we should change the threshold [13:38:14] paladox: yeah, sorry. I was mid multiple things. [13:38:42] ok, if you had multiple things to do, you could have gotten back later :) [13:59:10] I could have [13:59:27] I also have a habit of trying to do things quickly [14:19:32] Please remember to be kind here and to ask for things to be done nicely :) [14:29:14] *op [16:08:40] PROBLEM - cp6 HTTP 4xx/5xx ERROR Rate on cp6 is CRITICAL: CRITICAL - NGINX Error Rate is 89% [16:15:50] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-8 [+0/-0/±1] 13https://git.io/JfKs4 [16:15:51] [02miraheze/puppet] 07paladox 03afd589e - monitoring::hosts: Add ipv6 address [16:15:53] [02puppet] 07paladox created branch 03paladox-patch-8 - 13https://git.io/vbiAS [16:15:54] [02puppet] 07paladox opened pull request 03#1385: monitoring::hosts: Add ipv6 address - 13https://git.io/JfKsR [16:16:34] [02puppet] 07paladox closed pull request 03#1385: monitoring::hosts: Add ipv6 address - 13https://git.io/JfKsR [16:16:35] [02miraheze/puppet] 07paladox pushed 032 commits to 03master [+0/-0/±2] 13https://git.io/JfKsu [16:16:37] [02miraheze/puppet] 07paladox 033ba9cba - Merge pull request #1385 from miraheze/paladox-patch-8 monitoring::hosts: Add ipv6 address [16:16:38] [02puppet] 07paladox deleted branch 03paladox-patch-8 - 13https://git.io/vbiAS [16:16:40] [02miraheze/puppet] 07paladox deleted branch 03paladox-patch-8 [16:21:04] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-8 [+0/-0/±1] 13https://git.io/JfKsi [16:21:06] [02miraheze/puppet] 07paladox 03d660652 - monitoring: Increase ping critical/warning levels [16:21:07] [02puppet] 07paladox created branch 03paladox-patch-8 - 13https://git.io/vbiAS [16:21:09] [02puppet] 07paladox opened pull request 03#1386: monitoring: Increase ping critical/warning levels - 13https://git.io/JfKsP [16:21:36] PROBLEM - ping6 on ns1 is WARNING: PING WARNING - Packet loss = 0%, RTA = 106.82 ms [16:22:19] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-8 [+1/-0/±0] 13https://git.io/JfKs1 [16:22:20] [02miraheze/puppet] 07paladox 03e1f7f93 - Create services.conf [16:22:22] [02puppet] 07paladox synchronize pull request 03#1386: monitoring: Increase ping critical/warning levels - 13https://git.io/JfKsP [16:24:02] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-8 [+0/-0/±1] 13https://git.io/JfKsH [16:24:03] [02miraheze/puppet] 07paladox 0335354f3 - Update services.conf [16:24:05] [02puppet] 07paladox synchronize pull request 03#1386: monitoring: Increase ping critical/warning levels - 13https://git.io/JfKsP [16:24:59] paladox: that’s probably why I couldn’t find the file in the config if it wasn’t there [16:25:09] * RhinosF1 attempted to look but got lost [16:25:14] [02puppet] 07paladox edited pull request 03#1386: monitoring: Increase ping critical/warning levels - 13https://git.io/JfKsP [16:26:01] PROBLEM - ping6 on dbt1 is CRITICAL: CRITICAL - Destination Unreachable (fe80::f816:3eff:fe5d:3eae) [16:26:55] [02puppet] 07paladox edited pull request 03#1386: monitoring: Increase ping critical/warning levels - 13https://git.io/JfKsP [16:27:18] [02puppet] 07paladox synchronize pull request 03#1386: monitoring: Increase ping critical/warning levels - 13https://git.io/JfKsP [16:27:20] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-8 [+0/-0/±1] 13https://git.io/JfKsp [16:27:21] [02miraheze/puppet] 07paladox 0326dfaa4 - Update services.conf [16:28:18] paladox: do they apply to ping6 as well [16:28:24] yes [16:28:33] As there’s 2 ping4 issues and ping6 currently [16:28:37] But looks good [16:30:55] PROBLEM - cp6 HTTP 4xx/5xx ERROR Rate on cp6 is WARNING: WARNING - NGINX Error Rate is 58% [16:32:07] paladox: 2 ping6 alerts have gone off now. ns1 warning and dbt1 critical (unreachable) [16:32:12] Are they expected? [16:32:22] Also cp6 above [16:32:37] cp6 is not in used apart from one custom domain using it. [16:32:51] and i am aware of the issues [16:33:01] paladox: good and which wiki? [16:33:22] i would have to look, which i'm not going to do atm [16:33:51] Fair enough [16:35:22] !log root@dbt1:/home/paladox# ip addr del fe80::f816:3eff:fe5d:3eae/64 dev eth0 [16:35:26] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log, Master [16:35:31] [02puppet] 07JohnFLewis commented on pull request 03#1386: monitoring: Increase ping critical/warning levels - 13https://git.io/JfKGc [16:36:39] RhinosF1 looks like you'll need to take your task up with JohnLewis ^ [16:36:44] (going to decline it) [16:36:49] PROBLEM - ping6 on cp3 is CRITICAL: PING CRITICAL - Packet loss = 0%, RTA = 261.25 ms [16:37:01] paladox: I’m replying now [16:37:26] (declined now) [16:37:32] [02puppet] 07RhinosF1 commented on pull request 03#1386: monitoring: Increase ping critical/warning levels - 13https://git.io/JfKGR [16:37:43] JohnLewis: ^ [16:40:02] Hello GG95! If you have any questions, feel free to ask and someone should answer soon. [16:40:29] hi [16:40:46] Hi GG95, how can we help? [16:40:57] just admit it [16:41:24] GG95: what [16:41:45] RECOVERY - cp6 HTTP 4xx/5xx ERROR Rate on cp6 is OK: OK - NGINX Error Rate is 33% [16:41:55] THAT RHINOSF1 IS MARRIED TO RECEPTION123 BUT HAS AN AFFAIR WITH EXAMKNOW AND VOIDWALKER. I FELL THAT THIS IS VERY TRUE BECAUSE HE TOLD ME! [16:42:28] *lockdown [16:44:51] RhinosF1: I'm saying if we call "499" OK, that really isn'y [16:44:52] JohnLewis: 2020-05-26 - 17:52:24UTC tell JohnLewis miraheze wiki emails are taking hours [16:44:52] JohnLewis: 2020-05-26 - 17:58:31UTC tell JohnLewis see https://phabricator.miraheze.org/T5659 [16:44:54] JohnLewis: 2020-05-27 - 15:19:36UTC tell JohnLewis if no one else has done it already, please fix the user rights problem in T5661 [16:46:07] JohnLewis: that was fixed and anything above 251ms is fine [16:46:40] so 499ms is acceptable? [16:48:10] JohnLewis: if the site loads, yes [16:48:18] I did suggest just upping for cp3 [16:48:27] But 252ms will stop the alert [16:48:34] 499 isn't acceptable [16:48:35] though [16:49:02] I'd argue anything >50ms is probably not acceptable either from icinga baring international cps [16:49:53] JohnLewis: then maybe create an international cp check [16:50:03] With higher parais [16:50:13] (Which I said this morning) [16:50:50] Yes [16:51:01] That would be an acceptable solution [16:51:31] paladox: ^ [17:05:44] note that i got the value from wikimedia [17:06:11] https://github.com/wikimedia/puppet/blob/4d9909af05dc3a9c63f6c3ed5cca79052b961e45/modules/monitoring/manifests/host.pp#L113 [17:06:11] [ puppet/host.pp at 4d9909af05dc3a9c63f6c3ed5cca79052b961e45 · wikimedia/puppet · GitHub ] - github.com [17:07:07] JohnLewis ^ [17:23:09] RECOVERY - ping4 on cp3 is OK: PING OK - Packet loss = 0%, RTA = 251.86 ms [17:23:58] i don't even need to create a seperate check [17:23:59] woo [17:24:05] i can do conditionals! [17:39:42] PROBLEM - mon1 Puppet on mon1 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[icinga2] [17:39:59] paladox: yey! [17:42:15] RECOVERY - mon1 Puppet on mon1 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:43:22] RECOVERY - ping4 on ns1 is OK: PING OK - Packet loss = 0%, RTA = 103.15 ms [17:45:41] paladox: https://phabricator.miraheze.org/T5669#110668 resolved then? [17:45:42] [ ⚓ T5669 ping4 on cp3 thresholds too high ] - phabricator.miraheze.org [17:53:20] RECOVERY - ping6 on ns1 is OK: PING OK - Packet loss = 0%, RTA = 107.14 ms [17:53:30] RECOVERY - ping6 on cp3 is OK: PING OK - Packet loss = 0%, RTA = 260.68 ms [17:53:53] Yey! [17:54:09] Just test2 puppet broke now [18:15:24] done [20:23:47] .in 5m purge IP gbans [20:23:48] Examknow: Okay, will remind at 2020-05-29 - 15:28:48CDT [20:28:48] Examknow: purge IP gbans [20:29:01] ty bot [21:46:57] PROBLEM - cp6 HTTP 4xx/5xx ERROR Rate on cp6 is WARNING: WARNING - NGINX Error Rate is 57% [22:38:40] RECOVERY - cp6 HTTP 4xx/5xx ERROR Rate on cp6 is OK: OK - NGINX Error Rate is 39% [22:52:58] Hello NNJS! If you have any questions, feel free to ask and someone should answer soon. [22:53:08] NOTE: I have scheduled a Maint Window for ZppixBot to perform some database cleanup & cache and log deletion. ZppixBot will be unavailable on Monday between 10am and 10:30 am UTC+1 [22:53:12] hi NNJS [22:53:21] hi RhinosF1 [22:53:25] is it true? [22:54:14] NNJS: what? [22:54:42] THAT YOU ARE GAY WITH RECEPTION123 EXAMKNOW AND VOIDWALKER AT THE SAME FUCKING TIME!!!!!!!!!!!!!!!!! [22:55:10] * Examknow called it [23:01:50] Voidwalker: ^ [23:02:06] ty [23:08:14] *unlockdown [23:13:52] PROBLEM - cp3 Disk Space on cp3 is WARNING: DISK WARNING - free space: / 2649 MB (10% inode=93%); [23:50:19] [02miraheze/services] 07MirahezeSSLBot pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JfKgb [23:50:21] [02miraheze/services] 07MirahezeSSLBot 03d53d12f - BOT: Updating services config for wikis