[00:00:19] PROBLEM - test2.miraheze.org - Sectigo on sslhost is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:04:23] [02puppet] 07paladox closed pull request 03#1638: Remove old infrastructure - 13https://git.io/JtauB [00:04:25] [02miraheze/puppet] 07paladox pushed 031 commit to 03master [+0/-19/±1] 13https://git.io/JtrQJ [00:04:26] [02miraheze/puppet] 07Universal-Omega 03fe7ef6b - Remove old infrastructure (#1638) [00:04:33] [02dns] 07paladox closed pull request 03#191: Remove old infra - 13https://git.io/JtrSh [00:04:34] [02miraheze/dns] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JtrQU [00:04:36] [02miraheze/dns] 07paladox 0312c0acc - Remove old infra (#191) [00:04:37] [02dns] 07paladox deleted branch 03paladox-patch-1 - 13https://git.io/vbQXl [00:04:39] [02miraheze/dns] 07paladox deleted branch 03paladox-patch-1 [00:07:14] RECOVERY - test2.miraheze.org - Sectigo on sslhost is OK: OK - Certificate '*.miraheze.org' will expire on Sat 23 Oct 2021 23:59:59 GMT +0000. [00:09:01] !log removed cloud1 from cluster [00:09:04] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [00:10:13] RECOVERY - test2.miraheze.org - reverse DNS on sslhost is OK: rDNS OK - test2.miraheze.org reverse DNS resolves to cp11.miraheze.org [00:55:04] He said I didn’t need to back that up if my memory is correct [01:00:40] paladox, yeah now that you mention it, I think that sounds right. Or actually, I think I remember him saying he moved it to jobrunner3 because I asked him how you do a migration of home directories across servers [01:13:16] @Site Reliability Engineers I received an e-mail notification 35 minutes ago of an e-mail message from a user on metawiki, but I've checked all my e-mail account's folders, and there's nothing. Can we check for jobs on metawiki or, alternatively, if everything has been configured correctly for the on-wiki e-mails post server migration? Graylog may have something useful, potentially as well [01:33:35] RECOVERY - thesimswiki.com - LetsEncrypt on sslhost is OK: OK - Certificate 'www.thesimswiki.com' will expire on Thu 25 Mar 2021 22:57:41 GMT +0000. [02:01:34] PROBLEM - ping6 on dbbackup2 is CRITICAL: PING CRITICAL - Packet loss = 16%, RTA = 102.47 ms [02:03:37] PROBLEM - ping6 on dbbackup2 is WARNING: PING WARNING - Packet loss = 0%, RTA = 101.66 ms [03:45:53] PROBLEM - ping6 on dbbackup2 is CRITICAL: PING CRITICAL - Packet loss = 100% [03:47:55] PROBLEM - ping6 on dbbackup2 is WARNING: PING WARNING - Packet loss = 0%, RTA = 102.61 ms [07:05:31] PROBLEM - jobrunner3 APT on jobrunner3 is CRITICAL: APT CRITICAL: 33 packages available for upgrade (1 critical updates). [07:06:37] PROBLEM - mw8 APT on mw8 is CRITICAL: APT CRITICAL: 30 packages available for upgrade (1 critical updates). [07:07:33] PROBLEM - cp11 APT on cp11 is CRITICAL: APT CRITICAL: 26 packages available for upgrade (1 critical updates). [07:08:22] PROBLEM - ldap2 APT on ldap2 is CRITICAL: APT CRITICAL: 25 packages available for upgrade (1 critical updates). [07:11:22] PROBLEM - services4 APT on services4 is CRITICAL: APT CRITICAL: 28 packages available for upgrade (1 critical updates). [07:12:37] PROBLEM - services3 APT on services3 is CRITICAL: APT CRITICAL: 28 packages available for upgrade (1 critical updates). [07:14:17] PROBLEM - cp3 APT on cp3 is CRITICAL: APT CRITICAL: 27 packages available for upgrade (1 critical updates). [07:16:07] PROBLEM - cloud4 APT on cloud4 is CRITICAL: APT CRITICAL: 26 packages available for upgrade (2 critical updates). [07:20:34] PROBLEM - cloud3 APT on cloud3 is CRITICAL: APT CRITICAL: 91 packages available for upgrade (2 critical updates). [07:20:47] PROBLEM - bacula2 APT on bacula2 is CRITICAL: APT CRITICAL: 1 packages available for upgrade (1 critical updates). [07:20:52] PROBLEM - ns2 APT on ns2 is CRITICAL: APT CRITICAL: 26 packages available for upgrade (1 critical updates). [07:21:02] PROBLEM - gluster3 APT on gluster3 is CRITICAL: APT CRITICAL: 25 packages available for upgrade (1 critical updates). [07:21:10] PROBLEM - db12 APT on db12 is CRITICAL: APT CRITICAL: 66 packages available for upgrade (1 critical updates). [07:21:12] PROBLEM - db11 APT on db11 is CRITICAL: APT CRITICAL: 66 packages available for upgrade (1 critical updates). [07:23:26] PROBLEM - mon2 APT on mon2 is CRITICAL: APT CRITICAL: 26 packages available for upgrade (1 critical updates). [07:24:12] PROBLEM - cp12 APT on cp12 is CRITICAL: APT CRITICAL: 25 packages available for upgrade (1 critical updates). [07:24:36] PROBLEM - cp10 APT on cp10 is CRITICAL: APT CRITICAL: 26 packages available for upgrade (1 critical updates). [07:25:28] PROBLEM - db13 APT on db13 is CRITICAL: APT CRITICAL: 28 packages available for upgrade (1 critical updates). [07:25:46] PROBLEM - puppet3 APT on puppet3 is CRITICAL: APT CRITICAL: 31 packages available for upgrade (1 critical updates). [07:26:39] PROBLEM - gluster4 APT on gluster4 is CRITICAL: APT CRITICAL: 25 packages available for upgrade (1 critical updates). [07:27:42] PROBLEM - ns1 APT on ns1 is CRITICAL: APT CRITICAL: 23 packages available for upgrade (1 critical updates). [07:28:44] PROBLEM - cloud5 APT on cloud5 is CRITICAL: APT CRITICAL: 26 packages available for upgrade (2 critical updates). [07:30:50] PROBLEM - rdb4 APT on rdb4 is CRITICAL: APT CRITICAL: 25 packages available for upgrade (1 critical updates). [07:30:58] PROBLEM - mail2 APT on mail2 is CRITICAL: APT CRITICAL: 32 packages available for upgrade (1 critical updates). [07:31:48] PROBLEM - graylog2 APT on graylog2 is CRITICAL: APT CRITICAL: 26 packages available for upgrade (1 critical updates). [07:34:08] Someone else (Matsu) reported that exact issue as well. [07:34:09] PROBLEM - rdb3 APT on rdb3 is CRITICAL: APT CRITICAL: 25 packages available for upgrade (1 critical updates). [07:34:11] PROBLEM - mw10 APT on mw10 is CRITICAL: APT CRITICAL: 30 packages available for upgrade (1 critical updates). [07:34:39] PROBLEM - phab2 APT on phab2 is CRITICAL: APT CRITICAL: 25 packages available for upgrade (1 critical updates). [07:35:09] PROBLEM - jobrunner4 APT on jobrunner4 is CRITICAL: APT CRITICAL: 31 packages available for upgrade (1 critical updates). [07:35:28] PROBLEM - test3 APT on test3 is CRITICAL: APT CRITICAL: 30 packages available for upgrade (1 critical updates). [07:36:22] PROBLEM - mw9 APT on mw9 is CRITICAL: APT CRITICAL: 2 packages available for upgrade (1 critical updates). [07:36:22] PROBLEM - mw11 APT on mw11 is CRITICAL: APT CRITICAL: 30 packages available for upgrade (1 critical updates). [10:01:08] PROBLEM - ping6 on dbbackup2 is CRITICAL: PING CRITICAL - Packet loss = 28%, RTA = 103.68 ms [10:03:10] PROBLEM - ping6 on dbbackup2 is WARNING: PING WARNING - Packet loss = 0%, RTA = 102.55 ms [10:04:39] PROBLEM - dbbackup2 Current Load on dbbackup2 is WARNING: WARNING - load average: 3.53, 2.44, 1.25 [10:06:37] PROBLEM - dbbackup2 Current Load on dbbackup2 is CRITICAL: CRITICAL - load average: 4.90, 3.25, 1.69 [10:08:37] RECOVERY - dbbackup2 Current Load on dbbackup2 is OK: OK - load average: 1.08, 2.29, 1.52 [11:10:19] PROBLEM - ns1 GDNSD Datacenters on ns1 is CRITICAL: CRITICAL - 2 datacenters are down: 128.199.139.216/cpweb, 2607:5300:205:200::1c30/cpweb [11:12:19] RECOVERY - ns1 GDNSD Datacenters on ns1 is OK: OK - all datacenters are online [11:28:06] PROBLEM - ns2 GDNSD Datacenters on ns2 is CRITICAL: CRITICAL - 1 datacenter is down: 51.195.236.219/cpweb [11:30:02] RECOVERY - ns2 GDNSD Datacenters on ns2 is OK: OK - all datacenters are online [12:01:00] PROBLEM - ns2 GDNSD Datacenters on ns2 is CRITICAL: CRITICAL - 1 datacenter is down: 51.195.236.219/cpweb [12:02:55] RECOVERY - ns2 GDNSD Datacenters on ns2 is OK: OK - all datacenters are online [12:06:02] PROBLEM - ping6 on dbbackup2 is CRITICAL: PING CRITICAL - Packet loss = 100% [12:08:04] PROBLEM - ping6 on dbbackup2 is WARNING: PING WARNING - Packet loss = 0%, RTA = 101.83 ms [12:27:27] PROBLEM - ns1 GDNSD Datacenters on ns1 is CRITICAL: CRITICAL - 1 datacenter is down: 51.195.236.250/cpweb [12:29:21] RECOVERY - ns1 GDNSD Datacenters on ns1 is OK: OK - all datacenters are online [13:10:37] PROBLEM - dbbackup2 Current Load on dbbackup2 is WARNING: WARNING - load average: 3.93, 2.66, 1.47 [13:12:37] RECOVERY - dbbackup2 Current Load on dbbackup2 is OK: OK - load average: 1.58, 2.45, 1.56 [14:01:07] Yeah... I'm thinking I should create a Phabricator task for that. [15:05:06] !log sudo -u www-data php /srv/mediawiki/w/maintenance/deleteBatch.php --wiki marveloustvshowepisodeswiki --r "[[phab:T6836|Requested]]" /home/reception/bwmdel.txt [15:05:09] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [15:05:20] !log sudo -u www-data php /srv/mediawiki/w/maintenance/deleteBatch.php --wiki wandavisionwiki --r "[[phab:T6836|Requested]]" /home/reception/bwmdel.txt [15:05:24] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [16:06:41] PROBLEM - rdb3 Current Load on rdb3 is CRITICAL: CRITICAL - load average: 4.86, 2.82, 1.28 [16:08:38] RECOVERY - rdb3 Current Load on rdb3 is OK: OK - load average: 0.74, 1.94, 1.14 [16:16:19] [02miraheze/mediawiki] 07paladox pushed 031 commit to 03REL1_35 [+0/-0/±1] 13https://git.io/JtogS [16:16:20] [02miraheze/mediawiki] 07paladox 03c7fe39b - Update moderation [16:17:37] [02MatomoAnalytics] 07paladox deleted branch 03paladox-patch-1 - 13https://git.io/fN4LT [16:17:38] [02miraheze/MatomoAnalytics] 07paladox deleted branch 03paladox-patch-1 [16:47:03] PROBLEM - cp3 Varnish Backends on cp3 is CRITICAL: 2 backends are down. mw10 mw11 [16:48:12] PROBLEM - cp3 NTP time on cp3 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [16:48:42] PROBLEM - cp3 Stunnel Http for mw10 on cp3 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [16:49:04] PROBLEM - cp3 Stunnel Http for mw8 on cp3 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [16:49:26] PROBLEM - ns2 GDNSD Datacenters on ns2 is CRITICAL: CRITICAL - 2 datacenters are down: 128.199.139.216/cpweb, 2400:6180:0:d0::403:f001/cpweb [16:50:21] PROBLEM - ns1 GDNSD Datacenters on ns1 is CRITICAL: CRITICAL - 2 datacenters are down: 128.199.139.216/cpweb, 2400:6180:0:d0::403:f001/cpweb [16:50:24] PROBLEM - cp3 Puppet on cp3 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/sysctl.d] [16:51:07] RECOVERY - cp3 Stunnel Http for mw8 on cp3 is OK: HTTP OK: HTTP/1.1 200 OK - 15131 bytes in 1.060 second response time [16:51:26] RECOVERY - ns2 GDNSD Datacenters on ns2 is OK: OK - all datacenters are online [16:52:14] RECOVERY - cp3 NTP time on cp3 is OK: NTP OK: Offset -0.0006739795208 secs [16:52:19] RECOVERY - ns1 GDNSD Datacenters on ns1 is OK: OK - all datacenters are online [16:52:54] RECOVERY - cp3 Stunnel Http for mw10 on cp3 is OK: HTTP OK: HTTP/1.1 200 OK - 15132 bytes in 1.006 second response time [16:53:06] RECOVERY - cp3 Varnish Backends on cp3 is OK: All 7 backends are healthy [17:14:52] RECOVERY - cp3 Puppet on cp3 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [17:50:37] PROBLEM - dbbackup2 Current Load on dbbackup2 is WARNING: WARNING - load average: 3.90, 2.93, 1.58 [17:52:38] PROBLEM - dbbackup2 Current Load on dbbackup2 is CRITICAL: CRITICAL - load average: 4.27, 3.38, 1.91 [17:54:37] PROBLEM - dbbackup2 Current Load on dbbackup2 is WARNING: WARNING - load average: 3.97, 3.55, 2.15 [17:56:39] RECOVERY - dbbackup2 Current Load on dbbackup2 is OK: OK - load average: 2.90, 3.36, 2.27 [18:12:21] PROBLEM - test3 Puppet on test3 is WARNING: WARNING: Puppet is currently disabled, message: John, last run 9 minutes ago with 0 failures [18:21:15] PROBLEM - ping6 on dbbackup2 is CRITICAL: PING CRITICAL - Packet loss = 44%, RTA = 101.95 ms [18:23:18] PROBLEM - ping6 on dbbackup2 is WARNING: PING WARNING - Packet loss = 0%, RTA = 101.86 ms [18:33:19] JohnLewis, there's currently an issue, confirmed by two users, Pine/Matsu and myself, whereby we receive the Echo notification when a user has sent us an e-mail message, but no e-mail message was received. Both of us have checked all our e-mail accounts' folders, and there's not been anything. Reception123 did receive an e-mail I sent him, but that was also to his reception123[at]miraheze[dot]org e-mail address. In any case, with multiple [18:33:19] users reporting issues not receiving mail, I suspect there's a configuration issue with `mail2` and MediaWiki, likely stemming from the server migration. Reception123 checked where he thought the mail2 logs would be, but noted only an empty folder [18:33:34] The mail log on mail2 seemed empty, and I don't see any mail related logs on graylog either [18:33:36] Are you able to troubleshoot this? [18:33:57] I very much doubt it's a "configuration issue" though since I received one [18:34:11] But that's also inside the Miraheze cluster, Reception123 [18:35:41] I received a MediaWiki-generated email earlier today. Unless you would be able to provide exact times or reproduce it, debugging may not be eventful [18:35:42] We need to track down the logs MediaWiki communicating with mail2, which is possibly handled by redis, but not sure exactly, and then track down potential mail server message delivery errors [18:36:00] JohnLewis, yes, can give you exact times. [18:36:15] We also had someone email us today who said they were struggling to reset their password via email, but they then emailed back and said it was an issue on their side as when they changed the email it worked [18:36:26] MediaWiki auto-generated emails don't seem to be the issue; seems to be related only Special:EmailUser emails [18:37:08] Also, there were no issues with Special:EmailUser before the server migration [18:38:02] I've just sent an email to myself via EmailUser and I received it [18:38:53] dmehus: oh that's interesting yeah, try sending yourself an email and see what happens [18:43:03] JohnLewis, the timestamps are 18 minutes, 29 minutes, and 18 hours ago, respectively [18:44:33] Reception123, okay, just tried that [18:45:40] Reception123, no Echo notification, but may not get Echo notification for emails from self, and no email [18:46:04] I could try it on another wiki and see if it's a metawiki issue perhaps? [18:46:19] yeah, sure [18:46:25] it's very strange why it's only happening to you [18:47:24] Not just me [18:47:29] Pine/Matsu as well [18:48:23] Can get their exact on-wiki username as may be helpful for JohnLewis looking in the error/job/access logs for the email sent 18 hours ago [18:48:53] Pine/Matsu is `松` on-wiki [18:48:58] Time is all I need, Username is useless as it's not known to mail [18:49:47] JohnLewis, yeah, unless it would show up in the MediaWiki error logs, but probably not, yeah? [18:50:14] I can't get you a closer time than 18 hours ago as MediaWiki doesn't report the minutes when > 1 hour ago, afaik [18:52:12] No e-mail from the email I sent to myself from `testwiki` a few minutes ago either [18:52:58] JohnLewis and Reception123, I think I might've found the issue [18:52:59] Has to be an issue on your end [18:53:20] Logs show at 18:23:26, a mail was sent by noreply@miraheze.org to your email [18:54:26] Nevermind, outlook is rejecting our emails [18:54:38] Hrm, maybe that's a problem with not getting by Outlook's mail server, can we try changing that to something like `mediawiki@miraheze.org` that might be getting ensarred by spam filters? [18:54:39] ah [18:55:07] Not sure if 松 also uses Outlook/Hotmail, though... [18:56:13] Reception123: ^ can you check via the User object? [18:56:41] will check [18:57:57] I wonder if it could be something related to our certificates / IPs of our mail server that's causing Outlook/Hotmail and potentially other mail servers to reject our emails? 🤔 [19:01:42] it's not outlook for the user in question [19:02:27] It's nothing on our end [19:02:44] It's literally Outlook rejecting the emails because of an IP-based rejection [19:03:04] https://mxtoolbox.com/SuperTool.aspx?action=blacklist%3amail.miraheze.org&run=toolpage we do seem to be on a blacklist here, though I'm not quite sure what that is/means [19:03:04] [ Network Tools: DNS,IP,Email ] - mxtoolbox.com [19:04:55] JohnLewis, yeah, I get that, but that's still problematic. We should not be having any issues as we didn't before, and other wiki farms, like Wikimedia, don't have the same issue. [19:05:12] Reception123, yeah, that's interesting. We definitely don't want to be on any email blacklists [19:05:41] Wikimedia own their IP space, we don't - so unfortunately we're at the mercy of what others are doing inside the same /24 as us, or in this case, what they were doing with the IP before us [19:07:40] I suspect it's our choice of host. OVH is a known source of spam. I've personally had to globally soft rangeblock OVH servers in France, Canada, and the UK due to the tendency of spam only accounts to use their services [19:08:15] It's nothing to do with the host, anyone and anything can be marked as spam and have spam activity on it [19:08:57] JohnLewis, I disagree. OVH should be more responsive to abuse reports, so they don't end up on so many spam blacklists [19:10:01] I disagree, there's no obligation to send abuse reports - responding to them likewise isn't a protection [19:10:20] as that website Reception123 linked to says, "If you are on the UCEPROTECTL2 / L3, you have an IP Address from your ISP that falls into a poor reputation range; i.e. the entire range of IP Addresses is blocked as a result of the provider hosting spammers." [19:10:55] it is difficult for a whole provider to actively combat spam and suspend spammers though [19:10:59] Mhm, that's exactly what I said before. It doesn't prove your point though [19:11:13] I suppose, but how do we resolve this? And why was I able to receive emails on the old server (which was also OVH, I think, right?). Could it be a problem with the OVH server location for `mail2`? [19:11:49] Maybe we could migrate `mail2` onto a different VM in a different server location [19:12:23] Changing the cloud server won't change anything, because the IP choices to us are limited and all in the same range [19:12:45] Do we need to have them all in the same range? [19:13:08] We didn't really get a say, we were given them like that [19:13:08] Could we swap out an IP from our allocation to one in a different range? [19:13:14] oh [19:13:19] that sucks :( [19:14:23] Can we raise an upstream ticket with OVH to have them contact Outlook and other major email service providers to have the IP removed from the blacklist(s)? [19:17:22] Nope, that's for us to do [19:17:28] which I have already done for Outlook [19:19:02] oh, think they'd give us a whitelist on our particular IP or something? [19:19:17] They did for us in 2016 and then in 2018 [19:19:24] oh, that's hopeful then [19:19:30] they've unfortunately never answered regarding https://phabricator.miraheze.org/T5824 [19:19:31] [ ⚓ T5824 Deploy MediaModeration ] - phabricator.miraheze.org [19:19:37] JohnLewis, okay thanks :) [19:20:09] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-1 [+0/-0/±1] 13https://git.io/Jto1I [19:20:10] [02miraheze/puppet] 07paladox 03482f020 - base::syslog: Use rsyslog to log remotely [19:20:12] [02puppet] 07paladox created branch 03paladox-patch-1 - 13https://git.io/vbiAS [19:20:13] [02puppet] 07paladox opened pull request 03#1642: base::syslog: Use rsyslog to log remotely - 13https://git.io/Jto1t [19:20:46] Reception123, oh interesting, hopefully Outlook is more responsive, then, than Microsoft's PhotoDNA team :) [19:21:03] based on what John said regarding the last years, they must be [19:21:03] JohnLewis, do you want me to create a Phabricator task so you can resolve it? [19:21:34] Reception123, yeah to Outlook... weird that Microsoft PhotoDNA team hasn't replied though :( [19:21:44] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-1 [+1/-0/±0] 13https://git.io/Jto1O [19:21:45] [02miraheze/puppet] 07paladox 03120625e - Create remote_syslog.conf.erb [19:21:47] Well, it's already been resolved so the need for a task has gone now [19:21:47] [02puppet] 07paladox synchronize pull request 03#1642: base::syslog: Use rsyslog to log remotely - 13https://git.io/Jto1t [19:22:02] Well, 'resolved' in the sense of it's being dealt with best it can be [19:23:39] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-1 [+0/-0/±1] 13https://git.io/Jto1C [19:23:40] [02miraheze/puppet] 07paladox 03051ff31 - Update common.yaml [19:23:42] [02puppet] 07paladox synchronize pull request 03#1642: base::syslog: Use rsyslog to log remotely - 13https://git.io/Jto1t [19:23:43] JohnLewis... I didn't want to create a task that would be closed as invalid, though, I guess a basic task couldn't hurt and you and Reception123 can add in anything important I miss? [19:24:06] [02puppet] 07paladox edited pull request 03#1642: base::syslog: Add support to log remotely using rsyslog - 13https://git.io/Jto1t [19:24:14] well it's one of those upstream tasks where there's nothing left for us to do anyway, so it wouldn't be valid :) [19:24:16] Reception123, did you give John Pine/Matsu's email service provider as well, so he can contact them too? [19:24:38] well it's valid in the sense of sending an email to the email service provider [19:24:42] I don't think an email service provider is private and can't be mentioned here, can it? [19:25:00] you can DM him though [19:25:12] and just confirm that you DMed it to him [19:25:33] well, I can say it's one of the most popular ones and that there shouldn't be any issues with it [19:25:35] unless you meant "can be mentioned here" [19:25:52] Reception123, but there are issues, so we obviously need to contact them as well [19:25:55] dmehus: and yes, I meant I'm not sure whether it's appropriate to mention a user's email provider here, whether that's considered private information [19:26:10] as it's not an email address, it's just a (very famous) provider [19:26:13] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-1 [+0/-0/±1] 13https://git.io/Jto1u [19:26:14] [02miraheze/puppet] 07paladox 0306ddf28 - Update syslog.pp [19:26:16] [02puppet] 07paladox synchronize pull request 03#1642: base::syslog: Add support to log remotely using rsyslog - 13https://git.io/Jto1t [19:27:48] Reception123, yeah, maybe Pine/Matsu didn't check their spam folder [19:27:55] so not sure it was the same issue, good point [19:28:10] yeah, if we want to investigate their case we should ask them more questions [19:28:14] yeah [19:28:28] and anyway, John would've seen an error message if it wasn't able to be delivered [19:28:35] and he only mentioned mine was blocked [19:30:32] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-1 [+0/-0/±1] 13https://git.io/Jto1P [19:30:33] [02miraheze/puppet] 07paladox 0369ae875 - Update remote_syslog.conf.erb [19:30:35] [02puppet] 07paladox synchronize pull request 03#1642: base::syslog: Add support to log remotely using rsyslog - 13https://git.io/Jto1t [19:44:53] Without a time, I can't investigate the other one further unless I fancy trawling through thousands of lines of logs [19:49:47] JohnLewis, yeah, that's fine...see DM [19:56:42] [02puppet] 07JohnFLewis assigned pull request 03#1642: base::syslog: Add support to log remotely using rsyslog - 13https://git.io/Jto1t [19:57:49] Reception123: can you follow up with MediaModeration thing then if possible? I just saw the PR and if its been waiting since Nov 2020 that's bad :) [20:04:14] PROBLEM - guia.cineastas.pt - reverse DNS on sslhost is CRITICAL: rDNS CRITICAL - guia.cineastas.pt All nameservers failed to answer the query. [20:11:09] RECOVERY - guia.cineastas.pt - reverse DNS on sslhost is OK: rDNS OK - guia.cineastas.pt reverse DNS resolves to cp11.miraheze.org [20:30:07] JohnLewis: I've asked RhinosF1 who sent the original email quite a few times and he's said he's not received a response [20:33:40] If there's no response then unless we do it again, we should drop the idea noting it's not pursuable [20:37:51] Yeah, that's probably the best idea. I mean, we could potentially pursue the idea again in the future, but no point in keeping the task open indefinitely [20:39:47] I disagree JohnLewis considering what the ext does we should make another attempt before moving on [20:47:28] Well, to be fair, JohnLewis didn't say never pursue the idea again. Closing the task as declined just means it's not pursued at this time (in this case, due the lack of communication from Microsoft), but someone else can always pursue it again in the future. [20:47:38] We can probably try again (restart the application) but if we get no response the only logical (and unfortunate) choice will be to close it [20:48:20] yeah [20:50:08] Zppix: Re-read what I said, I clearly said "unless we do it again" [20:50:22] Ah i missed that my bad [20:52:30] JohnLewis and Reception123, the latter of which before you go to bed, regarding https://phabricator.miraheze.org/T6006, is that still feasible with the new infrastructure, or should we close that as declined at this time? Also, would migrating to a Kafka job queue system be in scope of the EM (Infrastructure) or EM (MediaWiki)? That is, do non-MediaWiki services use jobrunner? [20:52:31] [ ⚓ T6006 Migrate to a Kafka Job Queue ] - phabricator.miraheze.org [20:53:04] John and I had a talk about that, and it was indeed difficult to qualify whether that's part of the Infra team or the MW team and in the end as you can see with the tag we want for MW team [20:53:15] but this is definitely a task that if we want to do, will require collaboration between the two teams [20:53:30] *went not want [20:53:36] Reception123, yeah 💯 that it would require cross-coordination between the two teams [20:55:02] Reception123: wanted to take on responsibility so it's his team to ensure it gets completed, which makes sense given the biggest effect (100% actually) is on MediaWiki and the person who wants it is in the MW team [20:55:28] yeah [20:55:40] as for the feasibility, @RhinosF1 has taken responsibility to look into that, and I did ask him about it recently and iirc he said he still had to ask someone (though I admit I may have not fully retained that conversation :P) [20:55:55] heh [20:56:04] and JohnLewis 's initial comment seems to suggest that the chances of implementation aren't too high [20:56:10] * dmehus thinks that's what IRC and Discord DM logs are for :D [20:56:58] yeah, I agree it seems less likely we can implement it, based on comments I've read from SPF|Cloud being disappointed in the performance results of the dbbackup server(s) [20:57:24] Seems system resource usage will be the continued issue [20:57:30] I'm unsure of the relevance between those two topics [20:57:37] dbbackup isn't even with OVH [20:57:44] oh [20:58:10] even less related, but if I had to choose my favorite task that's not my team's responsibility it would definitely be https://phabricator.miraheze.org/T6759 :) [20:58:11] [ ⚓ T6759 Automate the adding of SSL private keys to puppet2 ] - phabricator.miraheze.org [20:58:38] lol, Reception123 because it would be difficult for your team to implement? [20:59:01] dmehus: well it's part of Infra because my team can't implement it without help from Infra [20:59:09] and puppet3 is infra [20:59:46] oh I misread what you wrote, Reception123 [20:59:58] you said your favourite task that's not your team's responsibility [21:00:14] as for difficulty, unless perhaps there's something I'm missing the approach RhinosF1 suggested seems reasonable [21:00:17] I thought you said your least favourite task of your team heh [21:00:40] oh no [21:00:49] dmehus: my favourite task that's my team's is https://phabricator.miraheze.org/T6788 [21:00:50] [ ⚓ T6788 Enhancements to RequestWiki workflow for both requestors and creators ] - phabricator.miraheze.org [21:01:07] oh, does that mean JohnLewis isn't going to claim that? :( [21:01:13] I was hoping he would. :) [21:01:51] oh no, I'm just saying it's my personal favourite :D [21:02:19] oh yeah I misread what you wrote again, lol [21:02:35] and urgh, https://phabricator.miraheze.org/T5222 is going to officially be *1 year old* tomorrow [21:02:36] [ ⚓ T5222 MediaWiki response time can fluctuate due to messages ] - phabricator.miraheze.org [21:02:38] * dmehus thinks he needs to clean his glasses [21:02:45] that's definitely not good for a normal priority task [21:02:58] oh wow [21:03:01] dmehus: there seems to be something in this channel today with misreading comments heh :P [21:03:30] that might be an Infra task actually if it's related to the cache proxies/Varnish, no? [21:03:44] or just a cross-coordination task but technically MediaWiki :) [21:03:48] tbf, SPF has looked into this task a few weeks ago but it's not very clear what the next steps are [21:03:55] Reception123, lol yeah :P [21:04:30] and no, it's the mediawiki cache not the varnish cache, so it's MW :) [21:05:13] it's quite related to T6765#134610 as well [21:05:47] Reception123: it's 100% related [21:06:01] Resolving T6765 would resolve that task to my knowledge [21:06:13] JohnLewis: yeah, that wasn't the best wording, it definitely is related :) [21:07:00] dmehus: it's funny because while the task is my team's the only comments have been from the DSRE and the infra team :P [21:08:06] Reception123, ah, yeah, and that's not that surprising actually considering the level of expertise of the DSRE and Infrastructure team :) [21:16:52] JohnLewis, there still haven't been any CreateWiki experimental approval scores on wiki requests since the server migration was completed [21:23:04] really? [21:25:18] Let me check jbr see if jobqueue is ok [21:26:14] Jbr is backed up [21:26:20] Im manually running [21:27:19] Where are we logging now? [21:29:04] graylog2 [21:30:31] paladox: i should of been more specific i meant sal [21:31:46] we log in this channel [21:32:28] Ok [21:33:01] !log runjobs.php on jbr3 on metawiki (130+ jobs were queued) [21:33:05] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [21:34:07] 130? [21:35:05] Prior to running yes [21:35:14] Looked like a bunch of createwiki [21:35:21] According to showjobs.php [21:35:26] I got 1 when I ran it before you put that message though [21:35:37] Bit delayed on log [21:36:05] I ran it after Doug's comment and before you said you'll check it [21:36:09] php /srv/mediawiki/w/maintenance/showJobs.php --wiki metawiki [21:36:09] 1 [21:37:33] * Zppix looks again [21:41:41] All i have in backscroll is a bunch of createwiki related job output before putty cuts off ugh, but i know showjobs.php showed atleast over 130 so maybe a wiki request was just approved [21:42:19] There are about 125 abandoned CWAI jobs [21:43:09] Most of the output i have are namespace related [21:43:21] examples? [21:44:01] One sec [21:44:32] dmehus: the model file can't be opened so my only guess is somehow the file got corrupted during the migration, and all I can do that in case is ping paladox while I go and re-gen the file and see if that fixes it [21:45:45] /mnt/mediawiki-static/requestmodel.phpml doesn't look corrupt to me [21:45:46] JohnLewis [21:45:49] its owned by root [21:45:53] but doesn't look corrupt [21:47:12] It shouldnt be owned by root iirc [21:47:15] Notice: unserialize(): Error at offset 67108862 of 121734760 bytes in /srv/mediawiki/w/extensions/CreateWiki/vendor/php-ai/php-ml/src/ModelManager.php on line 35 [21:49:25] so i guess it only needs to be regenerated [21:59:12] paladox: https://github.com/miraheze/puppet/pull/1642/files why? [21:59:13] [ base::syslog: Add support to log remotely using rsyslog by paladox · Pull Request #1642 · miraheze/puppet · GitHub ] - github.com [22:00:07] SPF|Cloud because to me stopping logging locally is a failure waiting to happen. What if the central logger goes down and we need logs? What if the logger is backed up? [22:00:15] I don't even think the WMF disable local logging [22:00:21] introducing two log daemons is a very bad idea, if your concerns are merely regarding potential unavailability of graylog, then you should configure syslog-ng to log to multiple destinations (remote AND local) [22:00:33] [02miraheze/puppet] 07JohnFLewis pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JtoQQ [22:00:34] [02miraheze/puppet] 07JohnFLewis 034406ef1 - increase AI memory to 1500M [22:01:28] !log manually run AI creation job for two wiki requests + regenerate model file [22:01:31] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [22:01:41] i'm not sure how to configure syslog-ng to log locally when i looked earlier SPF|Cloud [22:01:50] then you need to look again [22:04:00] I understand the need for local logging in certain situations, but journalctl already offers lots of logs locally, even on hosts currently running syslog-ng. Wikimedia is not a fair comparison, totally different stack and size, I'd understand scaling elasticsearch to their requirements is a much harder task than it is for us (to scale elasticsearch) [22:06:44] writing logs twice comes at an I/O and disk space cost, my plan is not to eliminate local logs completely, but let's keep local logging restricted to certain situations: where log sizes become an issue (for example, logging the same request twice by logging from cache proxy nginx *and* mediawiki nginx) or where local logs are needed for monitoring (icinga-miraheze bot?) [22:07:47] [02puppet] 07paladox closed pull request 03#1642: base::syslog: Add support to log remotely using rsyslog - 13https://git.io/Jto1t [22:07:50] [02puppet] 07paladox deleted branch 03paladox-patch-1 - 13https://git.io/vbiAS [22:07:52] [02miraheze/puppet] 07paladox deleted branch 03paladox-patch-1 [22:08:37] for the record, you could use a syslog_ng::destination stanza in any role (like role::mediawiki) to record syslog messages onto the local disk [22:09:15] defining multiple sources (input) and destinations (output) is not a problem for syslog-ng, works fine [22:13:06] we'll need a file beat for glusters logs [22:15:28] JohnLewis, ah I figured you must've manually run some AI wiki request jobs as I saw approval scores pop up on three quite old wiki requests (https://meta.miraheze.org/wiki/Special:RequestWikiQueue/16367#mw-section-comments) heh [22:15:28] dmehus: 2021-02-10 - 21:36:58UTC tell dmehus can you help me set up a wiki on my own VPS, if so, DM me [22:15:29] [ Wiki requests queue - Miraheze Meta ] - meta.miraheze.org [22:15:50] filebeat for gluster logs, how so? [22:16:11] i've been looking to see if gluster logs to syslog [22:16:36] the only thing i found is https://lists.gluster.org/pipermail/gluster-users/2011-December/009167.html [22:16:37] [ [Gluster-users] syslog options for gluster ] - lists.gluster.org [22:16:51] you can try that [22:17:08] dmehus: yeah, going to produce an eval script to put all the others in the queue in a little bit [22:17:15] if gluster only logs locally, syslog-ng could 'tail -f' those files, not an issue [22:17:44] it can read logs? [22:17:57] yes [22:18:13] JohnLewis, responding to your comment, ack, that makes sense and sounds reasonable then, thanks :) [22:19:09] depending on your use case, you can even input a log file from mw10 to mw10's syslog-ng and let graylog1's syslog-ng dump the output to both graylog and a local log file on graylog1 [22:19:24] s/graylog1/graylog2/g [22:19:25] SPF|Cloud meant to say: depending on your use case, you can even input a log file from mw10 to mw10's syslog-ng and let graylog2's syslog-ng dump the output to both graylog and a local log file on graylog2 [22:19:42] root@gluster3:/home/paladox# gluster volume set static client.syslog-level info [22:19:42] volume set: failed: option : client.syslog-level does not exist [22:19:42] Did you mean client.ssl or ...strict-locks? [22:19:42] root@gluster3:/home/paladox# gluster volume set static brick.syslog-level info [22:19:43] volume set: failed: option : brick.syslog-level does not exist [22:19:44] Did you mean ctime.noatime? [22:20:17] > yeah, going to produce an eval script to put all the others in the queue in a little bit [22:20:18] JohnLewis, ah, SGTM :) [22:20:20] https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level [22:20:21] [ 15.4. Configuring the Log Level Red Hat Gluster Storage 3 | Red Hat Customer Portal ] - access.redhat.com [22:20:34] says gluster volume set VOLNAME diagnostics.client-sys-log-level [22:20:51] and gluster volume set VOLNAME diagnostics.brick-sys-log-level [22:21:04] yeh [22:21:08] found all the logs with [22:21:08] gluster volume set help | grep log [22:22:06] !log set diagnostics.brick-sys-log-level and diagnostics.client-sys-log-level to INFO on gluster [22:22:11] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [22:22:36] good luck, I am offline again [22:22:44] So i see https://graylog.miraheze.org/messages/graylog_86/67d5deb3-6bee-11eb-b57d-0200001a24a4 [22:24:52] PROBLEM - cp12 Current Load on cp12 is CRITICAL: CRITICAL - load average: 1.26, 2.06, 1.30 [22:26:47] !log ran the following on eval.php: for ( $id = 16383; $id <= 16586; $id++ ) { $wr = new WikiRequest( $id ); if ( $wr->language == 'en' ) { $wr->tryAutoCreate(); } } [22:26:50] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [22:26:51] RECOVERY - cp12 Current Load on cp12 is OK: OK - load average: 0.61, 1.59, 1.22 [22:27:25] dmehus: done [22:31:09] JohnLewis, heh I could tell...I have like 30+ Echo notifications to review :P [22:31:20] and ty [22:32:33] I might do a query tonight to create a table of 'actual' v 'predicted' [22:33:47] Oh, that'd be nice actually. So the actual score would be updated after the wiki is created to include the comments from the wiki creator / assess the updated information of the requestor? [22:34:31] No, just the query to see an actual wiki creators decision v what CW thinks it should be [22:35:08] tbf, that's the reason at the minute why the score is added at the end of wiki creation, to ensure it's the most 'accurate' description [22:38:30] Ah, true, but would the 'actual' result be different from the 'predicted' result, if the description has not changed? If not, I'm wondering if we should call it 'predicted' and 'post-approval', but I guess 'actual' is that...I'm probably splitting hairs lol [22:39:12] actual == what a wiki creator decided [22:39:20] predicted == what the AI thinks [22:39:55] Ah, that makes sense then, I like that. That should help us better assess CW's accuracy [23:27:39] PROBLEM - ping4 on cp3 is CRITICAL: PING CRITICAL - Packet loss = 16%, RTA = 257.02 ms [23:29:42] PROBLEM - ping4 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 257.80 ms