[09:00:25] PROBLEM - cp10 Current Load on cp10 is CRITICAL: CRITICAL - load average: 4.36, 6.41, 3.03 [09:04:27] RECOVERY - cp10 Current Load on cp10 is OK: OK - load average: 0.37, 3.18, 2.48 [12:24:26] PROBLEM - cp10 Current Load on cp10 is CRITICAL: CRITICAL - load average: 4.95, 5.32, 2.56 [12:25:11] [02miraheze/services] 07MirahezeSSLBot pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/Jt6lm [12:25:12] [02miraheze/services] 07MirahezeSSLBot 030b88fab - BOT: Updating services config for wikis [12:26:25] PROBLEM - cp10 Current Load on cp10 is WARNING: WARNING - load average: 1.45, 3.76, 2.31 [12:28:25] RECOVERY - cp10 Current Load on cp10 is OK: OK - load average: 0.28, 2.55, 2.04 [12:59:13] Award for best edit summary when making a change to on-wiki documentation goes to me right? https://meta.miraheze.org/w/index.php?title=Tech:Bacula&diff=162148&oldid=158051&diffmode=source [12:59:14] [ Difference between revisions of "Tech:Bacula" - Miraheze Meta ] - meta.miraheze.org [13:08:40] :P [13:13:28] Reception123: https://meta.miraheze.org/wiki/Tech:MediaWiki_appserver task for you and MWE :D [13:13:29] [ Tech:MediaWiki appserver - Miraheze Meta ] - meta.miraheze.org [13:14:04] heh, if I only I knew where to start [13:24:26] https://meta.miraheze.org/wiki/Tech:Goals#Review_of_Past_Goals [13:24:27] [ Tech:Goals - Miraheze Meta ] - meta.miraheze.org [13:24:36] 2018 we were on fire, since... not so [14:33:02] RECOVERY - pol.wiki - LetsEncrypt on sslhost is OK: OK - Certificate 'pol.wiki' will expire on Fri 26 Mar 2021 01:19:25 GMT +0000. [14:37:28] PROBLEM - dbbackup2 Current Load on dbbackup2 is CRITICAL: CRITICAL - load average: 4.05, 2.85, 1.69 [14:41:27] PROBLEM - dbbackup2 Current Load on dbbackup2 is WARNING: WARNING - load average: 3.86, 3.49, 2.23 [14:42:26] PROBLEM - pol.wiki - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:43:28] RECOVERY - dbbackup2 Current Load on dbbackup2 is OK: OK - load average: 2.62, 3.20, 2.28 [14:52:21] :( [16:12:59] * dmehus wonders if Reception123 was reacting with `:(` to John's assigning his team the MediaWiki appserver task, the Tech:Goals comment, or both ;) [16:13:44] lol at your edit summary, JohnLewis [16:29:36] dmehus: lol, it was Tech:Goals [16:29:41] but the appserver task is quite difficult to [16:29:42] *too [16:35:19] very very difficult, that's why I gave it you ;) [16:35:48] Tech:Goals? [16:37:24] the difficult part is figuring out what to even write :D [16:38:26] JohnLewis, lol yeah, I gathered that from your somewhat rarely used `;)` emoji :P [16:40:00] darkmatterman450, TechGoals is a listing of medium- to long-term goals/projects the system administrator team feels are important objectives for Miraheze's technical infrastructure. Some are completed more easily / quickly than others [16:40:14] s/TechGoals/Tech:Goals [16:40:14] dmehus meant to say: darkmatterman450, Tech:Goals is a listing of medium- to long-term goals/projects the system administrator team feels are important objectives for Miraheze's technical infrastructure. Some are completed more easily / quickly than others [16:40:45] @dmehus Oh. [16:41:16] > but the appserver task is quite difficult to [16:41:16] ^ dmehus wonders if Reception123 was secretly taken over by Universal_Omega :P [16:48:58] :P [18:05:13] JohnLewis: where does eir go to now [18:05:27] #miraheze-ops [18:05:44] JohnLewis: joined [18:07:24] heh, I haven't been in that channel since May [18:07:25] joined now [18:09:48] Reception123: oh I didn't know it existed before. Was it not registered before because it wasn't appear in channel lists. [18:10:19] Reception123, I didn't know #miraheze-ops existed before either. lol [18:10:26] it would've been since 2020 [18:10:33] ah [18:10:34] and I knew it existed, just not that we were using it again now :P [18:10:42] lol [18:10:54] yeah, that's why JohnLewis kicked everyone from #miraheze-staff [18:11:08] He did give a link to join the channel :P [18:11:32] Reception123: using alis lists on `*miraheze*` it didn't appear as registered. [18:11:47] hmm, I guess maybe it was abandoned at some point? [18:11:58] yeah... it was probably not officially registered, but used unofficially [18:12:08] before now, of course [18:13:01] Reception123: Probably. Speaking of which can you run `DROP` on #miraheze-testwiki-es ? Since that's unused now and you have the +F flag there it seems so can. [18:13:40] ok [18:13:58] Shouldn't PuppyKun be voiced [18:14:11] Universal_Omega: know how to do that so I don't search for it? :P [18:15:03] RhinosF1|NotHere I think he just joined the channel. John probably hasn't gotten around to it yet [18:15:24] Reception123: `/msg ChanServ DROP #miraheze-testwiki-es` then it'll need confirmation. [18:15:37] I'm thinking out loud tbh [18:21:00] Does icinga-miraheze not announce ack's paladox ? [18:21:21] RhinosF1|NotHere if thinking out load, you should use `/me thought bubble` :P [18:21:27] dmehus: true [18:21:54] what do you mean re: icinga-miraheze? [18:22:43] I just ack'd a a few alerts [18:22:51] Expected it to show off on irc [18:26:41] oh [18:27:05] Trying to clear unhandled alerts for the MediaWiki team [18:27:26] I like a nice clean board [18:27:29] You mean you actually mark alerts as handled in icinga? [18:27:39] I thought it was just a logging mechanism [18:28:08] I know paladox or JohnLewis did a commit that changed the icinga access levels, but that may not be related [18:28:35] No, we don't want icinga to ackowledge acks on irc for the simple reason its spammy [18:28:46] ah, thanks paladox [18:28:52] dmehus: that gave me access to do it but when an alert goes off you can say it's known and whose working on it [18:29:00] Or downtime stuff to stop them going off [18:29:04] ah [18:29:09] paladox: ty, makes sense tbh [18:29:30] yeah, we get one alert for the alert, no need to get another alert saying who tended to it [18:29:33] dmehus: if you sign in as guest to https://icinga.miraheze.org/monitoring/service/show?host=sslhost&service=wiki.yapsavun.com%20-%20reverse%20DNS then you'll see [18:29:34] [ Icinga Web 2 Login ] - icinga.miraheze.org [18:30:20] ah, and were you always able to acknowledge icinga alerts before? [18:30:32] dmehus: no [18:30:35] oh ok [18:31:08] Make sure you don't acknowlege alerts unless (a) you actually handled them or (b) they don't need to be handled, so they don't otherwise get missed :) [18:32:43] I acked them for the ones that MediaWiki team were waiting on replies on wiki owners to fix stuff with [18:32:56] To expire when they need to be looked at [18:38:56] ok [19:29:49] RECOVERY - ping4 on cp3 is OK: PING OK - Packet loss = 0%, RTA = 249.76 ms [19:43:24] ping [19:43:28] pong [19:45:11] ping? [19:51:18] why is mh-discord here [19:51:34] Because there's a relay [19:52:37] To -sre? [19:52:40] I thought it was only to #miraheze [19:52:48] Universal_Omega: ^ [19:53:09] @Naleksuh, #miraheze was split into #miraheze and #miraheze-sre, so we wanted to have relays to both the channels on Discord [19:54:02] Universal_Omega, I answered the RhinosF1|NotHere's ping to you for you [19:54:21] Thanks! [19:54:35] np [20:23:34] PROBLEM - ping4 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 251.11 ms [20:25:37] RECOVERY - ping4 on cp3 is OK: PING OK - Packet loss = 0%, RTA = 249.50 ms [21:12:28] https://phabricator.miraheze.org/T5044 I am planning to remove the nginx access log -> graylog logging from mediawiki, instead cache proxies will log everything to graylog and mediawiki servers will only log errors in graylog [21:12:30] [ ⚓ T5044 Setup centralised logging for services ] - phabricator.miraheze.org [21:12:38] thoughts? [21:14:24] logging both is possible, but requires a lot of extra disk space... [21:14:26] SGTM. Can't see a downside to centralized logging [21:14:53] Assuming graylog will log everything from the former nginx access log [21:16:37] graylog2 has 750 GB of space, so it should be able to handle a lot of logging, but if we make effort to reduce the amount of logs, we'll be able to cope with the 750 GB limit for a lot longer :-) [21:16:49] Access wise, does Graylog allow you to specify which roles have access? My only small-ish concern would be if MW Engineers didn't have access to the cache proxy/nginx access logs, will they now have access to those logs in Graylog? [21:17:09] PROBLEM - ping4 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 250.51 ms [21:17:36] yes, mediawiki engineers only have access to logs they were able to see before the migration [21:17:45] ah, SGTM then [21:18:25] I don't want to make access expansion (e.g. to cache proxy logs) prohibited, as long as such change has been agreed upon [21:18:54] Yeah... not prohibited, but definitely should be a discussion all SREs or EMs/DSRE agree to [21:19:11] RECOVERY - ping4 on cp3 is OK: PING OK - Packet loss = 0%, RTA = 249.10 ms [21:20:35] it is important to keep access limited where possible, but if the lack of access prevents people from doing their work properly, then it is up to us to determine how to expand access [21:21:46] and lack of access reduces responsibilities, if a server crashes because someone thought it was a good idea to rm -rf the server, we can rule out mediawiki engineers as the cause ;) [21:24:08] yeah, that latter point is a particularly good one :) [21:25:32] We do also have currently unused `cache-admin` that could be added to a trusted and technically competent MediaWiki Engineer in the future, too [21:48:54] SPF|Cloud that seems fine for me, though we have matomo also going through the cache proxy. [21:56:21] matomo traffic is traffic too :) [22:00:44] paladox: can mon2 puppet be enabled? [22:00:51] yes [22:00:52] oh [22:01:00] i forgot to re enable it [22:01:04] let me re enable [22:01:25] done [22:02:19] RECOVERY - mon2 Puppet on mon2 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [22:16:08] https://wiki.trezor.io/SSH something for SRE? [22:16:09] [ SSH - Trezor Wiki ] - wiki.trezor.io [22:18:13] [02miraheze/puppet] 07Southparkfan pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/Jt6dv [22:18:15] [02miraheze/puppet] 07Southparkfan 03c7adf30 - Deploy syslog-ng to mon2 [22:21:36] PROBLEM - mon2 Puppet on mon2 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [22:22:17] [02miraheze/puppet] 07Southparkfan pushed 031 commit to 03master [+0/-0/±3] 13https://git.io/Jt6dY [22:22:19] [02miraheze/puppet] 07Southparkfan 033bce55c - Remove systemd::syslog calls for irc/monitoring [22:24:40] ^ icinga-miraheze reads a log file that is not affected [22:24:59] Ok [22:25:36] RECOVERY - mon2 Puppet on mon2 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:25:55] ^ talking about proof [22:27:51] [02miraheze/puppet] 07Southparkfan pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/Jt6du [22:27:52] [02miraheze/puppet] 07Southparkfan 03a44bf35 - mon2: use graylog for nginx [22:33:33] PROBLEM - ping4 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 250.15 ms [22:35:35] RECOVERY - ping4 on cp3 is OK: PING OK - Packet loss = 0%, RTA = 249.34 ms [22:51:53] PROBLEM - ping4 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 250.68 ms [22:53:55] RECOVERY - ping4 on cp3 is OK: PING OK - Packet loss = 0%, RTA = 248.78 ms [23:24:32] PROBLEM - ping4 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 251.62 ms [23:26:33] RECOVERY - ping4 on cp3 is OK: PING OK - Packet loss = 0%, RTA = 249.90 ms