[07:11:57] PROBLEM - jobrunner3 Disk Space on jobrunner3 is CRITICAL: DISK CRITICAL - free space: / 704 MB (1% inode=86%); [08:02:22] .op [08:03:01] pinging paladox Reception123 SPF|Cloud Universal_Omega [08:08:38] !log stop/start jobrunner&jobcron on jobrunner* [08:08:41] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [08:10:28] jbr3 ran out of disk :( [08:11:31] jbr4 seems to have left but there both spewing errors [08:12:52] Reception123: it's wikibackups [08:14:01] RhinosF1: I'll have access in around 20 mins but I started zipping the dir a while ago so feel free to delete the actual dir from my homedir [08:14:11] As the zip should be done [08:14:21] Reception123: it's wikibackups16022021.tar.gz that's the biggest [08:14:34] but is /home/reception/wikibackups safe to delete then [08:14:56] Yes, it should be. In any case we can't have jbr3 out of disk space [08:16:35] !log deleted /home/reception/wikibackups/* [08:16:37] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [08:16:51] Reception123: /home/reception/images1 is fairly big tooo [08:17:22] RhinosF1: oh, I also started an SCP of that yeah, was going to start the import soon [08:17:58] RECOVERY - jobrunner3 Disk Space on jobrunner3 is OK: DISK OK - free space: / 16449 MB (37% inode=86%); [08:18:20] Reception123: looking in size order it's /home/reception/wikibackups16022021.tar.gz, /home/reception/delbackups2 then /home/reception/images1 [08:18:25] but icinga-miraheze happy [08:19:36] Ok, will figure it out when I have access [08:20:07] Reception123: errors on jobrunner* have stopped but mw* is still spewing [08:20:22] oh no [08:20:26] i see errors still [08:21:53] Hmm [08:22:08] !log stop/start job services again [08:22:11] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [08:26:25] not seeing anything yet [08:26:35] nope [08:26:38] it's back [08:26:49] PROBLEM - ping6 on cp3 is CRITICAL: PING CRITICAL - Packet loss = 16%, RTA = 266.85 ms [08:28:49] PROBLEM - ping6 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 267.34 ms [08:29:18] Reception123: error rates are lower but still happening [08:38:16] after all no space was left so the zip didn't complete [08:38:23] will have to do them again and exclude the top 20 instead of top 10 [08:39:33] Reception123: can we try a full reboot of jobrunner* ? I'm seeing about 1-2 errors a minute persist [08:39:42] (down though from 50-100) [08:40:02] okay [08:40:11] !log rebooted jobrunner3 [08:40:15] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [08:41:04] !log rebooted jobrunner4 [08:41:07] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [08:42:15] !log reception@jobrunner3:~$ sudo -u www-data php /srv/mediawiki/w/maintenance/importImages.php --wiki kelevarwiki /home/reception/images --search-recursively [08:42:18] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [08:43:29] PROBLEM - jobrunner4 Check Gluster Clients on jobrunner4 is CRITICAL: PROCS CRITICAL: 0 processes with args '/usr/sbin/glusterfs' [08:43:53] Reception123: ^ might want to fix the mount [08:44:22] * Reception123 is fixing [08:44:55] !log umount/mount mediawiki-static on jobrunner4 [08:44:58] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [08:45:29] RECOVERY - jobrunner4 Check Gluster Clients on jobrunner4 is OK: PROCS OK: 1 process with args '/usr/sbin/glusterfs' [08:46:27] Reception123: I still see errors on https://graylog.miraheze.org/search?q=NOT+mediawiki_exception_file%3A%5C%2Fsrv%5C%2Fmediawiki%5C%2Fw%5C%2Fextensions%5C%2FCargo%5C%2F%2A+AND+mediawiki_exception_class%3AJobQueueError&rangetype=relative&streams=5f8c6fd446640840f104b0ba&relative=7200 [08:46:52] and https://graylog.miraheze.org/search?q=mediawiki_host%3Ajobrunner%2A&rangetype=relative&streams=5f8c6fd446640840f104b0ba&relative=7200 now [08:47:01] RhinosF1: should I perhaps try running manually on mw10 and see what happens? [08:47:28] Reception123: how do you manually insert a job? [08:48:11] oh no, I was just thinking of running everything manually and see if there's any error or if it's just because there's a backlog [08:49:25] Reception123: showJobs.php is giving none [08:49:37] because they're not getting to the jobrunners [08:50:04] Reception123: something between them seems stuck in a failed state [08:50:16] but that's redis and I don't understand redis [08:50:33] hmm, yeah redis isn't really my thing either [08:51:23] Reception123: i'm not sure what's safe to restart too [08:51:40] yeah, I wouldn't want to mess with something and make things even worse [08:51:54] given the much lower rate, i'd say hope paladox wakes up soon [08:59:09] RhinosF1: similar issue here: https://phabricator.miraheze.org/T5939 [08:59:10] [ ⚓ T5939 Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report er ] - phabricator.miraheze.org [08:59:38] where it said it couldn't insert X [09:00:27] Reception123: Paladox said they killed redis-server but not where [09:00:39] If they mean mw* then it might be safe to do [09:08:13] hmm yeah, not sure aobut that [09:17:28] Reception123: we should write docs about when redis goes mad [09:26:39] yeah, there are some now but not on this specific issue [09:28:12] Redis is a fairly important component but also the one with the least knowledge [09:43:54] hey folks, there is someone complaining about "Fatal exception of type "JobQueueError"" at a miraheze wiki on mediawiki.org support desk, known issue or not? [09:44:25] Majavah: known but mostly fixed [09:44:32] ack, thanks [09:46:22] PROBLEM - guia.cineastas.pt - reverse DNS on sslhost is CRITICAL: rDNS CRITICAL - guia.cineastas.pt All nameservers failed to answer the query. [09:53:07] Reception123: delete your images folder if you're done with it [09:53:08] RECOVERY - guia.cineastas.pt - reverse DNS on sslhost is OK: rDNS OK - guia.cineastas.pt reverse DNS resolves to cp10.miraheze.org [09:55:12] Already done [09:56:32] Ty [09:56:52] Reception123: if that tar.gz is useless then that can go too [09:56:55] As that was huge [09:57:33] (Or just delete anything that can't or won't be used again from /home/reception [09:57:34] ) [09:58:52] I already deleted that, as it will have to be redone with less large wikis [09:59:12] Okay cool [09:59:34] Reception123: what does df show we have free currently? [10:00:37] 30G, I deleted the backups a while ago anyway [10:02:38] Ah k [11:14:52] PROBLEM - ping6 on cp3 is CRITICAL: PING CRITICAL - Packet loss = 16%, RTA = 281.79 ms [11:15:09] hi JohnLewis [11:16:52] PROBLEM - ping6 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 259.01 ms [11:17:38] JohnLewis: can you look at the last few flowing errors on https://graylog.miraheze.org/search?q=NOT+mediawiki_exception_file%3A%5C%2Fsrv%5C%2Fmediawiki%5C%2Fw%5C%2Fextensions%5C%2FCargo%5C%2F%2A+AND+mediawiki_exception_class%3AJobQueueError&rangetype=relative&streams=5f8c6fd446640840f104b0ba&relative=28800 and [11:17:38] https://graylog.miraheze.org/search?q=mediawiki_host%3Ajobrunner%2A&rangetype=relative&streams=5f8c6fd446640840f104b0ba&relative=28800. Last time it happened paladox restarted redis-server but didn't see on what. both jbrs have been rebooted which caused most to stop. [11:18:56] I don’t have access but redis-server only runs on jobrunner for jobs [12:15:12] JohnLewis: ok [12:20:42] PROBLEM - ping4 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 252.04 ms [12:22:44] RECOVERY - ping4 on cp3 is OK: PING OK - Packet loss = 0%, RTA = 248.66 ms [12:49:32] What happened? [12:49:40] I’m mobile [12:51:14] You’ll see errors logging to graylog for a while [12:51:19] Because of a backlog [12:51:37] https://phabricator.miraheze.org/T6858 [12:51:37] [ ⚓ T6858 Messages take a while to be sent to graylog ] - phabricator.miraheze.org [13:42:17] PROBLEM - services4 APT on services4 is CRITICAL: APT CRITICAL: 30 packages available for upgrade (3 critical updates). [13:43:10] PROBLEM - services3 APT on services3 is CRITICAL: APT CRITICAL: 30 packages available for upgrade (3 critical updates). [13:43:41] PROBLEM - cp3 APT on cp3 is CRITICAL: APT CRITICAL: 28 packages available for upgrade (2 critical updates). [13:44:54] PROBLEM - cloud4 APT on cloud4 is CRITICAL: APT CRITICAL: 39 packages available for upgrade (2 critical updates). [13:50:05] paladox: if you can, please check JobQueue doesn't have any lagging issues from this morning [13:50:14] PROBLEM - cloud3 APT on cloud3 is CRITICAL: APT CRITICAL: 99 packages available for upgrade (2 critical updates). [13:50:24] Jbr3 ran out of space [13:50:31] PROBLEM - bacula2 APT on bacula2 is CRITICAL: APT CRITICAL: 3 packages available for upgrade (2 critical updates). [13:50:49] Rebooting both fixed most but they still seem to be pockets of errors being logged [13:51:01] PROBLEM - ns2 APT on ns2 is CRITICAL: APT CRITICAL: 27 packages available for upgrade (2 critical updates). [13:51:36] PROBLEM - db12 APT on db12 is CRITICAL: APT CRITICAL: 67 packages available for upgrade (2 critical updates). [13:51:52] PROBLEM - gluster3 APT on gluster3 is CRITICAL: APT CRITICAL: 26 packages available for upgrade (2 critical updates). [13:52:00] PROBLEM - db11 APT on db11 is CRITICAL: APT CRITICAL: 67 packages available for upgrade (2 critical updates). [13:53:35] PROBLEM - db13 APT on db13 is CRITICAL: APT CRITICAL: 29 packages available for upgrade (2 critical updates). [13:53:55] PROBLEM - mon2 APT on mon2 is CRITICAL: APT CRITICAL: 29 packages available for upgrade (3 critical updates). [13:54:30] PROBLEM - puppet3 APT on puppet3 is CRITICAL: APT CRITICAL: 30 packages available for upgrade (2 critical updates). [13:54:42] PROBLEM - cp12 APT on cp12 is CRITICAL: APT CRITICAL: 26 packages available for upgrade (2 critical updates). [13:55:00] PROBLEM - cp10 APT on cp10 is CRITICAL: APT CRITICAL: 27 packages available for upgrade (2 critical updates). [13:56:43] PROBLEM - cloud5 APT on cloud5 is CRITICAL: APT CRITICAL: 39 packages available for upgrade (2 critical updates). [13:57:21] PROBLEM - gluster4 APT on gluster4 is CRITICAL: APT CRITICAL: 26 packages available for upgrade (2 critical updates). [13:57:58] PROBLEM - ns1 APT on ns1 is CRITICAL: APT CRITICAL: 24 packages available for upgrade (2 critical updates). [13:59:44] PROBLEM - rdb4 APT on rdb4 is CRITICAL: APT CRITICAL: 26 packages available for upgrade (2 critical updates). [14:01:39] PROBLEM - mail2 APT on mail2 is CRITICAL: APT CRITICAL: 31 packages available for upgrade (3 critical updates). [14:03:26] PROBLEM - graylog2 APT on graylog2 is CRITICAL: APT CRITICAL: 30 packages available for upgrade (2 critical updates). [14:04:29] PROBLEM - rdb3 APT on rdb3 is CRITICAL: APT CRITICAL: 26 packages available for upgrade (2 critical updates). [14:04:38] PROBLEM - phab2 APT on phab2 is CRITICAL: APT CRITICAL: 27 packages available for upgrade (3 critical updates). [14:05:02] PROBLEM - mw11 APT on mw11 is CRITICAL: APT CRITICAL: 32 packages available for upgrade (3 critical updates). [14:05:05] PROBLEM - jobrunner3 APT on jobrunner3 is CRITICAL: APT CRITICAL: 35 packages available for upgrade (3 critical updates). [14:05:10] PROBLEM - mw9 APT on mw9 is CRITICAL: APT CRITICAL: 4 packages available for upgrade (3 critical updates). [14:05:11] PROBLEM - test3 APT on test3 is CRITICAL: APT CRITICAL: 32 packages available for upgrade (3 critical updates). [14:05:18] PROBLEM - mw10 APT on mw10 is CRITICAL: APT CRITICAL: 32 packages available for upgrade (3 critical updates). [14:05:32] PROBLEM - jobrunner4 APT on jobrunner4 is CRITICAL: APT CRITICAL: 33 packages available for upgrade (3 critical updates). [14:06:01] PROBLEM - mw8 APT on mw8 is CRITICAL: APT CRITICAL: 32 packages available for upgrade (3 critical updates). [14:07:31] PROBLEM - ldap2 APT on ldap2 is CRITICAL: APT CRITICAL: 26 packages available for upgrade (2 critical updates). [14:08:26] PROBLEM - cp11 APT on cp11 is CRITICAL: APT CRITICAL: 27 packages available for upgrade (2 critical updates). [14:17:02] PROBLEM - ping6 on cp3 is CRITICAL: PING CRITICAL - Packet loss = 28%, RTA = 295.23 ms [14:21:03] PROBLEM - ping6 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 264.72 ms [14:23:07] PROBLEM - ping6 on cp3 is CRITICAL: PING CRITICAL - Packet loss = 44%, RTA = 293.55 ms [14:25:06] PROBLEM - ping6 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 277.27 ms [14:31:07] PROBLEM - ping6 on cp3 is CRITICAL: PING CRITICAL - Packet loss = 16%, RTA = 275.48 ms [14:35:08] PROBLEM - ping6 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 275.99 ms [15:01:06] PROBLEM - ping6 on cp3 is CRITICAL: PING CRITICAL - Packet loss = 16%, RTA = 277.28 ms [15:03:06] PROBLEM - ping6 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 278.90 ms [15:11:14] !log root@jobrunner3:/home/paladox# /usr/local/bin/foreachwikiindblist /srv/mediawiki/w/cache/databases.json /srv/mediawiki/w/maintenance/runJobs.php [15:11:17] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [15:21:06] PROBLEM - ping6 on cp3 is CRITICAL: PING CRITICAL - Packet loss = 16%, RTA = 277.29 ms [15:21:37] paladox: are you happy with https://phabricator.miraheze.org/T6859#135242 then or? [15:21:38] [ ⚓ T6859 Huge rise in Redis server error: Could not insert job(s). ] - phabricator.miraheze.org [15:21:55] Yes. [15:22:10] paladox: lower to normal as I'll do an IR [15:22:15] assign to me [15:22:29] IR doesn't require that the task stays open though [15:22:32] if its resolved. [15:22:43] paladox: yeah but it's a way to track [15:22:51] ok [15:23:06] PROBLEM - ping6 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 271.78 ms [15:23:10] we can give the jobrunners more disk space if necessary [15:23:13] but also [15:23:24] we do have 500gb of backup space for each cloud server [15:23:43] paladox: i've got a few ideas for followup things [15:23:49] oh? [15:23:53] one will be the mess that is wikibackups [15:24:04] one documenting better how to fix the issue [15:24:17] what do you me documenting how to fix the issue? [15:24:28] and one making things better redundant as jobrunner4 was useless too as a result [15:24:29] simple as don't fill up the space so much :) [15:25:03] paladox: we need a way to work out based on what you said if redis is actually failed or graylog slow [15:25:18] the service looks fine [15:25:27] and if so if there's a way other than rebooting the whole server to unstick it [15:25:29] and its not graylog being slow [15:25:35] its syslog-ng being slow [15:25:41] graylog is still showing things come through now [15:25:42] i gave you the task [15:25:49] yes [15:25:57] did you read what i said earlier? :) [15:26:34] stuff like: [15:26:35] that needs to be documented as a aggravating the issue in the IR as both me and Reception123 couldn't tell if it was still spewing errors for a reason as a result [15:26:35] Feb 17 08:47:30 jobrunner3 jobrunner[3115]: 2021-02-17T08:47:30+0000 ERROR: Runner loop 0 process in slot 1 gave status '0': [15:26:35] Feb 17 08:47:30 jobrunner3 jobrunner[3115]: json_decode() error (4): Syntax error [15:26:35] Feb 17 08:47:30 jobrunner3 jobrunner[3115]: php /srv/mediawiki/w/maintenance/runJobs.php --wiki='kelevarwiki' --type='refreshLinks' --maxtime='60' --memory-limit='192M' --result=json STDOUT: [15:26:50] really needs to be fixed, like why is it throwing a json syntax error? [15:27:07] PROBLEM - ping6 on cp3 is CRITICAL: PING CRITICAL - Packet loss = 16%, RTA = 275.70 ms [15:27:11] unless i know what the json is then i can't look [15:29:07] PROBLEM - ping6 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 273.76 ms [15:32:38] [02puppet] 07paladox deleted branch 03revert-1645-patch-28 - 13https://git.io/vbiAS [15:32:40] [02miraheze/puppet] 07paladox deleted branch 03revert-1645-patch-28 [15:33:53] [02miraheze/puppet] 07paladox deleted branch 03paladox-patch-1 [15:33:54] [02puppet] 07paladox deleted branch 03paladox-patch-1 - 13https://git.io/vbiAS [15:35:04] [02miraheze/mediawiki] 07paladox pushed 031 commit to 03REL1_35 [+0/-0/±1] 13https://git.io/JtMiI [15:35:06] [02miraheze/mediawiki] 07paladox 0376c08ee - Update MirahezeMagic [15:37:04] !log rebuild lc on mw* and jobrunner* [15:37:07] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [15:41:08] PROBLEM - ping4 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 256.78 ms [15:41:16] PROBLEM - mw11 Current Load on mw11 is CRITICAL: CRITICAL - load average: 8.40, 6.19, 4.58 [15:41:35] PROBLEM - mw10 Current Load on mw10 is CRITICAL: CRITICAL - load average: 9.66, 6.89, 4.95 [15:41:49] PROBLEM - mw8 Current Load on mw8 is CRITICAL: CRITICAL - load average: 8.23, 6.35, 4.97 [15:43:37] PROBLEM - mw10 Current Load on mw10 is WARNING: WARNING - load average: 7.10, 6.82, 5.16 [15:44:51] PROBLEM - mw9 Current Load on mw9 is CRITICAL: CRITICAL - load average: 9.20, 8.02, 5.93 [15:45:13] RECOVERY - ping4 on cp3 is OK: PING OK - Packet loss = 0%, RTA = 248.80 ms [15:45:37] PROBLEM - mw10 Current Load on mw10 is CRITICAL: CRITICAL - load average: 10.10, 8.02, 5.81 [15:53:18] PROBLEM - mw11 Current Load on mw11 is WARNING: WARNING - load average: 7.31, 7.89, 6.80 [15:53:36] PROBLEM - mw10 Current Load on mw10 is WARNING: WARNING - load average: 6.62, 7.64, 6.67 [15:55:13] PROBLEM - mw8 Current Load on mw8 is WARNING: WARNING - load average: 5.82, 7.91, 7.23 [15:56:35] PROBLEM - mw9 Current Load on mw9 is WARNING: WARNING - load average: 6.11, 7.53, 7.11 [15:57:16] RECOVERY - mw11 Current Load on mw11 is OK: OK - load average: 5.07, 6.75, 6.59 [15:57:39] RECOVERY - mw10 Current Load on mw10 is OK: OK - load average: 5.23, 6.53, 6.48 [16:00:30] RECOVERY - mw9 Current Load on mw9 is OK: OK - load average: 2.90, 5.41, 6.37 [16:01:13] RECOVERY - mw8 Current Load on mw8 is OK: OK - load average: 4.24, 5.95, 6.63 [16:03:16] [02miraheze/mw-config] 07Universal-Omega created branch 03Test 13https://git.io/JtMPp [16:03:17] [02mw-config] 07Universal-Omega created branch 03Test - 13https://git.io/vbvb3 [16:04:34] [02mw-config] 07Universal-Omega deleted branch 03Test - 13https://git.io/vbvb3 [16:04:35] [02miraheze/mw-config] 07Universal-Omega deleted branch 03Test [16:04:39] [02mw-config] 07Universal-Omega deleted branch 03Universal-Omega-patch-4 - 13https://git.io/vbvb3 [16:04:41] [02miraheze/mw-config] 07Universal-Omega deleted branch 03Universal-Omega-patch-4 [16:04:43] PROBLEM - ping4 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 259.17 ms [16:06:45] RECOVERY - ping4 on cp3 is OK: PING OK - Packet loss = 0%, RTA = 248.19 ms [16:21:02] RECOVERY - mw11 APT on mw11 is OK: APT OK: 29 packages available for upgrade (0 critical updates). [16:21:05] RECOVERY - jobrunner3 APT on jobrunner3 is OK: APT OK: 32 packages available for upgrade (0 critical updates). [16:21:13] RECOVERY - mw9 APT on mw9 is OK: APT OK: 1 packages available for upgrade (0 critical updates). [16:21:19] RECOVERY - mw10 APT on mw10 is OK: APT OK: 29 packages available for upgrade (0 critical updates). [16:21:31] RECOVERY - ldap2 APT on ldap2 is OK: APT OK: 24 packages available for upgrade (0 critical updates). [16:21:32] RECOVERY - jobrunner4 APT on jobrunner4 is OK: APT OK: 30 packages available for upgrade (0 critical updates). [16:21:33] RECOVERY - db13 APT on db13 is OK: APT OK: 27 packages available for upgrade (0 critical updates). [16:21:36] RECOVERY - db12 APT on db12 is OK: APT OK: 65 packages available for upgrade (0 critical updates). [16:21:42] RECOVERY - graylog2 APT on graylog2 is OK: APT OK: 28 packages available for upgrade (0 critical updates). [16:21:45] RECOVERY - rdb4 APT on rdb4 is OK: APT OK: 24 packages available for upgrade (0 critical updates). [16:21:45] !log install security updates on all servers [16:21:52] RECOVERY - gluster3 APT on gluster3 is OK: APT OK: 24 packages available for upgrade (0 critical updates). [16:21:54] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [16:21:56] RECOVERY - mon2 APT on mon2 is OK: APT OK: 26 packages available for upgrade (0 critical updates). [16:21:58] RECOVERY - ns1 APT on ns1 is OK: APT OK: 22 packages available for upgrade (0 critical updates). [16:22:16] RECOVERY - cloud3 APT on cloud3 is OK: APT OK: 97 packages available for upgrade (0 critical updates). [16:22:17] RECOVERY - mw8 APT on mw8 is OK: APT OK: 29 packages available for upgrade (0 critical updates). [16:22:19] RECOVERY - services4 APT on services4 is OK: APT OK: 27 packages available for upgrade (0 critical updates). [16:22:23] RECOVERY - cp3 APT on cp3 is OK: APT OK: 26 packages available for upgrade (0 critical updates). [16:22:26] RECOVERY - cp11 APT on cp11 is OK: APT OK: 25 packages available for upgrade (0 critical updates). [16:22:31] RECOVERY - puppet3 APT on puppet3 is OK: APT OK: 28 packages available for upgrade (0 critical updates). [16:22:40] RECOVERY - phab2 APT on phab2 is OK: APT OK: 24 packages available for upgrade (0 critical updates). [16:22:44] RECOVERY - cloud5 APT on cloud5 is OK: APT OK: 37 packages available for upgrade (0 critical updates). [16:22:52] RECOVERY - cp12 APT on cp12 is OK: APT OK: 24 packages available for upgrade (0 critical updates). [16:22:56] RECOVERY - db11 APT on db11 is OK: APT OK: 65 packages available for upgrade (0 critical updates). [16:23:00] RECOVERY - cp10 APT on cp10 is OK: APT OK: 25 packages available for upgrade (0 critical updates). [16:23:01] RECOVERY - ns2 APT on ns2 is OK: APT OK: 25 packages available for upgrade (0 critical updates). [16:23:10] RECOVERY - services3 APT on services3 is OK: APT OK: 27 packages available for upgrade (0 critical updates). [16:23:11] RECOVERY - test3 APT on test3 is OK: APT OK: 29 packages available for upgrade (0 critical updates). [16:23:21] RECOVERY - gluster4 APT on gluster4 is OK: APT OK: 24 packages available for upgrade (0 critical updates). [16:23:30] RECOVERY - cloud4 APT on cloud4 is OK: APT OK: 37 packages available for upgrade (0 critical updates). [16:23:39] RECOVERY - mail2 APT on mail2 is OK: APT OK: 28 packages available for upgrade (0 critical updates). [16:24:13] PROBLEM - mw9 Current Load on mw9 is CRITICAL: CRITICAL - load average: 8.86, 6.61, 5.58 [16:25:30] RECOVERY - rdb3 APT on rdb3 is OK: APT OK: 24 packages available for upgrade (0 critical updates). [16:27:13] PROBLEM - mw8 Current Load on mw8 is CRITICAL: CRITICAL - load average: 8.89, 7.74, 6.28 [16:27:22] [02miraheze/puppet] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JtM1z [16:27:24] [02miraheze/puppet] 07paladox 030143508 - Increase warning ping for cp3 to 300 [16:28:08] RECOVERY - mw9 Current Load on mw9 is OK: OK - load average: 4.45, 6.37, 5.81 [16:28:49] RECOVERY - ping6 on cp3 is OK: PING OK - Packet loss = 0%, RTA = 265.45 ms [16:29:05] PROBLEM - mw8 Current Load on mw8 is WARNING: WARNING - load average: 5.89, 7.27, 6.29 [16:33:02] RECOVERY - mw8 Current Load on mw8 is OK: OK - load average: 5.80, 6.45, 6.19 [16:44:16] .deop [16:53:55] paladox, or Universal_Omega, can either of you take a look at https://meta.miraheze.org/w/index.php?title=Community_noticeboard&diff=next&oldid=163081&diffmode=source#ManageWiki/Settings_broken? ManageWiki/settings isn't accepting changes, and I wonder if ManageWiki cache needs to be cleared maybe? I can reproduce the user's error as I tried to change it, but it didn't accept the change [16:53:57] [ Difference between revisions of "Community noticeboard" - Miraheze Meta ] - meta.miraheze.org [16:55:57] Doubt its cache [16:56:14] we don't use cache for showing settings, so what you see is whats in the database [16:56:58] Oh, okay. Any idea what could cause it then? Could it be a ManageWiki bug with the incorrect syntax of that logo URL? [16:57:22] Maybe we can use `sql.php` or `eval.php` to clear the logo URL from that database table on that wiki? [17:00:10] dmehus: I know the issue [17:00:27] One second. [17:02:15] Universal_Omega, oh, what's the issue? [17:02:39] dmehus: Have them enable SocialProfile on that wiki, then make a change to wgCosmosProfileTagGroups and see if that saves. The value there is an invalid option. [17:02:41] and can you correct the issue so it doesn't reoccur in the future on other wikis? [17:03:06] What if they don't want SocialProfile though? [17:03:14] Disable it afterwards. [17:03:19] Just temporarily [17:03:27] Universal_Omega: can we get a proper fix? [17:03:36] ^ [17:03:50] How did a CosmosProfileTagGroups get set though if SocialProfile's not enabled? [17:04:12] Can't you make that greyed out if SocialProfile's not enabled, or fix it to not require SocialProfile? [17:04:43] RhinosF1: no it's an upstream bug. And dmehus it is grayed out they somehow changed it awhile ago. Or deleted the default groups possibly. [17:04:59] It's an issue with input validation [17:05:09] Universal_Omega: okay [17:05:13] Not displaying error messages except on that specifictab [17:05:19] We can't change our input validation to use another validation method? [17:05:23] No [17:05:26] oh [17:06:04] It's a valid error but it can only be seen on the specific tab. Not everywhere appearing like it's not doing anything. [17:06:09] Well, since it's related to a bug, you can probably change it for them, or I could. Seems less bureaucratic than having them temporarily enable and disable SocialProfile [17:06:16] do you want to or me? [17:06:43] I can't change it unless SP is temporarily enabled. [17:06:50] Because it's grayed out. [17:07:01] Yeah that's what I'm saying, you temporarily enable it to resolve a bug [17:07:23] Oh. Sure give me a minute. [17:07:33] you can probably do that, since it's related to the bug, or I can, if you'd prefer that [17:08:12] * dmehus hopes this apparent upstream bug is resolved soon as this fix seems quite hacky heh [17:08:49] Universal_Omega do you know how to fix the upstream bug? [17:10:04] paladox: no idea. The bug may just be missing functionality tbh. All fixing it would do is allow us to display errors about why managewiki isn't saving when validation fails. [17:10:12] oh [17:14:21] dmehus: fixed [17:15:50] Universal_Omega, thanks, except the SocialProfile-added user rights were not removed. [17:16:04] I thought we fixed this when we fixed the Report extension? [17:16:58] RhinosF1: since you first responded to the incident, could you do the IR for https://phabricator.miraheze.org/T6859#135260 ? [17:16:59] [ ⚓ T6859 Huge rise in Redis server error: Could not insert job(s). ] - phabricator.miraheze.org [17:17:16] (assuming you already are but just to make sure) [17:17:20] Reception123: yep I plan to [17:17:26] ok, sounds good [17:17:57] Oh. No that didn't fix that. I still have open fixes to ManageWiki which will do namespaces. But I still need to recently done ManageWiki fix to mw-config to remove them. I will soon. Sorry did not think about that. [17:18:11] dmehus: ^ [17:18:55] dmehus: actually I still need to finish that fix on ManageWiki. [17:32:46] Universal_Omega, oh okay. So when you're done with that wiki, can we just remove the SP-added user rights with `eval.php` or `sql.php` or something so we don't have to re-add the extension again, remove rights, and remove the extension? [17:33:05] like you did with those other wikis before, basically [17:33:37] PROBLEM - cloud4 Current Load on cloud4 is WARNING: WARNING - load average: 21.54, 18.78, 14.50 [17:34:01] PROBLEM - cp10 Current Load on cp10 is CRITICAL: CRITICAL - load average: 4.25, 7.59, 3.85 [17:35:37] RECOVERY - cloud4 Current Load on cloud4 is OK: OK - load average: 11.15, 15.76, 13.91 [17:38:03] PROBLEM - cp10 Current Load on cp10 is WARNING: WARNING - load average: 0.22, 3.50, 3.01 [17:40:01] RECOVERY - cp10 Current Load on cp10 is OK: OK - load average: 0.17, 2.40, 2.66 [19:06:31] [02miraheze/mw-config] 07paladox pushed 031 commit to 03paladox-patch-2 [+0/-0/±1] 13https://git.io/JtM7F [19:06:33] [02miraheze/mw-config] 07paladox 030057e53 - redis: Stop using nutcracker [19:06:34] [02mw-config] 07paladox created branch 03paladox-patch-2 - 13https://git.io/vbvb3 [19:06:36] [02mw-config] 07paladox opened pull request 03#3724: redis: Stop using nutcracker - 13https://git.io/JtM7b [19:07:44] miraheze/mw-config - paladox the build passed. [19:39:19] [02mw-config] 07paladox closed pull request 03#3724: redis: Stop using nutcracker - 13https://git.io/JtM7b [19:39:21] [02miraheze/mw-config] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JtMds [19:39:22] [02miraheze/mw-config] 07paladox 03104c461 - redis: Stop using nutcracker (#3724) [19:39:23] [02mw-config] 07paladox deleted branch 03paladox-patch-2 - 13https://git.io/vbvb3 [19:39:25] [02miraheze/mw-config] 07paladox deleted branch 03paladox-patch-2 [19:40:17] miraheze/mw-config - paladox the build passed. [19:46:37] [02miraheze/mediawiki] 07paladox pushed 031 commit to 03REL1_35 [+0/-0/±1] 13https://git.io/JtMdK [19:46:39] [02miraheze/mediawiki] 07paladox 039265095 - Update Cargo [19:47:57] [02miraheze/mediawiki] 07paladox pushed 031 commit to 03REL1_35 [+0/-0/±1] 13https://git.io/JtMd6 [19:47:59] [02miraheze/mediawiki] 07paladox 032f2745e - Update Cargo [19:56:41] PROBLEM - cp12 Current Load on cp12 is CRITICAL: CRITICAL - load average: 1.27, 2.02, 1.47 [19:58:40] RECOVERY - cp12 Current Load on cp12 is OK: OK - load average: 0.63, 1.50, 1.34 [22:07:43] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-1 [+0/-0/±1] 13https://git.io/JtMpj [22:07:44] [02miraheze/puppet] 07paladox 0391536c7 - Reinstall rdb4 as mc2 [22:07:46] [02puppet] 07paladox created branch 03paladox-patch-1 - 13https://git.io/vbiAS [22:07:47] [02puppet] 07paladox opened pull request 03#1647: Reinstall rdb4 as mc2 - 13https://git.io/JtMhe [22:08:55] [02puppet] 07paladox edited pull request 03#1647: Reinstall rdb4 as mc2 - 13https://git.io/JtMhe [22:09:13] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-1 [+0/-1/±0] 13https://git.io/JtMhJ [22:09:14] [02miraheze/puppet] 07paladox 03c27a0ee - Delete rdb4.yaml [22:09:16] [02puppet] 07paladox synchronize pull request 03#1647: Reinstall rdb4 as mc2 - 13https://git.io/JtMhe [22:09:20] [02puppet] 07paladox closed pull request 03#1647: Reinstall rdb4 as mc2 - 13https://git.io/JtMhe [22:09:22] [02miraheze/puppet] 07paladox pushed 031 commit to 03master [+0/-1/±1] 13https://git.io/JtMhT [22:09:23] [02miraheze/puppet] 07paladox 0399e8a1c - Reinstall rdb4 as mc2 (#1647) [22:09:25] [02miraheze/puppet] 07paladox deleted branch 03paladox-patch-1 [22:09:26] [02puppet] 07paladox deleted branch 03paladox-patch-1 - 13https://git.io/vbiAS [22:10:56] [02miraheze/dns] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JtMhq [22:10:58] [02miraheze/dns] 07paladox 03a432612 - rdb4 -> mc2 [22:11:57] !log renaming rdb4 to mc2 [22:12:00] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [22:18:11] PROBLEM - rdb4 APT on rdb4 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [22:18:11] PROBLEM - rdb4 Disk Space on rdb4 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [22:18:19] PROBLEM - rdb4 Current Load on rdb4 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [22:20:12] PROBLEM - rdb4 SSH on rdb4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:20:12] PROBLEM - ping6 on rdb4 is CRITICAL: CRITICAL - Destination Unreachable (2001:41d0:800:1bbd::12) [22:20:13] PROBLEM - rdb4 Puppet on rdb4 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [22:20:13] PROBLEM - rdb4 NTP time on rdb4 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [22:20:13] PROBLEM - rdb4 Redis Process on rdb4 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [22:20:15] PROBLEM - ping4 on rdb4 is CRITICAL: PING CRITICAL - Packet loss = 100% [22:20:20] PROBLEM - Host rdb4 is DOWN: PING CRITICAL - Packet loss = 100% [22:22:43] I assume that will clear when puppet runs on mon2 paladox ? [22:22:52] yes [22:22:55] Should I downtime? [22:23:14] I don't even see them now [22:23:51] PROBLEM - mon2 APT on mon2 is CRITICAL: APT CRITICAL: 27 packages available for upgrade (1 critical updates). [22:24:25] Yeah it's gone already [22:27:52] Another wikimedia page. I'll stay awake and see if it calms. Might make stuff slow again [22:34:53] PROBLEM - test3 APT on test3 is CRITICAL: APT CRITICAL: 47 packages available for upgrade (18 critical updates). [22:35:21] PROBLEM - mw9 APT on mw9 is CRITICAL: APT CRITICAL: 19 packages available for upgrade (18 critical updates). [22:35:29] PROBLEM - mw8 APT on mw8 is CRITICAL: APT CRITICAL: 47 packages available for upgrade (18 critical updates). [22:35:36] paladox: replied to your graylog lag task [22:35:43] thanks! [22:35:57] https://phabricator.miraheze.org/T6862 [22:35:58] [ ⚓ T6862 Use memcache for the cache ] - phabricator.miraheze.org [22:35:59] PROBLEM - mw10 APT on mw10 is CRITICAL: APT CRITICAL: 47 packages available for upgrade (18 critical updates). [22:36:02] eh? [22:36:17] PROBLEM - mw11 APT on mw11 is CRITICAL: APT CRITICAL: 47 packages available for upgrade (18 critical updates). [22:36:18] PROBLEM - jobrunner3 APT on jobrunner3 is CRITICAL: APT CRITICAL: 50 packages available for upgrade (18 critical updates). [22:36:26] PROBLEM - jobrunner4 APT on jobrunner4 is CRITICAL: APT CRITICAL: 48 packages available for upgrade (18 critical updates). [22:37:08] I recalled moving mediawiki caching to local memcache instances made wikis slower, last time [22:37:15] SPF|Cloud https://phabricator.miraheze.org/T6858#135316 [22:37:16] [ ⚓ T6858 Messages take a while to be sent to graylog ] - phabricator.miraheze.org [22:37:31] SPF|Cloud yeh, this will be on a dedicated instance and the last time we tried we were using openvz [22:38:07] is openvz slower than kvm? [22:38:49] i mean we found that we can handle more using the dedicated server than on openvz :) [22:39:13] though we can see if things get slower with the dedicated instance, we can easily revert back to redis. [22:39:56] of course we can handle more, we have some high power CPUs nowadays [22:40:51] kvm instances run on a dedicated kernel, openvz shares the host kernel. if you see performance improvements on the new VMs, that is due to better hardware, not because kvm is performing better than openvz[1] [22:41:07] [1] unless your sysctl tuning helps, which is possible in kvm but not in openvz [22:41:15] yeh [22:41:36] but also because we were doing it locally, it meant that each load used a different memcache [22:41:44] *page [22:43:09] !sre I can't access `mc2wiki`, `mc2.miraheze.org`. Getting connection timed out / error 500s [22:43:19] correct, but with only four mediawiki servers running, the chance of a miss is no more than 25% [22:43:57] err, 75%, which will eventually be reduced to 0% as cache warms up [22:43:58] dmehus: looking [22:44:00] I think I know why though and if so not much we can do [22:44:06] Is it just that one wiki? [22:44:18] oh, its a wiki [22:44:19] Seems to be, yes, RhinosF1 [22:44:35] yeah, paladox, you can't use `mc2` for the name of the new `rdb4` [22:44:44] ok [22:44:46] maybe use `mc` if it's available [22:44:50] dmehus: oh yeah that would be an issue [22:44:56] Yeah ignore me [22:45:05] yeah that's why I was pointing that out in a very (too?) subtle way [22:45:06] heh [22:45:10] I was hoping I was wrong tbh but the server conflict would be an issue [22:45:16] https://phabricator.miraheze.org/T6862#135320 [22:45:17] [ ⚓ T6862 Use memcache for the cache ] - phabricator.miraheze.org [22:45:23] [02miraheze/dns] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JtMj0 [22:45:24] [02miraheze/dns] 07paladox 034c8de18 - mc2 -> mem2 [22:45:27] !log install security patches on test2 [22:45:31] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [22:45:32] uh... test3 [22:45:32] that works paladox, mem2 [22:46:19] dmehus: i wouldn't be shocked at you seeing lag though as you load a lot from wikimedia sites and they're having tech issues [22:46:53] RECOVERY - test3 APT on test3 is OK: APT OK: 29 packages available for upgrade (0 critical updates). [22:46:53] RhinosF1, have had a bit of lag loading scripts, yep. Miraheze performance has actually been decent today though unlike yesterday [22:47:18] dmehus: I mean like especially now [22:47:26] Theyve got an ongoing incident [22:47:35] Was one about an hour earlier yesterday too [22:48:19] RhinosF1, what's the incident related to? [22:49:00] dmehus: not enough php fpm childs remaining causing high latency [22:49:30] !log installing security patches on mw* [22:49:35] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [22:50:00] !log installing security patches on jobrunner* [22:50:03] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [22:51:13] [02mw-config] 07dmehus opened pull request 03#3725: Add `mem` to CreateWiki blacklist - 13https://git.io/JtMjM [22:52:35] [02mw-config] 07paladox closed pull request 03#3725: Add `mem` to CreateWiki blacklist - 13https://git.io/JtMjM [22:52:36] [02miraheze/mw-config] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JtMj9 [22:52:38] [02miraheze/mw-config] 07dmehus 0320418fc - Add `mem` to CreateWiki blacklist (#3725) [22:52:39] RhinosF1, ack,. interesting [22:52:57] dmehus: I'm keeping a close eye [22:53:43] miraheze/mw-config - dmehus the build passed. [22:54:38] RhinosF1, ah [22:55:20] dmehus: I'm used to us running out of childs not wikimedia. Looks like something is causing a lot of load for them. [22:57:27] .op [22:58:05] RECOVERY - jobrunner4 APT on jobrunner4 is OK: APT OK: 30 packages available for upgrade (0 critical updates). [22:58:11] RECOVERY - jobrunner3 APT on jobrunner3 is OK: APT OK: 32 packages available for upgrade (0 critical updates). [22:59:11] dmehus: updated topic [22:59:21] I'll get my laptop out [23:01:13] RECOVERY - mw10 APT on mw10 is OK: APT OK: 29 packages available for upgrade (0 critical updates). [23:01:15] RECOVERY - mw8 APT on mw8 is OK: APT OK: 29 packages available for upgrade (0 critical updates). [23:01:34] RECOVERY - mw11 APT on mw11 is OK: APT OK: 29 packages available for upgrade (0 critical updates). [23:01:38] RECOVERY - mw9 APT on mw9 is OK: APT OK: 1 packages available for upgrade (0 critical updates). [23:01:39] PROBLEM - jobrunner3 MediaWiki Rendering on jobrunner3 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 344 bytes in 0.012 second response time [23:02:15] PROBLEM - jobrunner4 MediaWiki Rendering on jobrunner4 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 344 bytes in 0.006 second response time [23:02:20] !log second attempt of installing security patches, this time the patches were actually applied [23:02:23] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [23:02:39] miraheze/mw-config - paladox the build passed. [23:02:43] SPF|Cloud: jbrs critical? [23:03:07] icinga probably performed the checks while I was updating php-fpm [23:03:11] we really need to remove php-fpm from the jobrunners, its not needed. [23:03:17] RhinosF1, ok [23:03:33] well, we really need to separate jobrunners from mediawiki servers [23:03:38] yeh [23:03:40] RECOVERY - jobrunner3 MediaWiki Rendering on jobrunner3 is OK: HTTP OK: HTTP/1.1 200 OK - 20688 bytes in 2.254 second response time [23:03:42] no nginx, no php-fpm [23:03:49] well [23:03:51] we need nginx [23:03:53] for the ssl stuff [23:04:07] that's part of the 'letsencrypt' suite, not of the jobrunner suite [23:04:13] SPF|Cloud, do we have the resources on the new cloud server to migrate to a Kafka job queue? [23:04:16] RECOVERY - jobrunner4 MediaWiki Rendering on jobrunner4 is OK: HTTP OK: HTTP/1.1 200 OK - 20685 bytes in 0.278 second response time [23:04:16] otherwise you'll install nginx on all jobrunners instead of one [23:04:24] that's a long outstanding Phabricator task [23:04:37] I guess we have the technical resources, but not the human resources [23:04:44] oh yeh [23:04:53] SPF|Cloud, ah [23:04:54] but in the first place, John is responsible for both [23:05:49] well, yes, but I'm sure he'd discuss such changes with you and Paladox and gain input from the MediaWiki team as related to their aspects [23:06:13] of course, I am not opposed, but the kafka job queue is a typical Wikimedia product [23:06:31] it works well for them, but implementing the same here is a painful task [23:06:49] yeah... exactly [23:06:57] could be a challenging migration [23:06:57] and unless you love reinventing the wheel, you have no choice but doing the same [23:07:27] MediaWiki is built for Wikimedia, it works well for them and may work well for Miraheze, until it doesn't anymore [23:08:00] the kafka docs were a mess last time we looked [23:08:08] and the constant firefighting and shoestring budgets here don't help either [23:08:09] paladox: glad to see we are finally switching to memcache. I think I recommended that awhile ago though and John said it was tried before but decided not to keep memcache. Why the change now, just wondering. Maybe I'm remembering wrong though. [23:08:29] jobrunners tbh have always been fragile [23:08:47] because like spf says it's built for them and like a single wiki [23:08:50] I realise that's a debatable thing to say in my role, but funds are very, very important here [23:09:17] Well i was looking at wikimedia, and it more came because they were using kask for sessions (we're not going to use kask yet as it really needs more looking into whether its worth it). Memcache supports tls so we could add support for tls in mediawiki if it doesn't support it. [23:09:40] and memcache is faster according to the redis vs memcache reviews (seeing as it is multi threaded) [23:10:00] in the past we installed memcache locally on each mw* which is why i think things were slow [23:10:03] for the record, I love our donors and volunteers, working with the board has never been an issue, but there's certainly a trend where fewer investments in infrastructure increases the workload of the volunteers [23:10:21] Yeah that's why I recommended it awhile ago actual was because of that paladox. [23:11:14] Universal_Omega, did you recommend it on the old servers? Maybe that's why it was declined at the time? [23:12:00] I recommended it period dmehus, but yeah it was when we were on old infrastructure. [23:12:24] I didn't mean for the specific servers. I just thought it'd be beneficial [23:12:27] a capital injection of 10k would be welcome :P [23:13:58] * RhinosF1 will try and donate something when he starts earning [23:14:03] Universal_Omega, yeah I remember when you raised the task as I was subscribed to it, but yeah maybe it was just declined because we of our infrastructure at the time. Not sure [23:14:09] i don't have 10k though [23:14:36] RhinosF1: you can, but I won't endorse it [23:16:03] engineers donating time /and/ £ seems a tad wrong [23:16:53] * RhinosF1 doesn't really mind. Miraheze has had quite an impact in a good way on me. [23:17:15] ^ what RhinosF1 said. [23:18:47] RECOVERY - mon2 APT on mon2 is OK: APT OK: 26 packages available for upgrade (0 critical updates). [23:18:51] that's great to hear ;-) [23:19:49] I will endorse sysadmins / stewards donating time + financial resources, so long as it's not more than the sysadmin / steward can afford and won't break their budget. [23:20:08] there's no way i'd have the confidence and skills i do if it weren't for miraheze [23:20:18] :) [23:20:21] If the kask job queue task was followed up with, arguably we could potentially get rid of jobqueue temporarily as an expire meant as it would increase traffic serving servers by 2 and increase job running servers by 4, but would need to be monitored closely [23:20:42] *experiment not expire meant :P [23:20:57] oh, heh I wonder what 'expire' meant [23:21:02] thanks for clarifying, JohnLewis [23:21:29] But that jobqueue task lies with Reception and MWE who wanted to take ownership of if [23:21:37] heh [23:21:45] *it - late nights and autocorrect == bad [23:21:46] PROBLEM - mem2 APT on mem2 is CRITICAL: APT CRITICAL: 5 packages available for upgrade (5 critical updates). [23:21:49] GPG errors when upgrading packages on bacula2 [23:22:02] it's been a while since I have seen that [23:22:27] Now I’m going silent for about 40 minutes so any replies will follow that [23:22:27] * dmehus imagines John eats a fair bit of pizza and ramen [23:22:43] ha, John and late nights [23:22:57] lol, and SPF|Cloud likes his late nights, too :P [23:23:01] I do [23:23:04] heh [23:23:10] have you gone for your walk yet? [23:23:22] but I can't remember what the usual times were back when John and I drafted the concepts for miraheze [23:23:36] heh, yeah, that'd be interesting [23:23:40] yep, curfew started at 21h and it's 00:23, so.. [23:23:48] ah [23:24:50] I have heard the price for dogs have ramped up since... (context: walking your dog is allowed at all times, regardless of current restrictions) [23:25:07] LOL [23:25:58] ah, /me wondered why updates fail on bacula2 [23:26:04] Write error - write (28: No space left on device) [23:26:51] even more interesting: df -h reports 931G usage, yet the filesystem has a size of 981G [23:27:38] Hrm, is there a hidden partition for non-OS related things? [23:28:02] wait, this could be the 5% that's usually reserved [23:28:15] https://wiki.archlinux.org/index.php/ext4#Reserved_blocks [23:28:16] yeah... that's what I was thinking [23:28:16] [ Ext4 - ArchWiki ] - wiki.archlinux.org [23:28:37] Don't we use Debian, though, or is Arch Linux based on Debian? [23:28:57] ext4 is not specific to either arch or debian [23:29:01] oh, right [23:29:03] duh heh [23:29:04] just saw the email SPF|Cloud [23:29:20] the email? [23:29:47] SPF|Cloud: about bacula2 [23:30:14] I didn't see this email in the first place.. [23:30:47] SPF|Cloud: i assumed that's why you were on about upgrade failures [23:31:08] it went to sre@ saying unattended upgrades failed [23:31:09] [02miraheze/puppet] 07Southparkfan pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JtDvG [23:31:11] [02miraheze/puppet] 07Southparkfan 03b68e535 - unattended-upgrade: send email to sre-infra [23:32:42] paladox: how to fix bacula2? [23:32:50] SPF|Cloud fix? [23:33:01] oh, you mean because its out of space? [23:33:03] to me, the database backups are not worth that much [23:33:04] yes [23:33:17] Um you could delete backups and then regenerate them i guess [23:33:24] i have some scripts in my home dir [23:33:54] i just fetch the volumes, decide which ones to delete, write to a file called delete, and then after deleting rom bacula, i remove the backups in /bacula/backup [23:33:59] *backups [23:34:03] when https://phabricator.miraheze.org/T5877 is fixed, less space will be used on bacula2 [23:34:04] [ ⚓ T5877 Revise MariaDB backup strategy ] - phabricator.miraheze.org [23:35:36] I don't know how to use your scripts [23:37:14] so i cat fetchVolume [23:37:18] i then run it [23:37:24] echo list volume | bconsole | awk '{print $4}' [23:37:44] i then choose the volumes i want deleted, for example STATIC_ (and copy and paste all that into delete) [23:37:51] i then run deleteBackups.sh [23:38:03] after, i then remove the same named files from /bacula/backups [23:38:13] * /bacula/backup [23:39:20] !log free up space by deleting one db backup [23:39:23] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [23:39:58] after this + performing the security patch, it's time to head off for today [23:41:38] !log installed security update on bacula2 [23:41:41] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [23:41:50] thanks paladox :) [23:41:56] yw [23:41:59] and night [23:42:13] RECOVERY - bacula2 APT on bacula2 is OK: APT OK: 1 packages available for upgrade (0 critical updates).