[07:11:57] PROBLEM - jobrunner3 Disk Space on jobrunner3 is CRITICAL: DISK CRITICAL - free space: / 704 MB (1% inode=86%); [08:02:22] .op [08:03:01] pinging paladox Reception123 SPF|Cloud Universal_Omega [08:08:38] !log stop/start jobrunner&jobcron on jobrunner* [08:08:41] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [08:10:28] jbr3 ran out of disk :( [08:11:31] jbr4 seems to have left but there both spewing errors [08:12:52] Reception123: it's wikibackups [08:14:01] RhinosF1: I'll have access in around 20 mins but I started zipping the dir a while ago so feel free to delete the actual dir from my homedir [08:14:11] As the zip should be done [08:14:21] Reception123: it's wikibackups16022021.tar.gz that's the biggest [08:14:34] but is /home/reception/wikibackups safe to delete then [08:14:56] Yes, it should be. In any case we can't have jbr3 out of disk space [08:16:35] !log deleted /home/reception/wikibackups/* [08:16:37] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [08:16:51] Reception123: /home/reception/images1 is fairly big tooo [08:17:22] RhinosF1: oh, I also started an SCP of that yeah, was going to start the import soon [08:17:58] RECOVERY - jobrunner3 Disk Space on jobrunner3 is OK: DISK OK - free space: / 16449 MB (37% inode=86%); [08:18:20] Reception123: looking in size order it's /home/reception/wikibackups16022021.tar.gz, /home/reception/delbackups2 then /home/reception/images1 [08:18:25] but icinga-miraheze happy [08:19:36] Ok, will figure it out when I have access [08:20:07] Reception123: errors on jobrunner* have stopped but mw* is still spewing [08:20:22] oh no [08:20:26] i see errors still [08:21:53] Hmm [08:22:08] !log stop/start job services again [08:22:11] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [08:26:25] not seeing anything yet [08:26:35] nope [08:26:38] it's back [08:26:49] PROBLEM - ping6 on cp3 is CRITICAL: PING CRITICAL - Packet loss = 16%, RTA = 266.85 ms [08:28:49] PROBLEM - ping6 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 267.34 ms [08:29:18] Reception123: error rates are lower but still happening [08:38:16] after all no space was left so the zip didn't complete [08:38:23] will have to do them again and exclude the top 20 instead of top 10 [08:39:33] Reception123: can we try a full reboot of jobrunner* ? I'm seeing about 1-2 errors a minute persist [08:39:42] (down though from 50-100) [08:40:02] okay [08:40:11] !log rebooted jobrunner3 [08:40:15] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [08:41:04] !log rebooted jobrunner4 [08:41:07] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [08:42:15] !log reception@jobrunner3:~$ sudo -u www-data php /srv/mediawiki/w/maintenance/importImages.php --wiki kelevarwiki /home/reception/images --search-recursively [08:42:18] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [08:43:29] PROBLEM - jobrunner4 Check Gluster Clients on jobrunner4 is CRITICAL: PROCS CRITICAL: 0 processes with args '/usr/sbin/glusterfs' [08:43:53] Reception123: ^ might want to fix the mount [08:44:22] * Reception123 is fixing [08:44:55] !log umount/mount mediawiki-static on jobrunner4 [08:44:58] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [08:45:29] RECOVERY - jobrunner4 Check Gluster Clients on jobrunner4 is OK: PROCS OK: 1 process with args '/usr/sbin/glusterfs' [08:46:27] Reception123: I still see errors on https://graylog.miraheze.org/search?q=NOT+mediawiki_exception_file%3A%5C%2Fsrv%5C%2Fmediawiki%5C%2Fw%5C%2Fextensions%5C%2FCargo%5C%2F%2A+AND+mediawiki_exception_class%3AJobQueueError&rangetype=relative&streams=5f8c6fd446640840f104b0ba&relative=7200 [08:46:52] and https://graylog.miraheze.org/search?q=mediawiki_host%3Ajobrunner%2A&rangetype=relative&streams=5f8c6fd446640840f104b0ba&relative=7200 now [08:47:01] RhinosF1: should I perhaps try running manually on mw10 and see what happens? [08:47:28] Reception123: how do you manually insert a job? [08:48:11] oh no, I was just thinking of running everything manually and see if there's any error or if it's just because there's a backlog [08:49:25] Reception123: showJobs.php is giving none [08:49:37] because they're not getting to the jobrunners [08:50:04] Reception123: something between them seems stuck in a failed state [08:50:16] but that's redis and I don't understand redis [08:50:33] hmm, yeah redis isn't really my thing either [08:51:23] Reception123: i'm not sure what's safe to restart too [08:51:40] yeah, I wouldn't want to mess with something and make things even worse [08:51:54] given the much lower rate, i'd say hope paladox wakes up soon [08:59:09] RhinosF1: similar issue here: https://phabricator.miraheze.org/T5939 [08:59:10] [ ⚓ T5939 Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report er ] - phabricator.miraheze.org [08:59:38] where it said it couldn't insert X [09:00:27] Reception123: Paladox said they killed redis-server but not where [09:00:39] If they mean mw* then it might be safe to do [09:08:13] hmm yeah, not sure aobut that [09:17:28] Reception123: we should write docs about when redis goes mad [09:26:39] yeah, there are some now but not on this specific issue [09:28:12] Redis is a fairly important component but also the one with the least knowledge [09:43:54] hey folks, there is someone complaining about "Fatal exception of type "JobQueueError"" at a miraheze wiki on mediawiki.org support desk, known issue or not? [09:44:25] Majavah: known but mostly fixed [09:44:32] ack, thanks [09:46:22] PROBLEM - guia.cineastas.pt - reverse DNS on sslhost is CRITICAL: rDNS CRITICAL - guia.cineastas.pt All nameservers failed to answer the query. [09:53:07] Reception123: delete your images folder if you're done with it [09:53:08] RECOVERY - guia.cineastas.pt - reverse DNS on sslhost is OK: rDNS OK - guia.cineastas.pt reverse DNS resolves to cp10.miraheze.org [09:55:12] Already done [09:56:32] Ty [09:56:52] Reception123: if that tar.gz is useless then that can go too [09:56:55] As that was huge [09:57:33] (Or just delete anything that can't or won't be used again from /home/reception [09:57:34] ) [09:58:52] I already deleted that, as it will have to be redone with less large wikis [09:59:12] Okay cool [09:59:34] Reception123: what does df show we have free currently? [10:00:37] 30G, I deleted the backups a while ago anyway [10:02:38] Ah k [11:14:52] PROBLEM - ping6 on cp3 is CRITICAL: PING CRITICAL - Packet loss = 16%, RTA = 281.79 ms [11:15:09] hi JohnLewis [11:16:52] PROBLEM - ping6 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 259.01 ms [11:17:38] JohnLewis: can you look at the last few flowing errors on https://graylog.miraheze.org/search?q=NOT+mediawiki_exception_file%3A%5C%2Fsrv%5C%2Fmediawiki%5C%2Fw%5C%2Fextensions%5C%2FCargo%5C%2F%2A+AND+mediawiki_exception_class%3AJobQueueError&rangetype=relative&streams=5f8c6fd446640840f104b0ba&relative=28800 and [11:17:38] https://graylog.miraheze.org/search?q=mediawiki_host%3Ajobrunner%2A&rangetype=relative&streams=5f8c6fd446640840f104b0ba&relative=28800. Last time it happened paladox restarted redis-server but didn't see on what. both jbrs have been rebooted which caused most to stop. [11:18:56] I don’t have access but redis-server only runs on jobrunner for jobs [12:15:12] JohnLewis: ok [12:20:42] PROBLEM - ping4 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 252.04 ms [12:22:44] RECOVERY - ping4 on cp3 is OK: PING OK - Packet loss = 0%, RTA = 248.66 ms [12:49:32] What happened? [12:49:40] I’m mobile [12:51:14] You’ll see errors logging to graylog for a while [12:51:19] Because of a backlog [12:51:37] https://phabricator.miraheze.org/T6858 [12:51:37] [ ⚓ T6858 Messages take a while to be sent to graylog ] - phabricator.miraheze.org [13:42:17] PROBLEM - services4 APT on services4 is CRITICAL: APT CRITICAL: 30 packages available for upgrade (3 critical updates). [13:43:10] PROBLEM - services3 APT on services3 is CRITICAL: APT CRITICAL: 30 packages available for upgrade (3 critical updates). [13:43:41] PROBLEM - cp3 APT on cp3 is CRITICAL: APT CRITICAL: 28 packages available for upgrade (2 critical updates). [13:44:54] PROBLEM - cloud4 APT on cloud4 is CRITICAL: APT CRITICAL: 39 packages available for upgrade (2 critical updates). [13:50:05] paladox: if you can, please check JobQueue doesn't have any lagging issues from this morning [13:50:14] PROBLEM - cloud3 APT on cloud3 is CRITICAL: APT CRITICAL: 99 packages available for upgrade (2 critical updates). [13:50:24] Jbr3 ran out of space [13:50:31] PROBLEM - bacula2 APT on bacula2 is CRITICAL: APT CRITICAL: 3 packages available for upgrade (2 critical updates). [13:50:49] Rebooting both fixed most but they still seem to be pockets of errors being logged [13:51:01] PROBLEM - ns2 APT on ns2 is CRITICAL: APT CRITICAL: 27 packages available for upgrade (2 critical updates). [13:51:36] PROBLEM - db12 APT on db12 is CRITICAL: APT CRITICAL: 67 packages available for upgrade (2 critical updates). [13:51:52] PROBLEM - gluster3 APT on gluster3 is CRITICAL: APT CRITICAL: 26 packages available for upgrade (2 critical updates). [13:52:00] PROBLEM - db11 APT on db11 is CRITICAL: APT CRITICAL: 67 packages available for upgrade (2 critical updates). [13:53:35] PROBLEM - db13 APT on db13 is CRITICAL: APT CRITICAL: 29 packages available for upgrade (2 critical updates). [13:53:55] PROBLEM - mon2 APT on mon2 is CRITICAL: APT CRITICAL: 29 packages available for upgrade (3 critical updates). [13:54:30] PROBLEM - puppet3 APT on puppet3 is CRITICAL: APT CRITICAL: 30 packages available for upgrade (2 critical updates). [13:54:42] PROBLEM - cp12 APT on cp12 is CRITICAL: APT CRITICAL: 26 packages available for upgrade (2 critical updates). [13:55:00] PROBLEM - cp10 APT on cp10 is CRITICAL: APT CRITICAL: 27 packages available for upgrade (2 critical updates). [13:56:43] PROBLEM - cloud5 APT on cloud5 is CRITICAL: APT CRITICAL: 39 packages available for upgrade (2 critical updates). [13:57:21] PROBLEM - gluster4 APT on gluster4 is CRITICAL: APT CRITICAL: 26 packages available for upgrade (2 critical updates). [13:57:58] PROBLEM - ns1 APT on ns1 is CRITICAL: APT CRITICAL: 24 packages available for upgrade (2 critical updates). [13:59:44] PROBLEM - rdb4 APT on rdb4 is CRITICAL: APT CRITICAL: 26 packages available for upgrade (2 critical updates). [14:01:39] PROBLEM - mail2 APT on mail2 is CRITICAL: APT CRITICAL: 31 packages available for upgrade (3 critical updates). [14:03:26] PROBLEM - graylog2 APT on graylog2 is CRITICAL: APT CRITICAL: 30 packages available for upgrade (2 critical updates). [14:04:29] PROBLEM - rdb3 APT on rdb3 is CRITICAL: APT CRITICAL: 26 packages available for upgrade (2 critical updates). [14:04:38] PROBLEM - phab2 APT on phab2 is CRITICAL: APT CRITICAL: 27 packages available for upgrade (3 critical updates). [14:05:02] PROBLEM - mw11 APT on mw11 is CRITICAL: APT CRITICAL: 32 packages available for upgrade (3 critical updates). [14:05:05] PROBLEM - jobrunner3 APT on jobrunner3 is CRITICAL: APT CRITICAL: 35 packages available for upgrade (3 critical updates). [14:05:10] PROBLEM - mw9 APT on mw9 is CRITICAL: APT CRITICAL: 4 packages available for upgrade (3 critical updates). [14:05:11] PROBLEM - test3 APT on test3 is CRITICAL: APT CRITICAL: 32 packages available for upgrade (3 critical updates). [14:05:18] PROBLEM - mw10 APT on mw10 is CRITICAL: APT CRITICAL: 32 packages available for upgrade (3 critical updates). [14:05:32] PROBLEM - jobrunner4 APT on jobrunner4 is CRITICAL: APT CRITICAL: 33 packages available for upgrade (3 critical updates). [14:06:01] PROBLEM - mw8 APT on mw8 is CRITICAL: APT CRITICAL: 32 packages available for upgrade (3 critical updates). [14:07:31] PROBLEM - ldap2 APT on ldap2 is CRITICAL: APT CRITICAL: 26 packages available for upgrade (2 critical updates). [14:08:26] PROBLEM - cp11 APT on cp11 is CRITICAL: APT CRITICAL: 27 packages available for upgrade (2 critical updates). [14:17:02] PROBLEM - ping6 on cp3 is CRITICAL: PING CRITICAL - Packet loss = 28%, RTA = 295.23 ms [14:21:03] PROBLEM - ping6 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 264.72 ms [14:23:07] PROBLEM - ping6 on cp3 is CRITICAL: PING CRITICAL - Packet loss = 44%, RTA = 293.55 ms [14:25:06] PROBLEM - ping6 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 277.27 ms [14:31:07] PROBLEM - ping6 on cp3 is CRITICAL: PING CRITICAL - Packet loss = 16%, RTA = 275.48 ms [14:35:08] PROBLEM - ping6 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 275.99 ms [15:01:06] PROBLEM - ping6 on cp3 is CRITICAL: PING CRITICAL - Packet loss = 16%, RTA = 277.28 ms [15:03:06] PROBLEM - ping6 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 278.90 ms [15:11:14] !log root@jobrunner3:/home/paladox# /usr/local/bin/foreachwikiindblist /srv/mediawiki/w/cache/databases.json /srv/mediawiki/w/maintenance/runJobs.php [15:11:17] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [15:21:06] PROBLEM - ping6 on cp3 is CRITICAL: PING CRITICAL - Packet loss = 16%, RTA = 277.29 ms [15:21:37] paladox: are you happy with https://phabricator.miraheze.org/T6859#135242 then or? [15:21:38] [ ⚓ T6859 Huge rise in Redis server error: Could not insert

job(s). ] - phabricator.miraheze.org [15:21:55] Yes. [15:22:10] paladox: lower to normal as I'll do an IR [15:22:15] assign to me [15:22:29] IR doesn't require that the task stays open though [15:22:32] if its resolved. [15:22:43] paladox: yeah but it's a way to track [15:22:51] ok [15:23:06] PROBLEM - ping6 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 271.78 ms [15:23:10] we can give the jobrunners more disk space if necessary [15:23:13] but also [15:23:24] we do have 500gb of backup space for each cloud server [15:23:43] paladox: i've got a few ideas for followup things [15:23:49] oh? [15:23:53] one will be the mess that is wikibackups [15:24:04] one documenting better how to fix the issue [15:24:17] what do you me documenting how to fix the issue? [15:24:28] and one making things better redundant as jobrunner4 was useless too as a result [15:24:29] simple as don't fill up the space so much :) [15:25:03] paladox: we need a way to work out based on what you said if redis is actually failed or graylog slow [15:25:18] the service looks fine [15:25:27] and if so if there's a way other than rebooting the whole server to unstick it [15:25:29] and its not graylog being slow [15:25:35] its syslog-ng being slow [15:25:41] graylog is still showing things come through now [15:25:42] i gave you the task [15:25:49] yes [15:25:57] did you read what i said earlier? :) [15:26:34] stuff like: [15:26:35] that needs to be documented as a aggravating the issue in the IR as both me and Reception123 couldn't tell if it was still spewing errors for a reason as a result [15:26:35] Feb 17 08:47:30 jobrunner3 jobrunner[3115]: 2021-02-17T08:47:30+0000 ERROR: Runner loop 0 process in slot 1 gave status '0': [15:26:35] Feb 17 08:47:30 jobrunner3 jobrunner[3115]: json_decode() error (4): Syntax error [15:26:35] Feb 17 08:47:30 jobrunner3 jobrunner[3115]: php /srv/mediawiki/w/maintenance/runJobs.php --wiki='kelevarwiki' --type='refreshLinks' --maxtime='60' --memory-limit='192M' --result=json STDOUT: [15:26:50] really needs to be fixed, like why is it throwing a json syntax error? [15:27:07] PROBLEM - ping6 on cp3 is CRITICAL: PING CRITICAL - Packet loss = 16%, RTA = 275.70 ms [15:27:11] unless i know what the json is then i can't look [15:29:07] PROBLEM - ping6 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 273.76 ms [15:32:38] [02puppet] 07paladox deleted branch 03revert-1645-patch-28 - 13https://git.io/vbiAS [15:32:40] [02miraheze/puppet] 07paladox deleted branch 03revert-1645-patch-28 [15:33:53] [02miraheze/puppet] 07paladox deleted branch 03paladox-patch-1 [15:33:54] [02puppet] 07paladox deleted branch 03paladox-patch-1 - 13https://git.io/vbiAS [15:35:04] [02miraheze/mediawiki] 07paladox pushed 031 commit to 03REL1_35 [+0/-0/±1] 13https://git.io/JtMiI [15:35:06] [02miraheze/mediawiki] 07paladox 0376c08ee - Update MirahezeMagic [15:37:04] !log rebuild lc on mw* and jobrunner* [15:37:07] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [15:41:08] PROBLEM - ping4 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 256.78 ms [15:41:16] PROBLEM - mw11 Current Load on mw11 is CRITICAL: CRITICAL - load average: 8.40, 6.19, 4.58 [15:41:35] PROBLEM - mw10 Current Load on mw10 is CRITICAL: CRITICAL - load average: 9.66, 6.89, 4.95 [15:41:49] PROBLEM - mw8 Current Load on mw8 is CRITICAL: CRITICAL - load average: 8.23, 6.35, 4.97 [15:43:37] PROBLEM - mw10 Current Load on mw10 is WARNING: WARNING - load average: 7.10, 6.82, 5.16 [15:44:51] PROBLEM - mw9 Current Load on mw9 is CRITICAL: CRITICAL - load average: 9.20, 8.02, 5.93 [15:45:13] RECOVERY - ping4 on cp3 is OK: PING OK - Packet loss = 0%, RTA = 248.80 ms [15:45:37] PROBLEM - mw10 Current Load on mw10 is CRITICAL: CRITICAL - load average: 10.10, 8.02, 5.81 [15:53:18] PROBLEM - mw11 Current Load on mw11 is WARNING: WARNING - load average: 7.31, 7.89, 6.80 [15:53:36] PROBLEM - mw10 Current Load on mw10 is WARNING: WARNING - load average: 6.62, 7.64, 6.67 [15:55:13] PROBLEM - mw8 Current Load on mw8 is WARNING: WARNING - load average: 5.82, 7.91, 7.23 [15:56:35] PROBLEM - mw9 Current Load on mw9 is WARNING: WARNING - load average: 6.11, 7.53, 7.11 [15:57:16] RECOVERY - mw11 Current Load on mw11 is OK: OK - load average: 5.07, 6.75, 6.59 [15:57:39] RECOVERY - mw10 Current Load on mw10 is OK: OK - load average: 5.23, 6.53, 6.48 [16:00:30] RECOVERY - mw9 Current Load on mw9 is OK: OK - load average: 2.90, 5.41, 6.37 [16:01:13] RECOVERY - mw8 Current Load on mw8 is OK: OK - load average: 4.24, 5.95, 6.63 [16:03:16] [02miraheze/mw-config] 07Universal-Omega created branch 03Test 13https://git.io/JtMPp [16:03:17] [02mw-config] 07Universal-Omega created branch 03Test - 13https://git.io/vbvb3 [16:04:34] [02mw-config] 07Universal-Omega deleted branch 03Test - 13https://git.io/vbvb3 [16:04:35] [02miraheze/mw-config] 07Universal-Omega deleted branch 03Test [16:04:39] [02mw-config] 07Universal-Omega deleted branch 03Universal-Omega-patch-4 - 13https://git.io/vbvb3 [16:04:41] [02miraheze/mw-config] 07Universal-Omega deleted branch 03Universal-Omega-patch-4 [16:04:43] PROBLEM - ping4 on cp3 is WARNING: PING WARNING - Packet loss = 0%, RTA = 259.17 ms [16:06:45] RECOVERY - ping4 on cp3 is OK: PING OK - Packet loss = 0%, RTA = 248.19 ms [16:21:02] RECOVERY - mw11 APT on mw11 is OK: APT OK: 29 packages available for upgrade (0 critical updates). [16:21:05] RECOVERY - jobrunner3 APT on jobrunner3 is OK: APT OK: 32 packages available for upgrade (0 critical updates). [16:21:13] RECOVERY - mw9 APT on mw9 is OK: APT OK: 1 packages available for upgrade (0 critical updates). [16:21:19] RECOVERY - mw10 APT on mw10 is OK: APT OK: 29 packages available for upgrade (0 critical updates). [16:21:31] RECOVERY - ldap2 APT on ldap2 is OK: APT OK: 24 packages available for upgrade (0 critical updates). [16:21:32] RECOVERY - jobrunner4 APT on jobrunner4 is OK: APT OK: 30 packages available for upgrade (0 critical updates). [16:21:33] RECOVERY - db13 APT on db13 is OK: APT OK: 27 packages available for upgrade (0 critical updates). [16:21:36] RECOVERY - db12 APT on db12 is OK: APT OK: 65 packages available for upgrade (0 critical updates). [16:21:42] RECOVERY - graylog2 APT on graylog2 is OK: APT OK: 28 packages available for upgrade (0 critical updates). [16:21:45] RECOVERY - rdb4 APT on rdb4 is OK: APT OK: 24 packages available for upgrade (0 critical updates). [16:21:45] !log install security updates on all servers [16:21:52] RECOVERY - gluster3 APT on gluster3 is OK: APT OK: 24 packages available for upgrade (0 critical updates). [16:21:54] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [16:21:56] RECOVERY - mon2 APT on mon2 is OK: APT OK: 26 packages available for upgrade (0 critical updates). [16:21:58] RECOVERY - ns1 APT on ns1 is OK: APT OK: 22 packages available for upgrade (0 critical updates). [16:22:16] RECOVERY - cloud3 APT on cloud3 is OK: APT OK: 97 packages available for upgrade (0 critical updates). [16:22:17] RECOVERY - mw8 APT on mw8 is OK: APT OK: 29 packages available for upgrade (0 critical updates). [16:22:19] RECOVERY - services4 APT on services4 is OK: APT OK: 27 packages available for upgrade (0 critical updates). [16:22:23] RECOVERY - cp3 APT on cp3 is OK: APT OK: 26 packages available for upgrade (0 critical updates). [16:22:26] RECOVERY - cp11 APT on cp11 is OK: APT OK: 25 packages available for upgrade (0 critical updates). [16:22:31] RECOVERY - puppet3 APT on puppet3 is OK: APT OK: 28 packages available for upgrade (0 critical updates). [16:22:40] RECOVERY - phab2 APT on phab2 is OK: APT OK: 24 packages available for upgrade (0 critical updates). [16:22:44] RECOVERY - cloud5 APT on cloud5 is OK: APT OK: 37 packages available for upgrade (0 critical updates). [16:22:52] RECOVERY - cp12 APT on cp12 is OK: APT OK: 24 packages available for upgrade (0 critical updates). [16:22:56] RECOVERY - db11 APT on db11 is OK: APT OK: 65 packages available for upgrade (0 critical updates). [16:23:00] RECOVERY - cp10 APT on cp10 is OK: APT OK: 25 packages available for upgrade (0 critical updates). [16:23:01] RECOVERY - ns2 APT on ns2 is OK: APT OK: 25 packages available for upgrade (0 critical updates). [16:23:10] RECOVERY - services3 APT on services3 is OK: APT OK: 27 packages available for upgrade (0 critical updates). [16:23:11] RECOVERY - test3 APT on test3 is OK: APT OK: 29 packages available for upgrade (0 critical updates). [16:23:21] RECOVERY - gluster4 APT on gluster4 is OK: APT OK: 24 packages available for upgrade (0 critical updates). [16:23:30] RECOVERY - cloud4 APT on cloud4 is OK: APT OK: 37 packages available for upgrade (0 critical updates). [16:23:39] RECOVERY - mail2 APT on mail2 is OK: APT OK: 28 packages available for upgrade (0 critical updates). [16:24:13] PROBLEM - mw9 Current Load on mw9 is CRITICAL: CRITICAL - load average: 8.86, 6.61, 5.58 [16:25:30] RECOVERY - rdb3 APT on rdb3 is OK: APT OK: 24 packages available for upgrade (0 critical updates). [16:27:13] PROBLEM - mw8 Current Load on mw8 is CRITICAL: CRITICAL - load average: 8.89, 7.74, 6.28 [16:27:22] [02miraheze/puppet] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JtM1z [16:27:24] [02miraheze/puppet] 07paladox 030143508 - Increase warning ping for cp3 to 300 [16:28:08] RECOVERY - mw9 Current Load on mw9 is OK: OK - load average: 4.45, 6.37, 5.81 [16:28:49] RECOVERY - ping6 on cp3 is OK: PING OK - Packet loss = 0%, RTA = 265.45 ms [16:29:05] PROBLEM - mw8 Current Load on mw8 is WARNING: WARNING - load average: 5.89, 7.27, 6.29 [16:33:02] RECOVERY - mw8 Current Load on mw8 is OK: OK - load average: 5.80, 6.45, 6.19 [16:44:16] .deop [16:53:55] paladox, or Universal_Omega, can either of you take a look at https://meta.miraheze.org/w/index.php?title=Community_noticeboard&diff=next&oldid=163081&diffmode=source#ManageWiki/Settings_broken? ManageWiki/settings isn't accepting changes, and I wonder if ManageWiki cache needs to be cleared maybe? I can reproduce the user's error as I tried to change it, but it didn't accept the change [16:53:57] [ Difference between revisions of "Community noticeboard" - Miraheze Meta ] - meta.miraheze.org [16:55:57] Doubt its cache [16:56:14] we don't use cache for showing settings, so what you see is whats in the database [16:56:58] Oh, okay. Any idea what could cause it then? Could it be a ManageWiki bug with the incorrect syntax of that logo URL? [16:57:22] Maybe we can use `sql.php` or `eval.php` to clear the logo URL from that database table on that wiki? [17:00:10] dmehus: I know the issue [17:00:27] One second. [17:02:15] Universal_Omega, oh, what's the issue? [17:02:39] dmehus: Have them enable SocialProfile on that wiki, then make a change to wgCosmosProfileTagGroups and see if that saves. The value there is an invalid option. [17:02:41] and can you correct the issue so it doesn't reoccur in the future on other wikis? [17:03:06] What if they don't want SocialProfile though? [17:03:14] Disable it afterwards. [17:03:19] Just temporarily [17:03:27] Universal_Omega: can we get a proper fix? [17:03:36] ^ [17:03:50] How did a CosmosProfileTagGroups get set though if SocialProfile's not enabled? [17:04:12] Can't you make that greyed out if SocialProfile's not enabled, or fix it to not require SocialProfile? [17:04:43] RhinosF1: no it's an upstream bug. And dmehus it is grayed out they somehow changed it awhile ago. Or deleted the default groups possibly. [17:04:59] It's an issue with input validation [17:05:09] Universal_Omega: okay [17:05:13] Not displaying error messages except on that specifictab [17:05:19] We can't change our input validation to use another validation method? [17:05:23] No [17:05:26] oh [17:06:04] It's a valid error but it can only be seen on the specific tab. Not everywhere appearing like it's not doing anything. [17:06:09] Well, since it's related to a bug, you can probably change it for them, or I could. Seems less bureaucratic than having them temporarily enable and disable SocialProfile [17:06:16] do you want to or me? [17:06:43] I can't change it unless SP is temporarily enabled. [17:06:50] Because it's grayed out. [17:07:01] Yeah that's what I'm saying, you temporarily enable it to resolve a bug [17:07:23] Oh. Sure give me a minute. [17:07:33] you can probably do that, since it's related to the bug, or I can, if you'd prefer that [17:08:12] * dmehus hopes this apparent upstream bug is resolved soon as this fix seems quite hacky heh [17:08:49] Universal_Omega do you know how to fix the upstream bug? [17:10:04] paladox: no idea. The bug may just be missing functionality tbh. All fixing it would do is allow us to display errors about why managewiki isn't saving when validation fails. [17:10:12] oh [17:14:21] dmehus: fixed [17:15:50] Universal_Omega, thanks, except the SocialProfile-added user rights were not removed. [17:16:04] I thought we fixed this when we fixed the Report extension? [17:16:58] RhinosF1: since you first responded to the incident, could you do the IR for https://phabricator.miraheze.org/T6859#135260 ? [17:16:59] [ ⚓ T6859 Huge rise in Redis server error: Could not insert

job(s). ] - phabricator.miraheze.org [17:17:16] (assuming you already are but just to make sure) [17:17:20] Reception123: yep I plan to [17:17:26] ok, sounds good [17:17:57] Oh. No that didn't fix that. I still have open fixes to ManageWiki which will do namespaces. But I still need to recently done ManageWiki fix to mw-config to remove them. I will soon. Sorry did not think about that. [17:18:11] dmehus: ^ [17:18:55] dmehus: actually I still need to finish that fix on ManageWiki. [17:32:46] Universal_Omega, oh okay. So when you're done with that wiki, can we just remove the SP-added user rights with `eval.php` or `sql.php` or something so we don't have to re-add the extension again, remove rights, and remove the extension? [17:33:05] like you did with those other wikis before, basically [17:33:37] PROBLEM - cloud4 Current Load on cloud4 is WARNING: WARNING - load average: 21.54, 18.78, 14.50 [17:34:01] PROBLEM - cp10 Current Load on cp10 is CRITICAL: CRITICAL - load average: 4.25, 7.59, 3.85 [17:35:37] RECOVERY - cloud4 Current Load on cloud4 is OK: OK - load average: 11.15, 15.76, 13.91 [17:38:03] PROBLEM - cp10 Current Load on cp10 is WARNING: WARNING - load average: 0.22, 3.50, 3.01 [17:40:01] RECOVERY - cp10 Current Load on cp10 is OK: OK - load average: 0.17, 2.40, 2.66 [19:06:31] [02miraheze/mw-config] 07paladox pushed 031 commit to 03paladox-patch-2 [+0/-0/±1] 13https://git.io/JtM7F [19:06:33] [02miraheze/mw-config] 07paladox 030057e53 - redis: Stop using nutcracker [19:06:34] [02mw-config] 07paladox created branch 03paladox-patch-2 - 13https://git.io/vbvb3 [19:06:36] [02mw-config] 07paladox opened pull request 03#3724: redis: Stop using nutcracker - 13https://git.io/JtM7b [19:07:44] miraheze/mw-config - paladox the build passed. [19:39:19] [02mw-config] 07paladox closed pull request 03#3724: redis: Stop using nutcracker - 13https://git.io/JtM7b [19:39:21] [02miraheze/mw-config] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JtMds [19:39:22] [02miraheze/mw-config] 07paladox 03104c461 - redis: Stop using nutcracker (#3724) [19:39:23] [02mw-config] 07paladox deleted branch 03paladox-patch-2 - 13https://git.io/vbvb3 [19:39:25] [02miraheze/mw-config] 07paladox deleted branch 03paladox-patch-2 [19:40:17] miraheze/mw-config - paladox the build passed. [19:46:37] [02miraheze/mediawiki] 07paladox pushed 031 commit to 03REL1_35 [+0/-0/±1] 13https://git.io/JtMdK [19:46:39] [02miraheze/mediawiki] 07paladox 039265095 - Update Cargo [19:47:57] [02miraheze/mediawiki] 07paladox pushed 031 commit to 03REL1_35 [+0/-0/±1] 13https://git.io/JtMd6 [19:47:59] [02miraheze/mediawiki] 07paladox 032f2745e - Update Cargo [19:56:41] PROBLEM - cp12 Current Load on cp12 is CRITICAL: CRITICAL - load average: 1.27, 2.02, 1.47 [19:58:40] RECOVERY - cp12 Current Load on cp12 is OK: OK - load average: 0.63, 1.50, 1.34 [22:07:43] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-1 [+0/-0/±1] 13https://git.io/JtMpj [22:07:44] [02miraheze/puppet] 07paladox 0391536c7 - Reinstall rdb4 as mc2 [22:07:46] [02puppet] 07paladox created branch 03paladox-patch-1 - 13https://git.io/vbiAS [22:07:47] [02puppet] 07paladox opened pull request 03#1647: Reinstall rdb4 as mc2 - 13https://git.io/JtMhe [22:08:55] [02puppet] 07paladox edited pull request 03#1647: Reinstall rdb4 as mc2 - 13https://git.io/JtMhe [22:09:13] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-1 [+0/-1/±0] 13https://git.io/JtMhJ [22:09:14] [02miraheze/puppet] 07paladox 03c27a0ee - Delete rdb4.yaml [22:09:16] [02puppet] 07paladox synchronize pull request 03#1647: Reinstall rdb4 as mc2 - 13https://git.io/JtMhe [22:09:20] [02puppet] 07paladox closed pull request 03#1647: Reinstall rdb4 as mc2 - 13https://git.io/JtMhe [22:09:22] [02miraheze/puppet] 07paladox pushed 031 commit to 03master [+0/-1/±1] 13https://git.io/JtMhT [22:09:23] [02miraheze/puppet] 07paladox 0399e8a1c - Reinstall rdb4 as mc2 (#1647) [22:09:25] [02miraheze/puppet] 07paladox deleted branch 03paladox-patch-1 [22:09:26] [02puppet] 07paladox deleted branch 03paladox-patch-1 - 13https://git.io/vbiAS [22:10:56] [02miraheze/dns] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JtMhq [22:10:58] [02miraheze/dns] 07paladox 03a432612 - rdb4 -> mc2 [22:11:57] !log renaming rdb4 to mc2 [22:12:00] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [22:18:11] PROBLEM - rdb4 APT on rdb4 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [22:18:11] PROBLEM - rdb4 Disk Space on rdb4 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [22:18:19] PROBLEM - rdb4 Current Load on rdb4 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [22:20:12] PROBLEM - rdb4 SSH on rdb4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:20:12] PROBLEM - ping6 on rdb4 is CRITICAL: CRITICAL - Destination Unreachable (2001:41d0:800:1bbd::12) [22:20:13] PROBLEM - rdb4 Puppet on rdb4 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [22:20:13] PROBLEM - rdb4 NTP time on rdb4 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [22:20:13] PROBLEM - rdb4 Redis Process on rdb4 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [22:20:15] PROBLEM - ping4 on rdb4 is CRITICAL: PING CRITICAL - Packet loss = 100% [22:20:20] PROBLEM - Host rdb4 is DOWN: PING CRITICAL - Packet loss = 100% [22:22:43] I assume that will clear when puppet runs on mon2 paladox ? [22:22:52] yes [22:22:55] Should I downtime? [22:23:14] I don't even see them now [22:23:51] PROBLEM - mon2 APT on mon2 is CRITICAL: APT CRITICAL: 27 packages available for upgrade (1 critical updates). [22:24:25] Yeah it's gone already [22:27:52] Another wikimedia page. I'll stay awake and see if it calms. Might make stuff slow again [22:34:53] PROBLEM - test3 APT on test3 is CRITICAL: APT CRITICAL: 47 packages available for upgrade (18 critical updates). [22:35:21] PROBLEM - mw9 APT on mw9 is CRITICAL: APT CRITICAL: 19 packages available for upgrade (18 critical updates). [22:35:29] PROBLEM - mw8 APT on mw8 is CRITICAL: APT CRITICAL: 47 packages available for upgrade (18 critical updates). [22:35:36] paladox: replied to your graylog lag task [22:35:43] thanks! [22:35:57] https://phabricator.miraheze.org/T6862 [22:35:58] [ ⚓ T6862 Use memcache for the cache ] - phabricator.miraheze.org [22:35:59] PROBLEM - mw10 APT on mw10 is CRITICAL: APT CRITICAL: 47 packages available for upgrade (18 critical updates). [22:36:02] eh? [22:36:17] PROBLEM - mw11 APT on mw11 is CRITICAL: APT CRITICAL: 47 packages available for upgrade (18 critical updates). [22:36:18] PROBLEM - jobrunner3 APT on jobrunner3 is CRITICAL: APT CRITICAL: 50 packages available for upgrade (18 critical updates). [22:36:26] PROBLEM - jobrunner4 APT on jobrunner4 is CRITICAL: APT CRITICAL: 48 packages available for upgrade (18 critical updates). [22:37:08] I recalled moving mediawiki caching to local memcache instances made wikis slower, last time [22:37:15] SPF|Cloud https://phabricator.miraheze.org/T6858#135316 [22:37:16] [ ⚓ T6858 Messages take a while to be sent to graylog ] - phabricator.miraheze.org [22:37:31] SPF|Cloud yeh, this will be on a dedicated instance and the last time we tried we were using openvz [22:38:07] is openvz slower than kvm? [22:38:49] i mean we found that we can handle more using the dedicated server than on openvz :) [22:39:13] though we can see if things get slower with the dedicated instance, we can easily revert back to redis. [22:39:56] of course we can handle more, we have some high power CPUs nowadays [22:40:51] kvm instances run on a dedicated kernel, openvz shares the host kernel. if you see performance improvements on the new VMs, that is due to better hardware, not because kvm is performing better than openvz[1] [22:41:07] [1] unless your sysctl tuning helps, which is possible in kvm but not in openvz [22:41:15] yeh [22:41:36] but also because we were doing it locally, it meant that each load used a different memcache [22:41:44] *page [22:43:09] !sre I can't access `mc2wiki`, `mc2.miraheze.org`. Getting connection timed out / error 500s [22:43:19] correct, but with only four mediawiki servers running, the chance of a miss is no more than 25% [22:43:57] err, 75%, which will eventually be reduced to 0% as cache warms up [22:43:58] dmehus: looking [22:44:00] I think I know why though and if so not much we can do [22:44:06]