[00:00:19] oh, glusterfs OOM'd before I had a chance to restart it [00:01:32] RECOVERY - cp3 Disk Space on cp3 is OK: DISK OK - free space: / 3593 MB (14% inode=93%); [00:02:20] RECOVERY - mon1 Disk Space on mon1 is OK: DISK OK - free space: / 4892 MB (13% inode=93%); [00:02:31] PROBLEM - bacula2 Bacula Static on bacula2 is CRITICAL: CRITICAL: Timeout or unknown client: gluster1-fd [00:03:35] !log start bacula-fd on gluster1 [00:03:39] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [00:04:25] RECOVERY - bacula2 Bacula Static on bacula2 is OK: OK: Full, 1664252 files, 197.7GB, 2020-07-19 23:58:00 (6.4 minutes ago) [00:05:29] PROBLEM - gluster1 Puppet on gluster1 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[bacula-fd],Exec[/mnt/mediawiki-static] [00:10:21] [02miraheze/services] 07MirahezeSSLBot pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JJcg2 [00:10:23] [02miraheze/services] 07MirahezeSSLBot 03ebb0433 - BOT: Updating services config for wikis [00:13:31] !log db12 high on mem usage, added 2G swap + entry in /etc/fstab [00:13:38] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [00:16:23] PROBLEM - mw6 APT on mw6 is CRITICAL: APT CRITICAL: 2 packages available for upgrade (1 critical updates). [00:20:06] <-CloudGuy38-> I also see the email that the Scratch Team wants to delete Bad Scratch Wiki https://badscratch.miraheze.org , not just privated, also what are the steps to delete it? I'll try to tell the stewards about it. [00:20:07] [ Bad Scratch Wiki ] - badscratch.miraheze.org [00:33:24] !log db12: set table_(definition|open)_cache to 4000 [00:33:27] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [00:43:21] RECOVERY - gluster1 Puppet on gluster1 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:44:48] !log root@gluster1:/var/log/glusterfs# gluster volume set mvol open-behind off [00:44:51] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [00:48:29] !log reboot jobrunner1 - gluster mount [00:48:32] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [00:50:28] PROBLEM - jobrunner1 MirahezeRenewSsl on jobrunner1 is CRITICAL: connect to address 51.89.160.135 and port 5000: Connection refused [00:51:17] PROBLEM - jobrunner1 APT on jobrunner1 is CRITICAL: APT CRITICAL: 4 packages available for upgrade (3 critical updates). [00:52:24] RECOVERY - jobrunner1 MirahezeRenewSsl on jobrunner1 is OK: TCP OK - 0.000 second response time on 51.89.160.135 port 5000 [01:10:14] PROBLEM - cloud1 APT on cloud1 is CRITICAL: APT CRITICAL: 54 packages available for upgrade (1 critical updates). [01:16:55] PROBLEM - miraheze.wiki - DNS on sslhost is CRITICAL: DNS CRITICAL - expected '2001:41d0:800:1056::2,2001:41d0:800:105a::10,51.77.107.210,51.89.160.142' but got '2001:41d0:800:1056::2,2001:41d0:800:105a::10,51.77.107.210' [01:17:04] PROBLEM - ml.gyaanipedia.co.in - DNS on sslhost is CRITICAL: DNS CRITICAL - expected '2001:41d0:800:1056::2,2001:41d0:800:105a::10,51.77.107.210,51.89.160.142' but got '2001:41d0:800:1056::2,2001:41d0:800:105a::10,51.89.160.142' [01:17:14] PROBLEM - dariawiki.org - DNS on sslhost is CRITICAL: DNS CRITICAL - expected '2001:41d0:800:1056::2,2001:41d0:800:105a::10,51.77.107.210,51.89.160.142' but got '2001:41d0:800:1056::2,2001:41d0:800:105a::10,51.77.107.210' [01:23:37] RECOVERY - miraheze.wiki - DNS on sslhost is OK: DNS OK: 0.052 seconds response time. sslhost returns 2001:41d0:800:1056::2,2001:41d0:800:105a::10,51.77.107.210,51.89.160.142 [01:23:53] RECOVERY - dariawiki.org - DNS on sslhost is OK: DNS OK: 0.041 seconds response time. sslhost returns 2001:41d0:800:1056::2,2001:41d0:800:105a::10,51.77.107.210,51.89.160.142 [01:23:57] RECOVERY - ml.gyaanipedia.co.in - DNS on sslhost is OK: DNS OK: 0.049 seconds response time. sslhost returns 2001:41d0:800:1056::2,2001:41d0:800:105a::10,51.77.107.210,51.89.160.142 [01:41:14] PROBLEM - mw4 APT on mw4 is CRITICAL: APT CRITICAL: 2 packages available for upgrade (1 critical updates). [02:20:44] PROBLEM - cp3 APT on cp3 is CRITICAL: APT CRITICAL: 2 packages available for upgrade (1 critical updates). [02:57:09] PROBLEM - cloud3 APT on cloud3 is CRITICAL: APT CRITICAL: 13 packages available for upgrade (1 critical updates). [03:06:41] PROBLEM - services1 APT on services1 is CRITICAL: APT CRITICAL: 2 packages available for upgrade (1 critical updates). [03:20:19] [02miraheze/services] 07MirahezeSSLBot pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JJcr9 [03:20:21] [02miraheze/services] 07MirahezeSSLBot 03a08bf11 - BOT: Updating services config for wikis [04:44:16] PROBLEM - mw5 APT on mw5 is CRITICAL: APT CRITICAL: 2 packages available for upgrade (1 critical updates). [05:31:29] PROBLEM - cloud2 APT on cloud2 is CRITICAL: APT CRITICAL: 54 packages available for upgrade (1 critical updates). [05:37:42] PROBLEM - services2 APT on services2 is CRITICAL: APT CRITICAL: 2 packages available for upgrade (1 critical updates). [06:04:15] RECOVERY - cloud1 APT on cloud1 is OK: APT OK: 53 packages available for upgrade (0 critical updates). [06:11:14] RECOVERY - mw4 APT on mw4 is OK: APT OK: 1 packages available for upgrade (0 critical updates). [06:11:41] PROBLEM - rdb2 APT on rdb2 is CRITICAL: APT CRITICAL: 3 packages available for upgrade (2 critical updates). [06:13:14] RECOVERY - jobrunner1 APT on jobrunner1 is OK: APT OK: 1 packages available for upgrade (0 critical updates). [06:19:41] RECOVERY - services2 APT on services2 is OK: APT OK: 1 packages available for upgrade (0 critical updates). [06:26:21] RECOVERY - mw6 APT on mw6 is OK: APT OK: 1 packages available for upgrade (0 critical updates). [06:36:41] RECOVERY - services1 APT on services1 is OK: APT OK: 1 packages available for upgrade (0 critical updates). [06:39:09] RECOVERY - cloud3 APT on cloud3 is OK: APT OK: 12 packages available for upgrade (0 critical updates). [06:49:40] RECOVERY - rdb2 APT on rdb2 is OK: APT OK: 1 packages available for upgrade (0 critical updates). [06:51:30] RECOVERY - cloud2 APT on cloud2 is OK: APT OK: 53 packages available for upgrade (0 critical updates). [06:52:53] RECOVERY - cp3 APT on cp3 is OK: APT OK: 1 packages available for upgrade (0 critical updates). [06:54:16] RECOVERY - mw5 APT on mw5 is OK: APT OK: 1 packages available for upgrade (0 critical updates). [07:19:49] .help [07:19:49] Hang on, I'm creating a list. [07:19:51] I've posted a list of my commands at https://clbin.com/uZGd3 - You can see more info about any of these commands by doing .help (e.g. .help time) [07:20:39] .tmask Welcome to the IRC channel of Miraheze, a free non-profit wiki hosting provider! | https://meta.miraheze.org | Status: {} | SRE Duty: {} | This channel is publicly logged at http://wm-bot.wmflabs.org/browser/index.php?display=%23miraheze | By participating in this channel, you agree to abide by our Code of Conduct: https://meta.miraheze.org/m/PA [07:20:39] Gotcha, RhinosF1 [07:20:40] [ Meta ] - meta.miraheze.org [07:20:40] [ Wikimedia IRC logs browser ] - wm-bot.wmflabs.org [07:20:41] [ Code of Conduct - Miraheze Meta ] - meta.miraheze.org [07:20:50] .topic Up RhinosF1 [07:20:50] Please wait... [07:20:51] Not enough arguments. You gave 1, it requires 2. [07:21:33] .topic Up,RhinosF1 [07:21:33] Not enough arguments. You gave 1, it requires 2. [07:21:40] .help topic [07:21:41] Change the channel topic. The bot must be a channel operator for this command to work. [07:21:41] e.g. .topic Your Great New Topic [07:21:56] .showmask [07:21:56] Welcome to the IRC channel of Miraheze, a free non-profit wiki hosting provider! | https://meta.miraheze.org | Status: {} | SRE Duty: {} | This channel is publicly logged at http://wm-bot.wmflabs.org/browser/index.php?display=%23miraheze | By participating in this channel, you agree to abide by our Code of Conduct: https://meta.miraheze.org/m/PA [07:22:09] .topic . Up RhinosF1 [07:22:10] Not enough arguments. You gave 1, it requires 2. [07:22:18] Hmm you're broke [08:07:36] RhinosF1: add DT right to sysops in ManageWikiExtensions [08:12:34] MirahezeBot: later I got osticket to fail over [11:08:35] PROBLEM - www.wikimicrofinanza.it - DNS on sslhost is CRITICAL: DNS CRITICAL - expected '2001:41d0:800:1056::2,2001:41d0:800:105a::10,51.77.107.210,51.89.160.142' but got '2001:41d0:800:1056::2,51.77.107.210,51.89.160.142' [11:08:36] PROBLEM - wiki.nowchess.org - DNS on sslhost is CRITICAL: DNS CRITICAL - expected '2001:41d0:800:1056::2,2001:41d0:800:105a::10,51.77.107.210,51.89.160.142' but got '2001:41d0:800:1056::2,51.77.107.210,51.89.160.142' [11:09:08] Huh [11:15:27] RECOVERY - www.wikimicrofinanza.it - DNS on sslhost is OK: DNS OK: 0.051 seconds response time. sslhost returns 2001:41d0:800:1056::2,2001:41d0:800:105a::10,51.77.107.210,51.89.160.142 [11:15:30] RECOVERY - wiki.nowchess.org - DNS on sslhost is OK: DNS OK: 0.038 seconds response time. sslhost returns 2001:41d0:800:1056::2,2001:41d0:800:105a::10,51.77.107.210,51.89.160.142 [12:00:09] PROBLEM - cyberlaw.ccdcoe.org - LetsEncrypt on sslhost is WARNING: WARNING - Certificate 'cyberlaw.ccdcoe.org' expires in 15 day(s) (Wed 05 Aug 2020 11:53:28 GMT +0000). [12:00:30] [02miraheze/ssl] 07MirahezeSSLBot pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JJcjF [12:00:31] [02miraheze/ssl] 07MirahezeSSLBot 03765780f - Bot: Update SSL cert for cyberlaw.ccdcoe.org [12:14:07] RECOVERY - cyberlaw.ccdcoe.org - LetsEncrypt on sslhost is OK: OK - Certificate 'cyberlaw.ccdcoe.org' will expire on Sun 18 Oct 2020 11:00:23 GMT +0000. [12:34:20] PROBLEM - cloud2 Current Load on cloud2 is WARNING: WARNING - load average: 15.14, 21.56, 16.27 [12:34:21] PROBLEM - cp7 Current Load on cp7 is CRITICAL: CRITICAL - load average: 4.73, 8.74, 5.38 [12:36:10] @System Administrators [0b27d8430956b39af4ffb220] 2020-07-20 12:35:19: Fatal exception of type "JobQueueError" trying to decline a wiki request [12:36:18] RECOVERY - cloud2 Current Load on cloud2 is OK: OK - load average: 12.77, 17.88, 15.51 [12:36:20] RECOVERY - cp7 Current Load on cp7 is OK: OK - load average: 1.96, 6.22, 4.86 [12:38:38] Tried it again, so seems to be more than a one-off error... "[c1b6ce924ec7ee40f0a47f16] 2020-07-20 12:38:06: Fatal exception of type "JobQueueError"" [12:38:58] Which request? [12:39:35] https://meta.miraheze.org/wiki/Special:RequestWikiQueue/13281#mw-section-decline pings @RhinosF1 and @paladox, in case they're just hidden [12:39:35] [ Wiki requests queue - Miraheze Meta ] - meta.miraheze.org [12:40:25] @Doug: What did you type? [12:40:25] PROBLEM - ta.gyaanipedia.co.in - DNS on sslhost is CRITICAL: DNS CRITICAL - expected '2001:41d0:800:1056::2,2001:41d0:800:105a::10,51.77.107.210,51.89.160.142' but got '2001:41d0:800:1056::2,51.77.107.210,51.89.160.142' [12:40:54] @RhinosF1 Same thing I usually type for most requests. It wasn't too long. It was the same length as my previous notes. [12:41:04] Exactly? [12:41:21] "Procedural decline to notify user in case they're not watching this wiki request to go back into [[Special:RequestWikiEdit]] for this request and define the wiki's purpose, scope, and type of content." [12:42:05] Every other time it goes through fine; plus, @paladox increased the character limit with that Phab ticket from last week. This is a different error than the last time. [12:42:25] RhinosF1: are you trying, or do want me to? [12:44:19] @RhinosF1 Just tried approving a request...same error, [3c36b21f9e3e057f62fcd1e1] 2020-07-20 12:43:49: Fatal exception of type "JobQueueError" [12:44:45] So problem occurs with creating wikis too, not just declining them. [12:45:08] I just tried declining a different request, same error [12:45:43] @Sario Yeah, I tried approving/creating the ToastyMC one with the above error. [12:45:44] I think it's time for a phab ticket [12:46:12] 2020-07-20 12:44:13 mw6 metawiki: [6e724fc4e36bcfc852763030] /wiki/Special:RequestWikiQueue/13281 JobQueueError from line 778 of /srv/mediawiki/w/includes/jobqueue/JobQueueRedis.php: Redis server error: Could not insert 1 EchoNotificationDeleteJob job(s). [12:46:45] RhinosF1: need me or Doug to write a phab ticket? [12:47:05] RECOVERY - ta.gyaanipedia.co.in - DNS on sslhost is OK: DNS OK: 0.037 seconds response time. sslhost returns 2001:41d0:800:1056::2,2001:41d0:800:105a::10,51.77.107.210,51.89.160.142 [12:47:28] no [12:47:30] i on it [12:47:38] ack [12:47:48] 1983 entries in the log with "Redis server error" [12:47:56] I think we can guess the cause [12:48:01] Oof. Puppet? [12:48:06] Redis [12:48:20] sorry not familiar with redis server [12:48:25] Subscribe me on the ticket please [12:48:43] calls this UBN [12:48:51] UBN? [12:49:03] Unbreak now! [12:49:16] LOL [12:49:42] Highest priority [12:49:49] Should we take CreateWiki down for maintenance until this is fixed? [12:49:50] And I agree [12:50:16] We should probably stop trying to create or decline for now [12:50:40] yes, but with 1,983 log entries, I don't think those were all wiki creators. [12:51:00] I wonder what the other ones were from. People editing their requests and not being able to? [12:51:51] Echo [12:52:53] Ah, thanks. [12:53:16] What's the Phabricator ticket? I don't see it in my subscriptions. Did you ping me in the details? [12:53:19] https://phabricator.miraheze.org/T5939 [12:53:20] [ ⚓ T5939 Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report er ] - phabricator.miraheze.org [12:53:26] > https://phabricator.miraheze.org/T5939 @RhinosF1 thanks [12:53:27] [ ⚓ T5939 Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report er ] - phabricator.miraheze.org [12:54:35] @RhinosF1 Looks good. 🙂 [12:54:38] paging @Site Reliability Engineers [12:54:56] paladox, SPF|Cloud, JohnLewis: ^ [12:56:53] Ooh, @NDKilla's online. Pinging him just in case he's around. [12:57:40] the sre ping should have done it [12:57:48] true [12:58:24] !log Mediawiki & Redis logs are flooded with errors related to https://phabricator.miraheze.org/T5939, cause currently unknown. [12:58:26] [ ⚓ T5939 Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report er ] - phabricator.miraheze.org [12:58:28] No need yo spam pings [12:58:30] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [12:58:41] s/yo/to [12:58:41] Sario meant to say: No need to spam pings [12:59:17] I get a lot of mileage out of that function [13:02:43] 2020-07-20 12:43:49 mw6 metawiki: [3c36b21f9e3e057f62fcd1e1] /wiki/Special:RequestWikiQueue/13282 JobQueueError from line 778 of /srv/mediawiki/w/includes/jobqueue/JobQueueRedis.php: Redis server error: Could not insert 1 NamespaceMigrationJob job(s). [13:03:06] paladox: between 2 mw servers there's about 4000 variations of it [13:04:13] 1195:C 20 Jul 2020 13:03:29.017 # Failed opening the RDB file dump.rdb (in server root dir /srv/redis) for saving: Read-only file system [13:05:23] paladox: since and why? [13:05:51] a chmod issue? [13:06:21] since 06:12:23 [13:06:53] ok [13:07:44] PROBLEM - jobrunner1 Redis Process on jobrunner1 is CRITICAL: PROCS CRITICAL: 4 processes with args 'redis-server' [13:09:44] RECOVERY - jobrunner1 Redis Process on jobrunner1 is OK: PROCS OK: 1 process with args 'redis-server' [13:10:25] !log killed redis-server and started up (had to kill -9 it as it was just hanging on sudo service redis-server stop) [13:10:29] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [13:10:35] on jobrunner1 [13:10:43] ah [13:11:32] PROBLEM - cp3 Disk Space on cp3 is WARNING: DISK WARNING - free space: / 2646 MB (10% inode=93%); [13:11:34] paladox: is that working now? [13:12:01] logs indicate yes [13:12:50] paladox: how do we recover all the failed jobs? [13:12:50] yep, seems to be working for creating and declining wikis [13:13:10] That i'm not sure about [13:13:16] Logs indicate that various RecentChanges + refreshLinks jobs failed [13:13:35] If they couldn't be inserted then it would be impossible to get them (i think). [13:13:36] Good question; I wonder how many were FuzzyBot jobs. [13:13:49] paladox: could we maybe run rebuildall.php on all wikis although that might break a lot for a while [13:13:58] No we cannot [13:14:01] that would take weeks [13:14:24] and also as you said would break wikis [13:14:46] (i take back my weeks comment, it would take maybe a month or 2) [13:14:53] lol [13:14:57] paladox: how else we going to rebuild rc + links, they've not been refreshed on any wiki with edits since 6am [13:15:30] JohnLewis ^ (should we run rebuildAll on all wikis?) [13:15:35] @Doug: They don't get rebuilt. [13:15:43] @RhinosF1 Will the edits since 6 am just show up as patrolled edits, or what would be the impact in this case? [13:16:02] paladox: is there a way to work out any wiki that has had an edit since 6am? [13:16:19] I would imagine that would take quite a while [13:16:58] [02mw-config] 07Amanda-Catherine opened pull request 03#3184: Remove 'skin' from wgHiddenPrefs on dcmultiversewiki T5938 - 13https://git.io/JJCUH [13:17:23] @Doug: I'm not sure how I just see that there's a lot of "Could not insert 1 recentChangesUpdate job(s). " [13:18:10] > @Doug: I'm not sure how I just see that there's a lot of "Could not insert 1 recentChangesUpdate job(s). > " @RhinosF1 Ah, interesting. What are some of the wiki db names related to that error? I could check out the front end to see the way in which it's been impacted. [13:18:22] paladox: we could grep the exception logs for recentChangesUpdate and refreshLinksDynamic and get the line it's on [13:18:31] @Doug: Any wiki with an edit since 6am [13:19:09] then we'd have to take that list and me build a fancy script to go and parse it for db names [13:19:18] > @Doug: Any wiki with an edit since 6am Which log is that? Is that log in a web-accessible folder? something cleaner than #wiki-feed [13:19:48] we don't have a list yet [13:19:52] ah [13:20:31] https://phabricator.miraheze.org/P332 [13:20:32] [ ✎ P332 (An Untitled Masterwork) ] - phabricator.miraheze.org [13:20:49] paladox: it was on every mw [13:20:59] I like that idea; doesn't sound like it would take long to write the fancy script....not sure how long it would take to extract the data from the exception logs [13:21:07] yes i know that [13:21:10] @Doug: Nearly done [13:21:39] if we strip the colon and dupes from P332 when every mw is added then we should have the list of affected wikis [13:24:08] paladox: update that paste with the list from both jobs on each mw and then I can strip dupes [13:24:55] RhinosF1 done [13:30:52] "Read-only file system" - are we sure that was resolved? File systems don't generally just go in and out of read-only on their own [13:31:44] @Void i mean i could write on the disk, searching the error brought me to https://stackoverflow.com/questions/44814351/failed-opening-the-rdb-file-read-only-file-system but that's set in the unit. [13:31:44] [ linux - Failed opening the RDB file ... Read-only file system - Stack Overflow ] - stackoverflow.com [13:31:54] The logs indicate it's saving sucessfully now [13:32:02] after killing redis-server with kill -9 [13:32:33] > "Read-only file system" - are we sure that was resolved? File systems don't generally just go in and out of read-only on their own @Void Yeah, wondered about that, and also why the redis server/service got hung like that. Something had to have caused it. wonders if this will be the type of thing there will be a sysadmin post-mortem meeting on [13:33:08] !log root@jobrunner1:/home/paladox# /usr/local/bin/foreachwikiindblist /srv/mediawiki/w/cache/databases.json /srv/mediawiki/w/maintenance/runJobs.php [13:33:12] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [13:33:42] Hmm, well if you can write files, then it's probably fine, but finding out what caused it to go read-only would be a very good idea [13:35:08] paladox: generating the list now [13:35:24] ok [13:36:27] The last time I had an issue with a read-only file system, though, the cause couldn't really be identified, but the VHD had been corrupted and needed "repairs" [13:36:54] something caused it to "Received SIGTERM scheduling shutdown..." at 06:11: [13:37:08] *06:11:21 [13:38:04] found it [13:38:56] Oh? [13:38:59] was there a redis upgrade... [13:39:29] redis was upgraded [13:39:31] that's why [13:39:45] the unit from redis package is broken hence why we do our own [13:39:45] paladox: https://phabricator.miraheze.org/P332 [13:39:46] [ Login ] - phabricator.miraheze.org [13:39:58] That seems... Undesirable [13:39:58] 180 affected wikis [13:40:21] att is huge [13:40:36] i would like to hear JohnLewis opinion on running rebuildAll on all those wikis. [13:40:42] @paladox, att? learning acronyms [13:40:49] allthetropes [13:40:55] ah [13:41:01] thanks 🙂 [13:41:26] didn't RhinosF1 post something above about JohnLewis suggesting rebuildAll? [13:41:50] no [13:41:57] I suggested rebuildall [13:42:13] yeah, but I thought I saw something from John concurring with you [13:42:35] paladox: would running rebuild rc and rebuild links individually be better? [13:42:43] JohnLewis ^ (should we run rebuildAll on all wikis?) [13:42:44] yeh [13:42:59] paladox: we could do that as they are two affected parts [13:44:29] [02dns] 07MacFan4000 opened pull request 03#168: update phab endpoint - 13https://git.io/JJCkZ [13:44:44] PROBLEM - test2 APT on test2 is CRITICAL: APT CRITICAL: 2 packages available for upgrade (1 critical updates). [13:45:15] Also mind documenting exactly what happened on the task? Would be a good idea to keep records of this kind of stuff in a better place than just IRC/Discord [13:45:28] paladox: can you merge dns? [13:45:28] (or incident report) [13:45:32] hi AmandaCath [13:45:46] !log root@jobrunner1:/home/paladox# ./foreachwikiindblist /home/paladox/all.dblist /srv/mediawiki/w/maintenance/rebuildrecentchanges.php [13:45:50] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [13:45:52] * AmandaCath waves to RhinosF1 [13:46:11] [02dns] 07paladox closed pull request 03#168: update phab endpoint - 13https://git.io/JJCkZ [13:46:11] paladox: why all.dblist when you have a list of only affected wikis? [13:46:13] [02miraheze/dns] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JJCkB [13:46:14] [02miraheze/dns] 07MacFan4000 03688ea3f - update phab endpoint (#168) [13:46:22] why do 4000 when you can do 180 [13:47:13] [02mw-config] 07paladox closed pull request 03#3184: Remove 'skin' from wgHiddenPrefs on dcmultiversewiki T5938 - 13https://git.io/JJCUH [13:47:15] [02miraheze/mw-config] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JJCkE [13:47:16] [02miraheze/mw-config] 07Amanda-Catherine 03f07a57f - Remove 'skin' from wgHiddenPrefs on dcmultiversewiki T5938 (#3184) [13:47:26] it's not 4000... [13:48:19] oh [13:48:24] I saw all.dblist [13:48:31] not /home/paladox [13:49:49] PROBLEM - cp9 APT on cp9 is CRITICAL: APT CRITICAL: 37 packages available for upgrade (1 critical updates). [13:51:26] https://www.irccloud.com/pastebin/x0wWCVYs/ [13:51:27] [ Snippet | IRCCloud ] - www.irccloud.com [13:52:06] paladox: them 7 wikis all missed categoryMembershipChange - could you please rebuild categories for them [13:52:17] ok [13:52:47] PROBLEM - rdb1 APT on rdb1 is CRITICAL: APT CRITICAL: 3 packages available for upgrade (2 critical updates). [13:52:49] paladox: could you check via salt for grep "categoryMembershipChange" exception.log to check other mw's [13:52:51] PROBLEM - mw4 Current Load on mw4 is WARNING: WARNING - load average: 7.87, 6.85, 5.79 [13:53:33] ^ gluster and php-fpm are higest [13:53:39] i don't see a script for rebuilding categories [13:54:03] paladox: grep on all mw's to check there's none on any but mw6 while I find it [13:54:44] RhinosF1 i don't see a script for rebuilding categories [13:55:07] paladox: https://www.mediawiki.org/wiki/Manual:RecountCategories.php, Will you please get a list from mw's not mw6 [13:55:07] [ Manual:recountCategories.php - MediaWiki ] - www.mediawiki.org [13:55:11] oh [13:56:20] PROBLEM - mw7 Puppet on mw7 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/mnt/mediawiki-static] [13:56:44] categoryMembershipChange,userGroupExpiry,userOptionsUpdate [13:56:45] RhinosF1 https://phabricator.miraheze.org/P333 [13:56:46] RECOVERY - mw4 Current Load on mw4 is OK: OK - load average: 5.80, 6.67, 5.99 [13:56:46] [ ✎ P333 (An Untitled Masterwork) ] - phabricator.miraheze.org [13:57:19] paladox: them 3 jobs I care about failing [13:58:33] https://www.irccloud.com/pastebin/DeDln2i9/ [13:58:33] [ Snippet | IRCCloud ] - www.irccloud.com [13:58:54] paladox: ^ that's for category, could you get a count from userGroupExpiry and userOptionsUpdate [14:01:02] PROBLEM - cp3 Stunnel Http for mw4 on cp3 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [14:01:11] PROBLEM - cp7 Stunnel Http for mw4 on cp7 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [14:01:59] !log depool and repool mw4 [14:02:02] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [14:02:05] paladox: I think gluester and php just died [14:02:07] !log reboot mw4 too [14:02:08] PROBLEM - cp9 Varnish Backends on cp9 is CRITICAL: 1 backends are down. mw4 [14:02:08] PROBLEM - cp9 Stunnel Http for mw4 on cp9 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [14:02:09] PROBLEM - cp6 Varnish Backends on cp6 is CRITICAL: 1 backends are down. mw4 [14:02:09] yup [14:02:10] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [14:02:14] PROBLEM - cp6 Stunnel Http for mw4 on cp6 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [14:02:20] RECOVERY - mw7 Puppet on mw7 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [14:02:25] PROBLEM - mw4 HTTPS on mw4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:35] PROBLEM - cp7 Varnish Backends on cp7 is CRITICAL: 1 backends are down. mw4 [14:02:58] PROBLEM - cp3 Varnish Backends on cp3 is CRITICAL: 1 backends are down. mw4 [14:03:31] @RhinosF1 Yeah, not sure if this was related, but I timed out a couple times trying to load Meta...third time refreshing, it loaded. Could be a cache proxy issue, though. [14:03:43] it's related [14:04:19] PROBLEM - mw6 Puppet on mw6 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/mnt/mediawiki-static] [14:04:25] paladox: CentralAuthCreateLocalAccountJob [14:04:37] PROBLEM - mw4 Puppet on mw4 is CRITICAL: connect to address 51.89.160.128 port 5666: Connection refusedconnect to host 51.89.160.128 port 5666: Connection refused [14:04:44] PROBLEM - mw4 Current Load on mw4 is CRITICAL: connect to address 51.89.160.128 port 5666: Connection refusedconnect to host 51.89.160.128 port 5666: Connection refused [14:05:10] PROBLEM - mw4 NTP time on mw4 is CRITICAL: connect to address 51.89.160.128 port 5666: Connection refusedconnect to host 51.89.160.128 port 5666: Connection refused [14:05:13] PROBLEM - mw4 APT on mw4 is CRITICAL: connect to address 51.89.160.128 port 5666: Connection refusedconnect to host 51.89.160.128 port 5666: Connection refused [14:05:58] PROBLEM - mw4 Disk Space on mw4 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [14:06:08] PROBLEM - mw4 SSH on mw4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:06:19] PROBLEM - mw4 php-fpm on mw4 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [14:06:38] that sounds like the function that attaches your account to a wiki when you visit it the first time [14:06:44] yep [14:09:18] RECOVERY - cp7 Stunnel Http for mw4 on cp7 is OK: HTTP OK: HTTP/1.1 200 OK - 15595 bytes in 0.016 second response time [14:09:37] RECOVERY - cp3 Stunnel Http for mw4 on cp3 is OK: HTTP OK: HTTP/1.1 200 OK - 15595 bytes in 1.036 second response time [14:10:04] RECOVERY - cp9 Varnish Backends on cp9 is OK: All 7 backends are healthy [14:10:06] RECOVERY - cp6 Varnish Backends on cp6 is OK: All 7 backends are healthy [14:10:18] RECOVERY - cp9 Stunnel Http for mw4 on cp9 is OK: HTTP OK: HTTP/1.1 200 OK - 15595 bytes in 1.513 second response time [14:10:28] RECOVERY - cp6 Stunnel Http for mw4 on cp6 is OK: HTTP OK: HTTP/1.1 200 OK - 15609 bytes in 0.018 second response time [14:10:39] RECOVERY - cp7 Varnish Backends on cp7 is OK: All 7 backends are healthy [14:10:56] RECOVERY - cp3 Varnish Backends on cp3 is OK: All 7 backends are healthy [14:14:18] RECOVERY - mw6 Puppet on mw6 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:39:19] !log root@jobrunner1:/home/paladox# ./foreachwikiindblist /home/paladox/7wikis.dblist /srv/mediawiki/w/maintenance/recountCategories.php --mode=pages [14:39:22] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [14:39:40] !log root@jobrunner1:/home/paladox# ./foreachwikiindblist /home/paladox/7wikis.dblist /srv/mediawiki/w/maintenance/recountCategories.php --mode=subcats [14:39:43] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [14:39:46] !log root@jobrunner1:/home/paladox# ./foreachwikiindblist /home/paladox/7wikis.dblist /srv/mediawiki/w/maintenance/recountCategories.php --mode=files [14:39:47] RhinosF1 ^ [14:39:49] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log [14:40:15] paladox: ty [14:50:20] I had an inkling something may be wrong before I encountered that first fatal exception error when declining a request...in my "Watchlist," I couldn't get the pages to stay unbolded after visiting them. Now, the pages seem to stay unbolded. Probably was related to those job queue errors. [14:57:59] just noticed something odd in [[Special:RecentChanges]] on Meta, her user talk page needed patrolling except she's autopatrolled and never lost it so it's not related to that RC bug [14:59:02] Oh god, now a whole bunch of revisions I just patrolled need patrolling again [14:59:09]