[00:00:48] RECOVERY - cp3 Disk Space on cp3 is OK: DISK OK - free space: / 3422 MB (14% inode=93%); [01:41:27] PROBLEM - cp8 Current Load on cp8 is WARNING: WARNING - load average: 1.51, 1.83, 1.39 [01:44:07] RECOVERY - cp8 Current Load on cp8 is OK: OK - load average: 0.54, 1.30, 1.26 [03:10:35] PROBLEM - cp8 Current Load on cp8 is WARNING: WARNING - load average: 1.72, 1.62, 1.23 [03:13:22] RECOVERY - cp8 Current Load on cp8 is OK: OK - load average: 0.77, 1.35, 1.19 [03:33:19] PROBLEM - cp8 Current Load on cp8 is CRITICAL: CRITICAL - load average: 2.04, 1.84, 1.42 [03:35:56] RECOVERY - cp8 Current Load on cp8 is OK: OK - load average: 1.01, 1.56, 1.38 [05:55:10] [02miraheze/services] 07MirahezeSSLBot pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JfT6d [05:55:12] [02miraheze/services] 07MirahezeSSLBot 039a5abce - BOT: Updating services config for wikis [06:24:53] PROBLEM - rdb2 Puppet on rdb2 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[nagios-plugins] [06:33:04] RECOVERY - rdb2 Puppet on rdb2 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:40:18] [02miraheze/services] 07MirahezeSSLBot pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JfTPL [06:40:20] [02miraheze/services] 07MirahezeSSLBot 03d664d80 - BOT: Updating services config for wikis [06:49:49] hello! are we aware of the 503s? [06:52:39] musikanimal: no, monitoring just quit as well [06:52:41] Thx [06:53:21] paladox, Reception123, PuppyKun, SPF|Cloud: ^ [06:53:43] Everythung seems dead [06:55:31] Zppix is on "SRE Duty" [06:58:30] musikanimal: Zppix will be asleep [06:58:32] .t Zppix [06:58:33] 2020-04-21 - 01:58:33CDT [06:58:57] Phab is an error on db7 so could it be a database error [07:04:13] Could it be http://travaux.ovh.net/?do=details&id=44159 [07:04:13] [ OVH Tasks ] - travaux.ovh.net [07:04:36] Times exactly to our outage [07:11:02] PROBLEM - cp7 Stunnel Http for test2 on cp7 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:11:15] Hmm [07:12:38] There must be something them [07:12:47] But still no service [07:23:23] PROBLEM - mw5 Puppet on mw5 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:23:39] PROBLEM - cp8 Stunnel Http for mw5 on cp8 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:24:09] PROBLEM - mw7 HTTPS on mw7 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:24:19] PROBLEM - cp8 Stunnel Http for mw6 on cp8 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:24:19] PROBLEM - cp6 Stunnel Http for mw5 on cp6 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:24:20] PROBLEM - cp6 HTTP 4xx/5xx ERROR Rate on cp6 is CRITICAL: CRITICAL - NGINX Error Rate is 91% [07:24:23] yes well done icinga-miraheze [07:24:29] PROBLEM - db7 SSH on db7 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:24:30] PROBLEM - db6 SSH on db6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:24:39] PROBLEM - mw7 SSH on mw7 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:24:40] PROBLEM - db7 Puppet on db7 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:24:49] PROBLEM - mw5 HTTPS on mw5 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:24:50] PROBLEM - bebaskanpengetahuan.id - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:24:59] PROBLEM - mw4 Current Load on mw4 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:25:09] PROBLEM - db7 Disk Space on db7 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:25:10] PROBLEM - cp6 HTTPS on cp6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 4205 bytes in 0.007 second response time [07:25:12] PROBLEM - phab1 phabricator.miraheze.org HTTPS on phab1 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 5483 bytes in 2.040 second response time [07:25:20] PROBLEM - Host cp7 is DOWN: PING CRITICAL - Packet loss = 100% [07:25:23] PROBLEM - cp3 Stunnel Http for mw6 on cp3 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:25:29] PROBLEM - ns1 GDNSD Datacenters on ns1 is CRITICAL: CRITICAL - 6 datacenters are down: 128.199.139.216/cpweb, 2400:6180:0:d0::403:f001/cpweb, 51.77.107.210/cpweb, 2001:41d0:800:1056::2/cpweb, 51.161.32.127/cpweb, 2607:5300:205:200::17f6/cpweb [07:25:47] PROBLEM - mw4 Puppet on mw4 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:25:49] PROBLEM - ping4 on mw6 is CRITICAL: PING CRITICAL - Packet loss = 100% [07:25:59] PROBLEM - mw7 Disk Space on mw7 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:26:00] PROBLEM - mw5 Disk Space on mw5 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:26:11] PROBLEM - mw7 php-fpm on mw7 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:26:21] PROBLEM - mw4 php-fpm on mw4 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:26:21] PROBLEM - cp6 Varnish Backends on cp6 is CRITICAL: 4 backends are down. mw4 mw5 mw6 mw7 [07:26:30] PROBLEM - mw6 SSH on mw6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:26:40] PROBLEM - mon1 grafana.miraheze.org HTTPS on mon1 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 295 bytes in 0.004 second response time [07:26:50] PROBLEM - cp6 Stunnel Http for mw4 on cp6 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:26:51] PROBLEM - Host db6 is DOWN: PING CRITICAL - Packet loss = 100% [07:26:51] PROBLEM - mon1 icinga.miraheze.org HTTPS on mon1 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 295 bytes in 0.005 second response time [07:27:04] PROBLEM - ping4 on mw5 is CRITICAL: PING CRITICAL - Packet loss = 100% [07:27:10] PROBLEM - publictestwiki.com - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:27:10] PROBLEM - mw5 MediaWiki Rendering on mw5 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 4276 bytes in 0.045 second response time [07:27:40] PROBLEM - Host db7 is DOWN: PING CRITICAL - Packet loss = 100% [07:28:02] PROBLEM - mw7 Current Load on mw7 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:28:32] PROBLEM - Host mw5 is DOWN: PING CRITICAL - Packet loss = 100% [07:28:36] PROBLEM - cp8 Varnish Backends on cp8 is CRITICAL: 4 backends are down. mw4 mw5 mw6 mw7 [07:28:56] PROBLEM - cp8 Stunnel Http for mon1 on cp8 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 295 bytes in 0.232 second response time [07:28:58] PROBLEM - mw6 Disk Space on mw6 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:29:08] PROBLEM - mw7 Puppet on mw7 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:29:23] PROBLEM - mw6 php-fpm on mw6 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:29:26] PROBLEM - Host mw6 is DOWN: PING CRITICAL - Packet loss = 100% [07:29:27] PROBLEM - test2 MediaWiki Rendering on test2 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 4276 bytes in 0.044 second response time [07:29:36] PROBLEM - phab1 phab.miraheze.wiki HTTPS on phab1 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 500 Internal Server Error [07:29:42] PROBLEM - mw7 MediaWiki Rendering on mw7 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 4272 bytes in 0.044 second response time [07:29:50] PROBLEM - mw4 HTTPS on mw4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:29:52] PROBLEM - cp3 Varnish Backends on cp3 is CRITICAL: 4 backends are down. mw4 mw5 mw6 mw7 [07:30:02] PROBLEM - nonciclopedia.org - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:30:15] PROBLEM - Host mw4 is DOWN: PING CRITICAL - Packet loss = 100% [07:30:26] PROBLEM - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is WARNING: WARNING - NGINX Error Rate is 56% [07:30:27] PROBLEM - cp6 Stunnel Http for mon1 on cp6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 295 bytes in 0.002 second response time [07:30:36] PROBLEM - cp3 Stunnel Http for mw4 on cp3 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:30:36] PROBLEM - jobrunner1 MediaWiki Rendering on jobrunner1 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 4274 bytes in 0.045 second response time [07:30:41] PROBLEM - cp8 Stunnel Http for mw7 on cp8 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:30:41] PROBLEM - mon1 HTTPS on mon1 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 295 bytes in 0.005 second response time [07:30:46] PROBLEM - cp6 Stunnel Http for mw7 on cp6 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:30:47] PROBLEM - gluster1 GlusterFS port 49152 on gluster1 is CRITICAL: connect to address 51.77.107.209 and port 49152: Connection refused [07:30:57] PROBLEM - ping4 on mw7 is CRITICAL: PING CRITICAL - Packet loss = 100% [07:31:06] PROBLEM - cp3 Stunnel Http for mw7 on cp3 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:31:17] PROBLEM - cp8 Stunnel Http for mw4 on cp8 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:31:38] PROBLEM - ns2 GDNSD Datacenters on ns2 is CRITICAL: CRITICAL - 6 datacenters are down: 128.199.139.216/cpweb, 2400:6180:0:d0::403:f001/cpweb, 51.77.107.210/cpweb, 2001:41d0:800:1056::2/cpweb, 51.161.32.127/cpweb, 2607:5300:205:200::17f6/cpweb [07:31:48] PROBLEM - www.thesimswiki.com - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:32:11] PROBLEM - cp3 Stunnel Http for mw5 on cp3 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:32:12] PROBLEM - cp3 Stunnel Http for mon1 on cp3 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 295 bytes in 0.708 second response time [07:32:49] PROBLEM - cp8 HTTPS on cp8 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 4222 bytes in 0.325 second response time [07:33:01] PROBLEM - wiki.thesimswiki.com - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:33:11] PROBLEM - cp6 Stunnel Http for mw6 on cp6 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [07:33:16] PROBLEM - wiki.conworlds.org - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:33:26] PROBLEM - thesimswiki.com - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:34:39] PROBLEM - Host mw7 is DOWN: PING CRITICAL - Packet loss = 100% [07:35:05] PROBLEM - bacula2 Bacula Databases db7 on bacula2 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds. [07:38:37] Reception123: now would be a great time to wake up [07:41:21] PROBLEM - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is CRITICAL: CRITICAL - NGINX Error Rate is 65% [07:42:04] PROBLEM - phab1 Puppet on phab1 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 44 seconds ago with 1 failures. Failed resources (up to 3 shown): Service[phd] [07:42:08] RhinosF1: looks like an OVH problem [07:42:19] Can't SSH to db7 [07:42:31] Reception123: worked that out 32 mins ago [07:43:02] RhinosF1: yeah, so I can't do anything about it [07:43:04] PROBLEM - phab1 phd on phab1 is CRITICAL: PROCS CRITICAL: 0 processes with args 'phd' [07:43:10] It's up to them to fix it [07:43:24] Reception123: be here until we recover, send alerts [07:43:35] Yeah that's all I can [07:43:42] Though I hope they do recover soon [07:43:52] * RhinosF1 is watching the status page [07:43:58] PM’d you a lot [07:44:19] Borschts-zhwiki: hi [07:44:44] Hi :) [07:45:04] Borschts-zhwiki: we are aware of the outage, host is down [07:45:12] Reception123: update topic pls [07:46:42] Reception123: I’m here virtually all day [07:47:09] Also just to note afaik SRE Duty is only for Phabricator and doesn't imply that the user is available all week to deal with every issue [07:47:18] they're just in charge of making sure Phab triage is done properly [07:47:24] When I do go for lunch break, hopefully we’ll be back [07:47:34] hopefully before [07:48:01] I’ll be scared if we’re not but paladox and john should be back by then [07:48:09] I would email but mail is likely dead [07:49:14] * RhinosF1 will try anyway [08:12:10] you can send it but it'll sit at some other mailserver until misc1 comes back up [08:12:23] wait no i lied mail looks like its working? [08:12:41] PROBLEM - cp3 HTTP 4xx/5xx ERROR Rate on cp3 is CRITICAL: CRITICAL - NGINX Error Rate is 85% [08:13:44] PuppyKun: We think there might be just some stuff on the affected racks [08:13:52] Not sure how to confirm though [08:15:30] Linux misc1.miraheze.org Thu Jun 27 15:10:55 MSK 2019 x86_64 [08:15:33] welp, misc1 is def up [08:15:51] PuppyKun: Can you remember which cloud server hosts which? [08:16:31] erm [08:16:35] *visible confusion* [08:16:48] ssh ndkilla@mw1.miraheze.org successfully logs me into cp8 [08:16:49] wutface [08:16:53] PuppyKun: It’s stored on phab and wiki somewhere [08:17:22] Unless [08:17:25] I have idea [08:17:30] Let me get laptop up [08:19:27] RECOVERY - bacula2 Bacula Databases db7 on bacula2 is OK: OK: Diff, 3407 files, 84.88GB, 2020-04-19 04:09:00 (2.2 days ago) [08:19:50] RECOVERY - Host cp7 is UP: PING OK - Packet loss = 0%, RTA = 0.15 ms [08:19:53] PROBLEM - cp7 Stunnel Http for mon1 on cp7 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 295 bytes in 0.004 second response time [08:19:53] PROBLEM - cp7 Varnish Backends on cp7 is CRITICAL: 4 backends are down. mw4 mw5 mw6 mw7 [08:19:54] PROBLEM - cp7 HTTP 4xx/5xx ERROR Rate on cp7 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [08:19:54] PROBLEM - cp7 HTTPS on cp7 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:19:54] PROBLEM - cp7 Current Load on cp7 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [08:19:54] PROBLEM - ping4 on cp7 is CRITICAL: PING CRITICAL - Packet loss = 100% [08:19:54] PROBLEM - cp7 Disk Space on cp7 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [08:19:55] PROBLEM - cp7 Stunnel Http for mw4 on cp7 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [08:19:55] PROBLEM - cp7 Puppet on cp7 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [08:19:56] PROBLEM - cp7 Stunnel Http for mw5 on cp7 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [08:19:56] PROBLEM - cp7 SSH on cp7 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:19:57] PROBLEM - cp7 Stunnel Http for mw6 on cp7 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [08:19:57] RECOVERY - mon1 icinga.miraheze.org HTTPS on mon1 is OK: HTTP OK: HTTP/1.1 302 Found - 335 bytes in 0.007 second response time [08:19:57] OwO [08:20:20] That's good [08:20:24] RECOVERY - publictestwiki.com - LetsEncrypt on sslhost is OK: OK - Certificate 'publictestwiki.com' will expire on Mon 01 Jun 2020 16:16:12 GMT +0000. [08:20:33] RECOVERY - Host db7 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [08:20:34] PROBLEM - ping4 on db7 is CRITICAL: PING CRITICAL - Packet loss = 100% [08:20:34] PROBLEM - db7 Current Load on db7 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [08:20:34] PROBLEM - db7 MySQL on db7 is CRITICAL: Can't connect to MySQL server on '51.89.160.143' (115) [08:20:44] !restarted db7/mw* [08:20:54] RhinosF1: for some reason it seems that many servers were marked as "stop" [08:20:55] RECOVERY - Host db6 is UP: PING WARNING - Packet loss = 81%, RTA = 0.46 ms [08:20:58] PROBLEM - db6 Current Load on db6 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [08:20:58] PROBLEM - db7 MySQL on db7 is CRITICAL: Can't connect to MySQL server on '51.89.160.143' (115) [08:20:58] PROBLEM - db6 MySQL on db6 is CRITICAL: Can't connect to MySQL server on '51.89.160.130' (115) [08:20:59] PROBLEM - db6 Disk Space on db6 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [08:20:59] PROBLEM - db6 Puppet on db6 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [08:20:59] PROBLEM - ping4 on db6 is CRITICAL: PING CRITICAL - Packet loss = 100% [08:21:00] RECOVERY - Host mw5 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [08:21:01] PROBLEM - cp3 HTTP 4xx/5xx ERROR Rate on cp3 is WARNING: WARNING - NGINX Error Rate is 42% [08:21:03] RECOVERY - test2 MediaWiki Rendering on test2 is OK: HTTP OK: HTTP/1.1 200 OK - 19331 bytes in 0.468 second response time [08:21:04] RECOVERY - cp7 Current Load on cp7 is OK: OK - load average: 0.66, 0.48, 0.19 [08:21:04] RECOVERY - cp7 Disk Space on cp7 is OK: DISK OK - free space: / 11528 MB (43% inode=95%); [08:21:04] PROBLEM - mw5 php-fpm on mw5 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [08:21:04] PROBLEM - mw5 Current Load on mw5 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [08:21:05] !log restarted db7/mw* [08:21:11] I guess after whatever issue they had the servers likely shut down and never restarted? [08:21:13] RECOVERY - jobrunner1 MediaWiki Rendering on jobrunner1 is OK: HTTP OK: HTTP/1.1 200 OK - 19319 bytes in 0.122 second response time [08:21:13] RECOVERY - cp7 HTTP 4xx/5xx ERROR Rate on cp7 is OK: OK - NGINX Error Rate is 2% [08:21:15] RECOVERY - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is OK: OK - NGINX Error Rate is 29% [08:21:15] RECOVERY - nonciclopedia.org - LetsEncrypt on sslhost is OK: OK - Certificate 'nonciclopedia.org' will expire on Tue 30 Jun 2020 02:32:05 GMT +0000. [08:21:17] RECOVERY - cp7 Stunnel Http for mw4 on cp7 is OK: HTTP OK: HTTP/1.1 200 OK - 15316 bytes in 0.139 second response time [08:21:21] RECOVERY - Host mw6 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [08:21:23] RECOVERY - mw6 Disk Space on mw6 is OK: DISK OK - free space: / 7528 MB (39% inode=72%); [08:21:23] RECOVERY - mw6 php-fpm on mw6 is OK: PROCS OK: 19 processes with command name 'php-fpm7.3' [08:21:24] RECOVERY - cp7 HTTPS on cp7 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1609 bytes in 0.011 second response time [08:21:24] PROBLEM - mw6 MediaWiki Rendering on mw6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 4278 bytes in 0.043 second response time [08:21:24] PROBLEM - mw6 Current Load on mw6 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [08:21:24] PROBLEM - mw6 Puppet on mw6 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [08:21:25] PuppyKun: yeah [08:21:26] RECOVERY - cp3 Stunnel Http for mw4 on cp3 is OK: HTTP OK: HTTP/1.1 200 OK - 15316 bytes in 1.015 second response time [08:21:27] RECOVERY - cp8 Stunnel Http for mw7 on cp8 is OK: HTTP OK: HTTP/1.1 200 OK - 15316 bytes in 0.307 second response time [08:21:28] RECOVERY - cp6 Stunnel Http for mw7 on cp6 is OK: HTTP OK: HTTP/1.1 200 OK - 15316 bytes in 0.003 second response time [08:21:28] that's my guess too [08:21:29] PuppyKun: no update from OVH [08:21:31] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log, Master [08:21:35] RhinosF1: well at least we're back online now [08:21:38] RECOVERY - cp7 SSH on cp7 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) [08:21:40] RECOVERY - Host mw4 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [08:21:42] Reception123: that's good [08:21:43] RECOVERY - mw4 HTTPS on mw4 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 541 bytes in 0.247 second response time [08:21:43] RECOVERY - cp8 Stunnel Http for mw4 on cp8 is OK: HTTP OK: HTTP/1.1 200 OK - 15316 bytes in 0.306 second response time [08:21:44] PROBLEM - ping4 on mw4 is CRITICAL: PING CRITICAL - Packet loss = 100% [08:21:44] PROBLEM - mw4 Disk Space on mw4 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [08:21:44] PROBLEM - mw4 SSH on mw4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:21:44] PROBLEM - mw4 MediaWiki Rendering on mw4 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 4276 bytes in 0.046 second response time [08:21:44] didn't think it'd be as easy as restarting the servers [08:21:45] RECOVERY - cp3 Stunnel Http for mw7 on cp3 is OK: HTTP OK: HTTP/1.1 200 OK - 15316 bytes in 1.003 second response time [08:21:45] RECOVERY - db7 Current Load on db7 is OK: OK - load average: 1.28, 0.75, 0.31 [08:21:45] RECOVERY - ns2 GDNSD Datacenters on ns2 is OK: OK - all datacenters are online [08:21:46] RECOVERY - mw5 php-fpm on mw5 is OK: PROCS OK: 19 processes with command name 'php-fpm7.3' [08:21:46] RECOVERY - cp7 Puppet on cp7 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [08:21:50] icinga-miraheze: seems really happy too [08:21:53] PROBLEM - ping4 on cp7 is CRITICAL: PING CRITICAL - Packet loss = 100% [08:21:54] RECOVERY - www.thesimswiki.com - LetsEncrypt on sslhost is OK: OK - Certificate 'www.thesimswiki.com' will expire on Mon 01 Jun 2020 16:23:50 GMT +0000. [08:21:55] RECOVERY - ping4 on cp7 is OK: PING OK - Packet loss = 0%, RTA = 0.23 ms [08:21:58] RECOVERY - mw4 MediaWiki Rendering on mw4 is OK: HTTP OK: HTTP/1.1 200 OK - 19321 bytes in 0.045 second response time [08:22:01] RECOVERY - cp7 Stunnel Http for mw6 on cp7 is OK: HTTP OK: HTTP/1.1 200 OK - 15316 bytes in 0.003 second response time [08:22:02] RECOVERY - ping4 on mw4 is OK: PING OK - Packet loss = 0%, RTA = 0.34 ms [08:22:04] RECOVERY - cp3 Stunnel Http for mw5 on cp3 is OK: HTTP OK: HTTP/1.1 200 OK - 15316 bytes in 0.953 second response time [08:22:04] RECOVERY - cp8 HTTPS on cp8 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1611 bytes in 0.391 second response time [08:22:05] RECOVERY - mw4 Disk Space on mw4 is OK: DISK OK - free space: / 7234 MB (38% inode=73%); [08:22:08] PROBLEM - db6 Puppet on db6 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [08:22:12] Reception123: Which ones? [08:22:15] PROBLEM - db6 Puppet on db6 is WARNING: WARNING: Puppet last ran 1 hour ago [08:22:16] RECOVERY - db6 Disk Space on db6 is OK: DISK OK - free space: / 175091 MB (52% inode=99%); [08:22:16] RECOVERY - mw6 Puppet on mw6 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [08:22:17] RECOVERY - mw5 Current Load on mw5 is OK: OK - load average: 3.19, 1.17, 0.43 [08:22:18] RECOVERY - wiki.thesimswiki.com - LetsEncrypt on sslhost is OK: OK - Certificate 'www.thesimswiki.com' will expire on Mon 01 Jun 2020 16:23:50 GMT +0000. [08:22:19] RECOVERY - mw6 MediaWiki Rendering on mw6 is OK: HTTP OK: HTTP/1.1 200 OK - 19321 bytes in 0.046 second response time [08:22:23] RECOVERY - cp6 Stunnel Http for mw6 on cp6 is OK: HTTP OK: HTTP/1.1 200 OK - 15316 bytes in 0.005 second response time [08:22:28] PROBLEM - ping4 on db7 is CRITICAL: PING CRITICAL - Packet loss = 100% [08:22:31] RECOVERY - ping4 on db7 is OK: PING OK - Packet loss = 0%, RTA = 0.18 ms [08:22:36] RECOVERY - wiki.conworlds.org - LetsEncrypt on sslhost is OK: OK - Certificate 'wiki.conworlds.org' will expire on Wed 13 May 2020 18:33:36 GMT +0000. [08:22:36] RECOVERY - thesimswiki.com - LetsEncrypt on sslhost is OK: OK - Certificate 'www.thesimswiki.com' will expire on Mon 01 Jun 2020 16:23:50 GMT +0000. [08:22:38] PROBLEM - ping4 on db6 is CRITICAL: PING CRITICAL - Packet loss = 100% [08:22:41] erm [08:22:41] RECOVERY - cp7 Stunnel Http for mw5 on cp7 is OK: HTTP OK: HTTP/1.1 200 OK - 15316 bytes in 0.003 second response time [08:22:41] RECOVERY - mw4 SSH on mw4 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) [08:22:42] RECOVERY - ping4 on db6 is OK: PING OK - Packet loss = 0%, RTA = 0.25 ms [08:22:46] PROBLEM - cp6 HTTP 4xx/5xx ERROR Rate on cp6 is WARNING: WARNING - NGINX Error Rate is 59% [08:22:46] PROBLEM - cp7 Stunnel Http for test2 on cp7 is WARNING: HTTP WARNING: HTTP/1.1 404 Not Found - 322 bytes in 0.011 second response time [08:22:46] RECOVERY - mw6 Current Load on mw6 is OK: OK - load average: 1.42, 1.33, 0.58 [08:22:52] PROBLEM - mw5 Puppet on mw5 is WARNING: WARNING: Puppet last ran 1 hour ago [08:22:53] im slightly confused by the fact that ping on several hosts is flapping a lot [08:22:53] RECOVERY - cp8 Stunnel Http for mw5 on cp8 is OK: HTTP OK: HTTP/1.1 200 OK - 15316 bytes in 0.305 second response time [08:22:54] RECOVERY - cp8 Stunnel Http for mw6 on cp8 is OK: HTTP OK: HTTP/1.1 200 OK - 15316 bytes in 0.313 second response time [08:22:54] RECOVERY - db7 SSH on db7 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) [08:22:54] RECOVERY - db6 SSH on db6 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) [08:22:55] RECOVERY - cp6 HTTPS on cp6 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1610 bytes in 0.026 second response time [08:22:55] RECOVERY - cp6 Stunnel Http for mw5 on cp6 is OK: HTTP OK: HTTP/1.1 200 OK - 15316 bytes in 0.005 second response time [08:22:56] RECOVERY - db7 Puppet on db7 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:22:56] RECOVERY - mw4 Current Load on mw4 is OK: OK - load average: 2.70, 1.35, 0.53 [08:22:57] RECOVERY - ns1 GDNSD Datacenters on ns1 is OK: OK - all datacenters are online [08:22:57] RECOVERY - mw5 HTTPS on mw5 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 541 bytes in 0.004 second response time [08:22:58] RECOVERY - db7 Disk Space on db7 is OK: DISK OK - free space: / 337557 MB (54% inode=99%); [08:22:58] RECOVERY - bebaskanpengetahuan.id - LetsEncrypt on sslhost is OK: OK - Certificate 'bebaskanpengetahuan.id' will expire on Thu 11 Jun 2020 21:05:17 GMT +0000. [08:22:59] RECOVERY - Host mw7 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [08:23:03] RECOVERY - mw7 HTTPS on mw7 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 541 bytes in 0.009 second response time [08:23:03] RECOVERY - mw7 SSH on mw7 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) [08:23:05] RECOVERY - cp3 Stunnel Http for mw6 on cp3 is OK: HTTP OK: HTTP/1.1 200 OK - 15316 bytes in 0.941 second response time [08:23:05] RECOVERY - cp7 Varnish Backends on cp7 is OK: All 7 backends are healthy [08:23:08] RECOVERY - ping4 on mw6 is OK: PING OK - Packet loss = 0%, RTA = 0.20 ms [08:23:09] PuppyKun: It looks live everything dead was cloud1 [08:23:11] PuppyKun: yeah I don't know many they're still having some issues [08:23:13] PROBLEM - mw4 Puppet on mw4 is WARNING: WARNING: Puppet last ran 1 hour ago [08:23:14] RECOVERY - mw5 Disk Space on mw5 is OK: DISK OK - free space: / 7236 MB (38% inode=72%); [08:23:14] RECOVERY - cp6 Varnish Backends on cp6 is OK: All 7 backends are healthy [08:23:18] RECOVERY - mw7 php-fpm on mw7 is OK: PROCS OK: 19 processes with command name 'php-fpm7.3' [08:23:18] RECOVERY - mw7 Disk Space on mw7 is OK: DISK OK - free space: / 7558 MB (39% inode=72%); [08:23:21] RhinosF1: not only, all mw servers seemed to be dead and db7 also [08:23:28] RECOVERY - mw6 SSH on mw6 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) [08:23:28] RECOVERY - mw4 php-fpm on mw4 is OK: PROCS OK: 19 processes with command name 'php-fpm7.3' [08:23:29] RECOVERY - mw5 MediaWiki Rendering on mw5 is OK: HTTP OK: HTTP/1.1 200 OK - 19322 bytes in 0.107 second response time [08:23:29] RECOVERY - cp6 Stunnel Http for mw4 on cp6 is OK: HTTP OK: HTTP/1.1 200 OK - 15316 bytes in 0.003 second response time [08:23:42] RECOVERY - ping4 on mw5 is OK: PING OK - Packet loss = 0%, RTA = 0.32 ms [08:23:42] RECOVERY - cp8 Varnish Backends on cp8 is OK: All 7 backends are healthy [08:23:43] RECOVERY - mw7 Current Load on mw7 is OK: OK - load average: 0.94, 1.24, 0.59 [08:23:43] RECOVERY - db6 Current Load on db6 is OK: OK - load average: 1.12, 0.80, 0.34 [08:23:44] Reception123: Which rack does it say they're on? If it does [08:23:45] RECOVERY - db6 MySQL on db6 is OK: Uptime: 171 Threads: 8 Questions: 212 Slow queries: 6 Opens: 97 Flush tables: 1 Open tables: 91 Queries per second avg: 1.239 [08:23:49] !log restarted cp7 (same time as the others) [08:23:52] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log, Master [08:23:54] RhinosF1: I can't see where that info would be [08:23:59] RECOVERY - cp3 HTTP 4xx/5xx ERROR Rate on cp3 is OK: OK - NGINX Error Rate is 20% [08:24:00] RECOVERY - mw7 Puppet on mw7 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:24:00] RECOVERY - cp3 Varnish Backends on cp3 is OK: All 7 backends are healthy [08:24:03] RECOVERY - mw7 MediaWiki Rendering on mw7 is OK: HTTP OK: HTTP/1.1 200 OK - 19322 bytes in 0.046 second response time [08:24:17] im kind of assuming that each of the 2 cloud servers we have is some sort of rack. so likely everything on cloud1 is on the same rack but \o/ [08:24:23] RECOVERY - ping4 on mw7 is OK: PING OK - Packet loss = 0%, RTA = 0.14 ms [08:24:32] I only have access to something that allows me to manage servers that's it [08:25:11] PuppyKun: That's my thought but if both cloud1 and 2 had issues then we need to confirm that they both are affected [08:25:33] Reception123: E101E17 or E101E18 mentioned anywhere [08:26:13] nope, I only have access to the virtual cloud thing [08:26:17] I can't see other info [08:26:23] except directly relating to servers [08:27:03] hmm [08:27:34] Reception123: OVH are still showing as down so I wonder if it was them or that delayed us [08:28:01] RhinosF1: I don't know but it must be something from them our servers wouldn't just shut down like that [08:28:15] Reception123: I get that, we need to know why [08:32:58] PROBLEM - cp7 HTTP 4xx/5xx ERROR Rate on cp7 is CRITICAL: CRITICAL - NGINX Error Rate is 79% [08:33:00] RECOVERY - db6 Puppet on db6 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:33:08] PROBLEM - cp6 HTTP 4xx/5xx ERROR Rate on cp6 is CRITICAL: CRITICAL - NGINX Error Rate is 73% [08:33:31] [02miraheze/puppet] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JfTMV [08:33:32] [02miraheze/puppet] 07paladox 03343cb7b - db7: Lower mariadb::config::innodb_buffer_pool_instances to 1 [08:33:35] RECOVERY - mw5 Puppet on mw5 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:34:02] RECOVERY - mw4 Puppet on mw4 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:40:52] PROBLEM - cp6 HTTP 4xx/5xx ERROR Rate on cp6 is WARNING: WARNING - NGINX Error Rate is 59% [08:43:10] !log fallover db7 to db6 [08:43:13] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log, Master [08:44:35] [02miraheze/puppet] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JfTDv [08:44:37] [02miraheze/puppet] 07paladox 03d9c3285 - Switch db6 to being the db master [08:46:30] [02puppet] 07paladox closed pull request 03#1327: Switch over to db6 from db7 - 13https://git.io/JvhY5 [08:46:31] [02miraheze/puppet] 07paladox pushed 031 commit to 03master [+0/-0/±11] 13https://git.io/JfTDL [08:46:33] [02miraheze/puppet] 07paladox 038baf6cc - Switch over to db6 from db7 (#1327) [08:46:34] [02puppet] 07paladox deleted branch 03paladox-patch-2 - 13https://git.io/vbiAS [08:46:36] [02miraheze/puppet] 07paladox deleted branch 03paladox-patch-2 [08:46:50] [02mw-config] 07paladox closed pull request 03#2991: Switch to db6 - 13https://git.io/JvhOO [08:46:51] [02miraheze/mw-config] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JfTDt [08:46:53] [02miraheze/mw-config] 07paladox 03842d27f - Switch to db6 (#2991) [08:46:54] [02mw-config] 07paladox deleted branch 03paladox-patch-1 - 13https://git.io/vbvb3 [08:46:56] [02miraheze/mw-config] 07paladox deleted branch 03paladox-patch-1 [08:49:41] RECOVERY - cp3 Stunnel Http for mon1 on cp3 is OK: HTTP OK: HTTP/1.1 200 OK - 30936 bytes in 4.503 second response time [08:49:52] RECOVERY - cp7 Stunnel Http for mon1 on cp7 is OK: HTTP OK: HTTP/1.1 200 OK - 30936 bytes in 0.011 second response time [08:49:53] RECOVERY - mon1 grafana.miraheze.org HTTPS on mon1 is OK: HTTP OK: HTTP/1.1 200 OK - 30936 bytes in 0.016 second response time [08:50:45] RECOVERY - phab1 phabricator.miraheze.org HTTPS on phab1 is OK: HTTP OK: HTTP/1.1 200 OK - 19068 bytes in 0.238 second response time [08:51:05] RECOVERY - cp8 Stunnel Http for mon1 on cp8 is OK: HTTP OK: HTTP/1.1 200 OK - 30936 bytes in 0.336 second response time [08:51:08] PROBLEM - cp7 HTTP 4xx/5xx ERROR Rate on cp7 is WARNING: WARNING - NGINX Error Rate is 53% [08:51:23] RECOVERY - mon1 HTTPS on mon1 is OK: HTTP OK: HTTP/1.1 200 OK - 30936 bytes in 0.053 second response time [08:51:37] RECOVERY - gluster1 GlusterFS port 49152 on gluster1 is OK: TCP OK - 0.000 second response time on 51.77.107.209 port 49152 [08:51:37] RECOVERY - cp6 Stunnel Http for mon1 on cp6 is OK: HTTP OK: HTTP/1.1 200 OK - 30936 bytes in 0.016 second response time [08:51:52] PROBLEM - bacula2 Bacula Private Git on bacula2 is UNKNOWN: NRPE: Unable to read output [08:51:53] RECOVERY - phab1 Puppet on phab1 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [08:52:05] RECOVERY - phab1 phd on phab1 is OK: PROCS OK: 1 process with args 'phd' [08:52:11] RECOVERY - phab1 phab.miraheze.wiki HTTPS on phab1 is OK: HTTP OK: Status line output matched "HTTP/1.1 200" - 17719 bytes in 0.160 second response time [08:53:11] PROBLEM - bacula2 Bacula Databases db6 on bacula2 is UNKNOWN: NRPE: Unable to read output [08:53:22] RECOVERY - cp6 HTTP 4xx/5xx ERROR Rate on cp6 is OK: OK - NGINX Error Rate is 9% [08:54:31] RECOVERY - cp7 HTTP 4xx/5xx ERROR Rate on cp7 is OK: OK - NGINX Error Rate is 5% [08:54:39] PROBLEM - bacula2 Bacula Databases dbt1 on bacula2 is UNKNOWN: NRPE: Unable to read output [08:55:27] PROBLEM - bacula2 Bacula Static on bacula2 is UNKNOWN: NRPE: Unable to read output [08:56:12] PROBLEM - db6 Current Load on db6 is WARNING: WARNING - load average: 7.02, 7.55, 4.31 [08:56:35] PROBLEM - bacula2 Puppet on bacula2 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[bacula-director] [08:56:43] PROBLEM - bacula2 Bacula Daemon on bacula2 is CRITICAL: PROCS CRITICAL: 1 process with UID = 112 (bacula) [08:57:32] PROBLEM - bacula2 Bacula Phabricator Static on bacula2 is UNKNOWN: NRPE: Unable to read output [08:58:07] paladox: what's happened? [08:58:38] mysql wouldn't start on db7 due to a mariadb bug that we worked around on db6, so we've fallen over to db6 [08:58:51] paladox: is that what caused earlier? [08:59:02] no [08:59:21] paladox: is that the outstanding ovh issue on E101E17 & E101E18 or? [08:59:37] i have no idea, i only had time to fall over to db6 [08:59:39] RECOVERY - db6 Current Load on db6 is OK: OK - load average: 3.46, 6.16, 4.45 [09:00:21] [02miraheze/puppet] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JfTDg [09:00:23] [02miraheze/puppet] 07paladox 032d94491 - Update bacula-dir.conf [09:00:30] paladox: do we know what racks they're on? We could do eventually with an idea why servers randomly shut down. [09:03:21] PROBLEM - bacula2 Bacula Databases db6 on bacula2 is CRITICAL: CRITICAL: no terminated jobs [09:03:28] cloud1 & 2 are in rack E101E17 [09:04:11] RECOVERY - bacula2 Bacula Phabricator Static on bacula2 is OK: OK: Diff, 5376 files, 9.432MB, 2020-04-19 18:36:00 (1.6 days ago) [09:04:31] paladox: so the random shutdown likely was http://travaux.ovh.net/?do=details&id=44159 [09:04:32] [ OVH Tasks ] - travaux.ovh.net [09:04:43] RECOVERY - bacula2 Bacula Databases dbt1 on bacula2 is OK: OK: Diff, 65403 files, 19.89GB, 2020-04-19 05:30:00 (2.1 days ago) [09:04:46] We'll have to hope we stay up as they're still having issues [09:04:49] ok [09:05:01] still having issues? [09:05:05] RECOVERY - bacula2 Bacula Static on bacula2 is OK: OK: Diff, 150413 files, 9.520GB, 2020-04-19 18:33:00 (1.6 days ago) [09:05:07] RECOVERY - bacula2 Bacula Private Git on bacula2 is OK: OK: Full, 4097 files, 11.90MB, 2020-04-19 18:37:00 (1.6 days ago) [09:05:35] paladox: read the task page, there's been no update since the incident was created [09:05:43] ok [09:05:49] RECOVERY - bacula2 Puppet on bacula2 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:05:52] RECOVERY - bacula2 Bacula Daemon on bacula2 is OK: PROCS OK: 2 processes with UID = 112 (bacula) [09:06:49] paladox: thx for sorting db7 [09:07:08] Reception123, PuppyKun: see what paladox just said. [09:29:35] MH-Discord: test [09:29:50] works [09:59:40] [02miraheze/IncidentReporting] 07paladox pushed 031 commit to 03paladox-patch-2 [+0/-0/±1] 13https://git.io/JfT9l [09:59:41] [02miraheze/IncidentReporting] 07paladox 03f5bf4e7 - Fix bug where ->i_responders could be empty [09:59:43] [02IncidentReporting] 07paladox created branch 03paladox-patch-2 - 13https://git.io/fh5YJ [09:59:44] [02IncidentReporting] 07paladox opened pull request 03#12: Fix bug where ->i_responders could be empty - 13https://git.io/JfT98 [10:00:45] * RhinosF1 sees no update from OVH [10:01:55] miraheze/IncidentReporting/paladox-patch-2/f5bf4e7 - paladox The build passed. https://travis-ci.com/miraheze/IncidentReporting/builds/161214308 [11:23:13] paladox, Reception123: 12:16 UK TIME / 11:16 UTC - OVH INCIDENT RESOLVED [11:24:32] Reception123: PM me your updated IR draft [11:54:00] PROBLEM - cp8 Current Load on cp8 is WARNING: WARNING - load average: 1.32, 1.79, 1.10 [11:56:47] PROBLEM - cp8 Current Load on cp8 is CRITICAL: CRITICAL - load average: 2.23, 2.09, 1.34 [11:59:31] RECOVERY - cp8 Current Load on cp8 is OK: OK - load average: 0.96, 1.55, 1.25 [12:59:48] Hmm... can someone take a look at this please https://meta.miraheze.org/m/4Gw [12:59:50] [ Difference between revisions of "Stewards' noticeboard" - Miraheze Meta ] - meta.miraheze.org [13:00:02] I'm not exactly sure what they are asking [13:04:25] AmandaCath: think they want crat rights so needs a Steward [13:05:25] Reception123: well, as I said on their misplaced Phab task, if they are not the creator of the wiki they should ask the creator/current crat(s) for a promotion [13:17:48] AmandaCath: https://meta.miraheze.org/w/index.php?title=User_talk:RhinosF1&diff=103605&oldid=103603 [13:17:49] [ Difference between revisions of "User talk:RhinosF1" - Miraheze Meta ] - meta.miraheze.org [13:22:25] AmandaCath: also re https://phabricator.miraheze.org/T5445#106213, Zppix was right to remove asignee as you didn’t work on it. Zppix is also on SRE Duty and in charge on phabricator triage for the week. [13:22:26] [ ⚓ T5445 Installing a Mandatory Skin ] - phabricator.miraheze.org [13:23:57] Well, I did work on it [13:24:10] I gave the user the help they needed [13:25:25] AmandaCath: For the purpose of asignee, you were the one of 3 and no one worked on the issue as it wasn’t one [13:26:21] Meh, whatever [13:28:39] Mainly, don’t revert sre as well without discussion [13:28:43] Reception123: https://phabricator.miraheze.org/T5453 [13:28:44] [ ⚓ T5453 Alert when icinga-miraheze disconnects from IRC ] - phabricator.miraheze.org [13:31:53] And https://phabricator.miraheze.org/T5454 [13:31:54] [ ⚓ T5454 Reduce number of alerts in an outage ] - phabricator.miraheze.org [13:43:21] Reception123, paladox, PuppyKun: you’ve got mail [13:45:04] ok, will look [13:46:41] it’s good mail [13:54:03] hey everyone [13:54:27] hi Examknow [13:55:29] how is Reception123 this morning? [13:55:50] Examknow: well it's the afternoon for me ;) but I'm fine. You? [13:56:19] Reception123: So far very well. I assume you are UK time as well? [13:56:26] .t Reception123 [13:56:26] Could not find a timezone for this nick. Reception123 can set a timezone with `.settz ` [13:56:43] currently GMT + 2 (so one hour after UK time) [13:57:25] .t Reception123 [13:57:25] 2020-04-21 - 15:57:25CEST [13:57:27] there you go :) [13:58:02] JohnLewis I edit conflicted with you on T5455 :P [13:58:13] Oh, he's not even here [13:58:46] Reception123: :) [13:59:10] AmandaCath: I take it you are back from a wikibreak? [13:59:25] Yes, hopefully [13:59:30] nice [13:59:34] Still uncertain though [14:10:13] hi JohnLewis [14:11:58] hi [14:15:11] [02miraheze/services] 07MirahezeSSLBot pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JfTjx [14:15:13] [02miraheze/services] 07MirahezeSSLBot 03e63917a - BOT: Updating services config for wikis [14:16:25] JohnLewis: how you doing? [14:19:18] JohnLewis: to reply to your last comment, true but this takes me probably 15 mins to do if Examknow doesn’t beat me to it and is free. [14:20:52] But I really doesn’t negate that we need the monitoring in incinga [14:21:48] No, but it tells us that there’s an issue rather than waiting for users to notice [14:25:36] JohnLewis: I could add complexity to it to check if things are visibly up. That shouldn’t take much longer. [14:28:39] You can do whatever with your own software components, but the issue you described has identified a short coming in icinga. Namely, we don’t monitor what you identified [14:34:40] I’ll build that monitoring :) [14:34:50] Apologies, laggy phone [14:57:32] AmandaCath: theres no reason to change priority on a closed task [14:59:36] !log reinstall db7 [14:59:40] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log, Master [15:05:25] PROBLEM - db7 Current Load on db7 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [15:05:41] PROBLEM - db7 Puppet on db7 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds. [15:06:47] PROBLEM - Host db7 is DOWN: PING CRITICAL - Packet loss = 100% [15:13:49] RECOVERY - Host db7 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [15:13:51] PROBLEM - db7 SSH on db7 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:13:51] PROBLEM - db7 Disk Space on db7 is CRITICAL: connect to address 51.89.160.143 port 5666: Connection refusedconnect to host 51.89.160.143 port 5666: Connection refused [15:13:51] PROBLEM - ping4 on db7 is CRITICAL: PING CRITICAL - Packet loss = 100% [15:13:52] PROBLEM - db7 Disk Space on db7 is CRITICAL: connect to address 51.89.160.143 port 5666: Connection refusedconnect to host 51.89.160.143 port 5666: Connection refused [15:13:57] PROBLEM - ping4 on db7 is CRITICAL: PING CRITICAL - Packet loss = 100% [15:14:01] RECOVERY - ping4 on db7 is OK: PING OK - Packet loss = 0%, RTA = 0.87 ms [15:14:17] PROBLEM - db7 SSH on db7 is CRITICAL: connect to address 51.89.160.143 and port 22: Connection refused [15:16:13] [02miraheze/puppet] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JfkfV [15:16:14] [02miraheze/puppet] 07paladox 035191d68 - Update db7.yaml [15:16:53] [02miraheze/puppet] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/Jfkfw [15:16:55] [02miraheze/puppet] 07paladox 0322305d7 - db: Make db7 replicatator [15:17:39] [02miraheze/puppet] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/Jfkf6 [15:17:41] [02miraheze/puppet] 07paladox 03a9c5acb - Update dbreplication.pp [15:19:52] RECOVERY - db7 SSH on db7 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) [15:24:42] RECOVERY - db7 Disk Space on db7 is OK: DISK OK - free space: / 331932 MB (99% inode=99%); [15:26:44] PROBLEM - db7 Puppet on db7 is UNKNOWN: UNKNOWN: Failed to check. Reason is: no_summary_file [15:26:55] RECOVERY - db7 Current Load on db7 is OK: OK - load average: 0.38, 0.39, 0.18 [15:29:40] RECOVERY - db7 Puppet on db7 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:39:24] PROBLEM - cp8 Current Load on cp8 is CRITICAL: CRITICAL - load average: 1.82, 2.05, 1.67 [15:42:28] RECOVERY - cp8 Current Load on cp8 is OK: OK - load average: 1.42, 1.66, 1.57 [15:50:06] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-2-1 [+0/-0/±1] 13https://git.io/JfkJa [15:50:07] [02miraheze/puppet] 07paladox 036360f37 - Update dbreplication.pp [15:50:09] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-2 [+0/-0/±1] 13https://git.io/JfkJw [15:50:10] [02miraheze/puppet] 07paladox 03d68d40c - db: update dbcopy ssh key [15:50:12] uh [15:50:22] that better not saved [15:52:07] paladox: huh [15:59:24] [02puppet] 07paladox created branch 03paladox-patch-2 - 13https://git.io/vbiAS [16:01:03] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-3-1 [+0/-0/±1] 13https://git.io/JfkJS [16:01:04] [02miraheze/puppet] 07paladox 033e96363 - Update dbreplication.pp [16:02:12] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-3-2 [+0/-0/±1] 13https://git.io/JfkJQ [16:02:14] [02miraheze/puppet] 07paladox 03c3cf517 - Update dbreplication.pp [16:15:23] [02miraheze/puppet] 07paladox deleted branch 03paladox-patch-3-1 [16:18:04] [02puppet] 07paladox created branch 03paladox-patch-3-1 - 13https://git.io/vbiAS [16:18:43] [02miraheze/puppet] 07paladox deleted branch 03paladox-patch-3-2 [16:18:57] [02miraheze/puppet] 07paladox deleted branch 03paladox-patch-2-1 [16:19:12] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-3 [+0/-0/±1] 13https://git.io/JfkUJ [16:19:13] [02miraheze/puppet] 07paladox 0393addb2 - db: Update dbcopy ssh key [16:19:15] JohnLewis: monitoring via sigmabot should shortly now ping !staff when icinga-miraheze quits or leaves. It will give a short status check and a full one can be done by SRE, Examknow or me by running !status [16:20:03] [02puppet] 07paladox created branch 03paladox-patch-3-2 - 13https://git.io/vbiAS [16:20:05] !status [16:20:08] Miraheze Service Status: [16:20:09] https://meta.miraheze.org is 03UP03 [16:20:10] https://icinga.miraheze.org is 03UP03 [16:20:12] https://phabricator.miraheze.org is 03UP03 [16:20:14] https://grafana.miraheze.org is 03UP03 [16:20:16] https://publictestwiki.com is 03UP03 [16:20:16] Thats spammy [16:20:18] https://miraheze.org is 03UP03 [16:20:21] RhinosF1: Status report finished. There are currently 0 dead services and 6 alive services. [16:20:30] lets not [16:20:32] and say we did [16:20:45] [02miraheze/puppet] 07paladox deleted branch 03paladox-patch-3 [16:20:45] a simple we're up sufficies [16:20:48] Zppix: it’s to be used rarely. Making sure it was on [16:21:00] Zppix: It should be for emergencies really [16:21:04] It checks multiple services as it varies. [16:21:59] someone take us down so we can test if it works [16:22:15] Heh [16:22:32] Should be deployed now [16:22:52] Reception123: The bot tries to visit the webpages and if it gets a normal page it passes. If it gets any other page, it fails. [16:24:08] brb [16:24:48] Docs for it: https://phabricator.miraheze.org/T5453#106310 [16:24:49] [ ⚓ T5453 Alert when icinga-miraheze disconnects from IRC ] - phabricator.miraheze.org [16:25:29] [02puppet] 07paladox created branch 03paladox-patch-3 - 13https://git.io/vbiAS [16:25:51] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-7 [+0/-0/±1] 13https://git.io/JfkUn [16:25:53] [02miraheze/puppet] 07paladox 0335942c2 - Update db.pp [16:26:01] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-7-1 [+0/-0/±1] 13https://git.io/JfkUC [16:26:03] [02miraheze/puppet] 07paladox 03a4fa6c1 - Update db.pp [16:26:09] [02miraheze/puppet] 07paladox deleted branch 03paladox-patch-2 [16:26:11] [02miraheze/puppet] 07paladox deleted branch 03paladox-patch-3 [16:26:18] [02puppet] 07paladox deleted branch 03paladox-patch-3-1 - 13https://git.io/vbiAS [16:26:43] [02puppet] 07paladox deleted branch 03paladox-patch-3-2 - 13https://git.io/vbiAS [16:26:50] [02puppet] 07paladox deleted branch 03paladox-patch-3 - 13https://git.io/vbiAS [16:26:54] [02miraheze/puppet] 07paladox deleted branch 03paladox-patch-7-1 [16:26:57] [02puppet] 07paladox deleted branch 03paladox-patch-2-1 - 13https://git.io/vbiAS [16:27:27] [02miraheze/puppet] 07paladox deleted branch 03paladox-patch-7 [16:28:45] [02puppet] 07paladox opened pull request 03#1342: db: Update dbcopy ssh key - 13https://git.io/JfkUg [16:28:50] [02puppet] 07paladox edited pull request 03#1342: db: Update dbcopy ssh key - 13https://git.io/JfkUg [16:28:59] [02puppet] 07paladox created branch 03paladox-patch-7 - 13https://git.io/vbiAS [16:29:06] [02puppet] 07paladox closed pull request 03#1342: db: Update dbcopy ssh key - 13https://git.io/JfkUg [16:29:09] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-2 [+0/-0/±1] 13https://git.io/JfkUV [16:29:11] [02miraheze/puppet] 07paladox 034977e4f - Update db.pp [16:29:12] [02puppet] 07paladox deleted branch 03paladox-patch-2 - 13https://git.io/vbiAS [16:29:15] [02puppet] 07paladox created branch 03paladox-patch-7-1 - 13https://git.io/vbiAS [16:29:19] [02puppet] 07paladox deleted branch 03paladox-patch-3 - 13https://git.io/vbiAS [16:29:52] [02puppet] 07paladox deleted branch 03paladox-patch-7-1 - 13https://git.io/vbiAS [16:29:54] [02puppet] 07paladox opened pull request 03#1343: db: Update dbcopy ssh key - 13https://git.io/JfkU6 [16:29:58] [02puppet] 07paladox deleted branch 03paladox-patch-3-3 - 13https://git.io/vbiAS [16:30:05] [02miraheze/puppet] 07paladox pushed 031 commit to 03master [+0/-0/±2] 13https://git.io/JfkUP [16:30:07] [02miraheze/puppet] 07paladox 039ac1fd4 - db: Update dbcopy ssh key (#1343) [16:30:08] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-2 [+0/-0/±1] 13https://git.io/JfkUX [16:30:10] [02miraheze/puppet] 07paladox 03331346c - Update dbreplication.pp [16:30:15] [02puppet] 07paladox closed pull request 03#1343: db: Update dbcopy ssh key - 13https://git.io/JfkU6 [16:30:26] [02puppet] 07paladox deleted branch 03paladox-patch-7 - 13https://git.io/vbiAS [16:30:30] [02miraheze/puppet] 07paladox deleted branch 03paladox-patch-2 [16:32:05] ty [16:36:34] * Examknow is back [16:37:04] paladox: big puppet change today? [16:37:17] what big puppet change? [16:37:50] paladox: it looked like you were merging a change [16:38:02] yes, but not big [16:38:13] ah [16:54:17] PROBLEM - mon1 IRCEcho on mon1 is CRITICAL: PROCS CRITICAL: 2 processes with args 'ircecho' [17:02:36] PROBLEM - cp6 Current Load on cp6 is CRITICAL: CRITICAL - load average: 2.94, 4.15, 2.89 [17:05:44] RECOVERY - cp6 Current Load on cp6 is OK: OK - load average: 0.80, 2.57, 2.49 [17:15:42] RECOVERY - mon1 IRCEcho on mon1 is OK: PROCS OK: 1 process with args '/usr/local/bin/ircecho' [17:26:19] PROBLEM - mon1 IRCEcho on mon1 is CRITICAL: PROCS CRITICAL: 2 processes with args 'ircecho' [17:30:18] paladox: ^ [17:30:34] yup, we're aware. [17:30:43] Alert to Miraheze Staff: It looks like the icinga-miraheze bot has stopped! Ping !staff. [17:30:44] https://meta.miraheze.org is 03UP03 [17:30:45] https://icinga.miraheze.org is 03UP03 [17:31:00] It works! [17:31:01] !log restart ircecho [17:31:05] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log, Master [17:31:34] thats ironic [17:31:37] * RhinosF1 smiles at Examknow even though he doesn’t want to hear that so soon [17:31:47] icinga-miraheze: comes back revi leaves [17:32:07] Zppix: Everyone hates that bot lol [17:32:44] why was ircecho restarted? [17:33:03] JohnLewis: Because the icinga bot stopped [17:33:46] I don't think so [17:33:52] I feel it left because of the restart [17:34:23] Examknow: can you change the !_staff ping to another one as it’s conflicting for revi. Maybe @sre or something [17:34:58] RhinosF1: !sre sound good? [17:35:08] I can change after lunch [17:35:12] Examknow: fine by me [17:35:35] RhinosF1: {{done}} [17:35:37] https://meta.miraheze.org/wiki/Template:done [17:35:40] I couldn't wait [17:35:53] revi: now !_sre [17:36:13] Awesome [17:36:27] np, it was hours old [17:38:06] RhinosF1: so !staff is no longer a ping but !sre is? [17:38:13] just so I know what I put as my stalkword [17:41:59] RECOVERY - mon1 IRCEcho on mon1 is OK: PROCS OK: 1 process with args '/usr/local/bin/ircecho' [17:42:34] Reception123: !_sre now [17:46:20] who set their nick to ping? [17:46:55] RhinosF1: heh, why the _? [17:47:11] Reception123: to stop the ping going off [17:47:12] so it won't ping everyone [17:47:16] Ignore the _ [17:47:29] oh, I thought you meant that was the ping and I was confused [17:47:38] well I doubt it would since you just announced the stalkword [17:47:57] time to add it and wait for it to be abused [17:48:35] Reception123: Just ban people that abuse it [17:48:46] * Reception123 looks at Zppix [17:50:13] Reception123: why zppix? [17:51:29] Btw, ping is a Freenode staffer [17:52:11] that is not a very good nick [17:52:17] no offense ping [17:54:35] JohnLewis correct, i restarted due to " PROBLEM - mon1 IRCEcho on mon1 is CRITICAL: PROCS CRITICAL: 2 processes with args 'ircecho'" [17:55:03] paladox: but that wasn't the problem :) [17:56:12] JohnLewis what was the problem? [17:56:28] icinga-miraheze> RECOVERY - mon1 IRCEcho on mon1 is OK: PROCS OK: 1 process with args '/usr/local/bin/ircecho' [17:57:04] oh [17:57:09] but then it did [18:26:20] PROBLEM - mon1 IRCEcho on mon1 is CRITICAL: PROCS CRITICAL: 2 processes with args 'ircecho' [17:57:15] Reception123: why you looking at me [17:57:17] what did i break [17:57:40] Zppix: Thinks you abuse pingwords [17:58:43] paladox: yes, that was the original issue [17:59:49] oh [17:59:53] Zppix: I was implying you'd be the one to abuse the stalkerword :P [18:01:26] Reception123: why would i need the sre stalkword considering i have access to the servers, and -staff :P [18:03:17] Zppix: someone needs to take the blame [18:03:32] Reception123: that's paladox's job [18:04:03] oh, my bad [18:04:11] X) [18:15:27] paladox: can be unquiet GH now? [18:15:28] *we [18:15:32] yup [18:15:58] [02mw-config] 07paladox closed pull request 03#3031: Add notice for db maintenance - 13https://git.io/JfktF [18:16:00] [02miraheze/mw-config] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JfkqJ [18:16:01] [02miraheze/mw-config] 07paladox 0362771e6 - Add notice for db maintenance (#3031) [18:16:03] [02mw-config] 07paladox deleted branch 03paladox-patch-1 - 13https://git.io/vbvb3 [18:16:04] [02miraheze/mw-config] 07paladox deleted branch 03paladox-patch-1 [18:16:27] [02miraheze/puppet] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JfkqU [18:16:29] [02miraheze/puppet] 07paladox 037056a06 - dbt1: Lower buffer pool to 1 [19:08:17] paladox: who is charge of keeping community up to date for db maint [19:08:55] no one [19:09:19] can someone make sure they are kept informed [19:09:22] Zppix: ^ [19:10:01] they have, through the site notice [19:10:04] RhinosF1: if its not affecting them [19:10:05] Zppix: 2020-04-21 - 13:05:23CDT tell Zppix Please see your email, regarding the last message [19:10:27] RhinosF1: if its not affecting them theres not really a need [19:10:28] I’ll be around i suppose [19:10:46] RhinosF1: but im around [19:10:54] Great [19:28:48] Alert to Miraheze Staff: It looks like the MirahezeRC bot has stopped! Recent Changes are no longer avalible from IRC. [19:29:22] The misspelling of "available" grinds my gears [19:33:46] k6ka fixing [19:34:08] staff be advised that the bot just nipped out of -feed [19:39:11] Hello eth01! If you have any questions, feel free to ask and someone should answer soon. [19:44:01] Nice project [19:47:05] PROBLEM - cp8 Current Load on cp8 is CRITICAL: CRITICAL - load average: 1.51, 2.19, 1.75 [19:47:11] Hi eth01 [19:59:13] PROBLEM - cp8 Current Load on cp8 is WARNING: WARNING - load average: 0.70, 1.54, 1.83 [20:01:31] Hi Zppix [20:02:12] RECOVERY - cp8 Current Load on cp8 is OK: OK - load average: 1.08, 1.28, 1.67 [21:06:28] !log stopping mysql on dbt1 [21:06:31] Logged the message at https://meta.miraheze.org/wiki/Tech:Server_admin_log, Master [21:06:57] 3:28:48 PM ⇐ +MirahezeRC quit (~MirahezeR@miraheze/bots) Excess Flood [21:06:58] lol [21:07:39] hey eth01 [21:08:06] hey :) [21:09:28] is there an email address I can email the project on generally, or the person in charge? [21:09:42] (I tried looking at your website - but it's down - says server issues) [21:10:08] !s [21:10:11] Please wait while I check the status of Miraheze Services. [21:10:12] RhinosF1: Status report finished. There are currently 1 dead services and 5 alive services. To view the full report, say !status. [21:10:17] :( [21:10:27] * RhinosF1 checks in another channel [21:11:03] paladox: ^ we getting issues [21:11:22] ?? [21:11:28] PROBLEM - ns1 GDNSD Datacenters on ns1 is CRITICAL: CRITICAL - 6 datacenters are down: 128.199.139.216/cpweb, 2400:6180:0:d0::403:f001/cpweb, 51.77.107.210/cpweb, 2001:41d0:800:1056::2/cpweb, 51.161.32.127/cpweb, 2607:5300:205:200::17f6/cpweb [21:11:34] PROBLEM - cp6 HTTP 4xx/5xx ERROR Rate on cp6 is WARNING: WARNING - NGINX Error Rate is 41% [21:11:34] PROBLEM - ns2 GDNSD Datacenters on ns2 is CRITICAL: CRITICAL - 6 datacenters are down: 128.199.139.216/cpweb, 2400:6180:0:d0::403:f001/cpweb, 51.77.107.210/cpweb, 2001:41d0:800:1056::2/cpweb, 51.161.32.127/cpweb, 2607:5300:205:200::17f6/cpweb [21:11:43] PROBLEM - cp8 Varnish Backends on cp8 is CRITICAL: 4 backends are down. mw4 mw5 mw6 mw7 [21:11:43] PROBLEM - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is CRITICAL: CRITICAL - NGINX Error Rate is 63% [21:11:47] PROBLEM - cp6 Varnish Backends on cp6 is CRITICAL: 4 backends are down. mw4 mw5 mw6 mw7 [21:11:56] paladox: miraheze.org and wikis down [21:12:16] yes, but you knew about the maintenance [21:12:33] oh the db maintenance [21:12:53] PROBLEM - cp3 Varnish Backends on cp3 is CRITICAL: 4 backends are down. mw4 mw5 mw6 mw7 [21:12:55] PROBLEM - cp3 HTTP 4xx/5xx ERROR Rate on cp3 is WARNING: WARNING - NGINX Error Rate is 51% [21:13:21] paladox: 10pm UTC, isn’t it 9pm UTC [21:13:23] PROBLEM - cp7 Varnish Backends on cp7 is CRITICAL: 4 backends are down. mw4 mw5 mw6 mw7 [21:13:27] oh [21:13:31] .t UTC [21:13:31] 2020-04-21 - 21:13:31UTC [21:13:43] grrr, i mixed up the timezones [21:13:53] paladox: why are you stopping it now? [21:13:59] i'm not [21:14:00] PROBLEM - dbt1 MySQL on dbt1 is CRITICAL: Can't connect to MySQL server on '51.77.109.151' (115) [21:14:06] oh [21:14:11] JohnLewis i mixed up the timezones [21:14:15] so DBs are being moved again? [21:14:24] PROBLEM - cp7 HTTP 4xx/5xx ERROR Rate on cp7 is CRITICAL: CRITICAL - NGINX Error Rate is 81% [21:14:30] dreamcast99 no, we're just restarting for a config change. [21:14:40] PROBLEM - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is WARNING: WARNING - NGINX Error Rate is 57% [21:14:44] so, is there an email I can email? [21:14:52] so that was the maintenance? paladox [21:15:03] eth01: it depends what you are after :) [21:15:16] dreamcast99 yes [21:15:22] JohnLewis i'll write up an incident report i guess afterwards [21:15:25] People who make decisions , JohnLewis [21:15:37] !log starting mysql [21:16:04] paladox: not technically needed [21:16:07] eth01: about? [21:16:08] it was a planned outage [21:16:11] Mistimed, but planned [21:16:18] JohnLewis well yeh but wrong time. [21:16:29] so, after this is fixed, stuff won't go down again? [21:16:34] But we don't do reports for planned outages [21:16:36] JohnLewis: planned for the wrong time, a message somewhere from SRE would be nice [21:17:39] RECOVERY - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is OK: OK - NGINX Error Rate is 35% [21:18:47] PROBLEM - cp3 HTTP 4xx/5xx ERROR Rate on cp3 is CRITICAL: CRITICAL - NGINX Error Rate is 69% [21:24:20] PROBLEM - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is WARNING: WARNING - NGINX Error Rate is 55% [21:26:17] RECOVERY - cp6 HTTP 4xx/5xx ERROR Rate on cp6 is OK: OK - NGINX Error Rate is 22% [21:27:16] paladox: how long until 503s recover? Still down [21:27:19] PROBLEM - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is CRITICAL: CRITICAL - NGINX Error Rate is 61% [21:27:55] Should be soon. mysql i think is going over all dbs so it takes abit. [21:28:07] Ah, np [21:28:30] * hispano76 greetings [21:28:30] At least we got a chance to test & fix SigmaBot’s status functiom [21:28:38] hi hispano76 [21:28:44] hey [21:28:47] hey hispano76 [21:28:51] hey dreamcast99 [21:28:54] how are you doing hispano? [21:28:54] :D [21:29:28] perfect! dreamcast99 [21:29:35] nice [21:30:17] RECOVERY - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is OK: OK - NGINX Error Rate is 32% [21:33:54] https://phabricator.miraheze.org/T1979#106378 I answered [21:33:55] [ ⚓ T1979 Error: 503 Backend fetch failed ] - phabricator.miraheze.org [21:35:56] lol someone should take phab down so ppl can't moan about the server being down [21:36:31] Used a ticket from 2017. New issue == new ticket [21:36:35] PROBLEM - cp3 HTTP 4xx/5xx ERROR Rate on cp3 is WARNING: WARNING - NGINX Error Rate is 57% [21:36:52] yeah I saw [21:36:53] we're up [21:37:00] PROBLEM - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is CRITICAL: CRITICAL - NGINX Error Rate is 63% [21:37:05] thankfully I was able to rescue my edit [21:37:11] !s [21:37:13] Please wait while I check the status of Miraheze Services. [21:37:14] Examknow: Status report finished. There are currently 1 dead services and 5 alive services. To view the full report, say !status. [21:37:19] [02miraheze/puppet] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JfkZA [21:37:20] yayyy [21:37:20] [02miraheze/puppet] 07paladox 036e63d4b - Fix variable name [21:37:30] RECOVERY - cp6 Varnish Backends on cp6 is OK: All 7 backends are healthy [21:37:31] RECOVERY - ns2 GDNSD Datacenters on ns2 is OK: OK - all datacenters are online [21:37:32] RECOVERY - dbt1 MySQL on dbt1 is OK: Uptime: 1320 Threads: 22 Questions: 20057 Slow queries: 577 Opens: 2492 Flush tables: 1 Open tables: 2486 Queries per second avg: 15.194 [21:37:33] What’s the dead one [21:37:44] thanks for your services [21:38:04] RECOVERY - cp8 Varnish Backends on cp8 is OK: All 7 backends are healthy [21:38:23] dreamcast99: your welcome [21:38:32] [02miraheze/puppet] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JfkZx [21:38:33] Nothing is dead, the bot is wrong [21:38:34] [02miraheze/puppet] 07paladox 03b802ef2 - Update dbt1.yaml [21:39:09] RECOVERY - cp3 Varnish Backends on cp3 is OK: All 7 backends are healthy [21:39:31] RECOVERY - cp7 Varnish Backends on cp7 is OK: All 7 backends are healthy [21:39:32] RECOVERY - cp3 HTTP 4xx/5xx ERROR Rate on cp3 is OK: OK - NGINX Error Rate is 6% [21:39:57] RECOVERY - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is OK: OK - NGINX Error Rate is 2% [21:39:58] RECOVERY - ns1 GDNSD Datacenters on ns1 is OK: OK - all datacenters are online [21:40:04] RECOVERY - cp7 HTTP 4xx/5xx ERROR Rate on cp7 is OK: OK - NGINX Error Rate is 1% [21:40:05] paladox: icinga guest account just said ‘no dashlet available’ [21:40:22] works for me [21:41:46] paladox: working now [21:43:51] [02miraheze/mw-config] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JfknO [21:43:52] [02miraheze/mw-config] 07paladox 032e6990a - Remove Sitenotice for db maintenance [21:56:08] PROBLEM - cp8 Current Load on cp8 is CRITICAL: CRITICAL - load average: 2.66, 2.68, 1.85 [21:59:11] PROBLEM - cp8 Current Load on cp8 is WARNING: WARNING - load average: 0.65, 1.76, 1.64 [22:02:13] RECOVERY - cp8 Current Load on cp8 is OK: OK - load average: 0.76, 1.29, 1.47 [22:10:18] [02miraheze/services] 07MirahezeSSLBot pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/Jfkca [22:10:20] [02miraheze/services] 07MirahezeSSLBot 0309c434d - BOT: Updating services config for wikis [22:22:48] [02ssl] 07Pix1234 created branch 03augovwiki - 13https://git.io/vxP9L [22:22:50] [02miraheze/ssl] 07Pix1234 created branch 03augovwiki 13https://git.io/JfkCq [22:24:52] [02ssl] 07Pix1234 deleted branch 03augovwiki - 13https://git.io/vxP9L [22:24:54] [02miraheze/ssl] 07Pix1234 deleted branch 03augovwiki [22:25:14] [02miraheze/services] 07MirahezeSSLBot pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JfkCO [22:25:16] [02miraheze/services] 07MirahezeSSLBot 03ca34e93 - BOT: Updating services config for wikis [23:35:11] [02miraheze/services] 07MirahezeSSLBot pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JfklT [23:35:13] [02miraheze/services] 07MirahezeSSLBot 0326aafb3 - BOT: Updating services config for wikis [23:39:30] [02miraheze/mediawiki] 07paladox pushed 031 commit to 03REL1_34 [+0/-0/±1] 13https://git.io/JfklY [23:39:32] [02miraheze/mediawiki] 07paladox 032cd9905 - Update Moderation [23:46:37] PROBLEM - cloud1 Current Load on cloud1 is WARNING: WARNING - load average: 21.48, 18.60, 15.89 [23:49:38] RECOVERY - cloud1 Current Load on cloud1 is OK: OK - load average: 19.59, 19.09, 16.57