[00:00:00] PROBLEM - cp4 Stunnel Http for misc2 on cp4 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2319 bytes in 0.080 second response time [00:00:04] SPF|Cloud: how much of an impact though? [00:00:18] PROBLEM - cp3 HTTPS on cp3 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 4055 bytes in 0.660 second response time [00:00:48] RhinosF1: large, faster i/o would also make the process faster [00:01:07] it is expected - doing lots of checks in the background [00:01:14] ok [00:02:37] PROBLEM - cp3 Stunnel Http for test2 on cp3 is CRITICAL: NRPE: Command 'check_stunnel_test2' not defined [00:02:47] PROBLEM - cp3 Stunnel Http for mw7 on cp3 is CRITICAL: NRPE: Command 'check_stunnel_mw7' not defined [00:02:54] PROBLEM - cp3 Stunnel Http for mw6 on cp3 is CRITICAL: NRPE: Command 'check_stunnel_mw6' not defined [00:02:58] PROBLEM - cp3 Stunnel Http for mw4 on cp3 is CRITICAL: NRPE: Command 'check_stunnel_mw4' not defined [00:03:28] it's okay icinga-miraheze, you're next [00:03:29] PROBLEM - cp3 Stunnel Http for mw5 on cp3 is CRITICAL: NRPE: Command 'check_stunnel_mw5' not defined [00:04:03] PROBLEM - cp4 Current Load on cp4 is CRITICAL: CRITICAL - load average: 0.30, 3.31, 2.48 [00:04:53] PROBLEM - cp4 HTTPS on cp4 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 4136 bytes in 0.008 second response time [00:06:00] PROBLEM - cp4 Current Load on cp4 is WARNING: WARNING - load average: 0.09, 1.57, 1.95 [00:07:50] paladox: still starting? [00:07:54] yup [00:07:58] RECOVERY - cp4 Current Load on cp4 is OK: OK - load average: 0.04, 0.75, 1.52 [00:07:58] oh [00:08:01] it started now [00:08:09] awesome [00:08:13] !log running mysql_upgrade [00:08:28] sorry, I set read_only=1 quickly :P [00:08:46] RECOVERY - cp4 HTTPS on cp4 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1531 bytes in 0.047 second response time [00:08:52] RECOVERY - mw3 MediaWiki Rendering on mw3 is OK: HTTP OK: HTTP/1.1 200 OK - 19704 bytes in 6.025 second response time [00:09:39] RECOVERY - cp3 HTTPS on cp3 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1532 bytes in 1.391 second response time [00:09:42] RECOVERY - cp8 HTTPS on cp8 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1531 bytes in 0.487 second response time [00:11:55] paladox: if you see progress (how many dbs done?), it is mandatory to provide info to us :P [00:12:11] it says "Phase 3/7: Fixing views" [00:12:14] and that's about it [00:13:01] PROBLEM - mw3 MediaWiki Rendering on mw3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:13:18] RECOVERY - lizardfs6 MediaWiki Rendering on lizardfs6 is OK: HTTP OK: HTTP/1.1 200 OK - 19686 bytes in 1.066 second response time [00:13:53] uh oh, meta is loading [00:14:17] It'll be slow as db6 load is at 7 [00:14:25] due to i/o i think (due to mysql_upgrade) [00:14:26] paladox: shall I kill nginx temporarily? [00:14:30] yeh [00:14:43] there's a salt master on the new server [00:14:45] PROBLEM - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is WARNING: WARNING - NGINX Error Rate is 51% [00:14:45] (puppet2) [00:15:40] PROBLEM - cp4 HTTP 4xx/5xx ERROR Rate on cp4 is WARNING: WARNING - NGINX Error Rate is 59% [00:16:43] PROBLEM - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is CRITICAL: CRITICAL - NGINX Error Rate is 90% [00:17:09] PROBLEM - lizardfs6 MediaWiki Rendering on lizardfs6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 4197 bytes in 0.063 second response time [00:17:43] PROBLEM - cp4 HTTP 4xx/5xx ERROR Rate on cp4 is CRITICAL: CRITICAL - NGINX Error Rate is 68% [00:18:10] !log stop nginx on mediawiki servers to prevent load on db6, while performing upgrade [00:20:24] PROBLEM - mw1 HTTPS on mw1 is CRITICAL: connect to address 185.52.1.75 and port 443: Connection refusedHTTP CRITICAL - Unable to open TCP socket [00:22:07] RECOVERY - cp4 Varnish Backends on cp4 is OK: All 6 backends are healthy [00:22:08] RECOVERY - mw1 MediaWiki Rendering on mw1 is OK: HTTP OK: HTTP/1.1 200 OK - 19686 bytes in 1.102 second response time [00:22:22] RECOVERY - test1 MediaWiki Rendering on test1 is OK: HTTP OK: HTTP/1.1 200 OK - 19687 bytes in 0.327 second response time [00:22:24] RECOVERY - mw1 HTTPS on mw1 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 541 bytes in 0.009 second response time [00:22:35] RECOVERY - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is OK: OK - NGINX Error Rate is 11% [00:22:49] SPF|Cloud ^ Guess you forgot puppet :P [00:22:53] RECOVERY - mw2 MediaWiki Rendering on mw2 is OK: HTTP OK: HTTP/1.1 200 OK - 19687 bytes in 0.822 second response time [00:22:54] oh [00:22:57] RECOVERY - ns1 GDNSD Datacenters on ns1 is OK: OK - all datacenters are online [00:22:59] RECOVERY - mw3 MediaWiki Rendering on mw3 is OK: HTTP OK: HTTP/1.1 200 OK - 19704 bytes in 0.520 second response time [00:23:03] they shouldn't even be pooled, nvm [00:23:07] RECOVERY - lizardfs6 MediaWiki Rendering on lizardfs6 is OK: HTTP OK: HTTP/1.1 200 OK - 19686 bytes in 0.321 second response time [00:23:37] RECOVERY - cp8 Varnish Backends on cp8 is OK: All 6 backends are healthy [00:23:43] SPF|Cloud it's done "poserdazfreebieswiki.dpl_clview OK" and now still running.. [00:24:02] RECOVERY - cp3 HTTP 4xx/5xx ERROR Rate on cp3 is OK: OK - NGINX Error Rate is 26% [00:24:20] !log stopped most nginxes again [00:24:55] * RhinosF1 is logging off so he can sleep. Thanks to everyone both in SRE and as users. You've been great to support tonight and good luck for the rest of the update. [00:25:05] paladox: but so far it does look like everything went fine [00:25:10] yup [00:25:31] RhinosF1: sleep well. Thanks so much for the assistance and positivity. [00:25:44] PROBLEM - mw2 HTTPS on mw2 is CRITICAL: connect to address 185.52.2.113 and port 443: Connection refusedHTTP CRITICAL - Unable to open TCP socket [00:25:57] SPF|Cloud: no problem. You've done great! [00:26:04] PROBLEM - cp4 Varnish Backends on cp4 is CRITICAL: 4 backends are down. mw4 mw5 mw6 mw7 [00:26:08] PROBLEM - mw1 MediaWiki Rendering on mw1 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 4195 bytes in 0.106 second response time [00:26:16] PROBLEM - test1 MediaWiki Rendering on test1 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 4195 bytes in 0.077 second response time [00:26:24] PROBLEM - mw1 HTTPS on mw1 is CRITICAL: connect to address 185.52.1.75 and port 443: Connection refusedHTTP CRITICAL - Unable to open TCP socket [00:26:26] SPF|Cloud it's running "Phase 4/7: Running 'mysql_fix_privilege_tables'" now [00:26:33] PROBLEM - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is CRITICAL: CRITICAL - NGINX Error Rate is 71% [00:26:44] PROBLEM - mw2 MediaWiki Rendering on mw2 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 4199 bytes in 0.202 second response time [00:27:01] PROBLEM - mw3 MediaWiki Rendering on mw3 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 4199 bytes in 0.200 second response time [00:27:06] PROBLEM - ns1 GDNSD Datacenters on ns1 is CRITICAL: CRITICAL - 6 datacenters are down: 128.199.139.216/cpweb, 2400:6180:0:d0::403:f001/cpweb, 51.77.107.210/cpweb, 2001:41d0:800:1056::2/cpweb, 51.161.32.127/cpweb, 2607:5300:205:200::17f6/cpweb [00:27:07] PROBLEM - lizardfs6 MediaWiki Rendering on lizardfs6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 4197 bytes in 0.077 second response time [00:27:30] PROBLEM - mw3 HTTPS on mw3 is CRITICAL: connect to address 81.4.121.113 and port 443: Connection refusedHTTP CRITICAL - Unable to open TCP socket [00:27:32] PROBLEM - cp8 Varnish Backends on cp8 is CRITICAL: 4 backends are down. mw4 mw5 mw6 mw7 [00:27:51] PROBLEM - cp3 HTTP 4xx/5xx ERROR Rate on cp3 is CRITICAL: CRITICAL - NGINX Error Rate is 98% [00:29:47] !log disable puppet on lizardfs6 and mw4-7 [00:30:12] PROBLEM - lizardfs6 Puppet on lizardfs6 is WARNING: WARNING: Puppet is currently disabled, message: reason not specified, last run 8 minutes ago with 0 failures [00:31:06] It's running "Phase 5/7: Fixing table and database names" now [00:31:18] SPF|Cloud ^ [00:31:22] woo [00:31:46] RECOVERY - mw2 HTTPS on mw2 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 541 bytes in 0.012 second response time [00:32:24] RECOVERY - mw1 HTTPS on mw1 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 541 bytes in 0.008 second response time [00:32:26] PROBLEM - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is WARNING: WARNING - NGINX Error Rate is 58% [00:33:00] paladox: I presume it's at the next phase now? [00:33:07] yup [00:33:16] it's on the a's [00:33:17] * SPF|Cloud is spying on you with show full processlist; [00:33:27] RECOVERY - mw3 HTTPS on mw3 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 541 bytes in 0.007 second response time [00:33:28] now b's [00:33:30] heh [00:34:24] PROBLEM - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is CRITICAL: CRITICAL - NGINX Error Rate is 78% [00:35:54] on the c's [00:37:42] on the d's [00:38:04] on the e's [00:40:31] on the f's [00:41:51] on the g's [00:42:17] on the h's [00:42:42] is this phase 6 or 7? [00:44:18] on the i's [00:44:55] SPF|Cloud 6 [00:45:03] ah [00:48:46] on the m's [00:54:57] on the q's [00:55:04] on the r's [00:59:52] on the t's [01:04:21] on the v's [01:08:02] SPF|Cloud finished! [01:08:24] alright! [01:08:30] !log starting nginx [01:09:54] RECOVERY - mw1 MediaWiki Rendering on mw1 is OK: HTTP OK: HTTP/1.1 200 OK - 19685 bytes in 0.270 second response time [01:09:56] RECOVERY - mw2 MediaWiki Rendering on mw2 is OK: HTTP OK: HTTP/1.1 200 OK - 19686 bytes in 0.374 second response time [01:10:04] RECOVERY - test1 MediaWiki Rendering on test1 is OK: HTTP OK: HTTP/1.1 200 OK - 19687 bytes in 1.912 second response time [01:11:13] RECOVERY - cp8 Varnish Backends on cp8 is OK: All 6 backends are healthy [01:11:19] PROBLEM - cp3 HTTP 4xx/5xx ERROR Rate on cp3 is WARNING: WARNING - NGINX Error Rate is 43% [01:11:49] RECOVERY - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is OK: OK - NGINX Error Rate is 23% [01:12:48] RECOVERY - mw3 MediaWiki Rendering on mw3 is OK: HTTP OK: HTTP/1.1 200 OK - 19705 bytes in 1.605 second response time [01:13:13] RECOVERY - lizardfs6 MediaWiki Rendering on lizardfs6 is OK: HTTP OK: HTTP/1.1 200 OK - 19704 bytes in 0.894 second response time [01:13:19] RECOVERY - cp3 HTTP 4xx/5xx ERROR Rate on cp3 is OK: OK - NGINX Error Rate is 14% [01:13:38] RECOVERY - cp4 Varnish Backends on cp4 is OK: All 6 backends are healthy [01:13:51] RECOVERY - cp3 Varnish Backends on cp3 is OK: All 5 backends are healthy [01:15:28] RECOVERY - ns1 GDNSD Datacenters on ns1 is OK: OK - all datacenters are online [01:15:33] RECOVERY - misc1 GDNSD Datacenters on misc1 is OK: OK - all datacenters are online [01:15:52] PROBLEM - cp4 HTTP 4xx/5xx ERROR Rate on cp4 is WARNING: WARNING - NGINX Error Rate is 46% [01:16:39] time for c2 wikis [01:17:52] RECOVERY - cp4 HTTP 4xx/5xx ERROR Rate on cp4 is OK: OK - NGINX Error Rate is 25% [01:21:43] PROBLEM - ns1 GDNSD Datacenters on ns1 is CRITICAL: CRITICAL - 6 datacenters are down: 128.199.139.216/cpweb, 2400:6180:0:d0::403:f001/cpweb, 51.77.107.210/cpweb, 2001:41d0:800:1056::2/cpweb, 51.161.32.127/cpweb, 2607:5300:205:200::17f6/cpweb [01:21:45] PROBLEM - misc1 GDNSD Datacenters on misc1 is CRITICAL: CRITICAL - 6 datacenters are down: 128.199.139.216/cpweb, 2400:6180:0:d0::403:f001/cpweb, 51.77.107.210/cpweb, 2001:41d0:800:1056::2/cpweb, 51.161.32.127/cpweb, 2607:5300:205:200::17f6/cpweb [01:23:15] PROBLEM - mw3 MediaWiki Rendering on mw3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:23:50] PROBLEM - cp4 Varnish Backends on cp4 is CRITICAL: 2 backends are down. mw4 mw7 [01:23:53] PROBLEM - cp3 Varnish Backends on cp3 is CRITICAL: 1 backends are down. mw7 [01:25:22] RECOVERY - mw3 MediaWiki Rendering on mw3 is OK: HTTP OK: HTTP/1.1 200 OK - 19686 bytes in 8.601 second response time [01:25:34] PROBLEM - mw2 MediaWiki Rendering on mw2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:25:56] RECOVERY - cp4 Varnish Backends on cp4 is OK: All 6 backends are healthy [01:26:01] RECOVERY - cp3 Varnish Backends on cp3 is OK: All 5 backends are healthy [01:27:36] RECOVERY - mw2 MediaWiki Rendering on mw2 is OK: HTTP OK: HTTP/1.1 200 OK - 19687 bytes in 7.254 second response time [01:29:57] !log db6 MariaDB [(none)]> set global innodb_io_capacity=100; [01:36:48] !log MariaDB [(none)]> set global innodb_io_capacity_max=500; [01:41:51] !log stopping nginx on mw[4567] [01:42:00] PROBLEM - lizardfs6 MediaWiki Rendering on lizardfs6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:42:06] PROBLEM - cp4 Varnish Backends on cp4 is CRITICAL: 2 backends are down. mw4 mw5 [01:42:24] PROBLEM - test1 MediaWiki Rendering on test1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:42:29] PROBLEM - cp3 Varnish Backends on cp3 is CRITICAL: 2 backends are down. mw4 mw7 [01:42:58] PROBLEM - mw2 MediaWiki Rendering on mw2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:45:19] PROBLEM - cp3 HTTP 4xx/5xx ERROR Rate on cp3 is CRITICAL: CRITICAL - NGINX Error Rate is 92% [01:45:42] PROBLEM - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is CRITICAL: CRITICAL - NGINX Error Rate is 90% [01:46:24] PROBLEM - cp8 Varnish Backends on cp8 is CRITICAL: 4 backends are down. mw4 mw5 mw6 mw7 [01:46:27] PROBLEM - mw1 MediaWiki Rendering on mw1 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 4199 bytes in 0.064 second response time [01:46:42] PROBLEM - mw3 MediaWiki Rendering on mw3 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 4199 bytes in 0.102 second response time [01:50:25] PROBLEM - cp4 HTTP 4xx/5xx ERROR Rate on cp4 is WARNING: WARNING - NGINX Error Rate is 57% [01:52:25] PROBLEM - cp4 HTTP 4xx/5xx ERROR Rate on cp4 is CRITICAL: CRITICAL - NGINX Error Rate is 67% [02:11:44] [02miraheze/puppet] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/Jvl66 [02:11:45] [02miraheze/puppet] 07paladox 03cee92b9 - grafana: Switch to db6 [02:12:11] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-5 [+0/-0/±1] 13https://git.io/Jvl6i [02:12:13] [02miraheze/puppet] 07paladox 03f3e353a - matomo: Switch to db6 [02:12:14] [02puppet] 07paladox created branch 03paladox-patch-5 - 13https://git.io/vbiAS [02:12:16] [02puppet] 07paladox opened pull request 03#1233: matomo: Switch to db6 - 13https://git.io/Jvl6P [02:15:00] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-10 [+0/-0/±1] 13https://git.io/Jvl6y [02:15:02] [02miraheze/puppet] 07paladox 031053534 - icinga: Switch to db6 [02:15:03] [02puppet] 07paladox created branch 03paladox-patch-10 - 13https://git.io/vbiAS [02:15:08] [02puppet] 07paladox opened pull request 03#1234: icinga: Switch to db6 - 13https://git.io/Jvl6S [02:15:39] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-10 [+0/-0/±1] 13https://git.io/Jvl69 [02:15:41] [02miraheze/puppet] 07paladox 0365c215c - Update init.pp [02:15:42] [02puppet] 07paladox synchronize pull request 03#1234: icinga: Switch to db6 - 13https://git.io/Jvl6S [02:15:53] [02puppet] 07paladox closed pull request 03#1234: icinga: Switch to db6 - 13https://git.io/Jvl6S [02:15:55] [02miraheze/puppet] 07paladox pushed 031 commit to 03master [+0/-0/±2] 13https://git.io/Jvl6Q [02:15:56] [02miraheze/puppet] 07paladox 034531e22 - icinga: Switch to db6 (#1234) * icinga: Switch to db6 * Update init.pp [02:16:09] [02miraheze/puppet] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/Jvl65 [02:16:10] [02miraheze/puppet] 07paladox 039de4c51 - roundcubemail: Switch to db6 [02:16:17] [02miraheze/puppet] 07paladox deleted branch 03paladox-patch-10 [02:16:19] [02puppet] 07paladox deleted branch 03paladox-patch-10 - 13https://git.io/vbiAS [02:16:30] [02puppet] 07paladox deleted branch 03paladox-patch-2 - 13https://git.io/vbiAS [02:16:31] [02miraheze/puppet] 07paladox deleted branch 03paladox-patch-2 [02:21:06] [02miraheze/puppet] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/Jvl6p [02:21:07] [02miraheze/puppet] 07paladox 0329f3758 - Update main.pp [02:23:34] PROBLEM - cp3 HTTP 4xx/5xx ERROR Rate on cp3 is WARNING: WARNING - NGINX Error Rate is 53% [02:23:47] [02miraheze/puppet] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/Jvlif [02:23:48] [02miraheze/puppet] 07paladox 03309da41 - Update init.pp [02:25:30] PROBLEM - cp3 HTTP 4xx/5xx ERROR Rate on cp3 is CRITICAL: CRITICAL - NGINX Error Rate is 92% [02:28:52] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-2 [+0/-0/±1] 13https://git.io/JvliT [02:28:54] [02miraheze/puppet] 07paladox 031412023 - mariadb: Bind to 0.0.0.0 This will allow port 3306 to be served over both ipv6 and ipv4. [02:28:55] [02puppet] 07paladox created branch 03paladox-patch-2 - 13https://git.io/vbiAS [02:28:57] [02puppet] 07paladox opened pull request 03#1235: mariadb: Bind to 0.0.0.0 - 13https://git.io/Jvlik [02:31:59] [02puppet] 07paladox synchronize pull request 03#1235: mariadb: Bind to 0.0.0.0 - 13https://git.io/Jvlik [02:32:01] [02miraheze/puppet] 07paladox pushed 031 commit to 03paladox-patch-2 [+0/-0/±1] 13https://git.io/JvliI [02:32:02] [02miraheze/puppet] 07paladox 03a3d7baa - Update mw.cnf.erb [03:24:25] PROBLEM - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is WARNING: WARNING - NGINX Error Rate is 59% [03:26:25] PROBLEM - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is CRITICAL: CRITICAL - NGINX Error Rate is 84% [03:38:21] PROBLEM - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is WARNING: WARNING - NGINX Error Rate is 56% [03:40:20] PROBLEM - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is CRITICAL: CRITICAL - NGINX Error Rate is 67% [04:10:11] PROBLEM - misc1 HTTPS on misc1 is CRITICAL: connect to address 185.52.1.76 and port 443: Connection refusedHTTP CRITICAL - Unable to open TCP socket [04:10:29] PROBLEM - misc1 Puppet on misc1 is WARNING: WARNING: Puppet is currently disabled, message: paladox, last run 9 minutes ago with 0 failures [04:10:46] PROBLEM - misc1 grafana.miraheze.org HTTPS on misc1 is CRITICAL: connect to address 185.52.1.76 and port 443: Connection refusedHTTP CRITICAL - Unable to open TCP socket [04:11:32] PROBLEM - misc1 icinga.miraheze.org HTTPS on misc1 is CRITICAL: connect to address 185.52.1.76 and port 443: Connection refusedHTTP CRITICAL - Unable to open TCP socket [04:13:16] !log MariaDB [(none)]> set global innodb_io_capacity_max=200; [04:48:04] [02mw-config] 07paladox synchronize pull request 03#2881: database: Switch to db6 - 13https://git.io/JvluL [04:48:05] [02miraheze/mw-config] 07paladox pushed 031 commit to 03paladox-patch-3 [+0/-0/±1] 13https://git.io/JvlXU [04:48:07] [02miraheze/mw-config] 07paladox 03a635c3b - Update Database.php [04:50:20] [02mw-config] 07paladox closed pull request 03#2881: database: Switch to db6 - 13https://git.io/JvluL [04:50:21] [02miraheze/mw-config] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JvlXI [04:50:23] [02miraheze/mw-config] 07paladox 03cce9213 - database: Switch to db6 (#2881) * database: Switch to db6 * Update Database.php [04:50:24] [02mw-config] 07paladox deleted branch 03paladox-patch-3 - 13https://git.io/vbvb3 [04:50:26] [02miraheze/mw-config] 07paladox deleted branch 03paladox-patch-3 [04:51:46] [02miraheze/mw-config] 07paladox pushed 031 commit to 03paladox-patch-3 [+0/-0/±1] 13https://git.io/JvlXt [04:51:48] [02miraheze/mw-config] 07paladox 035c19f72 - Unset readonly from all wikis apart from a few [04:51:49] [02mw-config] 07paladox created branch 03paladox-patch-3 - 13https://git.io/vbvb3 [04:51:51] [02mw-config] 07paladox opened pull request 03#2882: Unset readonly from all wikis apart from a few - 13https://git.io/JvlXq [04:52:25] RECOVERY - test1 Puppet on test1 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:53:29] [02miraheze/mw-config] 07paladox pushed 031 commit to 03paladox-patch-3 [+0/-0/±1] 13https://git.io/JvlXY [04:53:30] [02miraheze/mw-config] 07paladox 03ee04372 - Update LocalSettings.php [04:53:32] [02mw-config] 07paladox synchronize pull request 03#2882: Unset readonly from all wikis apart from a few - 13https://git.io/JvlXq [04:57:19] [02mw-config] 07paladox closed pull request 03#2882: Unset readonly from all wikis apart from a few - 13https://git.io/JvlXq [04:57:20] [02miraheze/mw-config] 07paladox pushed 031 commit to 03master [+0/-0/±2] 13https://git.io/JvlXZ [04:57:22] [02miraheze/mw-config] 07paladox 031d790b3 - Unset readonly from all wikis apart from a few (#2882) * Unset readonly from all wikis apart from a few * Update LocalSettings.php [04:57:23] [02mw-config] 07paladox deleted branch 03paladox-patch-3 - 13https://git.io/vbvb3 [04:57:25] [02miraheze/mw-config] 07paladox deleted branch 03paladox-patch-3 [04:59:12] RECOVERY - test1 MediaWiki Rendering on test1 is OK: HTTP OK: HTTP/1.1 200 OK - 19705 bytes in 0.387 second response time [04:59:43] RECOVERY - cp3 Varnish Backends on cp3 is OK: All 5 backends are healthy [04:59:45] RECOVERY - cp8 Varnish Backends on cp8 is OK: All 6 backends are healthy [04:59:59] PROBLEM - cp3 HTTP 4xx/5xx ERROR Rate on cp3 is WARNING: WARNING - NGINX Error Rate is 56% [05:01:55] PROBLEM - cp3 HTTP 4xx/5xx ERROR Rate on cp3 is CRITICAL: CRITICAL - NGINX Error Rate is 84% [05:03:12] PROBLEM - test1 MediaWiki Rendering on test1 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 4197 bytes in 0.102 second response time [05:03:44] PROBLEM - cp3 Varnish Backends on cp3 is CRITICAL: 4 backends are down. mw4 mw5 mw6 mw7 [05:03:45] PROBLEM - cp8 Varnish Backends on cp8 is CRITICAL: 4 backends are down. mw4 mw5 mw6 mw7 [05:05:12] RECOVERY - test1 MediaWiki Rendering on test1 is OK: HTTP OK: HTTP/1.1 200 OK - 19685 bytes in 0.478 second response time [05:06:21] PROBLEM - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is WARNING: WARNING - NGINX Error Rate is 55% [05:06:34] PROBLEM - yellowiki.xyz - LetsEncrypt on sslhost is WARNING: WARNING - Certificate 'yellowiki.xyz' expires in 15 day(s) (Mon 02 Mar 2020 05:03:40 AM GMT +0000). [05:06:47] [02miraheze/ssl] 07MirahezeSSLBot pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/JvlXW [05:06:48] [02miraheze/ssl] 07MirahezeSSLBot 038d9b924 - Bot: Update SSL cert for yellowiki.xyz [05:08:21] PROBLEM - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is CRITICAL: CRITICAL - NGINX Error Rate is 83% [05:09:26] PROBLEM - test1 MediaWiki Rendering on test1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:10:44] RECOVERY - lizardfs6 MediaWiki Rendering on lizardfs6 is OK: HTTP OK: HTTP/1.1 200 OK - 19686 bytes in 0.589 second response time [05:11:55] PROBLEM - cp3 HTTP 4xx/5xx ERROR Rate on cp3 is WARNING: WARNING - NGINX Error Rate is 53% [05:12:23] RECOVERY - cp8 HTTP 4xx/5xx ERROR Rate on cp8 is OK: OK - NGINX Error Rate is 34% [05:12:35] RECOVERY - yellowiki.xyz - LetsEncrypt on sslhost is OK: OK - Certificate 'yellowiki.xyz' will expire on Fri 15 May 2020 04:06:41 AM GMT +0000. [05:13:55] PROBLEM - cp3 HTTP 4xx/5xx ERROR Rate on cp3 is CRITICAL: CRITICAL - NGINX Error Rate is 76% [05:14:12] RECOVERY - mw3 MediaWiki Rendering on mw3 is OK: HTTP OK: HTTP/1.1 200 OK - 19687 bytes in 0.410 second response time [05:14:57] PROBLEM - lizardfs6 MediaWiki Rendering on lizardfs6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:15:55] RECOVERY - cp3 HTTP 4xx/5xx ERROR Rate on cp3 is OK: OK - NGINX Error Rate is 20% [05:18:14] RECOVERY - mw2 MediaWiki Rendering on mw2 is OK: HTTP OK: HTTP/1.1 200 OK - 19686 bytes in 8.762 second response time [05:18:27] PROBLEM - mw3 MediaWiki Rendering on mw3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:20:27] PROBLEM - cp4 HTTP 4xx/5xx ERROR Rate on cp4 is WARNING: WARNING - NGINX Error Rate is 59% [05:21:15] RECOVERY - lizardfs6 MediaWiki Rendering on lizardfs6 is OK: HTTP OK: HTTP/1.1 200 OK - 19686 bytes in 0.993 second response time [05:24:26] RECOVERY - mw3 MediaWiki Rendering on mw3 is OK: HTTP OK: HTTP/1.1 200 OK - 19686 bytes in 0.474 second response time [05:26:18] RECOVERY - mw1 MediaWiki Rendering on mw1 is OK: HTTP OK: HTTP/1.1 200 OK - 19704 bytes in 4.908 second response time [05:28:53] PROBLEM - mw3 MediaWiki Rendering on mw3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:29:04] PROBLEM - lizardfs6 MediaWiki Rendering on lizardfs6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:31:44] PROBLEM - mw2 MediaWiki Rendering on mw2 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 1351 bytes in 0.067 second response time [05:33:16] RECOVERY - lizardfs6 MediaWiki Rendering on lizardfs6 is OK: HTTP OK: HTTP/1.1 200 OK - 19704 bytes in 8.960 second response time [05:34:32] PROBLEM - cp4 HTTP 4xx/5xx ERROR Rate on cp4 is CRITICAL: CRITICAL - NGINX Error Rate is 60% [05:35:03] RECOVERY - mw3 MediaWiki Rendering on mw3 is OK: HTTP OK: HTTP/1.1 200 OK - 19687 bytes in 0.640 second response time [05:36:29] PROBLEM - cp4 HTTP 4xx/5xx ERROR Rate on cp4 is WARNING: WARNING - NGINX Error Rate is 59% [05:36:58] PROBLEM - mw1 MediaWiki Rendering on mw1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:37:43] PROBLEM - lizardfs6 MediaWiki Rendering on lizardfs6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:38:58] RECOVERY - mw1 MediaWiki Rendering on mw1 is OK: HTTP OK: HTTP/1.1 200 OK - 19686 bytes in 2.792 second response time [06:38:01] paladox, PuppyKun, Reception123, SPF|Cloud: should we still be down? It’s 503ing [06:38:32] @Site Reliability Engineers [06:40:47] I'm not sure since I didn't do the migration [06:40:59] I can try taking a look but I doubt I can do anything [06:42:24] Reception123: paladox said we we’re back at 01:17 on discord but icinga,phab and wikis are all down [06:46:52] RhinosF1: I can just see that db6 is really slow [06:46:58] Reception123: wiki just loaded very slow but not rendered rigjt [06:47:18] Reception123: hmm, that’s not good. [06:47:34] Back to 503 [06:51:21] Reception123: no edits since 2pm my time [06:52:01] Reception123: this is UBM [06:52:03] UBN* [06:52:21] well we have to wait for the others, I'm looking at what I can but I don't think I'll get far [06:53:04] Zppix: I’ve pinged everyone [06:53:12] Even owen to wake john [06:53:24] But paladox sleeps well usually [06:57:44] I have to go to bed its late and i work tomorrow or else id stay up and help [06:58:49] Zppix: I don’t want to go yet but I need to get another 2-3 hrs sleep as well [06:59:13] Reception123: how long u around for [06:59:22] quite a while [07:10:46] Zppix: RhinosF1 shouldn't https://github.com/miraheze/mw-config/blob/master/LocalSettings.php#L24 be DB6? [07:10:46] [ mw-config/LocalSettings.php at master · miraheze/mw-config · GitHub ] - github.com [07:14:15] Try it Reception123 [07:14:23] Worse thing it cAn do is nothing [07:15:33] true [07:15:38] we're completely down anyway [07:17:30] [02miraheze/mw-config] 07Reception123 pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/Jvl1h [07:17:31] [02miraheze/mw-config] 07Reception123 0313df231 - try to change $wgLocalVirtualHosts to db6 IP [07:17:56] Reception123: force puppet runs [07:18:07] yeah doing [07:19:37] Reception123: any luck? [07:19:55] nope :( reverting [07:20:01] Wait dont [07:20:05] I think its fine [07:20:19] Reception123: [07:20:27] Zppix: well technically it should be like that [07:20:39] when paladox or someone comes I'll let them know of the change in case it shouldn't [07:20:57] Reception123: it was showing meta for second fhen it died [07:21:09] I wonder if the import is breaking it [07:21:13] Zppix: https://github.com/miraheze/mw-config/commit/05b3c870b20ac07d0922a8cf0169de2965332759 this shows it's the correct procedure when moving dbs [07:21:14] [ 81.4.125.112 -> 81.4.109.166 · miraheze/mw-config@05b3c87 · GitHub ] - github.com [07:21:26] Zppix: don't see an active import if I check htop on db6 [07:21:53] Then what the hell is causing this [07:22:08] Now im just getting 502 Reception123 [07:22:15] Zppix: I got that a few times last time [07:22:23] *before not last time [07:23:25] I just announced on discord that our down status is known [07:23:46] Zppix: I think ATT might be the culprit [07:23:54] How so? [07:24:11] Zppix: Sat Feb 15 7:18:46 UTC 2020 mw4 allthetropeswiki SqlBagOStuff::fetchBlobMulti db6.miraheze.org 1146 Table 'allthetropeswiki.objectcache' doesn't exist (db6.miraheze.org) SELECT keyname,value,exptime FROM `objectcache` WHERE keyname = 'allthetropeswiki:messages:en:status' [07:24:34] How do we fix that [07:25:26] Zppix: I don't know why it says it doesn't exist though, maybe a migration problem? [07:25:38] Idk how so we dix [07:25:39] Fix [07:25:45] Zppix: 'use allthetropeswiki' is slow for now not even working [07:26:18] it just won't do it [07:26:42] Maybe we just need to drop it :P jk Reception123 [07:27:21] Zppix: well db6 is just too slow, a select for meta doesn't work and neither does USE att [07:27:46] So basically we have no db? [07:28:01] Can we switch back to the old db temp? [07:28:17] Zppix: We probably could but then it would mean the migration was for nothing [07:28:39] We need to do something until the other sre are back [07:28:49] We cant have an outage overnight [07:28:53] Well lose wikis [07:29:21] Zppix: We could switch ATT back to db5 maybe? [07:29:42] Reception123: if you know how go for it [07:29:57] Did they keep db5 database intact? [07:30:24] yeah, they should've [07:30:29] and I'm not sure how but I'll look [07:31:05] Im just not sure why an error for att would break all wikis Reception123 [07:31:19] That sounds like bad error handling [07:33:02] Zppix: I know, it's probably not that [07:33:13] sounds like I should have exported a backup of our wiki before this maintenance [07:34:07] Tegu: no data loss is expected [07:34:11] phew :) [07:34:21] We still have the old db intact [07:38:04] Reception123: GL im off to bed im tired [07:38:17] Zppix: good night [07:38:43] Reception123: wait have you tried restarting the db? [07:39:02] Zppix: would that not make things worse? [07:39:33] How so? [07:39:52] Zppix: data loss? [07:40:22] Idk [07:40:46] I cant think of any other solutions [07:41:22] Besides literally dropping ATT and maybe reimport [07:42:15] But anyway before i come up with more crazy ideas im going to bed [07:44:07] well, good night [07:44:13] Reception123: is jobrunner operating? [07:44:20] Reception123: i just thought of that [07:44:36] let me see which server that's on [07:44:48] Reception123: jobrunner1 [07:44:59] oh yes I forgot it has its own [07:45:34] !log started jobrunner on jobrunner1 [07:45:50] I wonder if thats why [07:46:29] Zppix: yup, that was why... [07:46:37] Im a genius [07:46:39] Phabricator is still down but Meta seems to be back up [07:46:41] Zppix: you saved the day :) [07:47:18] Zppix: it's still really slow but it seems to work [07:47:44] I was legit getting ready to lay down and i was like oh i wonder if jobrunner is working Reception123 [07:48:15] Zppix: heh that was a good idea [07:48:26] though the others should really come and see why it's so slow and what's up with Phab [07:49:20] The main thing is the wikis work, the other services arent the end of the world [07:49:51] Reception123: it may be varnish hasnt had time to recache? [07:49:58] could be [07:50:07] Zppix: well they don't really because it's still really really slow [07:51:03] Landing page is fast Reception123 [07:51:43] Yeah [07:52:01] Anyway im going to bed (maybe?) [09:50:36] Reception123: Are we supposed to still be read only [09:50:50] Warning: The database has been locked for maintenance, so you will not be able to save your edits right now. You may wish to copy and paste your text into a text file and save it for later. [09:50:50] The system administrator who locked it offered this explanation: The master database server is running in read-only mode. [09:52:57] We just 503’d [10:13:40] Update: We are aware of an issue causing slowness for wikis and most services to be unavailable. We are waiting on either <@585842771256934431> or <@484010048004030484> to be available as it is an issue with the migration. There is a good chance they won’t be available for a few hours unfortunately as they are asleep. We will try to get an incident report as soon as we can. [10:14:17] That’s paladox or SPF|Cloud [10:29:24] No idea, but I'm not going to mess with it [10:30:14] Reception123: leave it. I have a good idea what they’ve done but why baffles me. It’s likely the cause as well. [12:16:06] paladox, SPF|Cloud: ? [13:30:28] i'm around [13:35:46] paladox: wikis are slow + readonly, phab is showing a db error + everything else is err_connection_failed [13:36:00] yup, i am aware. [13:36:05] Please work out wtf is going on. Anyway, I’m off out [13:36:17] paladox: that’s not good 20 hours on [13:36:34] RhinosF1 we are on hdd's, things are going to be slow. [13:36:54] paladox: can you make an update on discord [13:44:03] ok [14:38:20] [02miraheze/mw-config] 07paladox pushed 031 commit to 03master [+0/-0/±1] 13https://git.io/Jvl71 [14:38:22] [02miraheze/mw-config] 07paladox 03ba42894 - Only keep allthetropeswiki in read only [14:40:22] paladox: almost there? [14:40:37] SPF|Cloud i've restored access (which seems to be holding) [14:40:43] i'm still restoring att though [14:40:52] piwik i can do later after all wikis are done [14:41:04] How much of it has been done? [14:41:36] paladox: wikis are read only even without that for some reason [14:41:54] where? [14:42:20] oh [14:42:20] paladox: try edit on meta. i said when you turned up [14:42:40] 13:35:46 paladox: wikis are slow + readonly, phab is showing a db error + everything else is err_connection_failed [14:42:47] 13:36:01 yup, i am aware. [14:44:21]