[00:01:46] 10Operations, 10ops-eqiad, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 (10mobrovac) All of the instances have joined the ring (thnx @fgiunchedi!) and the latest version of RESTBase is in place, so we are good. There is one problem... [00:13:58] (03PS1) 10Bstorm: toolforge: update the version of php-cgi to 7.2 as well [puppet] - 10https://gerrit.wikimedia.org/r/485343 (https://phabricator.wikimedia.org/T213666) [00:15:21] (03CR) 10Bstorm: [C: 03+2] toolforge: update the version of php-cgi to 7.2 as well [puppet] - 10https://gerrit.wikimedia.org/r/485343 (https://phabricator.wikimedia.org/T213666) (owner: 10Bstorm) [01:25:07] (03CR) 10Krinkle: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/485106 (owner: 10Dzahn) [05:18:11] !log legoktm@deploy1001 Synchronized php-1.33.0-wmf.13/extensions/JsonConfig/includes/JCCache.php: Revert "JCCache: Explicit load the main slot to avoid API warnings" - T214179 (duration: 00m 58s) [05:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:15] T214179: mw.ext.data.get Lua call returns false - https://phabricator.wikimedia.org/T214179 [05:25:18] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@af21320]: bump discovery analytics to latest [05:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:35] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@af21320]: bump discovery analytics to latest (duration: 00m 17s) [05:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:03] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@af21320]: test swapping venv build to scap fetch/script step [05:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:17] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@af21320]: test swapping venv build to scap fetch/script step (duration: 00m 13s) [05:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:02] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@af21320]: test swapping venv build to scap fetch/script step [05:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:18] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@af21320]: test swapping venv build to scap fetch/script step (duration: 00m 15s) [05:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:32] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@af21320]: test swapping venv build to scap fetch/script step [05:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:46] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@af21320]: test swapping venv build to scap fetch/script step (duration: 00m 14s) [05:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:44] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2972.64 seconds [06:46:58] PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3506.21 seconds [06:47:08] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 64129.49 seconds [06:47:16] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 178588.03 seconds [06:47:18] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table frwiki.echo_notification: Cant find record in echo_notification, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000314, end_log_pos 800908454 [06:47:26] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 5744.24 seconds [06:47:30] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 7771.10 seconds [07:13:50] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:14:54] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [07:24:54] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:28:55] (03Abandoned) 10MGChecker: Reduce Codesniffer exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467104 (owner: 10MGChecker) [07:33:24] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [07:36:44] !log restart pdfrender on scb1004 [07:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:12] !log Fixing dbstore1002 x1 replication T213670 [08:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:15] T213670: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 [08:45:14] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:58:08] PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 963.16 seconds [08:58:14] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 997.00 seconds [08:58:20] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 961.38 seconds [09:02:24] PROBLEM - MariaDB Slave IO: m2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:02:24] PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:02:24] PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:02:24] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:02:28] PROBLEM - MariaDB Slave IO: s8 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:02:30] PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:02:38] PROBLEM - MariaDB Slave SQL: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:02:42] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:02:42] PROBLEM - MariaDB Slave IO: m3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:02:44] PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:02:44] PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:02:50] PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:02:52] PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:02:54] PROBLEM - MariaDB Slave SQL: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:03:00] PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:03:02] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:03:08] PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:03:12] PROBLEM - MariaDB Slave IO: s6 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:03:14] PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:03:18] PROBLEM - MariaDB Slave IO: x1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:03:34] PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:03:34] PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:06:20] elukey: I think it has now crashed because of the alter? [09:11:30] PROBLEM - MariaDB Slave Lag: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [09:11:36] PROBLEM - MariaDB Slave Lag: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [09:31:10] PROBLEM - MariaDB Slave SQL: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:31:10] PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:31:10] PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:31:10] PROBLEM - MariaDB Slave Lag: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [09:31:10] PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:31:14] PROBLEM - MariaDB Slave Lag: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [09:31:14] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [09:31:14] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:31:16] PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [09:31:18] PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:31:22] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [09:31:22] PROBLEM - MariaDB Slave IO: s6 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:31:24] PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:31:26] PROBLEM - MariaDB Slave IO: x1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:31:26] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [09:31:26] PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [09:31:38] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [09:31:46] PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:31:46] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [09:31:46] PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:31:50] PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:31:50] PROBLEM - MariaDB Slave IO: m2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:31:50] PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:31:50] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:31:54] PROBLEM - MariaDB Slave IO: s8 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:31:56] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [09:31:56] PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:32:04] PROBLEM - MariaDB Slave SQL: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:32:04] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [09:32:08] PROBLEM - MariaDB Slave IO: m3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:32:08] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:32:10] PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [09:32:10] PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [09:42:05] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) [09:42:26] marostegui: :( [09:43:37] I am tailing /srv/sqldata/dbstore1002.err, I am seeing the recovery steps [09:43:40] sigh [09:44:09] yeah, let's not alter it anymore [09:44:17] check the update on the task (the alters task) [09:45:16] yep I saw it, makes sense [09:45:27] what a nightmare :( [09:45:32] the main problem is if people will keep writing to dbstore1002's staging [09:46:08] yeah, what I suggested for monday is just a PoC [09:46:13] To see if it works fine [09:46:17] is mysql still bootstrapping ? [09:46:20] yes [09:46:24] it will take a while [09:46:53] Once we are ready to fully migrate staging users from dbstore1002, we can do the final mysqldump+alter on the final host [09:47:21] Which still reminds me that we need to decide where you want to place the current staging db [09:47:29] we as in anlytics :) [09:48:13] in any of the dbstores [09:48:16] no preference [09:48:42] yeah, but on which section? [09:49:01] https://phabricator.wikimedia.org/T210478 [09:49:25] I thought it was on a separate db not belonging to any section [09:49:27] no? [09:49:33] maybe I am still missing some stuff [09:51:20] anyway, me and Manuel have to run errand, and mysql is boostrapping [09:51:45] I should be back in maximum a couple of hours to see if everything is ok and slaves can be restarted [09:52:13] there not much that we can do now :( [09:52:27] will update this chan later on! (unless anybody else beats me :P) [10:03:46] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:18:12] RECOVERY - MariaDB Slave Lag: m2 on dbstore1002 is OK: OK slave_sql_lag not a slave [10:18:44] RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state not a slave [10:18:46] RECOVERY - MariaDB Slave IO: m2 on dbstore1002 is OK: OK slave_io_state not a slave [10:18:58] RECOVERY - MariaDB Slave SQL: m2 on dbstore1002 is OK: OK slave_sql_state not a slave [10:19:02] RECOVERY - MariaDB Slave IO: m3 on dbstore1002 is OK: OK slave_io_state not a slave [10:19:16] RECOVERY - MariaDB Slave Lag: m3 on dbstore1002 is OK: OK slave_sql_lag not a slave [10:29:52] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:06:50] back [12:07:37] ok so mysql on dbstore1002 seems running fine [12:08:04] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:08:06] RECOVERY - MariaDB Slave IO: s2 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [12:08:06] RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:08:14] RECOVERY - MariaDB Slave IO: s3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [12:08:16] RECOVERY - MariaDB Slave IO: s1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [12:08:18] RECOVERY - MariaDB Slave SQL: s8 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:08:19] !log run 'start all slaves' on dbstore1002 after crash [12:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:24] RECOVERY - MariaDB Slave SQL: s6 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:08:26] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:08:32] RECOVERY - MariaDB Slave IO: s5 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [12:08:34] RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:08:34] RECOVERY - MariaDB Slave IO: s6 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [12:08:36] RECOVERY - MariaDB Slave IO: x1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [12:08:56] RECOVERY - MariaDB Slave IO: s7 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [12:08:56] RECOVERY - MariaDB Slave IO: s4 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [12:09:02] RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:09:04] RECOVERY - MariaDB Slave IO: s8 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [12:09:06] RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:11:44] will recheck later :) [12:34:06] elukey: :) [12:34:43] !log pool maps1003 - stretch migration is complete T198622 [12:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:46] T198622: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 [12:36:01] elukey: x1 replication is broken ,I will check how many rows are missing and if I can fix it quickly or we should just reimport it [12:39:48] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [13:16:40] PROBLEM - HHVM rendering on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:44] RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 81018 bytes in 0.149 second response time [13:18:00] PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.04 seconds [13:18:04] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.41 seconds [13:18:10] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.49 seconds [13:18:12] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.63 seconds [13:18:14] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.82 seconds [13:18:24] PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.65 seconds [13:18:42] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.62 seconds [13:18:42] PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.27 seconds [13:18:58] PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.36 seconds [13:21:07] marostegui: sounds like you fixed it right?? [13:22:09] elukey: I have had to fix lots and lots of rows [13:22:23] It is still catching up, down from 60k seconds to 8k seconds [13:24:17] another one failed [13:25:45] :( [13:28:22] elukey: x1 caught up [13:28:46] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.35 seconds [13:31:02] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 268.50 seconds [13:34:59] niceee [13:37:43] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) After all the crashes, MySQL was able to start at around 10:18:11 (UTC). @elukey start replication on all slaves at around 12:07:56 (UTC). x1 replication w... [13:57:22] RECOVERY - MariaDB Slave Lag: s7 on db2077 is OK: OK slave_sql_lag Replication lag: 1.16 seconds [13:57:26] RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [13:57:32] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [13:57:36] RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [13:57:38] RECOVERY - MariaDB Slave Lag: s7 on db2047 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [13:57:46] RECOVERY - MariaDB Slave Lag: s7 on db2086 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [13:58:04] RECOVERY - MariaDB Slave Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [13:58:04] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [13:58:20] RECOVERY - MariaDB Slave Lag: s7 on db2087 is OK: OK slave_sql_lag Replication lag: 0.01 seconds [14:01:42] RECOVERY - MariaDB Slave Lag: s6 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 291.00 seconds [14:36:48] (03PS5) 10Giuseppe Lavagetto: Log docker build output [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475779 (owner: 10Hashar) [14:39:29] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fix the logic of the FSM to account for the fact we allow pulling [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/485219 (owner: 10Giuseppe Lavagetto) [14:40:00] (03CR) 10jenkins-bot: Fix the logic of the FSM to account for the fact we allow pulling [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/485219 (owner: 10Giuseppe Lavagetto) [14:40:27] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Log docker build output [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475779 (owner: 10Hashar) [14:40:58] (03CR) 10jenkins-bot: Log docker build output [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475779 (owner: 10Hashar) [14:55:27] (03PS1) 10Marostegui: [WIP] dbstore_multiinstance: Add stanging db [puppet] - 10https://gerrit.wikimedia.org/r/485367 [15:01:20] (03PS2) 10Marostegui: [WIP] dbstore_multiinstance: Add stanging db [puppet] - 10https://gerrit.wikimedia.org/r/485367 [15:06:46] (03PS3) 10Marostegui: [WIP] dbstore_multiinstance: Add stanging db [puppet] - 10https://gerrit.wikimedia.org/r/485367 [15:09:03] (03PS4) 10Marostegui: [WIP] dbstore_multiinstance: Add stanging db [puppet] - 10https://gerrit.wikimedia.org/r/485367 [15:10:14] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler1002/14398/" [puppet] - 10https://gerrit.wikimedia.org/r/485367 (owner: 10Marostegui) [15:14:40] PROBLEM - MariaDB Slave Lag: s4 on db2090 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 341.18 seconds [15:14:42] PROBLEM - MariaDB Slave Lag: s4 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 341.94 seconds [15:14:46] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 343.48 seconds [15:14:48] PROBLEM - MariaDB Slave Lag: s4 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 345.12 seconds [15:15:04] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 350.16 seconds [15:15:30] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 359.60 seconds [15:15:32] PROBLEM - MariaDB Slave Lag: s4 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 360.36 seconds [15:15:46] PROBLEM - MariaDB Slave Lag: s4 on db2073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 365.52 seconds [15:34:12] PROBLEM - MariaDB Slave Lag: s4 on db2073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.66 seconds [15:34:20] PROBLEM - MariaDB Slave Lag: s4 on db2090 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.20 seconds [15:34:20] PROBLEM - MariaDB Slave Lag: s4 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.28 seconds [15:34:24] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.12 seconds [15:34:28] PROBLEM - MariaDB Slave Lag: s4 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.65 seconds [15:34:44] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.10 seconds [15:53:14] RECOVERY - MariaDB Slave Lag: s2 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 248.04 seconds [16:03:14] RECOVERY - MariaDB Slave Lag: s8 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 297.01 seconds [16:13:54] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 56.29 seconds [16:13:56] RECOVERY - MariaDB Slave Lag: s4 on db2084 is OK: OK slave_sql_lag Replication lag: 56.18 seconds [16:14:14] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 59.90 seconds [16:15:45] actor migration? ^^ [16:15:52] RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 24.03 seconds [16:15:54] RECOVERY - MariaDB Slave Lag: s4 on db2091 is OK: OK slave_sql_lag Replication lag: 24.26 seconds [16:16:08] RECOVERY - MariaDB Slave Lag: s4 on db2073 is OK: OK slave_sql_lag Replication lag: 28.30 seconds [16:16:16] RECOVERY - MariaDB Slave Lag: s4 on db2095 is OK: OK slave_sql_lag Replication lag: 21.12 seconds [16:16:16] RECOVERY - MariaDB Slave Lag: s4 on db2090 is OK: OK slave_sql_lag Replication lag: 18.37 seconds [16:19:46] PROBLEM - puppet last run on kafka-jumbo1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:49:32] (03PS5) 10ArielGlenn: do multistream dumps in parallel and recombine for big wikis [dumps] - 10https://gerrit.wikimedia.org/r/484754 (https://phabricator.wikimedia.org/T213912) [16:51:06] RECOVERY - puppet last run on kafka-jumbo1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:05:51] Hauskatze: yes [17:06:08] marostegui: ¿tienes un minuto? [17:08:41] (03PS5) 10Marostegui: dbstore_multiinstance: Add stanging db [puppet] - 10https://gerrit.wikimedia.org/r/485367 (https://phabricator.wikimedia.org/T210478) [17:11:57] Hauskatze: Si :) [17:12:04] (03PS6) 10Marostegui: dbstore_multiinstance: Add staging db [puppet] - 10https://gerrit.wikimedia.org/r/485367 (https://phabricator.wikimedia.org/T210478) [18:02:00] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.55 seconds [18:02:02] PROBLEM - MariaDB Slave Lag: s4 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.29 seconds [18:02:12] PROBLEM - MariaDB Slave Lag: s4 on db2073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.70 seconds [18:02:20] PROBLEM - MariaDB Slave Lag: s4 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.11 seconds [18:02:22] PROBLEM - MariaDB Slave Lag: s4 on db2090 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.69 seconds [18:02:32] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.40 seconds [18:02:32] PROBLEM - MariaDB Slave Lag: s4 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.51 seconds [18:02:50] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 318.37 seconds [18:51:12] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 266.61 seconds [19:31:22] (03PS3) 10GTirloni: labstore - Allow multiple bdsync jobs per host [puppet] - 10https://gerrit.wikimedia.org/r/485200 (https://phabricator.wikimedia.org/T209527) [19:33:55] (03CR) 10GTirloni: [C: 03+2] labstore - Allow multiple bdsync jobs per host [puppet] - 10https://gerrit.wikimedia.org/r/485200 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [19:39:06] (03PS1) 10BryanDavis: toolforge: Prometheus replacement for sge.py diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/485372 (https://phabricator.wikimedia.org/T211684) [19:39:55] (03CR) 10jerkins-bot: [V: 04-1] toolforge: Prometheus replacement for sge.py diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/485372 (https://phabricator.wikimedia.org/T211684) (owner: 10BryanDavis) [19:42:09] (03PS2) 10BryanDavis: toolforge: Prometheus replacement for sge.py diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/485372 (https://phabricator.wikimedia.org/T211684) [20:34:26] !log upgraded and rebooted labstore200{3,4} [20:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:16] (03PS1) 10GTirloni: wmcs::nfs::misc - Backup for misc server (cloudstore1008) [puppet] - 10https://gerrit.wikimedia.org/r/485375 (https://phabricator.wikimedia.org/T209527) [20:39:46] (03CR) 10jerkins-bot: [V: 04-1] wmcs::nfs::misc - Backup for misc server (cloudstore1008) [puppet] - 10https://gerrit.wikimedia.org/r/485375 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [20:42:38] (03PS4) 10ArielGlenn: option to skip siteinfo header, mw footer for recompressing files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/442774 (https://phabricator.wikimedia.org/T213200) [20:42:40] (03PS4) 10ArielGlenn: options for writeuptopageid to skip writing header or footer [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/442775 (https://phabricator.wikimedia.org/T213200) [20:42:42] (03PS2) 10ArielGlenn: version 0.0.9 [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/482861 (https://phabricator.wikimedia.org/T213200) [20:44:45] 10Operations, 10Wikimedia-Mailing-lists: lost administrator login password for Wikies-l mail list - https://phabricator.wikimedia.org/T214249 (10JorgeGG) [20:45:22] (03Abandoned) 10ArielGlenn: fix up iohandlers to write separate streams for header and footer again [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/485240 (owner: 10ArielGlenn) [20:47:39] (03PS2) 10GTirloni: wmcs::nfs::misc - Backup for misc server (cloudstore1008) [puppet] - 10https://gerrit.wikimedia.org/r/485375 (https://phabricator.wikimedia.org/T209527) [20:47:48] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] move iohandler code for compression/decompression out to a separate file [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/441484 (https://phabricator.wikimedia.org/T213200) (owner: 10ArielGlenn) [20:49:26] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] use iohandlers for recompressxml input and output [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/441485 (https://phabricator.wikimedia.org/T213200) (owner: 10ArielGlenn) [20:50:25] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] option to skip siteinfo header, mw footer for recompressing files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/442774 (https://phabricator.wikimedia.org/T213200) (owner: 10ArielGlenn) [20:51:30] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] options for writeuptopageid to skip writing header or footer [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/442775 (https://phabricator.wikimedia.org/T213200) (owner: 10ArielGlenn) [20:52:22] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] version 0.0.9 [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/482861 (https://phabricator.wikimedia.org/T213200) (owner: 10ArielGlenn) [20:55:00] (03PS3) 10GTirloni: wmcs::nfs::misc - Backup for misc server (cloudstore1008) [puppet] - 10https://gerrit.wikimedia.org/r/485375 (https://phabricator.wikimedia.org/T209527) [21:09:12] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] version 0.0.9 [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/483077 (https://phabricator.wikimedia.org/T213200) (owner: 10ArielGlenn) [21:17:34] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 329.68 seconds [21:17:42] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 334.79 seconds [21:17:48] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 335.65 seconds [21:17:48] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 335.71 seconds [21:17:54] PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 339.83 seconds [21:17:56] PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 340.39 seconds [21:18:10] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 347.08 seconds [21:18:26] PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 352.09 seconds [21:18:46] PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 359.73 seconds [21:26:13] (03PS2) 10ArielGlenn: specify output file explicitly for recompress dump jobs [dumps] - 10https://gerrit.wikimedia.org/r/482870 (https://phabricator.wikimedia.org/T213200) [21:26:15] (03PS10) 10ArielGlenn: write header/body/footer of xml gz files as separate streams [dumps] - 10https://gerrit.wikimedia.org/r/484505 (https://phabricator.wikimedia.org/T182572) [21:26:17] (03PS6) 10ArielGlenn: do multistream dumps in parallel and recombine for big wikis [dumps] - 10https://gerrit.wikimedia.org/r/484754 (https://phabricator.wikimedia.org/T213912) [21:57:49] (03CR) 10ArielGlenn: [C: 03+2] specify output file explicitly for recompress dump jobs [dumps] - 10https://gerrit.wikimedia.org/r/482870 (https://phabricator.wikimedia.org/T213200) (owner: 10ArielGlenn) [22:05:39] (03CR) 10ArielGlenn: [C: 03+2] write header/body/footer of xml gz files as separate streams [dumps] - 10https://gerrit.wikimedia.org/r/484505 (https://phabricator.wikimedia.org/T182572) (owner: 10ArielGlenn) [22:10:47] (03CR) 10ArielGlenn: [C: 03+2] do multistream dumps in parallel and recombine for big wikis [dumps] - 10https://gerrit.wikimedia.org/r/484754 (https://phabricator.wikimedia.org/T213912) (owner: 10ArielGlenn) [22:12:22] !log ariel@deploy1001 Started deploy [dumps/dumps@ab79bbb]: multistream dumps in parallel, recombine gz and multistream without decompression [22:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:26] !log ariel@deploy1001 Finished deploy [dumps/dumps@ab79bbb]: multistream dumps in parallel, recombine gz and multistream without decompression (duration: 00m 03s) [22:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:24] PROBLEM - Host labstore2004 is DOWN: PING CRITICAL - Packet loss = 100% [22:30:42] (03PS1) 10ArielGlenn: dumps: recombine multiple page content multistream files, if produced [puppet] - 10https://gerrit.wikimedia.org/r/485477 (https://phabricator.wikimedia.org/T213912) [22:34:27] tfw you just pushed a bunch of stuff late on a saturday and suddenly your screen is full of PROBLEM... and then you relize a) it's known b) it's unrelated :-) :-) [22:41:26] (03CR) 10ArielGlenn: [C: 03+2] dumps: recombine multiple page content multistream files, if produced [puppet] - 10https://gerrit.wikimedia.org/r/485477 (https://phabricator.wikimedia.org/T213912) (owner: 10ArielGlenn) [22:42:58] 10Operations, 10Wikimedia-Mailing-lists: Reset list admin password for Wikies-l mailing list - https://phabricator.wikimedia.org/T214249 (10Peachey88) [22:50:25] all set for tomorrow's xml/sql dump run now. which, by my clock, is actually later today! [23:09:49] PROBLEM - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:10:54] RECOVERY - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.015 second response time [23:25:26] PROBLEM - WDQS HTTP Port on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.002 second response time [23:35:13] ACKNOWLEDGEMENT - Host labstore2004 is DOWN: PING CRITICAL - Packet loss = 100% GTirloni Stuck after reboot