[00:10:46] !log Decommissioning restbase1006.eqiad.wmnet : T95253 [00:10:48] T95253: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253 [00:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:28:37] PROBLEM - puppet last run on mw1141 is CRITICAL: CRITICAL: Puppet has 64 failures [01:55:44] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: puppet fail [02:21:30] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [02:24:50] PROBLEM - Apache HTTP on mw1141 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:38] RECOVERY - Apache HTTP on mw1141 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.043 second response time [02:27:19] RECOVERY - puppet last run on mw1141 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:38:19] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.320 second response time [02:39:58] RECOVERY - HHVM rendering on mw1117 is OK: HTTP OK: HTTP/1.1 200 OK - 65950 bytes in 0.129 second response time [02:41:19] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:36:28] PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: Puppet has 1 failures [03:58:28] PROBLEM - puppet last run on rdb2003 is CRITICAL: CRITICAL: puppet fail [04:01:30] RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [04:25:28] RECOVERY - puppet last run on rdb2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:48:53] PROBLEM - puppet last run on mw1115 is CRITICAL: CRITICAL: Puppet has 6 failures [05:49:50] (03PS1) 10Glaisher: Add TranslationsUpdateJob to translate job runner group [puppet] - 10https://gerrit.wikimedia.org/r/283841 (https://phabricator.wikimedia.org/T53731) [06:30:44] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:55] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: puppet fail [06:31:25] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:34] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:15] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:24] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:24] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:34] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:43] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:55] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:25] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:56:44] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:56:53] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:57:34] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:57:43] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:43] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:57:54] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:03] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:04] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:24] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:41:39] RECOVERY - salt-minion processes on mw1115 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:47:39] PROBLEM - salt-minion processes on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:49:19] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: Puppet has 1 failures [08:51:58] PROBLEM - puppet last run on mw2039 is CRITICAL: CRITICAL: puppet fail [09:15:05] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:18:54] PROBLEM - NTP on mw1115 is CRITICAL: NTP CRITICAL: No response from NTP server [09:19:34] RECOVERY - puppet last run on mw2039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:37:16] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table langcomwiki.hitcounter doesnt exist on query. Default database: information_schema. Query: DELETE FROM langcomwiki.hitcounter [10:39:16] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [10:42:54] 06Operations: API apache servers OOMing: mw1134 mw1132 mw1139 mw1138 - https://phabricator.wikimedia.org/T132845#2212755 (10Southparkfan) @Andrew mw1138 is not depooled (anymore), its CPU and network graphs show it is serving traffic. Looking at http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&h=mw1... [10:51:48] PROBLEM - MariaDB Slave SQL: s3 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table frwikisource.recentchanges: Cant find record in recentchanges, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db2018-bin.001861, end_log_pos 37498994 [10:52:17] PROBLEM - MariaDB Slave SQL: s6 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table frwiki.recentchanges: Cant find record in recentchanges, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db2028-bin.001212, end_log_pos 740146404 [10:55:58] PROBLEM - MariaDB Slave SQL: s2 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table fiwiki.recentchanges: Cant find record in recentchanges, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db2017-bin.001785, end_log_pos 264595760 [10:57:47] PROBLEM - MariaDB Slave Lag: s3 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 637.93 seconds [10:58:27] PROBLEM - MariaDB Slave Lag: s6 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 666.00 seconds [11:02:13] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0] [11:02:23] PROBLEM - MariaDB Slave Lag: s2 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 664.57 seconds [11:07:22] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [11:16:53] RECOVERY - Disk space on mw1115 is OK: DISK OK [11:22:53] PROBLEM - Disk space on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:22] RECOVERY - NTP on mw1115 is OK: NTP OK: Offset -0.2611320019 secs [11:25:42] RECOVERY - RAID on mw1115 is OK: OK: no RAID installed [11:25:42] RECOVERY - configured eth on mw1115 is OK: OK - interfaces up [11:25:53] RECOVERY - SSH on mw1115 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [11:26:12] RECOVERY - DPKG on mw1115 is OK: All packages OK [11:26:13] RECOVERY - nutcracker port on mw1115 is OK: TCP OK - 0.000 second response time on port 11212 [11:26:23] RECOVERY - dhclient process on mw1115 is OK: PROCS OK: 0 processes with command name dhclient [11:26:32] RECOVERY - HHVM processes on mw1115 is OK: PROCS OK: 6 processes with command name hhvm [11:26:33] RECOVERY - Disk space on mw1115 is OK: DISK OK [11:27:03] RECOVERY - nutcracker process on mw1115 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [11:27:13] RECOVERY - salt-minion processes on mw1115 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:27:13] RECOVERY - Check size of conntrack table on mw1115 is OK: OK: nf_conntrack is 3 % full [11:27:53] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.045 second response time [11:28:44] RECOVERY - puppet last run on mw1115 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [11:29:02] RECOVERY - HHVM rendering on mw1115 is OK: HTTP OK: HTTP/1.1 200 OK - 65576 bytes in 0.430 second response time [12:19:39] (03PS1) 10Mschon: changed double quotes to single quotes, now puppet-lint runs through [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283853 [12:25:41] ACKNOWLEDGEMENT - MariaDB Slave Lag: s2 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 5589.56 seconds Jcrespo https://phabricator.wikimedia.org/T130128 [12:25:41] ACKNOWLEDGEMENT - MariaDB Slave Lag: s3 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 5828.36 seconds Jcrespo https://phabricator.wikimedia.org/T130128 [12:25:41] ACKNOWLEDGEMENT - MariaDB Slave Lag: s6 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 5835.04 seconds Jcrespo https://phabricator.wikimedia.org/T130128 [12:25:41] ACKNOWLEDGEMENT - MariaDB Slave SQL: s2 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table fiwiki.recentchanges: Cant find record in recentchanges, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db2017-bin.001785, end_log_pos 264595760 Jcrespo https://phabricator.wikimedia.org/T130128 [12:25:41] ACKNOWLEDGEMENT - MariaDB Slave SQL: s3 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table frwikisource.recentchanges: Cant find record in recentchanges, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db2018-bin.001861, end_log_pos 37498994 Jcrespo https://phabricator.wikimedia.org/T130128 [12:25:41] ACKNOWLEDGEMENT - MariaDB Slave SQL: s6 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table frwiki.recentchanges: Cant find record in recentchanges, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db2028-bin.001212, end_log_pos 740146404 Jcrespo https://phabricator.wikimedia.org/T130128 [12:29:52] huh [12:29:57] oh ffs [12:30:02] ? [12:30:06] There we go. [12:35:54] (03PS1) 10Mschon: added .gitreview file [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283855 [12:41:13] (03CR) 10JanZerebecki: [C: 04-1] added .gitreview file (031 comment) [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283855 (owner: 10Mschon) [12:50:35] (03PS2) 10Mschon: added .gitreview file [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283855 [12:50:58] (03PS3) 10Mschon: added .gitreview file [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283855 [12:53:06] PROBLEM - puppet last run on mw2053 is CRITICAL: CRITICAL: puppet fail [12:53:18] (03CR) 10JanZerebecki: [C: 031] added .gitreview file [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283855 (owner: 10Mschon) [13:22:15] RECOVERY - puppet last run on mw2053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:33:12] 06Operations, 10DBA: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2212826 (10jcrespo) This looks great! Upload it "as is" to dbtools or mediawiki maintenance. I would like to parametrize the servers involved to be able to use the masters for s* or other es* servers and other timefram... [13:54:31] (03PS1) 10Mschon: fixed puppet-lint alignment [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283856 [14:18:30] (03PS1) 10Mschon: fix puppet-lint alignment [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283857 [14:49:06] (03Abandoned) 10Mschon: changed double quotes to single quotes, now puppet-lint runs through [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283853 (owner: 10Mschon) [15:03:40] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 13.79% of data above the critical threshold [100000000.0] [15:33:33] (03PS2) 10Mschon: fixed puppet-lint alignment [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283856 [15:36:58] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [18:22:58] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 327.82 seconds [18:23:36] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 317.10 seconds [18:25:28] RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Replication lag: 0.32 seconds [18:26:56] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 30.60 seconds [18:38:35] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.79 seconds [18:42:34] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 0.20 seconds [19:25:59] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.78 seconds [19:27:58] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 0.13 seconds [19:29:17] (03PS1) 10Mschon: added spf record to toolserver.org [dns] - 10https://gerrit.wikimedia.org/r/283870 (https://phabricator.wikimedia.org/T131930) [19:37:22] (03CR) 10Merlijn van Deen: "Mails forwarded by toolserver.org do not use envelope-from, so I *think* this should not cause any issues there." [dns] - 10https://gerrit.wikimedia.org/r/283870 (https://phabricator.wikimedia.org/T131930) (owner: 10Mschon) [19:47:25] (03PS1) 10Dereckson: Enable RC patrol on ta.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283873 (https://phabricator.wikimedia.org/T132868) [20:03:01] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [20:03:42] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [20:09:36] (03PS3) 10Nicko: Improve robustness of es-tool [puppet] - 10https://gerrit.wikimedia.org/r/282472 (https://phabricator.wikimedia.org/T128786) (owner: 10Adedommelin) [20:14:47] (03CR) 10Nicko: [C: 031] Improve robustness of es-tool [puppet] - 10https://gerrit.wikimedia.org/r/282472 (https://phabricator.wikimedia.org/T128786) (owner: 10Adedommelin) [20:15:02] (03CR) 10Nicko: Improve robustness of es-tool (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/282472 (https://phabricator.wikimedia.org/T128786) (owner: 10Adedommelin) [20:18:22] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:19:12] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:21:44] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 343.16 seconds [21:21:53] PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 338.19 seconds [21:22:03] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 353.74 seconds [21:22:03] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 351.15 seconds [21:23:43] RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Replication lag: 0.39 seconds [21:23:52] RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Replication lag: 0.12 seconds [21:23:54] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 0.03 seconds [21:24:03] RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Replication lag: 0.05 seconds [21:36:02] PROBLEM - cassandra CQL 10.64.48.100:9042 on restbase1006 is CRITICAL: Connection refused [21:43:21] PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.89 seconds [21:43:52] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 318.77 seconds [21:43:52] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.82 seconds [21:43:52] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 322.87 seconds [21:47:22] ACKNOWLEDGEMENT - cassandra CQL 10.64.48.100:9042 on restbase1006 is CRITICAL: Connection refused eevans Node has been decommissioned [21:50:17] !log `systemctl mask cassandra' on restbase1006.eqiad.wmnet (node is decommissioned) : T95253 [21:50:18] T95253: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253 [21:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:51:28] !log Decommissioning restbase1005.eqiad.wmnet : T95253 [21:51:29] T95253: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253 [21:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:53:32] PROBLEM - cassandra service on restbase1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [21:55:02] PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 322.12 seconds [21:55:33] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 347.23 seconds [21:55:33] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 345.82 seconds [21:55:33] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 355.40 seconds [21:57:02] RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Replication lag: 0.03 seconds [21:57:33] RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Replication lag: 6.61 seconds [21:57:33] RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Replication lag: 6.69 seconds [21:57:33] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 6.80 seconds [22:22:40] PROBLEM - puppet last run on restbase1006 is CRITICAL: CRITICAL: Puppet has 1 failures [22:45:22] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.49 seconds [22:47:02] PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 350.12 seconds [22:47:22] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 363.76 seconds [22:47:22] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 359.81 seconds [22:54:52] RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Replication lag: 0.47 seconds [22:55:12] RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Replication lag: 0.17 seconds [22:55:12] RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Replication lag: 0.52 seconds [22:56:09] 06Operations, 06Performance-Team, 10Wikimedia-General-or-Unknown: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#2213242 (10aaron) Active memory still shows the sawtooth pattern, not sure if it's better or not... [22:57:11] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 0.48 seconds