[00:10:46] <urandom>	 !log Decommissioning restbase1006.eqiad.wmnet : T95253
[00:10:48] <stashbot>	 T95253: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253
[00:10:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:28:37] <icinga-wm_>	 PROBLEM - puppet last run on mw1141 is CRITICAL: CRITICAL: Puppet has 64 failures
[01:55:44] <icinga-wm_>	 PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: puppet fail
[02:21:30] <icinga-wm_>	 RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[02:24:50] <icinga-wm_>	 PROBLEM - Apache HTTP on mw1141 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:26:38] <icinga-wm_>	 RECOVERY - Apache HTTP on mw1141 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.043 second response time
[02:27:19] <icinga-wm_>	 RECOVERY - puppet last run on mw1141 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[02:38:19] <icinga-wm_>	 RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.320 second response time
[02:39:58] <icinga-wm_>	 RECOVERY - HHVM rendering on mw1117 is OK: HTTP OK: HTTP/1.1 200 OK - 65950 bytes in 0.129 second response time
[02:41:19] <icinga-wm_>	 RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[03:36:28] <icinga-wm_>	 PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: Puppet has 1 failures
[03:58:28] <icinga-wm_>	 PROBLEM - puppet last run on rdb2003 is CRITICAL: CRITICAL: puppet fail
[04:01:30] <icinga-wm_>	 RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[04:25:28] <icinga-wm_>	 RECOVERY - puppet last run on rdb2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[05:48:53] <icinga-wm_>	 PROBLEM - puppet last run on mw1115 is CRITICAL: CRITICAL: Puppet has 6 failures
[05:49:50] <grrrit-wm>	 (03PS1) 10Glaisher: Add TranslationsUpdateJob to translate job runner group [puppet] - 10https://gerrit.wikimedia.org/r/283841 (https://phabricator.wikimedia.org/T53731) 
[06:30:44] <icinga-wm_>	 PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:55] <icinga-wm_>	 PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: puppet fail
[06:31:25] <icinga-wm_>	 PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:34] <icinga-wm_>	 PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:15] <icinga-wm_>	 PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:24] <icinga-wm_>	 PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:24] <icinga-wm_>	 PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:32:34] <icinga-wm_>	 PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:32:43] <icinga-wm_>	 PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:55] <icinga-wm_>	 PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:56:25] <icinga-wm_>	 RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[06:56:44] <icinga-wm_>	 RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[06:56:53] <icinga-wm_>	 RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[06:57:34] <icinga-wm_>	 RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[06:57:43] <icinga-wm_>	 RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:43] <icinga-wm_>	 RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[06:57:54] <icinga-wm_>	 RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:03] <icinga-wm_>	 RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:04] <icinga-wm_>	 RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:24] <icinga-wm_>	 RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:41:39] <icinga-wm>	 RECOVERY - salt-minion processes on mw1115 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[08:47:39] <icinga-wm>	 PROBLEM - salt-minion processes on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:49:19] <icinga-wm>	 PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: Puppet has 1 failures
[08:51:58] <icinga-wm>	 PROBLEM - puppet last run on mw2039 is CRITICAL: CRITICAL: puppet fail
[09:15:05] <icinga-wm>	 RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[09:18:54] <icinga-wm>	 PROBLEM - NTP on mw1115 is CRITICAL: NTP CRITICAL: No response from NTP server
[09:19:34] <icinga-wm>	 RECOVERY - puppet last run on mw2039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:37:16] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table langcomwiki.hitcounter doesnt exist on query. Default database: information_schema. Query: DELETE FROM langcomwiki.hitcounter
[10:39:16] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[10:42:54] <wikibugs>	 06Operations: API apache servers OOMing:  mw1134 mw1132 mw1139 mw1138 - https://phabricator.wikimedia.org/T132845#2212755 (10Southparkfan) @Andrew mw1138 is not depooled (anymore), its CPU and network graphs show it is serving traffic. Looking at http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&h=mw1...
[10:51:48] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table frwikisource.recentchanges: Cant find record in recentchanges, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db2018-bin.001861, end_log_pos 37498994
[10:52:17] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table frwiki.recentchanges: Cant find record in recentchanges, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db2028-bin.001212, end_log_pos 740146404
[10:55:58] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s2 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table fiwiki.recentchanges: Cant find record in recentchanges, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db2017-bin.001785, end_log_pos 264595760
[10:57:47] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 637.93 seconds
[10:58:27] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 666.00 seconds
[11:02:13] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0]
[11:02:23] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 664.57 seconds
[11:07:22] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[11:16:53] <icinga-wm>	 RECOVERY - Disk space on mw1115 is OK: DISK OK
[11:22:53] <icinga-wm>	 PROBLEM - Disk space on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:25:22] <icinga-wm>	 RECOVERY - NTP on mw1115 is OK: NTP OK: Offset -0.2611320019 secs
[11:25:42] <icinga-wm>	 RECOVERY - RAID on mw1115 is OK: OK: no RAID installed
[11:25:42] <icinga-wm>	 RECOVERY - configured eth on mw1115 is OK: OK - interfaces up
[11:25:53] <icinga-wm>	 RECOVERY - SSH on mw1115 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0)
[11:26:12] <icinga-wm>	 RECOVERY - DPKG on mw1115 is OK: All packages OK
[11:26:13] <icinga-wm>	 RECOVERY - nutcracker port on mw1115 is OK: TCP OK - 0.000 second response time on port 11212
[11:26:23] <icinga-wm>	 RECOVERY - dhclient process on mw1115 is OK: PROCS OK: 0 processes with command name dhclient
[11:26:32] <icinga-wm>	 RECOVERY - HHVM processes on mw1115 is OK: PROCS OK: 6 processes with command name hhvm
[11:26:33] <icinga-wm>	 RECOVERY - Disk space on mw1115 is OK: DISK OK
[11:27:03] <icinga-wm>	 RECOVERY - nutcracker process on mw1115 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[11:27:13] <icinga-wm>	 RECOVERY - salt-minion processes on mw1115 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[11:27:13] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1115 is OK: OK: nf_conntrack is 3 % full
[11:27:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.045 second response time
[11:28:44] <icinga-wm>	 RECOVERY - puppet last run on mw1115 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[11:29:02] <icinga-wm>	 RECOVERY - HHVM rendering on mw1115 is OK: HTTP OK: HTTP/1.1 200 OK - 65576 bytes in 0.430 second response time
[12:19:39] <grrrit-wm>	 (03PS1) 10Mschon: changed double quotes to single quotes, now puppet-lint runs through [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283853 
[12:25:41] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave Lag: s2 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 5589.56 seconds Jcrespo https://phabricator.wikimedia.org/T130128
[12:25:41] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave Lag: s3 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 5828.36 seconds Jcrespo https://phabricator.wikimedia.org/T130128
[12:25:41] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave Lag: s6 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 5835.04 seconds Jcrespo https://phabricator.wikimedia.org/T130128
[12:25:41] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave SQL: s2 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table fiwiki.recentchanges: Cant find record in recentchanges, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db2017-bin.001785, end_log_pos 264595760 Jcrespo https://phabricator.wikimedia.org/T130128
[12:25:41] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave SQL: s3 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table frwikisource.recentchanges: Cant find record in recentchanges, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db2018-bin.001861, end_log_pos 37498994 Jcrespo https://phabricator.wikimedia.org/T130128
[12:25:41] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Slave SQL: s6 on dbstore2002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table frwiki.recentchanges: Cant find record in recentchanges, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db2028-bin.001212, end_log_pos 740146404 Jcrespo https://phabricator.wikimedia.org/T130128
[12:29:52] <MrFish2>	 huh
[12:29:57] <MrFish2>	 oh ffs
[12:30:02] <jynus>	 ?
[12:30:06] <Bsadowski1>	 There we go.
[12:35:54] <grrrit-wm>	 (03PS1) 10Mschon: added .gitreview file [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283855 
[12:41:13] <grrrit-wm>	 (03CR) 10JanZerebecki: [C: 04-1] added .gitreview file (031 comment) [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283855 (owner: 10Mschon)
[12:50:35] <grrrit-wm>	 (03PS2) 10Mschon: added .gitreview file [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283855 
[12:50:58] <grrrit-wm>	 (03PS3) 10Mschon: added .gitreview file [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283855 
[12:53:06] <icinga-wm>	 PROBLEM - puppet last run on mw2053 is CRITICAL: CRITICAL: puppet fail
[12:53:18] <grrrit-wm>	 (03CR) 10JanZerebecki: [C: 031] added .gitreview file [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283855 (owner: 10Mschon)
[13:22:15] <icinga-wm>	 RECOVERY - puppet last run on mw2053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:33:12] <wikibugs>	 06Operations, 10DBA: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2212826 (10jcrespo) This looks great! Upload it "as is" to dbtools or mediawiki maintenance. I would like to parametrize the servers involved to be able to use the masters for s* or other es* servers and other timefram...
[13:54:31] <grrrit-wm>	 (03PS1) 10Mschon: fixed puppet-lint alignment [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283856 
[14:18:30] <grrrit-wm>	 (03PS1) 10Mschon: fix puppet-lint alignment [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283857 
[14:49:06] <grrrit-wm>	 (03Abandoned) 10Mschon: changed double quotes to single quotes, now puppet-lint runs through [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283853 (owner: 10Mschon)
[15:03:40] <icinga-wm>	 PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 13.79% of data above the critical threshold [100000000.0]
[15:33:33] <grrrit-wm>	 (03PS2) 10Mschon: fixed puppet-lint alignment [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283856 
[15:36:58] <icinga-wm>	 RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0]
[18:22:58] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 327.82 seconds
[18:23:36] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 317.10 seconds
[18:25:28] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Replication lag: 0.32 seconds
[18:26:56] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 30.60 seconds
[18:38:35] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.79 seconds
[18:42:34] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 0.20 seconds
[19:25:59] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.78 seconds
[19:27:58] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 0.13 seconds
[19:29:17] <grrrit-wm>	 (03PS1) 10Mschon: added spf record to toolserver.org [dns] - 10https://gerrit.wikimedia.org/r/283870 (https://phabricator.wikimedia.org/T131930) 
[19:37:22] <grrrit-wm>	 (03CR) 10Merlijn van Deen: "Mails forwarded by toolserver.org do not use envelope-from, so I *think* this should not cause any issues there." [dns] - 10https://gerrit.wikimedia.org/r/283870 (https://phabricator.wikimedia.org/T131930) (owner: 10Mschon)
[19:47:25] <grrrit-wm>	 (03PS1) 10Dereckson: Enable RC patrol on ta.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283873 (https://phabricator.wikimedia.org/T132868) 
[20:03:01] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[20:03:42] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[20:09:36] <grrrit-wm>	 (03PS3) 10Nicko: Improve robustness of es-tool [puppet] - 10https://gerrit.wikimedia.org/r/282472 (https://phabricator.wikimedia.org/T128786) (owner: 10Adedommelin)
[20:14:47] <grrrit-wm>	 (03CR) 10Nicko: [C: 031] Improve robustness of es-tool [puppet] - 10https://gerrit.wikimedia.org/r/282472 (https://phabricator.wikimedia.org/T128786) (owner: 10Adedommelin)
[20:15:02] <grrrit-wm>	 (03CR) 10Nicko: Improve robustness of es-tool (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/282472 (https://phabricator.wikimedia.org/T128786) (owner: 10Adedommelin)
[20:18:22] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:19:12] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[21:21:44] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 343.16 seconds
[21:21:53] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 338.19 seconds
[21:22:03] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 353.74 seconds
[21:22:03] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 351.15 seconds
[21:23:43] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Replication lag: 0.39 seconds
[21:23:52] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Replication lag: 0.12 seconds
[21:23:54] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 0.03 seconds
[21:24:03] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Replication lag: 0.05 seconds
[21:36:02] <icinga-wm>	 PROBLEM - cassandra CQL 10.64.48.100:9042 on restbase1006 is CRITICAL: Connection refused
[21:43:21] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.89 seconds
[21:43:52] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 318.77 seconds
[21:43:52] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.82 seconds
[21:43:52] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 322.87 seconds
[21:47:22] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra CQL 10.64.48.100:9042 on restbase1006 is CRITICAL: Connection refused eevans Node has been decommissioned
[21:50:17] <urandom>	 !log `systemctl mask cassandra' on restbase1006.eqiad.wmnet (node is decommissioned) : T95253
[21:50:18] <stashbot>	 T95253: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253
[21:50:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:51:28] <urandom>	 !log Decommissioning restbase1005.eqiad.wmnet : T95253
[21:51:29] <stashbot>	 T95253: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253
[21:51:32] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:53:32] <icinga-wm>	 PROBLEM - cassandra service on restbase1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[21:55:02] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 322.12 seconds
[21:55:33] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 347.23 seconds
[21:55:33] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 345.82 seconds
[21:55:33] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 355.40 seconds
[21:57:02] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Replication lag: 0.03 seconds
[21:57:33] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Replication lag: 6.61 seconds
[21:57:33] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Replication lag: 6.69 seconds
[21:57:33] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 6.80 seconds
[22:22:40] <icinga-wm>	 PROBLEM - puppet last run on restbase1006 is CRITICAL: CRITICAL: Puppet has 1 failures
[22:45:22] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.49 seconds
[22:47:02] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 350.12 seconds
[22:47:22] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 363.76 seconds
[22:47:22] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 359.81 seconds
[22:54:52] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Replication lag: 0.47 seconds
[22:55:12] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Replication lag: 0.17 seconds
[22:55:12] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Replication lag: 0.52 seconds
[22:56:09] <wikibugs>	 06Operations, 06Performance-Team, 10Wikimedia-General-or-Unknown: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#2213242 (10aaron) Active memory still shows the sawtooth pattern, not sure if it's better or not...
[22:57:11] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 0.48 seconds