[00:02:37] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] "I am okay with getting this in now." [puppet] - 10https://gerrit.wikimedia.org/r/486423 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn)
[00:13:04] <wikibugs>	 (03PS1) 10Paladox: gerrit: Set zuulUrl for plugin zuul-status [puppet] - 10https://gerrit.wikimedia.org/r/487619
[00:15:08] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 408.18 seconds
[00:15:22] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 354.09 seconds
[00:16:28] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2094 is OK: OK slave_sql_lag Replication lag: 0.41 seconds
[00:16:40] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.05 seconds
[00:17:02] <wikibugs>	 (03PS2) 10Paladox: gerrit: Set zuulUrl for plugin zuul-status [puppet] - 10https://gerrit.wikimedia.org/r/487619
[00:19:02] <wikibugs>	 (03PS3) 10Paladox: gerrit: Set zuulUrl for plugin zuul-status [puppet] - 10https://gerrit.wikimedia.org/r/487619 (https://phabricator.wikimedia.org/T214068)
[00:40:19] <wikibugs>	 10Operations, 10Gerrit, 10Icinga, 10monitoring: Investigate why icinga did not report high cpu/load for gerrit - https://phabricator.wikimedia.org/T215033 (10Dzahn) @Paladox I think it's just different descriptions for the same thing when looking at the nrpe_command line only. So just one new check.
[00:45:49] <wikibugs>	 10Operations, 10Analytics, 10Product-Analytics, 10User-Elukey: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10Dzahn) >>! In T212824#4922713, @elukey wrote: > @aborrero has already done a similar thing for the tool-forge hosts. Anything that we can share with pu...
[00:58:26] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[00:59:22] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:03:42] <icinga-wm>	 PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[01:03:54] <icinga-wm>	 PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[01:04:02] <icinga-wm>	 PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[01:04:14] <icinga-wm>	 PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={GET,LIST,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[01:05:02] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[01:05:32] <icinga-wm>	 PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[01:10:20] <elukey>	 !log powercycle mw1299 - can't ssh nor get a tty via console - racadm getsel shows "An OEM diagnostic event occurred."
[01:10:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:11:44] <icinga-wm>	 RECOVERY - Host mw1299 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[01:12:46] <elukey>	 this is a jobrunner --^
[01:27:14] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[01:54:13] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.13; 2019-01-15), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10aaron) Also note that even the old LRU algorithm...
[02:06:48] <icinga-wm>	 RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[02:07:00] <icinga-wm>	 RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[02:07:18] <icinga-wm>	 RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[02:08:08] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[02:08:26] <icinga-wm>	 RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[02:08:38] <icinga-wm>	 RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[02:30:02] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:57:26] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[03:32:58] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 29598880 and 1 seconds
[03:34:16] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 81200 and 33 seconds
[03:35:44] <icinga-wm>	 PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz]
[03:53:10] <icinga-wm>	 PROBLEM - Apache HTTP on mw1344 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 8.349 second response time
[03:54:20] <icinga-wm>	 RECOVERY - Apache HTTP on mw1344 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.060 second response time
[03:58:56] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:02:12] <icinga-wm>	 RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[04:20:25] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.13; 2019-01-15), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) >>! In T203786#4922942, @aaron wrote: > A...
[04:26:24] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[04:31:12] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:31:40] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:31:44] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:31:44] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:32:36] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:32:40] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:32:42] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:32:46] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[04:32:46] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:32:50] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[04:32:58] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[04:33:00] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:33:02] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s8 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[04:33:06] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:33:10] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[04:33:10] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[04:33:14] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:33:14] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[04:33:16] <elukey>	 hello dbstore1002
[04:33:18] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:33:32] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[04:33:34] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[04:33:34] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:33:38] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[04:33:44] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:33:48] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:33:48] <icinga-wm>	 PROBLEM - MariaDB Slave IO: x1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[04:34:03] <elukey>	 mysql crashed, sigh
[04:35:02] <onimisionipe>	 Not good
[04:35:07] <elukey>	 it is going through recovery, it will take a bit
[04:35:22] <elukey>	 the host is sadly old and keeps crashing
[04:36:57] <elukey>	 it is not a big deal for the moment, it is mainly used by analytics users
[04:41:44] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:42:16] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:42:42] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:46:42] <wikibugs>	 (03PS2) 10Legoktm: Removed WikibaseQuality from extensions-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486509 (https://phabricator.wikimedia.org/T208499) (owner: 10Zoranzoki21)
[04:46:57] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] "cherry-picked to master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486509 (https://phabricator.wikimedia.org/T208499) (owner: 10Zoranzoki21)
[04:48:02] <wikibugs>	 (03Merged) 10jenkins-bot: Removed WikibaseQuality from extensions-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486509 (https://phabricator.wikimedia.org/T208499) (owner: 10Zoranzoki21)
[04:49:19] <legoktm>	 ummmm
[04:49:25] <legoktm>	 legoktm@deploy1001:/srv/mediawiki-staging/wmf-config$ git log HEAD...origin/master --oneline
[04:49:25] <legoktm>	 53b3920d4 Removed WikibaseQuality from extensions-list
[04:49:25] <legoktm>	 154863b1e Revert "mariadb: Depool db1114"
[04:49:25] <legoktm>	 5576980c9 Document why ActiveAbstract is loaded in this way
[04:49:25] <legoktm>	 94c9ace0e In LocalSettings.php use a relative path to CommonSettings.php
[04:51:02] <legoktm>	 ok, the bottom 2 are not MW changes
[04:51:53] <wikibugs>	 (03CR) 10jenkins-bot: Removed WikibaseQuality from extensions-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486509 (https://phabricator.wikimedia.org/T208499) (owner: 10Zoranzoki21)
[04:53:12] <legoktm>	 somehow 154863b1e was already deployed
[04:53:16] <legoktm>	 OK, all good.
[04:55:20] <logmsgbot>	 !log legoktm@deploy1001 Synchronized wmf-config/extension-list: Remove WikibaseQuality from extensions-list (T208499) (duration: 00m 51s)
[04:55:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:55:23] <stashbot>	 T208499: Stop branching & deploying WikibaseQuality extension - https://phabricator.wikimedia.org/T208499
[05:13:14] <elukey>	 (they are boarding my flight, dbstore1002 will be handled by SRE later on, no big deal for the moment)
[05:24:30] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m3 on dbstore1002 is OK: OK slave_sql_lag not a slave
[05:24:30] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4111.08 seconds
[05:24:30] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m3 on dbstore1002 is OK: OK slave_io_state not a slave
[05:24:42] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 102506.30 seconds
[05:24:44] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m2 on dbstore1002 is OK: OK slave_io_state not a slave
[05:24:48] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3297.12 seconds
[05:25:00] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4256.91 seconds
[05:25:08] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m2 on dbstore1002 is OK: OK slave_sql_lag not a slave
[05:25:12] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state not a slave
[05:25:14] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m2 on dbstore1002 is OK: OK slave_sql_state not a slave
[05:25:32] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4275.75 seconds
[05:25:34] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4247.97 seconds
[05:25:38] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4249.53 seconds
[05:29:08] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:30:28] <wikibugs>	 (03PS2) 10Zoranzoki21: Add category at wgGettingStartedExcludedCategories for srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482534
[05:31:18] <wikibugs>	 (03PS2) 10Zoranzoki21: Add categories for other Croatian projects at wmgBabelMainCategory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482548
[05:36:24] <wikibugs>	 (03PS3) 10Zoranzoki21: Add categories for all Croatian projects at wmgBabelMainCategory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482548
[05:56:34] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[06:11:20] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[06:15:18] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[06:17:54] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[06:23:10] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[06:32:54] <icinga-wm>	 PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/update-library.R]
[06:43:58] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[06:51:48] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[06:54:24] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[06:59:18] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:59:22] <icinga-wm>	 RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[07:02:10] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[07:15:12] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[07:25:42] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[07:26:44] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[07:27:00] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[07:30:54] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[07:33:28] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[07:39:56] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[07:46:16] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[07:47:44] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[07:54:14] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[07:56:42] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[08:16:22] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[08:21:38] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[08:24:14] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[08:28:10] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[08:29:12] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:30:46] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[08:34:36] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[08:35:56] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[08:39:48] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[08:45:02] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[08:56:28] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[09:12:14] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[09:16:12] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[09:28:04] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[09:33:14] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[09:54:02] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[09:56:36] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[09:58:58] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:00:28] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[10:01:44] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[10:26:26] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[10:35:42] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[10:38:16] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[10:43:28] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[10:44:46] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[11:04:48] <wikibugs>	 10Operations, 10Maps: Map tile generation error - https://phabricator.wikimedia.org/T215120 (10Mathew.onipe)
[11:13:06] <wikibugs>	 10Operations, 10Security-Team, 10LDAP: Improve LDAP logging - https://phabricator.wikimedia.org/T214489 (10Peachey88)
[11:44:37] <wikibugs>	 10Operations, 10Maps: Map tile generation error - https://phabricator.wikimedia.org/T215120 (10Mathew.onipe) p:05Triage→03High
[12:30:04] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:56:06] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[14:01:16] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[14:02:32] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[14:11:36] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[14:15:32] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[14:19:30] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[14:26:06] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[14:58:38] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:28:49] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[17:02:16] <niko>	 sjoerddebruin: nice spam ^^
[17:02:52] <sjoerddebruin>	 It was, and it's fixed now
[17:25:51] <icinga-wm>	 RECOVERY - ElasticSearch shard size check on search.svc.codfw.wmnet is OK: OK - All good!
[17:29:35] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:31:23] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s6 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[17:31:23] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1030, Errmsg: Error Got error 22 Invalid argument from storage engine TokuDB on query. Default database: mediawikiwiki. [Query snipped]
[17:31:23] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[17:31:35] <icinga-wm>	 RECOVERY - MariaDB Slave IO: x1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[17:31:37] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[17:31:43] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[17:31:51] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[17:31:53] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s2 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[17:31:55] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[17:31:55] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s8 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[17:31:55] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s8 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[17:32:01] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s5 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[17:32:05] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[17:32:11] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s7 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[17:32:11] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[17:32:17] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[17:32:27] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s4 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[17:32:30] <elukey>	 just started all the dbstore1002 slaves --^
[17:32:35] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s6 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[17:32:49] <elukey>	 s3's replication is still broken, fixing it
[17:35:43] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table wikishared.echo_unread_wikis: Cant find record in echo_unread_wikis, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000330, end_log_pos 1006016474
[17:35:59] <elukey>	 ah and also x1
[17:36:00] <elukey>	 lovely
[17:53:42] <elukey>	 !log start all slaves on dbstore1002 (After a crash + recovery) + moved mediawikiwiki.revision_actor_temp to Innodb to unblock s3 slave replication (still broken though)
[17:53:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:56:57] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[18:06:15] <elukey>	 (still altering tables to innodb on dbstore1002 to re-enable s3's replication)
[18:56:00] <elukey>	 !log started a tmux session on dbstore1002 to migrate all the tokudb tables of mediawikiwiki to InnoDB - (s3 replication broken)
[18:56:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:59:26] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[19:26:36] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[19:38:50] <icinga-wm>	 PROBLEM - Long running screen/tmux on an-coord1001 is CRITICAL: CRIT: Long running SCREEN process. (user: otto PID: 26051, 1739536s 1728000s).
[19:51:32] <icinga-wm>	 PROBLEM - Host mw1272 is DOWN: PING CRITICAL - Packet loss = 100%
[19:59:04] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 175.23 seconds
[20:12:44] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.41 seconds
[20:16:52] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[20:17:18] <elukey>	 fixed s3 --^
[20:21:22] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[20:23:42] <icinga-wm>	 ACKNOWLEDGEMENT - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] Mathew.onipe T215120 - The acknowledgement expires at: 2019-02-04 11:22:57. https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[20:23:58] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[20:25:54] <elukey>	 !log powercycle mw1272 - no ssh, no tty available via com2 - DIMM correctable errors + OEM errors registered in getsel
[20:25:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:46] <icinga-wm>	 RECOVERY - Host mw1272 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[20:29:00] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[20:31:16] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10elukey) Another crash, had to alter all the `mediawikiwiki` database's tables to InnoDB to restart s3 replication. x1 still broken due to a missi...
[20:40:44] <icinga-wm>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[20:42:00] <icinga-wm>	 RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[20:43:40] <revi>	 Quick question: is CSP now enforced on private wikis?
[20:44:57] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) You did it in the last all hands! :-) I will walk you thru it so you can fix it yourself entirely!
[20:51:28] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10elukey) >>! In T213670#4923477, @Marostegui wrote: > You did it in the last all hands! :-) > I will walk you thru it so you can fix it yourself e...
[20:56:02] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[21:05:13] <wikibugs>	 (03CR) 10Gilles: [C: 03+2] Use webp -exact option on Stretch [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/477796 (https://phabricator.wikimedia.org/T170817) (owner: 10Gilles)
[21:05:48] <wikibugs>	 (03CR) 10Gilles: [V: 03+2 C: 03+2] Use webp -exact option on Stretch [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/477796 (https://phabricator.wikimedia.org/T170817) (owner: 10Gilles)
[21:06:06] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 152.49 seconds
[21:07:20] <wikibugs>	 (03CR) 10Gilles: [V: 03+2 C: 03+2] "Gerrit isn't giving me any "submit" option???" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/477796 (https://phabricator.wikimedia.org/T170817) (owner: 10Gilles)
[21:25:31] <wikibugs>	 10Operations, 10Maps: Map tile generation error - https://phabricator.wikimedia.org/T215120 (10Gehel) Looking at [[ https://grafana.wikimedia.org/d/000000305/maps-performances?panelId=8&fullscreen&orgId=1 | grafana ]], it looks like no tiles were generated on Feb 2, but generation started again on Feb 3. No id...
[21:50:10] <wikibugs>	 (03CR) 10Gilles: [V: 03+2 C: 03+2] "@hashar any idea what's happening here? Is this gerrit repo set up incorrectly?" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/477796 (https://phabricator.wikimedia.org/T170817) (owner: 10Gilles)
[21:55:50] <Krenair>	 revi, looks like it's report-only on otrs-wiki
[22:02:08] <wikibugs>	 (03PS1) 10Paladox: Modify access rules [software/thumbor-plugins] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/487783
[22:02:17] <paladox>	 gilles https://gerrit.wikimedia.org/r/#/c/operations/software/thumbor-plugins/+/487783/
[22:02:39] <wikibugs>	 (03CR) 10Gilles: [C: 03+2] Modify access rules [software/thumbor-plugins] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/487783 (owner: 10Paladox)
[22:02:45] <gilles>	 paladox: thanks!
[22:03:02] <paladox>	 your welcome :)
[22:03:08] <paladox>	 (needs v+2 and submit)
[22:03:24] <wikibugs>	 (03CR) 10Gilles: [V: 03+2 C: 03+2] Modify access rules [software/thumbor-plugins] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/487783 (owner: 10Paladox)
[22:16:50] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received
[22:18:04] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[22:29:42] <wikibugs>	 (03PS1) 10Gilles: Fix PNG transparency for more cases [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/487785 (https://phabricator.wikimedia.org/T198370)
[22:32:32] <wikibugs>	 (03PS2) 10Gilles: Fix PNG transparency for more cases [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/487785 (https://phabricator.wikimedia.org/T198370)
[22:37:12] <wikibugs>	 (03PS3) 10Gilles: Fix PNG transparency for more cases [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/487785 (https://phabricator.wikimedia.org/T198370)
[22:59:24] <icinga-wm>	 PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[23:25:57] <wikibugs>	 (03PS1) 10Gehel: elasticsearch: exit the JVM on OutOfMemoryError [puppet] - 10https://gerrit.wikimedia.org/r/487787 (https://phabricator.wikimedia.org/T76090)
[23:26:36] <icinga-wm>	 RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational
[23:26:43] <wikibugs>	 10Operations, 10Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: Icinga monitoring for elasticsearch doesn't notice OOM conditions - https://phabricator.wikimedia.org/T76090 (10Gehel) Instead of monitoring this specific error, let's just configure the JVM to restart on memory errors.
[23:26:58] <wikibugs>	 10Operations, 10Elasticsearch, 10Wikimedia-Logstash, 10monitoring, and 2 others: Icinga monitoring for elasticsearch doesn't notice OOM conditions - https://phabricator.wikimedia.org/T76090 (10Gehel)
[23:27:20] <wikibugs>	 10Operations, 10Elasticsearch, 10Wikimedia-Logstash, 10monitoring, and 2 others: Icinga monitoring for elasticsearch doesn't notice OOM conditions - https://phabricator.wikimedia.org/T76090 (10Gehel) a:03Gehel