[00:02:37] (03CR) 10Subramanya Sastry: [C: 03+1] "I am okay with getting this in now." [puppet] - 10https://gerrit.wikimedia.org/r/486423 (https://phabricator.wikimedia.org/T201366) (owner: 10Dzahn) [00:13:04] (03PS1) 10Paladox: gerrit: Set zuulUrl for plugin zuul-status [puppet] - 10https://gerrit.wikimedia.org/r/487619 [00:15:08] PROBLEM - MariaDB Slave Lag: s3 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 408.18 seconds [00:15:22] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 354.09 seconds [00:16:28] RECOVERY - MariaDB Slave Lag: s3 on db2094 is OK: OK slave_sql_lag Replication lag: 0.41 seconds [00:16:40] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.05 seconds [00:17:02] (03PS2) 10Paladox: gerrit: Set zuulUrl for plugin zuul-status [puppet] - 10https://gerrit.wikimedia.org/r/487619 [00:19:02] (03PS3) 10Paladox: gerrit: Set zuulUrl for plugin zuul-status [puppet] - 10https://gerrit.wikimedia.org/r/487619 (https://phabricator.wikimedia.org/T214068) [00:40:19] 10Operations, 10Gerrit, 10Icinga, 10monitoring: Investigate why icinga did not report high cpu/load for gerrit - https://phabricator.wikimedia.org/T215033 (10Dzahn) @Paladox I think it's just different descriptions for the same thing when looking at the nrpe_command line only. So just one new check. [00:45:49] 10Operations, 10Analytics, 10Product-Analytics, 10User-Elukey: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10Dzahn) >>! In T212824#4922713, @elukey wrote: > @aborrero has already done a similar thing for the tool-forge hosts. Anything that we can share with pu... [00:58:26] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [00:59:22] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:03:42] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:03:54] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:04:02] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:04:14] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={GET,LIST,PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:05:02] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:05:32] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:10:20] !log powercycle mw1299 - can't ssh nor get a tty via console - racadm getsel shows "An OEM diagnostic event occurred." [01:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:44] RECOVERY - Host mw1299 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [01:12:46] this is a jobrunner --^ [01:27:14] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [01:54:13] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.13; 2019-01-15), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10aaron) Also note that even the old LRU algorithm... [02:06:48] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:07:00] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:07:18] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:08:08] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:08:26] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:08:38] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:30:02] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:57:26] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [03:32:58] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 29598880 and 1 seconds [03:34:16] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 81200 and 33 seconds [03:35:44] PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:53:10] PROBLEM - Apache HTTP on mw1344 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 8.349 second response time [03:54:20] RECOVERY - Apache HTTP on mw1344 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.060 second response time [03:58:56] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:02:12] RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:20:25] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.13; 2019-01-15), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) >>! In T203786#4922942, @aaron wrote: > A... [04:26:24] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [04:31:12] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:31:40] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:31:44] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:31:44] PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:32:36] PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [04:32:40] PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [04:32:42] PROBLEM - MariaDB Slave SQL: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [04:32:46] PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [04:32:46] PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [04:32:50] PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [04:32:58] PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [04:33:00] PROBLEM - MariaDB Slave SQL: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [04:33:02] PROBLEM - MariaDB Slave IO: s8 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [04:33:06] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [04:33:10] PROBLEM - MariaDB Slave IO: m3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [04:33:10] PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [04:33:14] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:33:14] PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [04:33:16] hello dbstore1002 [04:33:18] PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [04:33:32] PROBLEM - MariaDB Slave IO: m2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [04:33:34] PROBLEM - MariaDB Slave IO: s6 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [04:33:34] PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [04:33:38] PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [04:33:44] PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [04:33:48] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [04:33:48] PROBLEM - MariaDB Slave IO: x1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [04:34:03] mysql crashed, sigh [04:35:02] Not good [04:35:07] it is going through recovery, it will take a bit [04:35:22] the host is sadly old and keeps crashing [04:36:57] it is not a big deal for the moment, it is mainly used by analytics users [04:41:44] PROBLEM - MariaDB Slave Lag: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:42:16] PROBLEM - MariaDB Slave Lag: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:42:42] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:46:42] (03PS2) 10Legoktm: Removed WikibaseQuality from extensions-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486509 (https://phabricator.wikimedia.org/T208499) (owner: 10Zoranzoki21) [04:46:57] (03CR) 10Legoktm: [C: 03+2] "cherry-picked to master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486509 (https://phabricator.wikimedia.org/T208499) (owner: 10Zoranzoki21) [04:48:02] (03Merged) 10jenkins-bot: Removed WikibaseQuality from extensions-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486509 (https://phabricator.wikimedia.org/T208499) (owner: 10Zoranzoki21) [04:49:19] ummmm [04:49:25] legoktm@deploy1001:/srv/mediawiki-staging/wmf-config$ git log HEAD...origin/master --oneline [04:49:25] 53b3920d4 Removed WikibaseQuality from extensions-list [04:49:25] 154863b1e Revert "mariadb: Depool db1114" [04:49:25] 5576980c9 Document why ActiveAbstract is loaded in this way [04:49:25] 94c9ace0e In LocalSettings.php use a relative path to CommonSettings.php [04:51:02] ok, the bottom 2 are not MW changes [04:51:53] (03CR) 10jenkins-bot: Removed WikibaseQuality from extensions-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486509 (https://phabricator.wikimedia.org/T208499) (owner: 10Zoranzoki21) [04:53:12] somehow 154863b1e was already deployed [04:53:16] OK, all good. [04:55:20] !log legoktm@deploy1001 Synchronized wmf-config/extension-list: Remove WikibaseQuality from extensions-list (T208499) (duration: 00m 51s) [04:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:23] T208499: Stop branching & deploying WikibaseQuality extension - https://phabricator.wikimedia.org/T208499 [05:13:14] (they are boarding my flight, dbstore1002 will be handled by SRE later on, no big deal for the moment) [05:24:30] RECOVERY - MariaDB Slave Lag: m3 on dbstore1002 is OK: OK slave_sql_lag not a slave [05:24:30] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4111.08 seconds [05:24:30] RECOVERY - MariaDB Slave IO: m3 on dbstore1002 is OK: OK slave_io_state not a slave [05:24:42] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 102506.30 seconds [05:24:44] RECOVERY - MariaDB Slave IO: m2 on dbstore1002 is OK: OK slave_io_state not a slave [05:24:48] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3297.12 seconds [05:25:00] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4256.91 seconds [05:25:08] RECOVERY - MariaDB Slave Lag: m2 on dbstore1002 is OK: OK slave_sql_lag not a slave [05:25:12] RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state not a slave [05:25:14] RECOVERY - MariaDB Slave SQL: m2 on dbstore1002 is OK: OK slave_sql_state not a slave [05:25:32] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4275.75 seconds [05:25:34] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4247.97 seconds [05:25:38] PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4249.53 seconds [05:29:08] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:30:28] (03PS2) 10Zoranzoki21: Add category at wgGettingStartedExcludedCategories for srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482534 [05:31:18] (03PS2) 10Zoranzoki21: Add categories for other Croatian projects at wmgBabelMainCategory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482548 [05:36:24] (03PS3) 10Zoranzoki21: Add categories for all Croatian projects at wmgBabelMainCategory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482548 [05:56:34] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [06:11:20] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [06:15:18] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [06:17:54] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [06:23:10] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [06:32:54] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/update-library.R] [06:43:58] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [06:51:48] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [06:54:24] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [06:59:18] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:59:22] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:02:10] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [07:15:12] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [07:25:42] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [07:26:44] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [07:27:00] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [07:30:54] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [07:33:28] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [07:39:56] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [07:46:16] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [07:47:44] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [07:54:14] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [07:56:42] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:16:22] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [08:21:38] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [08:24:14] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [08:28:10] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [08:29:12] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:30:46] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [08:34:36] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [08:35:56] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [08:39:48] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [08:45:02] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [08:56:28] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [09:12:14] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [09:16:12] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [09:28:04] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [09:33:14] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [09:54:02] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [09:56:36] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [09:58:58] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:00:28] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [10:01:44] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [10:26:26] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [10:35:42] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [10:38:16] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [10:43:28] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [10:44:46] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [11:04:48] 10Operations, 10Maps: Map tile generation error - https://phabricator.wikimedia.org/T215120 (10Mathew.onipe) [11:13:06] 10Operations, 10Security-Team, 10LDAP: Improve LDAP logging - https://phabricator.wikimedia.org/T214489 (10Peachey88) [11:44:37] 10Operations, 10Maps: Map tile generation error - https://phabricator.wikimedia.org/T215120 (10Mathew.onipe) p:05Triage→03High [12:30:04] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:56:06] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [14:01:16] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [14:02:32] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [14:11:36] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [14:15:32] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [14:19:30] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [14:26:06] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [14:58:38] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:28:49] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [17:02:16] sjoerddebruin: nice spam ^^ [17:02:52] It was, and it's fixed now [17:25:51] RECOVERY - ElasticSearch shard size check on search.svc.codfw.wmnet is OK: OK - All good! [17:29:35] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:31:23] RECOVERY - MariaDB Slave IO: s6 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [17:31:23] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1030, Errmsg: Error Got error 22 Invalid argument from storage engine TokuDB on query. Default database: mediawikiwiki. [Query snipped] [17:31:23] RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:31:35] RECOVERY - MariaDB Slave IO: x1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [17:31:37] RECOVERY - MariaDB Slave IO: s3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [17:31:43] RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:31:51] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:31:53] RECOVERY - MariaDB Slave IO: s2 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [17:31:55] RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:31:55] RECOVERY - MariaDB Slave SQL: s8 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:31:55] RECOVERY - MariaDB Slave IO: s8 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [17:32:01] RECOVERY - MariaDB Slave IO: s5 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [17:32:05] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:32:11] RECOVERY - MariaDB Slave IO: s7 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [17:32:11] RECOVERY - MariaDB Slave IO: s1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [17:32:17] RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:32:27] RECOVERY - MariaDB Slave IO: s4 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [17:32:30] just started all the dbstore1002 slaves --^ [17:32:35] RECOVERY - MariaDB Slave SQL: s6 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:32:49] s3's replication is still broken, fixing it [17:35:43] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table wikishared.echo_unread_wikis: Cant find record in echo_unread_wikis, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1069-bin.000330, end_log_pos 1006016474 [17:35:59] ah and also x1 [17:36:00] lovely [17:53:42] !log start all slaves on dbstore1002 (After a crash + recovery) + moved mediawikiwiki.revision_actor_temp to Innodb to unblock s3 slave replication (still broken though) [17:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:57] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [18:06:15] (still altering tables to innodb on dbstore1002 to re-enable s3's replication) [18:56:00] !log started a tmux session on dbstore1002 to migrate all the tokudb tables of mediawikiwiki to InnoDB - (s3 replication broken) [18:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:26] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:26:36] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [19:38:50] PROBLEM - Long running screen/tmux on an-coord1001 is CRITICAL: CRIT: Long running SCREEN process. (user: otto PID: 26051, 1739536s 1728000s). [19:51:32] PROBLEM - Host mw1272 is DOWN: PING CRITICAL - Packet loss = 100% [19:59:04] RECOVERY - MariaDB Slave Lag: s6 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 175.23 seconds [20:12:44] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.41 seconds [20:16:52] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [20:17:18] fixed s3 --^ [20:21:22] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [20:23:42] ACKNOWLEDGEMENT - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] Mathew.onipe T215120 - The acknowledgement expires at: 2019-02-04 11:22:57. https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [20:23:58] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [20:25:54] !log powercycle mw1272 - no ssh, no tty available via com2 - DIMM correctable errors + OEM errors registered in getsel [20:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:46] RECOVERY - Host mw1272 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [20:29:00] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:31:16] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10elukey) Another crash, had to alter all the `mediawikiwiki` database's tables to InnoDB to restart s3 replication. x1 still broken due to a missi... [20:40:44] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [20:42:00] RECOVERY - Maps tiles generation on icinga1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [20:43:40] Quick question: is CSP now enforced on private wikis? [20:44:57] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) You did it in the last all hands! :-) I will walk you thru it so you can fix it yourself entirely! [20:51:28] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10elukey) >>! In T213670#4923477, @Marostegui wrote: > You did it in the last all hands! :-) > I will walk you thru it so you can fix it yourself e... [20:56:02] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [21:05:13] (03CR) 10Gilles: [C: 03+2] Use webp -exact option on Stretch [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/477796 (https://phabricator.wikimedia.org/T170817) (owner: 10Gilles) [21:05:48] (03CR) 10Gilles: [V: 03+2 C: 03+2] Use webp -exact option on Stretch [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/477796 (https://phabricator.wikimedia.org/T170817) (owner: 10Gilles) [21:06:06] RECOVERY - MariaDB Slave Lag: s2 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 152.49 seconds [21:07:20] (03CR) 10Gilles: [V: 03+2 C: 03+2] "Gerrit isn't giving me any "submit" option???" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/477796 (https://phabricator.wikimedia.org/T170817) (owner: 10Gilles) [21:25:31] 10Operations, 10Maps: Map tile generation error - https://phabricator.wikimedia.org/T215120 (10Gehel) Looking at [[ https://grafana.wikimedia.org/d/000000305/maps-performances?panelId=8&fullscreen&orgId=1 | grafana ]], it looks like no tiles were generated on Feb 2, but generation started again on Feb 3. No id... [21:50:10] (03CR) 10Gilles: [V: 03+2 C: 03+2] "@hashar any idea what's happening here? Is this gerrit repo set up incorrectly?" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/477796 (https://phabricator.wikimedia.org/T170817) (owner: 10Gilles) [21:55:50] revi, looks like it's report-only on otrs-wiki [22:02:08] (03PS1) 10Paladox: Modify access rules [software/thumbor-plugins] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/487783 [22:02:17] gilles https://gerrit.wikimedia.org/r/#/c/operations/software/thumbor-plugins/+/487783/ [22:02:39] (03CR) 10Gilles: [C: 03+2] Modify access rules [software/thumbor-plugins] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/487783 (owner: 10Paladox) [22:02:45] paladox: thanks! [22:03:02] your welcome :) [22:03:08] (needs v+2 and submit) [22:03:24] (03CR) 10Gilles: [V: 03+2 C: 03+2] Modify access rules [software/thumbor-plugins] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/487783 (owner: 10Paladox) [22:16:50] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [22:18:04] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [22:29:42] (03PS1) 10Gilles: Fix PNG transparency for more cases [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/487785 (https://phabricator.wikimedia.org/T198370) [22:32:32] (03PS2) 10Gilles: Fix PNG transparency for more cases [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/487785 (https://phabricator.wikimedia.org/T198370) [22:37:12] (03PS3) 10Gilles: Fix PNG transparency for more cases [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/487785 (https://phabricator.wikimedia.org/T198370) [22:59:24] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:25:57] (03PS1) 10Gehel: elasticsearch: exit the JVM on OutOfMemoryError [puppet] - 10https://gerrit.wikimedia.org/r/487787 (https://phabricator.wikimedia.org/T76090) [23:26:36] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [23:26:43] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: Icinga monitoring for elasticsearch doesn't notice OOM conditions - https://phabricator.wikimedia.org/T76090 (10Gehel) Instead of monitoring this specific error, let's just configure the JVM to restart on memory errors. [23:26:58] 10Operations, 10Elasticsearch, 10Wikimedia-Logstash, 10monitoring, and 2 others: Icinga monitoring for elasticsearch doesn't notice OOM conditions - https://phabricator.wikimedia.org/T76090 (10Gehel) [23:27:20] 10Operations, 10Elasticsearch, 10Wikimedia-Logstash, 10monitoring, and 2 others: Icinga monitoring for elasticsearch doesn't notice OOM conditions - https://phabricator.wikimedia.org/T76090 (10Gehel) a:03Gehel