[00:14:50] <wikibugs>	 (03PS3) 10Paladox: Add support for "recheck" and "check experimental" as buttons in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/487089
[00:14:56] <paladox>	 legoktm ^^ works now!
[00:17:39] <paladox>	 and i just learned that you can use many of the polygerrit components.
[00:21:21] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:21:33] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:22:39] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:22:51] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:26:33] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:26:45] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:27:49] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:28:01] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:30:37] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1014.49 seconds
[00:30:49] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1031.93 seconds
[00:31:39] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[00:31:47] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[00:31:57] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:32:03] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[00:33:19] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[00:33:45] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:33:45] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:33:53] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:33:59] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[00:34:01] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:34:01] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:34:01] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[00:34:01] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[00:34:01] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s8 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[00:34:17] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:34:21] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:34:29] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[00:34:31] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[00:34:31] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:34:31] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:34:35] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:34:35] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[00:34:39] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:34:39] <icinga-wm>	 PROBLEM - MariaDB Slave IO: x1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[00:34:39] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[00:34:45] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[00:34:45] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect
[00:34:45] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect
[00:43:02] <wikibugs>	 (03PS4) 10Paladox: Add support for "recheck" and "check experimental" as buttons in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/487089
[00:43:46] <wikibugs>	 (03PS5) 10Paladox: Add support for "recheck" and "check experimental" as buttons in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/487089 (https://phabricator.wikimedia.org/T214631)
[00:43:47] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[00:43:49] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[00:43:51] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect
[00:55:50] <wikibugs>	 (03PS6) 10Paladox: Add support for "recheck" and "check experimental" as buttons in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/487089 (https://phabricator.wikimedia.org/T214631)
[00:56:07] <wikibugs>	 (03PS7) 10Paladox: Add support for "recheck" and "check experimental" as buttons in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/487089 (https://phabricator.wikimedia.org/T214631)
[01:13:11] <icinga-wm>	 PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 27447 MB (5% inode=99%)
[01:21:49] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4068.57 seconds
[01:21:49] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m3 on dbstore1002 is OK: OK slave_io_state not a slave
[01:21:51] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4089.65 seconds
[01:21:59] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m3 on dbstore1002 is OK: OK slave_sql_lag not a slave
[01:21:59] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3992.56 seconds
[01:21:59] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m2 on dbstore1002 is OK: OK slave_io_state not a slave
[01:21:59] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m2 on dbstore1002 is OK: OK slave_sql_lag not a slave
[01:22:01] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4053.47 seconds
[01:22:01] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3060.47 seconds
[01:22:05] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4108.03 seconds
[01:22:23] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state not a slave
[01:22:55] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m2 on dbstore1002 is OK: OK slave_sql_state not a slave
[01:22:55] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4096.79 seconds
[01:35:17] <icinga-wm>	 RECOVERY - Disk space on elastic1017 is OK: DISK OK
[01:52:11] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s8 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[01:52:11] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s4 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[01:52:11] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s2 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[01:52:26] <marostegui>	 I am fixing dbstore1002
[01:52:31] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[01:52:39] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[01:52:43] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[01:52:43] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[01:52:47] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s8 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[01:52:47] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s7 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[01:52:47] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[01:52:47] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s6 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[01:53:09] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[01:53:13] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s6 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[01:53:13] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[01:53:15] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[01:53:25] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[01:53:25] <icinga-wm>	 RECOVERY - MariaDB Slave IO: x1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[01:53:25] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s5 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes
[01:55:58] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) And after 4 days trying to alter `mep_word_persistence` dbstore1002 crashed again (T213706#4917915)
[01:56:01] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[02:05:05] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 184.21 seconds
[02:14:23] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.12 seconds
[02:16:31] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 86.43 seconds
[02:22:01] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 276.03 seconds
[02:22:19] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:22:45] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:34:51] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 47.61 seconds
[02:44:21] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 134.17 seconds
[03:03:03] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:03:35] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:10:13] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:10:59] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:22:55] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:24:47] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:37:17] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:37:51] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:39:55] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:40:31] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:48:27] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:49:13] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:49:47] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:53:11] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:32:33] <wikibugs>	 10Operations, 10User-herron: Improve visibility of incoming operations tasks - https://phabricator.wikimedia.org/T197624 (10Dzahn) >>! In T197624#4916933, @Aklapper wrote: > @Dzahn: Which exact type of notifications are you referring to? If it's from Phab itself: `@Phabricator_maintenance` activity itself shou...
[06:05:43] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:06:35] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:15:45] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:16:15] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:29:11] <icinga-wm>	 PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats]
[06:29:33] <icinga-wm>	 PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:38:41] <icinga-wm>	 RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational
[07:00:59] <icinga-wm>	 RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[07:58:54] <wikibugs>	 (03PS1) 10Ammarpad: Set wgArticleCountMethod='any' for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487115
[08:08:30] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.13; 2019-01-15), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10aaron) >>! In T203786#4916848, @elukey wrote: > @...
[08:28:41] <wikibugs>	 10Operations, 10Cloud-VPS, 10Discovery-Search, 10cloud-services-team: Setup elasticsearch on cloudelastic100[1-4] - https://phabricator.wikimedia.org/T214921 (10Mathew.onipe)
[08:29:11] <wikibugs>	 10Operations, 10Cloud-VPS, 10cloud-services-team, 10Discovery-Search (Current work): Setup elasticsearch on cloudelastic100[1-4] - https://phabricator.wikimedia.org/T214921 (10Mathew.onipe)
[08:30:44] <wikibugs>	 10Operations, 10Cloud-VPS, 10SRE-Access-Requests, 10cloud-services-team, and 2 others: Create cloudelastic-root group - https://phabricator.wikimedia.org/T214922 (10Mathew.onipe)
[08:44:38] <jynus>	 !log stop, upgrade and restart db2090
[08:44:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:17] <jynus>	 !log stop, upgrade and restart db2051, this will cause some lag on s4-codfw
[08:58:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:50] <jynus>	 !log stop, upgrade and restart db2089 (s5/s6)
[09:31:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:22] <wikibugs>	 10Operations, 10DBA, 10Packaging: db2085 doesn't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10jcrespo) db2089 failed once when rebooted into `4.9.0-8-amd64`, worked a second time. Worried because it maybe a random thing?
[09:56:28] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team, 10vm-requests: New node request: oresrdb[12]003 - https://phabricator.wikimedia.org/T210582 (10Ladsgroup)
[09:56:45] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team: Investigate memory usage of ORES in kubernetes - https://phabricator.wikimedia.org/T210264 (10Ladsgroup)
[10:03:38] <jynus>	 !log stop, upgrade and restart db2052, this will cause some lag on s5-codfw
[10:03:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:48] <jynus>	 !log stop, upgrade and restart db2039, this will cause some lag on s6-codfw
[10:33:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:54] <jynus>	 !log stop, upgrade and restart db2079
[10:54:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:25] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Populate wmgWikibaseRepoSpecialSiteLinkGroups for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487076 (https://phabricator.wikimedia.org/T213975) (owner: 10Ladsgroup)
[11:20:29] <jynus>	 !log stop, upgrade and restart db2045, this will cause some lag on s8-codfw
[11:20:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:43] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:39:03] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:44:01] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:44:21] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:52:35] <wikibugs>	 (03PS1) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921)
[11:53:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe)
[12:05:10] <wikibugs>	 (03PS2) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921)
[13:53:58] <jynus>	 !log stop, upgrade and restart db2069
[13:54:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:41] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1302 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[14:06:11] <icinga-wm>	 PROBLEM - Nginx local proxy to videoscaler on mw1302 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.009 second response time
[14:06:25] <icinga-wm>	 PROBLEM - Nginx local proxy to jobrunner on mw1302 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.008 second response time
[14:06:59] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 271 bytes in 0.081 second response time
[14:07:29] <icinga-wm>	 RECOVERY - Nginx local proxy to videoscaler on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 288 bytes in 0.008 second response time
[14:07:43] <icinga-wm>	 RECOVERY - Nginx local proxy to jobrunner on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 288 bytes in 0.010 second response time
[14:13:49] <icinga-wm>	 PROBLEM - puppet last run on proton1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[puppet]
[14:13:52] <wikibugs>	 10Operations, 10DBA, 10Packaging: db2085 doesn't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10Marostegui) I had a chat with Mortiz about this we he was not too sure it would be a kernel thing itself as in something really wrong with the kernel or maybe some sort of hardware thing or...
[14:15:37] <wikibugs>	 10Operations, 10DBA, 10Packaging: db2085 doesn't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10jcrespo) db2085 and db2089 come likely from the same batch, and no other batch showed those issues, so it may be happening only on those hosts.
[14:27:49] <jynus>	 !log stop, upgrade and restart db2034, this will cause some lag on x1-codfw
[14:27:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:51] <icinga-wm>	 PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received
[14:32:21] <icinga-wm>	 RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy
[14:38:55] <icinga-wm>	 PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received
[14:41:31] <icinga-wm>	 RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy
[14:43:40] <wikibugs>	 10Operations, 10DBA, 10Packaging: db2085 doesn't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10MoritzMuehlenhoff) We could narrow this down further by enabling debug flags for the initrd, I don't remember the specific options out of the top of my head, but we can look into this next...
[14:55:50] <jynus>	 Jan 30 14:38:22 proton1002 puppet-agent[20704]: (/Stage[main]/Nrpe/Base::Service_unit[nagios-nrpe-server]/Service[nagios-nrpe-server]) Could not evaluate: Cannot allocate memory - fork(2)
[14:55:57] <jynus>	 memory issues on proton1002
[14:56:45] <jynus>	 ^ akosiaris mobrovac
[15:01:29] <icinga-wm>	 RECOVERY - puppet last run on proton1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[15:17:22] <onimisionipe>	 I'm guessing some memory issues
[15:23:45] <paladox>	 I think that happened last week
[15:23:53] <paladox>	 There’s a task for it some where
[15:24:27] <jynus>	 !log stop, upgrade and restart db2042
[15:24:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:35] <icinga-wm>	 PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received
[15:32:51] <icinga-wm>	 RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy
[15:52:43] <jynus>	 !log stop, upgrade and restart db2037
[15:52:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:55] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.85 seconds
[16:16:59] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.26 seconds
[16:17:07] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on db2055 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.31 seconds
[16:17:11] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on db2085 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.49 seconds
[16:17:11] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on db2062 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.52 seconds
[16:17:15] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on db2070 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.56 seconds
[16:17:19] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on db2071 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.16 seconds
[16:17:49] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on db2048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 324.96 seconds
[16:17:55] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on db2092 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 326.56 seconds
[16:17:55] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on db2072 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 327.75 seconds
[16:18:05] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: Remove labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T209642 (10aborrero) There is a netbox entry for this host: https://netbox.wikimedia.org/dcim/devices/1638/ CC T214499  We may want to delete? I'm not sure about...
[16:19:05] <icinga-wm>	 PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received
[16:19:25] <akosiaris>	 jynus: ok noted. I see OOM has already showed up cleaning chromium 4 times in the last couple of hours
[16:19:42] <akosiaris>	 mobrovac: ^
[16:19:48] <akosiaris>	 !log restart proton on proton1002
[16:19:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:19] <icinga-wm>	 RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy
[16:20:35] <akosiaris>	 let's see if it recovers and stays stable with the restart
[16:24:28] <akosiaris>	 I see the pattern shifted to proton1002 so it's traffic based 
[16:24:40] <akosiaris>	 some requests triggers that
[16:24:42] <akosiaris>	 mobrovac: ^
[16:25:05] <akosiaris>	 I 'll have a closer look after moving to the allhands space
[16:42:31] <icinga-wm>	 PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:48:43] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:59:23] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:00:11] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:16:03] <icinga-wm>	 PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:21:55] <wikibugs>	 (03PS8) 10Paladox: Add support for "recheck" and "check experimental" as buttons in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/487089 (https://phabricator.wikimedia.org/T214631)
[17:28:44] <akosiaris>	 !log restart pdfrender on scb1003, scb1004
[17:28:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:30] <XioNoX>	 !log deactivate/activate cr2-esams:xe-0/1/3
[17:36:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:41] <wikibugs>	 (03CR) 10Urbanecm: "Is there consensus for this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485903 (owner: 10Zoranzoki21)
[17:42:35] <icinga-wm>	 RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[17:43:25] <wikibugs>	 10Operations, 10Proton: proton experienced a period of high CPU usage, busy queue, lockups - https://phabricator.wikimedia.org/T214975 (10akosiaris)
[17:43:42] <wikibugs>	 10Operations, 10Proton: proton experienced a period of high CPU usage, busy queue, lockups - https://phabricator.wikimedia.org/T214975 (10akosiaris) p:05Triage→03High
[17:46:13] <wikibugs>	 10Operations, 10Proton: proton experienced a period of high CPU usage, busy queue, lockups - https://phabricator.wikimedia.org/T214975 (10akosiaris)
[17:49:03] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.012 second response time
[17:49:23] <icinga-wm>	 RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time
[18:03:19] <jynus>	 !log reducing innodb consistency options for db2048 T188327
[18:03:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:23] <stashbot>	 T188327: Deploy refactored actor storage - https://phabricator.wikimedia.org/T188327
[18:10:25] <Amir1>	 I'm going to deploy one security update for ores
[18:11:10] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "See comments  inline. Some of those will require more discussion." (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe)
[18:11:17] <bawolff_>	 Amir1: Quick deploy well everyone is distracted by victoria's speech!! :P
[18:11:36] <Amir1>	 akosiaris said he's around :D
[18:12:15] <Amir1>	 revid to rollback 9253bebd358a6afa6fd70cce03548fa464559bcb
[18:12:41] <akosiaris>	 bawolff_: yeah I am around
[18:12:42] <logmsgbot>	 !log ladsgroup@deploy1001 Started deploy [ores/deploy@ad160b0]: (no justification provided)
[18:12:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:44] <halfak>	 :P 
[18:12:49] <halfak>	 Me too :) 
[18:13:15] <bawolff_>	 I think quite a few people in this crowd are on laptops
[18:13:29] <akosiaris>	 judging by the room yeah
[18:13:34] <halfak>	 I've seen these slides a few times ^_^
[18:17:01] <bawolff_>	 Best part of this meeting is everyone important is named Br[yi][ao]n, so i can pretend they are just talking about me!
[18:18:09] <Amir1>	 Being part of security team gives you lots of importance IMO, it's super fancy :D
[18:18:31] <Amir1>	 you fight the bad guys while we break things :P
[18:25:28] <logmsgbot>	 !log ladsgroup@deploy1001 Finished deploy [ores/deploy@ad160b0]: (no justification provided) (duration: 12m 46s)
[18:25:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:47] <halfak>	 Amir1, I don't see requests being handled in CODFW yet
[18:28:24] <Amir1>	 it seems change prop is down in codfw
[18:28:24] <halfak>	 Neermind!  
[18:28:32] <halfak>	 Oh yeah.  That's what I was looking at. 
[18:28:53] <Amir1>	 the reason precache in eqiad is non-zero is because of mediawiki precaching
[18:29:05] <Amir1>	 we need to talk to services
[18:29:45] <halfak>	 Gotcha. 
[18:30:44] <halfak>	 I'm struggling to get grafana to load right now. 
[18:31:55] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on graphite1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time
[18:32:27] <halfak>	 ^ That explains things. 
[18:32:32] <halfak>	 :( 
[18:32:37] <halfak>	 So, is ORES OK?  
[18:32:47] <halfak>	 Grafana is back!
[18:33:13] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1569 bytes in 0.006 second response time
[18:33:36] <halfak>	 Eqiad precache continues, but is a 1/3rd the rate. 
[18:35:20] <halfak>	 Looks like we can finally confirm our celery queue logging.  There's data now because we had a brief overload when rebooting celery workers!
[18:35:25] <halfak>	 akosiaris, ^ 
[18:36:03] <akosiaris>	 halfak: ah nice! good to know
[18:37:19] <halfak>	 So I'm kind of concerned about the status of ORES.  I can't explain the graphs using just precaching being down. 
[18:38:30] <halfak>	 We were at 2.7k requests/min, then something happened at 17:57 and that dropped to 2k
[18:38:41] <halfak>	 Oh!  Looks like something came back to life. 
[18:39:10] <halfak>	 At 18:36, we went back up to 2.7k requests/min in eqiad.  
[18:39:41] <halfak>	 CODFW looks good too.  Did change prop get kicked? 
[18:40:36] <halfak>	 Either way, I feel better about the status post deploy.  I'm going to file a task to look into what happened here because it doesn't look quite right. 
[18:41:35] <halfak>	 FWIW, there's no blip in "external scores" which is really what our users see. 
[18:42:33] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on db2085 is OK: OK slave_sql_lag Replication lag: 10.17 seconds
[18:42:35] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on db2062 is OK: OK slave_sql_lag Replication lag: 11.28 seconds
[18:42:43] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on db2071 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[18:43:15] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on db2092 is OK: OK slave_sql_lag Replication lag: 0.27 seconds
[18:43:17] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on db2048 is OK: OK slave_sql_lag Replication lag: 0.28 seconds
[18:43:19] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on db2072 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[18:43:37] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on db2088 is OK: OK slave_sql_lag Replication lag: 0.13 seconds
[18:43:39] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on db2094 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[18:44:35] <wikibugs>	 (03PS1) 10MarcoAurelio: Fix typo 'neccessary' [puppet] - 10https://gerrit.wikimedia.org/r/487157
[18:45:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix typo 'neccessary' [puppet] - 10https://gerrit.wikimedia.org/r/487157 (owner: 10MarcoAurelio)
[18:47:03] <wikibugs>	 (03PS2) 10MarcoAurelio: Fix typo 'neccessary' [puppet] - 10https://gerrit.wikimedia.org/r/487157 (https://phabricator.wikimedia.org/T201491)
[18:59:53] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on db2070 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[19:31:15] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on db2055 is OK: OK slave_sql_lag Replication lag: 0.10 seconds
[19:33:03] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "> However for memory I think we should have the segments in the metric name instead of labels, the reason being that in choosing metric na" [deployment-charts] - 10https://gerrit.wikimedia.org/r/486396 (owner: 10Alexandros Kosiaris)
[19:35:45] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: mathoid: Update prometheus-stats.conf [deployment-charts] - 10https://gerrit.wikimedia.org/r/486396
[19:37:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "I 've moved the "segment" (aka "total" heap, "used" heap, "rss") part I introduced in the previous patchset to the name of the metrics. Th" [deployment-charts] - 10https://gerrit.wikimedia.org/r/486396 (owner: 10Alexandros Kosiaris)
[19:41:09] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) I submitted the cable error finding to HPE and will see if they can send me new cables. When they came to replace all the parts they sent t...
[19:44:57] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) @faidon, I have a spare battery for cloudvirt1020 and will look at this when I return so you can compare the 2
[19:51:41] <wikibugs>	 10Operations, 10Discovery: Create extra elasticsearch clusters in beta cluster - https://phabricator.wikimedia.org/T213940 (10Gehel)
[19:52:19] <wikibugs>	 10Operations, 10Discovery, 10Discovery-Search: Create extra elasticsearch clusters in beta cluster - https://phabricator.wikimedia.org/T213940 (10Gehel)
[19:53:22] <wikibugs>	 10Operations, 10Discovery-Search: Collect per-node latency statistics from each node separately - https://phabricator.wikimedia.org/T204982 (10Gehel)
[20:03:43] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:04:01] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:23:09] <wikibugs>	 (03PS2) 10D3r1ck01: Stop NavPopups gadget conflict with PagePreviews on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487007 (https://phabricator.wikimedia.org/T214878)
[20:25:48] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1020 - https://phabricator.wikimedia.org/T214778 (10Cmjohnson) This is a HP server, while the f/w can probably be updated remotely it would be best if I did the update on-site with the service pack and can update everything else at the same time.
[20:28:37] <bawolff_>	 !log reset 2FA@wikitech for [[User:deigo]]
[20:28:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:24] <bawolff_>	 Whoops, i actually did diego, not deigo
[20:38:51] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[20:44:03] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[20:53:39] <icinga-wm>	 PROBLEM - Backup of s7 in eqiad on db1115 is CRITICAL: Backup for s7 at eqiad taken more than 8 days ago: Most recent backup 2019-01-22 20:45:17
[21:07:27] <wikibugs>	 10Operations, 10Discovery-Search, 10Datacenter-Switchover-2018: Warn when CirrusSearch is not configured to use local DC for an extended time - https://phabricator.wikimedia.org/T204135 (10Smalyshev)
[21:55:53] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10Cmjohnson) @jcrespo  I can update f/w if you need but you will need to depool the host again
[21:57:01] <wikibugs>	 10Operations, 10Discovery-Search: Collect per-node latency statistics from each node separately - https://phabricator.wikimedia.org/T204982 (10Gehel) a:03EBernhardson
[21:58:09] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10Cmjohnson) @aborrero can this server be re-installed.....there is a risk that removing /dev/sda will kill the OS.
[22:01:56] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T212010 (10Cmjohnson)
[22:01:59] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T202705 (10Cmjohnson)
[22:03:04] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T212010 (10Cmjohnson)
[22:03:08] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T202705 (10Cmjohnson)
[22:04:21] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T212010 (10Cmjohnson) 05duplicate→03Open
[22:07:15] <wikibugs>	 10Operations, 10RESTBase, 10RESTBase-Cassandra, 10Core Platform Team Backlog (Watching / External), and 2 others: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 (10Cmjohnson) a:03Joe removing ops-eqiad since this has moved past the data center need.   Assigned to @Joe
[22:08:54] <wikibugs>	 10Operations, 10Discovery-Search, 10Reading-Infrastructure-Team-Backlog, 10Maps (Tilerator): Log slow queries on postgresql / maps - https://phabricator.wikimedia.org/T204106 (10Gehel) 05Open→03Resolved Slow queries are already logged, we might want to revisit this if the threshold isn't what we need,...
[22:08:57] <wikibugs>	 10Operations, 10Maps (Tilerator), 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): investigate tilerator crash on maps eqiad - https://phabricator.wikimedia.org/T204047 (10Gehel)
[22:09:33] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Product-Analytics: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T206965 (10Cmjohnson) @Marostegui  no, I reseated the disk but I see it's optimal.  I am resolving the task. If it comes back we can open it again.
[22:10:18] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Product-Analytics: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T206965 (10Cmjohnson) 05Open→03Resolved
[22:10:52] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics: Rack A2's hosts alarm for PSU broken - https://phabricator.wikimedia.org/T212861 (10Cmjohnson) 05Open→03Resolved The fuse for the PDU on A2 has been replaced all power is restored.
[22:12:45] <wikibugs>	 10Operations, 10Discovery-Search, 10Elasticsearch: Modify elasticsearch_shard_size_check plugin to display only indices and shard size - https://phabricator.wikimedia.org/T204363 (10Gehel) 05Open→03Resolved This has been done already.
[22:13:58] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019 (10Cmjohnson)
[22:15:16] <wikibugs>	 10Operations, 10Discovery-Search, 10Elasticsearch, 10Maps, and 2 others: Cleanup/Improve elasticsearch/maps/wdqs doc in wikitech - https://phabricator.wikimedia.org/T213665 (10Gehel) 05Open→03Invalid This is the kind of task which does not have a definition of done, but which is the background work tha...
[22:17:05] <wikibugs>	 (03CR) 10Zoranzoki21: "> Is there consensus for this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485903 (owner: 10Zoranzoki21)
[22:18:30] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic: sda failure in hydrogen.wikimedia.org - https://phabricator.wikimedia.org/T192280 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson This server was decommissioned in https://phabricator.wikimedia.org/T201522
[22:18:33] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090 (10Cmjohnson)
[22:19:23] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073 (10Cmjohnson) @cwdent is this server installed? I like to remove it from workboard.  Thanks!
[22:19:49] <wikibugs>	 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073 (10Cmjohnson)
[22:20:55] <wikibugs>	 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073 (10Cmjohnson)
[22:23:26] <wikibugs>	 10Operations, 10ops-eqiad: eqiad: Re-connect cage cameras - https://phabricator.wikimedia.org/T207965 (10Cmjohnson) Moving this to blocked on my workboard until I can rack and setup the ex4200's for mgmt.
[22:23:44] <wikibugs>	 10Operations, 10Puppet, 10Discovery-Search, 10Maps: Fix maps puppet to make sure apt-get update runs after configuration change - https://phabricator.wikimedia.org/T214073 (10Gehel) Looking at the code:  * Apt repository is added by the main [[ https://github.com/wikimedia/puppet/blob/production/modules/ca...
[22:24:58] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: Replace eqiad mgmt switches with EX4200s - https://phabricator.wikimedia.org/T213128 (10Cmjohnson) These will not be able to have permanent console connections. I do not have enough available ports on the serial switches.
[22:32:53] <wikibugs>	 10Operations, 10ops-eqiad: eqiad: Re-connect cage cameras - https://phabricator.wikimedia.org/T207965 (10Cmjohnson) Spoke with Arzhel and I am going to connect 1 camera to msw1-eqiad
[22:37:43] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Product-Analytics: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T206965 (10Marostegui) Thanks!
[22:48:33] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-Services, 10DC-Ops: labsdb1004 and labsdb1005 some hard disks not healthy - https://phabricator.wikimedia.org/T194012 (10Cmjohnson) 05Open→03Resolved These are only reporting disks not healthy but have not actually failed. Disks can remain in this state for years.  Cl...
[22:49:09] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10Cmjohnson)
[22:49:40] <wikibugs>	 (03PS1) 10Zoranzoki21: Fix sorting of list of wikis per alphabetical order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487291
[22:51:05] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Cmjohnson) Moving lvs1015  to a priority item
[22:52:31] <wikibugs>	 (03PS2) 10Zoranzoki21: dblists/s3.dblist: Fix sorting of list of wikis per alphabetical order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487291
[22:57:19] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install backup1001 - https://phabricator.wikimedia.org/T196478 (10Cmjohnson) @akosiaris Sorry for the really late response to this....the task got buried. No, I don't know why mgmt would not be working now unless it's disconnected or the cable is bad....
[22:58:52] <wikibugs>	 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073 (10cwdent)
[22:58:59] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10User-Marostegui: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Cmjohnson)
[22:59:55] <wikibugs>	 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073 (10cwdent) 05Open→03Resolved @Cmjohnson yep, sorry forgot about this ticket!  Thanks for your help.
[23:01:27] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: upgrade row d to have 3 10G switches - https://phabricator.wikimedia.org/T196487 (10Cmjohnson) @robh @ayounsi  Let's get the procurement items we need to move this task along please.
[23:31:06] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Cmjohnson) lvs1013 and lvs1014 still need to be connected.
[23:50:28] <wikibugs>	 10Operations, 10Analytics, 10Product-Analytics, 10User-Elukey: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10Neil_P._Quinn_WMF)