[00:14:50] (03PS3) 10Paladox: Add support for "recheck" and "check experimental" as buttons in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/487089 [00:14:56] legoktm ^^ works now! [00:17:39] and i just learned that you can use many of the polygerrit components. [00:21:21] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:21:33] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:22:39] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:22:51] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:26:33] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:26:45] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:27:49] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:28:01] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:30:37] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1014.49 seconds [00:30:49] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1031.93 seconds [00:31:39] PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:31:47] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:31:57] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:32:03] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:33:19] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:33:45] PROBLEM - MariaDB Slave SQL: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:33:45] PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:33:53] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:33:59] PROBLEM - MariaDB Slave IO: s1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:34:01] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:34:01] PROBLEM - MariaDB Slave SQL: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:34:01] PROBLEM - MariaDB Slave IO: s3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:34:01] PROBLEM - MariaDB Slave IO: s7 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:34:01] PROBLEM - MariaDB Slave IO: s8 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:34:17] PROBLEM - MariaDB Slave SQL: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:34:21] PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:34:29] PROBLEM - MariaDB Slave IO: s6 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:34:31] PROBLEM - MariaDB Slave IO: m3 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:34:31] PROBLEM - MariaDB Slave SQL: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:34:31] PROBLEM - MariaDB Slave SQL: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:34:35] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:34:35] PROBLEM - MariaDB Slave IO: m2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:34:39] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:34:39] PROBLEM - MariaDB Slave IO: x1 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:34:39] PROBLEM - MariaDB Slave IO: s5 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:34:45] PROBLEM - MariaDB Slave IO: s2 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:34:45] PROBLEM - MariaDB Slave SQL: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state could not connect [00:34:45] PROBLEM - MariaDB Slave IO: s4 on dbstore1002 is CRITICAL: CRITICAL slave_io_state could not connect [00:43:02] (03PS4) 10Paladox: Add support for "recheck" and "check experimental" as buttons in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/487089 [00:43:46] (03PS5) 10Paladox: Add support for "recheck" and "check experimental" as buttons in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/487089 (https://phabricator.wikimedia.org/T214631) [00:43:47] PROBLEM - MariaDB Slave Lag: m3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:43:49] PROBLEM - MariaDB Slave Lag: m2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:43:51] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:55:50] (03PS6) 10Paladox: Add support for "recheck" and "check experimental" as buttons in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/487089 (https://phabricator.wikimedia.org/T214631) [00:56:07] (03PS7) 10Paladox: Add support for "recheck" and "check experimental" as buttons in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/487089 (https://phabricator.wikimedia.org/T214631) [01:13:11] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 27447 MB (5% inode=99%) [01:21:49] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4068.57 seconds [01:21:49] RECOVERY - MariaDB Slave IO: m3 on dbstore1002 is OK: OK slave_io_state not a slave [01:21:51] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4089.65 seconds [01:21:59] RECOVERY - MariaDB Slave Lag: m3 on dbstore1002 is OK: OK slave_sql_lag not a slave [01:21:59] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3992.56 seconds [01:21:59] RECOVERY - MariaDB Slave IO: m2 on dbstore1002 is OK: OK slave_io_state not a slave [01:21:59] RECOVERY - MariaDB Slave Lag: m2 on dbstore1002 is OK: OK slave_sql_lag not a slave [01:22:01] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4053.47 seconds [01:22:01] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3060.47 seconds [01:22:05] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4108.03 seconds [01:22:23] RECOVERY - MariaDB Slave SQL: m3 on dbstore1002 is OK: OK slave_sql_state not a slave [01:22:55] RECOVERY - MariaDB Slave SQL: m2 on dbstore1002 is OK: OK slave_sql_state not a slave [01:22:55] PROBLEM - MariaDB Slave Lag: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4096.79 seconds [01:35:17] RECOVERY - Disk space on elastic1017 is OK: DISK OK [01:52:11] RECOVERY - MariaDB Slave SQL: s8 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [01:52:11] RECOVERY - MariaDB Slave IO: s4 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [01:52:11] RECOVERY - MariaDB Slave IO: s2 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [01:52:26] I am fixing dbstore1002 [01:52:31] RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [01:52:39] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [01:52:43] RECOVERY - MariaDB Slave IO: s1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [01:52:43] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [01:52:47] RECOVERY - MariaDB Slave IO: s8 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [01:52:47] RECOVERY - MariaDB Slave IO: s7 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [01:52:47] RECOVERY - MariaDB Slave IO: s3 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [01:52:47] RECOVERY - MariaDB Slave SQL: s6 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [01:53:09] RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [01:53:13] RECOVERY - MariaDB Slave IO: s6 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [01:53:13] RECOVERY - MariaDB Slave SQL: s7 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [01:53:15] RECOVERY - MariaDB Slave SQL: s4 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [01:53:25] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [01:53:25] RECOVERY - MariaDB Slave IO: x1 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [01:53:25] RECOVERY - MariaDB Slave IO: s5 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [01:55:58] 10Operations, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) And after 4 days trying to alter `mep_word_persistence` dbstore1002 crashed again (T213706#4917915) [01:56:01] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [02:05:05] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 184.21 seconds [02:14:23] RECOVERY - MariaDB Slave Lag: s6 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.12 seconds [02:16:31] RECOVERY - MariaDB Slave Lag: s2 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 86.43 seconds [02:22:01] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 276.03 seconds [02:22:19] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:22:45] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:34:51] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 47.61 seconds [02:44:21] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 134.17 seconds [03:03:03] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:03:35] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:10:13] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:10:59] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:22:55] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:24:47] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:37:17] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:37:51] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:39:55] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:40:31] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:48:27] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:49:13] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:49:47] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:53:11] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:32:33] 10Operations, 10User-herron: Improve visibility of incoming operations tasks - https://phabricator.wikimedia.org/T197624 (10Dzahn) >>! In T197624#4916933, @Aklapper wrote: > @Dzahn: Which exact type of notifications are you referring to? If it's from Phab itself: `@Phabricator_maintenance` activity itself shou... [06:05:43] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:06:35] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:15:45] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:16:15] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:29:11] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [06:29:33] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:38:41] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [07:00:59] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [07:58:54] (03PS1) 10Ammarpad: Set wgArticleCountMethod='any' for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487115 [08:08:30] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.13; 2019-01-15), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10aaron) >>! In T203786#4916848, @elukey wrote: > @... [08:28:41] 10Operations, 10Cloud-VPS, 10Discovery-Search, 10cloud-services-team: Setup elasticsearch on cloudelastic100[1-4] - https://phabricator.wikimedia.org/T214921 (10Mathew.onipe) [08:29:11] 10Operations, 10Cloud-VPS, 10cloud-services-team, 10Discovery-Search (Current work): Setup elasticsearch on cloudelastic100[1-4] - https://phabricator.wikimedia.org/T214921 (10Mathew.onipe) [08:30:44] 10Operations, 10Cloud-VPS, 10SRE-Access-Requests, 10cloud-services-team, and 2 others: Create cloudelastic-root group - https://phabricator.wikimedia.org/T214922 (10Mathew.onipe) [08:44:38] !log stop, upgrade and restart db2090 [08:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:17] !log stop, upgrade and restart db2051, this will cause some lag on s4-codfw [08:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:50] !log stop, upgrade and restart db2089 (s5/s6) [09:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:22] 10Operations, 10DBA, 10Packaging: db2085 doesn't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10jcrespo) db2089 failed once when rebooted into `4.9.0-8-amd64`, worked a second time. Worried because it maybe a random thing? [09:56:28] 10Operations, 10ORES, 10Scoring-platform-team, 10vm-requests: New node request: oresrdb[12]003 - https://phabricator.wikimedia.org/T210582 (10Ladsgroup) [09:56:45] 10Operations, 10ORES, 10Scoring-platform-team: Investigate memory usage of ORES in kubernetes - https://phabricator.wikimedia.org/T210264 (10Ladsgroup) [10:03:38] !log stop, upgrade and restart db2052, this will cause some lag on s5-codfw [10:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:48] !log stop, upgrade and restart db2039, this will cause some lag on s6-codfw [10:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:54] !log stop, upgrade and restart db2079 [10:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:25] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Populate wmgWikibaseRepoSpecialSiteLinkGroups for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487076 (https://phabricator.wikimedia.org/T213975) (owner: 10Ladsgroup) [11:20:29] !log stop, upgrade and restart db2045, this will cause some lag on s8-codfw [11:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:43] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:39:03] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:44:01] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:44:21] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:52:35] (03PS1) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) [11:53:07] (03CR) 10jerkins-bot: [V: 04-1] cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [12:05:10] (03PS2) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) [13:53:58] !log stop, upgrade and restart db2069 [13:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:41] PROBLEM - HHVM jobrunner on mw1302 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [14:06:11] PROBLEM - Nginx local proxy to videoscaler on mw1302 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.009 second response time [14:06:25] PROBLEM - Nginx local proxy to jobrunner on mw1302 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.008 second response time [14:06:59] RECOVERY - HHVM jobrunner on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 271 bytes in 0.081 second response time [14:07:29] RECOVERY - Nginx local proxy to videoscaler on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 288 bytes in 0.008 second response time [14:07:43] RECOVERY - Nginx local proxy to jobrunner on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 288 bytes in 0.010 second response time [14:13:49] PROBLEM - puppet last run on proton1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[puppet] [14:13:52] 10Operations, 10DBA, 10Packaging: db2085 doesn't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10Marostegui) I had a chat with Mortiz about this we he was not too sure it would be a kernel thing itself as in something really wrong with the kernel or maybe some sort of hardware thing or... [14:15:37] 10Operations, 10DBA, 10Packaging: db2085 doesn't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10jcrespo) db2085 and db2089 come likely from the same batch, and no other batch showed those issues, so it may be happening only on those hosts. [14:27:49] !log stop, upgrade and restart db2034, this will cause some lag on x1-codfw [14:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:51] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received [14:32:21] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [14:38:55] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received [14:41:31] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [14:43:40] 10Operations, 10DBA, 10Packaging: db2085 doesn't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10MoritzMuehlenhoff) We could narrow this down further by enabling debug flags for the initrd, I don't remember the specific options out of the top of my head, but we can look into this next... [14:55:50] Jan 30 14:38:22 proton1002 puppet-agent[20704]: (/Stage[main]/Nrpe/Base::Service_unit[nagios-nrpe-server]/Service[nagios-nrpe-server]) Could not evaluate: Cannot allocate memory - fork(2) [14:55:57] memory issues on proton1002 [14:56:45] ^ akosiaris mobrovac [15:01:29] RECOVERY - puppet last run on proton1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:17:22] I'm guessing some memory issues [15:23:45] I think that happened last week [15:23:53] There’s a task for it some where [15:24:27] !log stop, upgrade and restart db2042 [15:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:35] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received [15:32:51] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [15:52:43] !log stop, upgrade and restart db2037 [15:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:55] PROBLEM - MariaDB Slave Lag: s1 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.85 seconds [16:16:59] PROBLEM - MariaDB Slave Lag: s1 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.26 seconds [16:17:07] PROBLEM - MariaDB Slave Lag: s1 on db2055 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.31 seconds [16:17:11] PROBLEM - MariaDB Slave Lag: s1 on db2085 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.49 seconds [16:17:11] PROBLEM - MariaDB Slave Lag: s1 on db2062 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.52 seconds [16:17:15] PROBLEM - MariaDB Slave Lag: s1 on db2070 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.56 seconds [16:17:19] PROBLEM - MariaDB Slave Lag: s1 on db2071 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.16 seconds [16:17:49] PROBLEM - MariaDB Slave Lag: s1 on db2048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 324.96 seconds [16:17:55] PROBLEM - MariaDB Slave Lag: s1 on db2092 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 326.56 seconds [16:17:55] PROBLEM - MariaDB Slave Lag: s1 on db2072 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 327.75 seconds [16:18:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: Remove labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T209642 (10aborrero) There is a netbox entry for this host: https://netbox.wikimedia.org/dcim/devices/1638/ CC T214499 We may want to delete? I'm not sure about... [16:19:05] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received [16:19:25] jynus: ok noted. I see OOM has already showed up cleaning chromium 4 times in the last couple of hours [16:19:42] mobrovac: ^ [16:19:48] !log restart proton on proton1002 [16:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:19] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [16:20:35] let's see if it recovers and stays stable with the restart [16:24:28] I see the pattern shifted to proton1002 so it's traffic based [16:24:40] some requests triggers that [16:24:42] mobrovac: ^ [16:25:05] I 'll have a closer look after moving to the allhands space [16:42:31] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:43] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:59:23] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:00:11] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:16:03] PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:21:55] (03PS8) 10Paladox: Add support for "recheck" and "check experimental" as buttons in PolyGerrit's ui [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/487089 (https://phabricator.wikimedia.org/T214631) [17:28:44] !log restart pdfrender on scb1003, scb1004 [17:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:30] !log deactivate/activate cr2-esams:xe-0/1/3 [17:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:41] (03CR) 10Urbanecm: "Is there consensus for this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485903 (owner: 10Zoranzoki21) [17:42:35] RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:43:25] 10Operations, 10Proton: proton experienced a period of high CPU usage, busy queue, lockups - https://phabricator.wikimedia.org/T214975 (10akosiaris) [17:43:42] 10Operations, 10Proton: proton experienced a period of high CPU usage, busy queue, lockups - https://phabricator.wikimedia.org/T214975 (10akosiaris) p:05Triage→03High [17:46:13] 10Operations, 10Proton: proton experienced a period of high CPU usage, busy queue, lockups - https://phabricator.wikimedia.org/T214975 (10akosiaris) [17:49:03] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.012 second response time [17:49:23] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [18:03:19] !log reducing innodb consistency options for db2048 T188327 [18:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:23] T188327: Deploy refactored actor storage - https://phabricator.wikimedia.org/T188327 [18:10:25] I'm going to deploy one security update for ores [18:11:10] (03CR) 10Gehel: [C: 04-1] "See comments inline. Some of those will require more discussion." (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [18:11:17] Amir1: Quick deploy well everyone is distracted by victoria's speech!! :P [18:11:36] akosiaris said he's around :D [18:12:15] revid to rollback 9253bebd358a6afa6fd70cce03548fa464559bcb [18:12:41] bawolff_: yeah I am around [18:12:42] !log ladsgroup@deploy1001 Started deploy [ores/deploy@ad160b0]: (no justification provided) [18:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:44] :P [18:12:49] Me too :) [18:13:15] I think quite a few people in this crowd are on laptops [18:13:29] judging by the room yeah [18:13:34] I've seen these slides a few times ^_^ [18:17:01] Best part of this meeting is everyone important is named Br[yi][ao]n, so i can pretend they are just talking about me! [18:18:09] Being part of security team gives you lots of importance IMO, it's super fancy :D [18:18:31] you fight the bad guys while we break things :P [18:25:28] !log ladsgroup@deploy1001 Finished deploy [ores/deploy@ad160b0]: (no justification provided) (duration: 12m 46s) [18:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:47] Amir1, I don't see requests being handled in CODFW yet [18:28:24] it seems change prop is down in codfw [18:28:24] Neermind! [18:28:32] Oh yeah. That's what I was looking at. [18:28:53] the reason precache in eqiad is non-zero is because of mediawiki precaching [18:29:05] we need to talk to services [18:29:45] Gotcha. [18:30:44] I'm struggling to get grafana to load right now. [18:31:55] PROBLEM - graphite.wikimedia.org on graphite1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [18:32:27] ^ That explains things. [18:32:32] :( [18:32:37] So, is ORES OK? [18:32:47] Grafana is back! [18:33:13] RECOVERY - graphite.wikimedia.org on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1569 bytes in 0.006 second response time [18:33:36] Eqiad precache continues, but is a 1/3rd the rate. [18:35:20] Looks like we can finally confirm our celery queue logging. There's data now because we had a brief overload when rebooting celery workers! [18:35:25] akosiaris, ^ [18:36:03] halfak: ah nice! good to know [18:37:19] So I'm kind of concerned about the status of ORES. I can't explain the graphs using just precaching being down. [18:38:30] We were at 2.7k requests/min, then something happened at 17:57 and that dropped to 2k [18:38:41] Oh! Looks like something came back to life. [18:39:10] At 18:36, we went back up to 2.7k requests/min in eqiad. [18:39:41] CODFW looks good too. Did change prop get kicked? [18:40:36] Either way, I feel better about the status post deploy. I'm going to file a task to look into what happened here because it doesn't look quite right. [18:41:35] FWIW, there's no blip in "external scores" which is really what our users see. [18:42:33] RECOVERY - MariaDB Slave Lag: s1 on db2085 is OK: OK slave_sql_lag Replication lag: 10.17 seconds [18:42:35] RECOVERY - MariaDB Slave Lag: s1 on db2062 is OK: OK slave_sql_lag Replication lag: 11.28 seconds [18:42:43] RECOVERY - MariaDB Slave Lag: s1 on db2071 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:43:15] RECOVERY - MariaDB Slave Lag: s1 on db2092 is OK: OK slave_sql_lag Replication lag: 0.27 seconds [18:43:17] RECOVERY - MariaDB Slave Lag: s1 on db2048 is OK: OK slave_sql_lag Replication lag: 0.28 seconds [18:43:19] RECOVERY - MariaDB Slave Lag: s1 on db2072 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:43:37] RECOVERY - MariaDB Slave Lag: s1 on db2088 is OK: OK slave_sql_lag Replication lag: 0.13 seconds [18:43:39] RECOVERY - MariaDB Slave Lag: s1 on db2094 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:44:35] (03PS1) 10MarcoAurelio: Fix typo 'neccessary' [puppet] - 10https://gerrit.wikimedia.org/r/487157 [18:45:08] (03CR) 10jerkins-bot: [V: 04-1] Fix typo 'neccessary' [puppet] - 10https://gerrit.wikimedia.org/r/487157 (owner: 10MarcoAurelio) [18:47:03] (03PS2) 10MarcoAurelio: Fix typo 'neccessary' [puppet] - 10https://gerrit.wikimedia.org/r/487157 (https://phabricator.wikimedia.org/T201491) [18:59:53] RECOVERY - MariaDB Slave Lag: s1 on db2070 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [19:31:15] RECOVERY - MariaDB Slave Lag: s1 on db2055 is OK: OK slave_sql_lag Replication lag: 0.10 seconds [19:33:03] (03CR) 10Alexandros Kosiaris: "> However for memory I think we should have the segments in the metric name instead of labels, the reason being that in choosing metric na" [deployment-charts] - 10https://gerrit.wikimedia.org/r/486396 (owner: 10Alexandros Kosiaris) [19:35:45] (03PS3) 10Alexandros Kosiaris: mathoid: Update prometheus-stats.conf [deployment-charts] - 10https://gerrit.wikimedia.org/r/486396 [19:37:51] (03CR) 10Alexandros Kosiaris: "I 've moved the "segment" (aka "total" heap, "used" heap, "rss") part I introduced in the previous patchset to the name of the metrics. Th" [deployment-charts] - 10https://gerrit.wikimedia.org/r/486396 (owner: 10Alexandros Kosiaris) [19:41:09] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) I submitted the cable error finding to HPE and will see if they can send me new cables. When they came to replace all the parts they sent t... [19:44:57] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) @faidon, I have a spare battery for cloudvirt1020 and will look at this when I return so you can compare the 2 [19:51:41] 10Operations, 10Discovery: Create extra elasticsearch clusters in beta cluster - https://phabricator.wikimedia.org/T213940 (10Gehel) [19:52:19] 10Operations, 10Discovery, 10Discovery-Search: Create extra elasticsearch clusters in beta cluster - https://phabricator.wikimedia.org/T213940 (10Gehel) [19:53:22] 10Operations, 10Discovery-Search: Collect per-node latency statistics from each node separately - https://phabricator.wikimedia.org/T204982 (10Gehel) [20:03:43] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:04:01] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:23:09] (03PS2) 10D3r1ck01: Stop NavPopups gadget conflict with PagePreviews on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487007 (https://phabricator.wikimedia.org/T214878) [20:25:48] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1020 - https://phabricator.wikimedia.org/T214778 (10Cmjohnson) This is a HP server, while the f/w can probably be updated remotely it would be best if I did the update on-site with the service pack and can update everything else at the same time. [20:28:37] !log reset 2FA@wikitech for [[User:deigo]] [20:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:24] Whoops, i actually did diego, not deigo [20:38:51] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [20:44:03] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [20:53:39] PROBLEM - Backup of s7 in eqiad on db1115 is CRITICAL: Backup for s7 at eqiad taken more than 8 days ago: Most recent backup 2019-01-22 20:45:17 [21:07:27] 10Operations, 10Discovery-Search, 10Datacenter-Switchover-2018: Warn when CirrusSearch is not configured to use local DC for an extended time - https://phabricator.wikimedia.org/T204135 (10Smalyshev) [21:55:53] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10Cmjohnson) @jcrespo I can update f/w if you need but you will need to depool the host again [21:57:01] 10Operations, 10Discovery-Search: Collect per-node latency statistics from each node separately - https://phabricator.wikimedia.org/T204982 (10Gehel) a:03EBernhardson [21:58:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10Cmjohnson) @aborrero can this server be re-installed.....there is a risk that removing /dev/sda will kill the OS. [22:01:56] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T212010 (10Cmjohnson) [22:01:59] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T202705 (10Cmjohnson) [22:03:04] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T212010 (10Cmjohnson) [22:03:08] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T202705 (10Cmjohnson) [22:04:21] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T212010 (10Cmjohnson) 05duplicate→03Open [22:07:15] 10Operations, 10RESTBase, 10RESTBase-Cassandra, 10Core Platform Team Backlog (Watching / External), and 2 others: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 (10Cmjohnson) a:03Joe removing ops-eqiad since this has moved past the data center need. Assigned to @Joe [22:08:54] 10Operations, 10Discovery-Search, 10Reading-Infrastructure-Team-Backlog, 10Maps (Tilerator): Log slow queries on postgresql / maps - https://phabricator.wikimedia.org/T204106 (10Gehel) 05Open→03Resolved Slow queries are already logged, we might want to revisit this if the threshold isn't what we need,... [22:08:57] 10Operations, 10Maps (Tilerator), 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): investigate tilerator crash on maps eqiad - https://phabricator.wikimedia.org/T204047 (10Gehel) [22:09:33] 10Operations, 10ops-eqiad, 10Analytics, 10Product-Analytics: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T206965 (10Cmjohnson) @Marostegui no, I reseated the disk but I see it's optimal. I am resolving the task. If it comes back we can open it again. [22:10:18] 10Operations, 10ops-eqiad, 10Analytics, 10Product-Analytics: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T206965 (10Cmjohnson) 05Open→03Resolved [22:10:52] 10Operations, 10ops-eqiad, 10Analytics: Rack A2's hosts alarm for PSU broken - https://phabricator.wikimedia.org/T212861 (10Cmjohnson) 05Open→03Resolved The fuse for the PDU on A2 has been replaced all power is restored. [22:12:45] 10Operations, 10Discovery-Search, 10Elasticsearch: Modify elasticsearch_shard_size_check plugin to display only indices and shard size - https://phabricator.wikimedia.org/T204363 (10Gehel) 05Open→03Resolved This has been done already. [22:13:58] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019 (10Cmjohnson) [22:15:16] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Maps, and 2 others: Cleanup/Improve elasticsearch/maps/wdqs doc in wikitech - https://phabricator.wikimedia.org/T213665 (10Gehel) 05Open→03Invalid This is the kind of task which does not have a definition of done, but which is the background work tha... [22:17:05] (03CR) 10Zoranzoki21: "> Is there consensus for this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/485903 (owner: 10Zoranzoki21) [22:18:30] 10Operations, 10ops-eqiad, 10Traffic: sda failure in hydrogen.wikimedia.org - https://phabricator.wikimedia.org/T192280 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson This server was decommissioned in https://phabricator.wikimedia.org/T201522 [22:18:33] 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090 (10Cmjohnson) [22:19:23] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073 (10Cmjohnson) @cwdent is this server installed? I like to remove it from workboard. Thanks! [22:19:49] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073 (10Cmjohnson) [22:20:55] 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073 (10Cmjohnson) [22:23:26] 10Operations, 10ops-eqiad: eqiad: Re-connect cage cameras - https://phabricator.wikimedia.org/T207965 (10Cmjohnson) Moving this to blocked on my workboard until I can rack and setup the ex4200's for mgmt. [22:23:44] 10Operations, 10Puppet, 10Discovery-Search, 10Maps: Fix maps puppet to make sure apt-get update runs after configuration change - https://phabricator.wikimedia.org/T214073 (10Gehel) Looking at the code: * Apt repository is added by the main [[ https://github.com/wikimedia/puppet/blob/production/modules/ca... [22:24:58] 10Operations, 10ops-eqiad, 10netops: Replace eqiad mgmt switches with EX4200s - https://phabricator.wikimedia.org/T213128 (10Cmjohnson) These will not be able to have permanent console connections. I do not have enough available ports on the serial switches. [22:32:53] 10Operations, 10ops-eqiad: eqiad: Re-connect cage cameras - https://phabricator.wikimedia.org/T207965 (10Cmjohnson) Spoke with Arzhel and I am going to connect 1 camera to msw1-eqiad [22:37:43] 10Operations, 10ops-eqiad, 10Analytics, 10Product-Analytics: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T206965 (10Marostegui) Thanks! [22:48:33] 10Operations, 10ops-eqiad, 10Cloud-Services, 10DC-Ops: labsdb1004 and labsdb1005 some hard disks not healthy - https://phabricator.wikimedia.org/T194012 (10Cmjohnson) 05Open→03Resolved These are only reporting disks not healthy but have not actually failed. Disks can remain in this state for years. Cl... [22:49:09] 10Operations, 10Traffic, 10Patch-For-Review: cp1075-90 - bnxt_en transmit hangs - https://phabricator.wikimedia.org/T203194 (10Cmjohnson) [22:49:40] (03PS1) 10Zoranzoki21: Fix sorting of list of wikis per alphabetical order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487291 [22:51:05] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Cmjohnson) Moving lvs1015 to a priority item [22:52:31] (03PS2) 10Zoranzoki21: dblists/s3.dblist: Fix sorting of list of wikis per alphabetical order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/487291 [22:57:19] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install backup1001 - https://phabricator.wikimedia.org/T196478 (10Cmjohnson) @akosiaris Sorry for the really late response to this....the task got buried. No, I don't know why mgmt would not be working now unless it's disconnected or the cable is bad.... [22:58:52] 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073 (10cwdent) [22:58:59] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10User-Marostegui: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Cmjohnson) [22:59:55] 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073 (10cwdent) 05Open→03Resolved @Cmjohnson yep, sorry forgot about this ticket! Thanks for your help. [23:01:27] 10Operations, 10ops-eqiad, 10netops: upgrade row d to have 3 10G switches - https://phabricator.wikimedia.org/T196487 (10Cmjohnson) @robh @ayounsi Let's get the procurement items we need to move this task along please. [23:31:06] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Cmjohnson) lvs1013 and lvs1014 still need to be connected. [23:50:28] 10Operations, 10Analytics, 10Product-Analytics, 10User-Elukey: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10Neil_P._Quinn_WMF)