[00:00:12] We can probably continue and file a bug. But I have pinged Tim in another channel [00:00:33] important thing to mention - this is a Minerva skin fix and Minerva is the only skin that have unit tests [00:01:03] some MWCore libraries expect skin to be Vector and some tests might not work when Minerva skin is applied (we had such issue) [00:02:24] but now everything should be fixed and Minerva doesn't interfere with Lua engine [00:02:57] raynor: https://github.com/wikimedia/mediawiki-extensions-Scribunto/commit/7418a571ac59cc25b682c681a9c2dd330c4a983a [00:05:22] Backporting to .16 and .17 [00:05:24] Wont' take long [00:05:44] good find Reedy [00:06:54] that might be it. but the setfenv() call was already there. why it fails when you pass different param? [00:07:32] I guess the upstream change to the php extension https://gerrit.wikimedia.org/r/#/c/367935/ [00:09:18] !log reedy@tin Synchronized php-1.30.0-wmf.17/extensions/Scribunto/tests/phpunit/: Fix broken test (duration: 00m 50s) [00:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:43] !log reedy@tin Synchronized php-1.30.0-wmf.16/extensions/Scribunto/tests/phpunit/: Fix broken test (duration: 00m 49s) [00:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:37] !log reedy@tin Synchronized php-1.30.0-wmf.16/skins/MinervaNeue: (no justification provided) (duration: 00m 49s) [00:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:56] !log that was T174747 Adjust language icon color [00:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:09] T174747: [regression] language icon is darker than other icons - https://phabricator.wikimedia.org/T174747 [00:15:36] raynor: deployed [00:15:45] testing [00:15:47] which server? [00:16:42] it's on all servers [00:17:52] tested on mwdebug1002 - it's fixed [00:18:35] production fixed [00:19:02] Reedy: it works. Thanks for deploying that patch [00:19:08] cool, np :) [00:22:49] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [00:33:29] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 2 [00:47:19] 10Operations, 10OCG-General, 10Reading-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3582911 (10GWicke) Thanks for the update & clarity on the timeline, @ovasileva! It is much appreciated. [00:53:30] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 1058 [01:26:01] (03PS1) 10Kaldari: Limit ArticleCreationWorkflow to just simplewiki to troubleshoot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376164 [01:26:33] (03PS2) 10Kaldari: Limit ArticleCreationWorkflow to just simplewiki to troubleshoot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376164 [01:29:46] (03Abandoned) 10Kaldari: Limit ArticleCreationWorkflow to just simplewiki to troubleshoot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376164 (owner: 10Kaldari) [01:37:54] (03CR) 10Chad: "I don't see why we can't land this already. We're already directing people to it on the wmfwiki website, and IIRC we're already receiving " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372824 (https://phabricator.wikimedia.org/T173684) (owner: 10Urbanecm) [02:01:39] could someone create a repo on github for me? [02:04:08] davidwbarratt: sure, what do you need? [02:04:36] legoktm uhh. let's call it logstash-report [02:04:49] legoktm it's for https://phabricator.wikimedia.org/T174191#3582190 [02:05:05] and here's my github user: https://github.com/davidbarratt/ [02:05:39] legoktm or logstash-limiter-reporter if that's better [02:05:48] legoktm or just limiter-reporter [02:06:17] davidwbarratt: you and comm tech should have admin access on https://github.com/wikimedia/logstash-report feel free to rename it :) [02:06:30] legoktm perfect! thanks! [02:31:25] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.16) (duration: 08m 18s) [02:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:49] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [03:01:39] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [03:07:11] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.17) (duration: 14m 49s) [03:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:10:49] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:10:50] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:14:17] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Sep 6 03:14:16 UTC 2017 (duration 7m 6s) [03:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:07] (03PS4) 10KartikMistry: Matxin MT service for ContentTranslation [puppet] - 10https://gerrit.wikimedia.org/r/374706 [03:22:38] (03Abandoned) 10KartikMistry: Configurable mode_path for apertium [puppet] - 10https://gerrit.wikimedia.org/r/297350 (https://phabricator.wikimedia.org/T139330) (owner: 10KartikMistry) [03:28:10] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 744.58 seconds [03:51:01] (03PS1) 10Chad: Remove spurious newline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376166 [03:51:03] (03CR) 10Chad: [C: 032] Remove spurious newline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376166 (owner: 10Chad) [03:52:29] (03Merged) 10jenkins-bot: Remove spurious newline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376166 (owner: 10Chad) [03:52:30] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:52:30] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:52:39] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:52:42] (03CR) 10jenkins-bot: Remove spurious newline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376166 (owner: 10Chad) [03:52:49] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:52:49] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:52:49] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:52:49] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:52:49] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:52:50] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:52:50] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:52:50] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:52:59] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:53:00] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:53:19] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:53:19] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:53:19] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:53:19] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:53:20] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:53:20] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:53:29] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:53:29] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:53:29] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:53:39] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:55:40] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [03:55:40] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0 [03:57:39] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [03:57:40] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [03:58:50] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [03:58:50] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [04:01:29] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:01:29] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:01:29] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:01:29] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:01:29] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:01:30] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:01:30] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:01:30] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:01:31] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:01:40] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:01:49] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:01:49] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:01:50] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:01:59] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:01:59] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:01:59] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:01:59] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:01:59] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:02:00] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:02:00] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:02:01] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:02:01] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:02:02] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:02:02] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:02:03] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:02:10] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:02:10] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:02:19] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:02:19] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:02:19] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:07:52] (03PS3) 10Chad: Remove $stdlogo entirely [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359037 (owner: 10Reedy) [04:08:59] !log demon@tin Synchronized wmf-config/throttle.php: no-op (duration: 00m 49s) [04:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:23:59] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 93.28 seconds [04:30:29] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0 [04:30:29] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [04:39:41] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [04:39:41] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [04:42:49] RECOVERY - MariaDB Slave Lag: s6 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89893.40 seconds [04:42:50] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:42:50] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:46:39] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:46:39] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:46:40] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:46:40] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:46:40] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:46:40] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:46:40] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:46:49] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:46:51] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:46:51] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:46:59] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:47:09] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:47:09] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:47:09] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:47:09] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:47:09] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:47:10] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:47:10] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:47:11] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:47:11] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:47:29] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:47:29] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:47:29] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:47:30] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:47:39] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:47:39] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:47:39] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:47:39] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:47:39] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:52:50] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:55:10] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [04:55:11] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:55:11] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:55:11] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [04:55:11] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [04:58:59] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:59:00] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:59:00] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:59:00] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:59:09] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:59:19] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:59:19] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:59:19] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:59:19] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:59:19] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:59:20] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:59:20] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:59:21] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:59:21] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:59:32] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:59:32] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:59:39] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:59:40] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:59:49] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:59:49] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:59:49] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:59:49] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:59:49] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:59:50] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:59:50] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:59:51] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:59:51] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:59:52] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:59:52] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:59:53] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:09:30] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:09:40] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:09:40] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:09:49] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [05:09:50] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:09:50] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:09:50] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:09:59] RECOVERY - MariaDB Slave Lag: m2 on dbstore1001 is OK: OK slave_sql_lag not a slave [05:10:00] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:10:00] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:10:00] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:10:00] RECOVERY - MariaDB Slave Lag: m3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 80447.93 seconds [05:10:00] RECOVERY - MariaDB Slave Lag: s7 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89112.93 seconds [05:10:00] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:10:01] RECOVERY - MariaDB Slave Lag: x1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 4108.95 seconds [05:12:53] (03CR) 10MZMcBride: "I thought there was a previous comment to this effect, but I'm wary of a search default that includes private wikis. Inverting the argumen" [puppet] - 10https://gerrit.wikimedia.org/r/262068 (https://phabricator.wikimedia.org/T71581) (owner: 10Reedy) [05:13:59] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:13:59] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:13:59] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:13:59] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:13:59] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:14:00] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:14:09] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:14:09] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:14:09] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:14:10] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:14:10] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:14:10] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:14:20] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:14:20] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:14:29] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:14:30] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:14:30] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:14:30] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:14:30] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:14:39] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:14:39] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:14:39] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:14:39] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:14:49] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:14:49] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:14:50] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:20:09] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:20:10] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:20:10] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:20:10] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:27:00] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:27:09] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:27:09] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:27:10] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:27:10] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:27:10] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:27:19] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:27:19] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:27:19] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:27:19] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:27:19] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:27:19] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:27:20] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:27:20] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:27:29] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:27:29] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:27:29] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:27:29] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:27:39] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:27:40] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:27:40] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:27:40] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:27:49] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:27:49] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:27:50] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:27:50] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:27:50] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:27:50] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:28:00] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:28:00] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:29:35] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T174857#3583074 (10Marostegui) 05Open>03Resolved a:03Cmjohnson This is all good now, thanks a lot Chris! ``` root@db1059:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Ta... [05:33:09] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:33:09] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:38:49] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:38:50] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:38:51] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:38:51] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:38:59] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:38:59] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:39:00] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:39:00] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:39:00] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:39:00] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:39:10] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:39:10] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:39:19] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:39:20] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:39:29] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:39:29] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:39:29] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:39:29] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:39:29] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:39:29] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:39:29] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:39:30] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:39:30] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:39:31] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:39:31] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:39:32] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:41:40] RECOVERY - MariaDB Slave Lag: s6 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 87585.37 seconds [05:41:40] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:41:40] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:51:21] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:51:21] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:51:21] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:51:21] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:51:21] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:51:30] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [05:51:30] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:51:39] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:51:39] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:51:39] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:51:39] RECOVERY - MariaDB Slave Lag: m2 on dbstore1001 is OK: OK slave_sql_lag not a slave [05:51:39] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:51:40] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:51:40] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:51:40] RECOVERY - MariaDB Slave Lag: s7 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 86290.01 seconds [05:51:41] RECOVERY - MariaDB Slave Lag: m3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 78347.02 seconds [05:51:41] RECOVERY - MariaDB Slave Lag: x1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 4.03 seconds [05:52:09] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:52:09] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:52:09] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [05:53:09] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:53:10] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:06:29] (03PS1) 10Marostegui: mariadb: Add db1100 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/376178 (https://phabricator.wikimedia.org/T172679) [06:06:55] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add db1100 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/376178 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [06:08:38] (03PS2) 10Marostegui: mariadb: Add db1100 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/376178 (https://phabricator.wikimedia.org/T172679) [06:14:02] (03PS1) 10Marostegui: s5.hosts: Add db1100 [software] - 10https://gerrit.wikimedia.org/r/376179 (https://phabricator.wikimedia.org/T172679) [06:20:13] (03CR) 10Marostegui: [C: 032] s5.hosts: Add db1100 [software] - 10https://gerrit.wikimedia.org/r/376179 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [06:20:57] (03Merged) 10jenkins-bot: s5.hosts: Add db1100 [software] - 10https://gerrit.wikimedia.org/r/376179 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [06:37:49] !log Truncate l10n_cache table across production - T150306 [06:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:02] T150306: truncate l10n_cache table on WMF wikis - https://phabricator.wikimedia.org/T150306 [06:46:31] !log installing php-luasandbox 2.0.14 on API canaries along with HHVM restart (T173705) [06:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:44] T173705: HHVM: Unknown exception - https://phabricator.wikimedia.org/T173705 [07:08:32] (03PS1) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376181 (https://phabricator.wikimedia.org/T175113) [07:09:58] (03CR) 10jerkins-bot: [V: 04-1] New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376181 (https://phabricator.wikimedia.org/T175113) (owner: 10Urbanecm) [07:11:38] (03CR) 10Marostegui: [C: 032] mariadb: Add db1100 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/376178 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [07:11:40] (03PS2) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376181 (https://phabricator.wikimedia.org/T175113) [07:13:54] (03CR) 10Muehlenhoff: [C: 031] "That looks fine, but I would prefer if we could use the opportunity to rename the ferm::client (e.g. to swift-object-server-incoming or so" [puppet] - 10https://gerrit.wikimedia.org/r/374170 (https://phabricator.wikimedia.org/T173731) (owner: 10Filippo Giunchedi) [07:15:04] !log Stop MySQL on db1049 to copy its content to db1100 - https://phabricator.wikimedia.org/T172679 [07:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:19] (03PS12) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) [07:27:49] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/374169 (https://phabricator.wikimedia.org/T173731) (owner: 10Filippo Giunchedi) [07:34:14] (03CR) 10Muehlenhoff: "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/376029 (https://phabricator.wikimedia.org/T164817) (owner: 10Muehlenhoff) [07:34:19] (03PS5) 10Muehlenhoff: cumin: extend aliases [puppet] - 10https://gerrit.wikimedia.org/r/376029 (https://phabricator.wikimedia.org/T164817) [07:34:53] (03CR) 10Smalyshev: [C: 031] wdqs - activate wdqs100[45] as wdqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/376025 (https://phabricator.wikimedia.org/T171210) (owner: 10Gehel) [07:43:11] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table bawiktionary.l10n_cache doesnt exist on query. Default database: bawiktionary. [Query snipped] [07:44:00] (03CR) 10Muehlenhoff: [C: 032] cumin: extend aliases [puppet] - 10https://gerrit.wikimedia.org/r/376029 (https://phabricator.wikimedia.org/T164817) (owner: 10Muehlenhoff) [07:53:11] marostegui: have you seen this? ^^^ [07:53:36] let me know if you need a hand [07:55:21] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 868.98 seconds [07:57:13] i am fixing that [07:57:13] (03PS1) 10Muehlenhoff: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/376187 [07:57:26] ah [07:57:27] you did [07:58:15] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/376187 (owner: 10Muehlenhoff) [07:58:27] (03CR) 10Muehlenhoff: [C: 032] Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/376187 (owner: 10Muehlenhoff) [07:58:41] marostegui: no, I didn't [07:58:50] maybe jynus did [07:59:11] (03PS3) 10Addshore: Enable Newsletter on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364734 (https://phabricator.wikimedia.org/T110170) [08:00:03] going to review dbstore1002 and 1001 to see why the tables aren't there and create them empty [08:00:29] (03PS4) 10Addshore: Enable Newsletter on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364734 (https://phabricator.wikimedia.org/T110170) [08:00:30] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:01:04] I did nothing, got waken by the alert [08:01:18] i fixed it yes [08:01:20] did it page? [08:01:23] i didn't get anything [08:03:17] I bet it "paged" jynus aNag ;) [08:03:31] yesterday one of those tokudb tables got corrupted, but why was that missing? [08:03:58] it is missing because the whole bawiktionary is missing on the dbstore servers [08:04:19] maybe that is a deleted or unused wiki or something [08:04:22] let's see [08:05:08] yep, it is on deleted list [08:05:40] I think people move things to the deleted lists but then keep writing [08:06:15] because they do not handle the config correctly [08:06:31] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 193.86 seconds [08:06:32] yeah, if it exists on the master it should exist everywhere on the replicaiton chain (that is how i see it) [08:06:42] except on labs [08:06:55] yeah, that yes :) [08:07:32] it could be that someone drop it from x1 [08:07:42] and it gets dropped overally [08:07:53] ah, true [08:07:56] could be [08:09:33] also, dbstore1002 is one of the few replicas that is not read only [08:11:40] RECOVERY - MariaDB Slave Lag: s4 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89992.16 seconds [08:16:20] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3583333 (10elukey) [08:16:40] 10Operations, 10DBA, 10Phabricator: Decom db1048 (BBU Faulty - slave lagging) - https://phabricator.wikimedia.org/T160731#3583335 (10jcrespo) Not yet, this is still in use. [08:17:50] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3463322 (10Verdy_p) The Module:Country version is no... [08:20:22] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3583339 (10Verdy_p) Note that the current "kludge" u... [08:33:38] (03PS1) 10Addshore: Add WMDE log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376191 (https://phabricator.wikimedia.org/T174948) [08:36:12] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376043 (https://phabricator.wikimedia.org/T144479) (owner: 10Gilles) [08:37:19] 10Operations, 10Performance-Team, 10Thumbor, 10User-fgiunchedi: thumbor1003 behaves differently than other thumbor hosts - https://phabricator.wikimedia.org/T174997#3583378 (10fgiunchedi) [08:38:01] (03CR) 10Addshore: [C: 032] Enable Newsletter on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364734 (https://phabricator.wikimedia.org/T110170) (owner: 10Addshore) [08:39:32] (03Merged) 10jenkins-bot: Enable Newsletter on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364734 (https://phabricator.wikimedia.org/T110170) (owner: 10Addshore) [08:39:42] (03CR) 10jenkins-bot: Enable Newsletter on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364734 (https://phabricator.wikimedia.org/T110170) (owner: 10Addshore) [08:39:54] 10Operations, 10Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: Do not deploy Cirrus elasticsearch plugins on logstash cluster - https://phabricator.wikimedia.org/T174933#3583382 (10Gehel) @debt no more work to be done here, feel free to close. [08:43:03] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T110170 [[gerrit:364734|Enable Newsletter on mediawikiwiki]] (duration: 00m 51s) [08:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:17] T110170: Goal: Deploy Newsletter extension in Wikimedia - https://phabricator.wikimedia.org/T110170 [08:46:26] (03PS2) 10Addshore: Add WMDE log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376191 (https://phabricator.wikimedia.org/T174948) [08:46:37] (03CR) 10Addshore: [C: 032] Add WMDE log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376191 (https://phabricator.wikimedia.org/T174948) (owner: 10Addshore) [08:48:07] (03Merged) 10jenkins-bot: Add WMDE log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376191 (https://phabricator.wikimedia.org/T174948) (owner: 10Addshore) [08:48:19] (03CR) 10jenkins-bot: Add WMDE log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376191 (https://phabricator.wikimedia.org/T174948) (owner: 10Addshore) [08:49:53] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T174948 [[gerrit:376191|Add WMDE log channel]] (duration: 00m 49s) [08:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:08] T174948: Deploy 'hack' patch & logging for tracking user registrations and guided tour - https://phabricator.wikimedia.org/T174948 [08:51:15] (03PS1) 10Filippo Giunchedi: cassandra: reprovision restbase1009 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/376195 (https://phabricator.wikimedia.org/T169939) [08:52:08] (03CR) 10Filippo Giunchedi: [C: 032] cassandra: reprovision restbase1009 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/376195 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi) [08:53:32] !log reimage restbase1009 with cassandra 3 - T169939 [08:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:47] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [08:55:42] (03PS3) 10Addshore: Move config variables from the extension to config repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376017 (https://phabricator.wikimedia.org/T174962) (owner: 10Ladsgroup) [08:57:06] (03CR) 10Addshore: [C: 031] Move config variables from the extension to config repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376017 (https://phabricator.wikimedia.org/T174962) (owner: 10Ladsgroup) [09:01:33] (03PS6) 10Jcrespo: mariadb: Add cluster manager hosts to allowed admin port users [puppet] - 10https://gerrit.wikimedia.org/r/362217 [09:02:24] (03CR) 10Hashar: [C: 031] "Indeed from the doc:" [puppet] - 10https://gerrit.wikimedia.org/r/375939 (https://phabricator.wikimedia.org/T170447) (owner: 10Andrew Bogott) [09:03:47] (03PS1) 10Muehlenhoff: Extend aliases [puppet] - 10https://gerrit.wikimedia.org/r/376196 [09:05:58] !log disabling puppet on most db hosts to merge firewall changes safely [09:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:16] (03CR) 10Jcrespo: [C: 032] mariadb: Add cluster manager hosts to allowed admin port users [puppet] - 10https://gerrit.wikimedia.org/r/362217 (owner: 10Jcrespo) [09:08:19] (03CR) 10Volans: [C: 031] "LGTM. Nitpick on the commit message, prepend "cumin:"" [puppet] - 10https://gerrit.wikimedia.org/r/376196 (owner: 10Muehlenhoff) [09:08:24] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3583445 (10Joe) >>! In T173710#3581849, @aaron wrote: > Those refreshLInks jobs (from wikibase) are the only ones that use multiple titles per job, so th... [09:09:54] (03PS3) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: refactor things to the profile [puppet] - 10https://gerrit.wikimedia.org/r/376020 [09:11:41] PROBLEM - Check systemd state on db1066 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:12:50] 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3583455 (10fgiunchedi) >>! In T158837#3582514, @Krinkle wrote: >>>! In T158837#3497281, @fgiunchedi wrote: >> >> re: coal/coal-web it should be straightforward to... [09:15:41] PROBLEM - Check systemd state on labsdb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:16:03] (03PS2) 10Muehlenhoff: Extend Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/376196 [09:16:37] (03PS3) 10Muehlenhoff: Extend Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/376196 [09:17:26] (03CR) 10Alexandros Kosiaris: [C: 031] Matxin MT service for ContentTranslation [puppet] - 10https://gerrit.wikimedia.org/r/374706 (owner: 10KartikMistry) [09:18:17] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::jobrunner: refactor things to the profile [puppet] - 10https://gerrit.wikimedia.org/r/376020 (owner: 10Giuseppe Lavagetto) [09:19:26] (03PS1) 10Jcrespo: mariadb: Do not try to resolve IP addresses for ferm [puppet] - 10https://gerrit.wikimedia.org/r/376197 [09:19:58] (03PS2) 10Jcrespo: mariadb: Do not try to resolve IP addresses for ferm [puppet] - 10https://gerrit.wikimedia.org/r/376197 [09:20:12] (03CR) 10Muehlenhoff: [C: 032] Extend Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/376196 (owner: 10Muehlenhoff) [09:20:18] (03PS4) 10Muehlenhoff: Extend Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/376196 [09:20:33] (03CR) 10Jcrespo: [C: 032] mariadb: Do not try to resolve IP addresses for ferm [puppet] - 10https://gerrit.wikimedia.org/r/376197 (owner: 10Jcrespo) [09:20:57] (03PS5) 10Muehlenhoff: Extend Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/376196 [09:21:06] (03CR) 10Muehlenhoff: [V: 032 C: 032] Extend Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/376196 (owner: 10Muehlenhoff) [09:22:06] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: Add local-only port [puppet] - 10https://gerrit.wikimedia.org/r/376021 (https://phabricator.wikimedia.org/T174599) [09:22:42] RECOVERY - Check systemd state on db1066 is OK: OK - running: The system is fully operational [09:22:42] RECOVERY - Check systemd state on labsdb1004 is OK: OK - running: The system is fully operational [09:28:50] !log installing libonig security uodates# [09:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:02] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3583488 (10elukey) @Papaul one thing that we could tell to Dell is that we have, as far as I can see, mw2251->60 that are identical, so our software is almost surely not the problem. I put a summary in this task abou... [09:31:40] (03PS1) 10Jcrespo: mariadb: Do not try to resolve IP addresses for ferm [puppet] - 10https://gerrit.wikimedia.org/r/376198 [09:31:52] (03PS2) 10Jcrespo: mariadb: Do not try to resolve IP addresses for ferm [puppet] - 10https://gerrit.wikimedia.org/r/376198 [09:32:56] (03CR) 10Jcrespo: [C: 032] mariadb: Do not try to resolve IP addresses for ferm [puppet] - 10https://gerrit.wikimedia.org/r/376198 (owner: 10Jcrespo) [09:39:49] !log installing libgd security updates [09:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:26] (03PS1) 10Elukey: role::mariadb::analytics::custom_repl_slave: raise el_sync batch to 10k [puppet] - 10https://gerrit.wikimedia.org/r/376201 (https://phabricator.wikimedia.org/T174815) [09:46:24] PROBLEM - cassandra-c CQL 10.64.48.131:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.131 and port 9042: Connection refused [09:47:14] PROBLEM - cassandra-c SSL 10.64.48.131:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [09:47:46] silencing ^ [09:48:29] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7732/" [puppet] - 10https://gerrit.wikimedia.org/r/376201 (https://phabricator.wikimedia.org/T174815) (owner: 10Elukey) [09:49:14] RECOVERY - cassandra-c SSL 10.64.48.131:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-c valid until 2018-08-17 16:11:04 +0000 (expires in 345 days) [09:50:24] RECOVERY - cassandra-c CQL 10.64.48.131:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.131 port 9042 [09:53:22] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::jobrunner: Add local-only port [puppet] - 10https://gerrit.wikimedia.org/r/376021 (https://phabricator.wikimedia.org/T174599) (owner: 10Giuseppe Lavagetto) [09:53:29] (03PS3) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: Add local-only port [puppet] - 10https://gerrit.wikimedia.org/r/376021 (https://phabricator.wikimedia.org/T174599) [09:56:46] (03PS1) 10Muehlenhoff: Also print amount of hosts not requiring a restart [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376203 [10:00:30] (03PS1) 10Giuseppe Lavagetto: jobrunner: add missing newline in template [puppet] - 10https://gerrit.wikimedia.org/r/376204 [10:00:57] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: add missing newline in template [puppet] - 10https://gerrit.wikimedia.org/r/376204 (owner: 10Giuseppe Lavagetto) [10:04:42] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner_tls: relay requests to the local-only port [puppet] - 10https://gerrit.wikimedia.org/r/376022 (https://phabricator.wikimedia.org/T174599) [10:05:47] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::jobrunner_tls: relay requests to the local-only port [puppet] - 10https://gerrit.wikimedia.org/r/376022 (https://phabricator.wikimedia.org/T174599) (owner: 10Giuseppe Lavagetto) [10:09:40] (03PS1) 10Muehlenhoff: Remove debdeploy salt grains previously used for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376208 [10:10:38] (03PS1) 10Filippo Giunchedi: cassandra: reprovision restbase1010 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/376209 (https://phabricator.wikimedia.org/T169939) [10:11:04] (03PS2) 10Filippo Giunchedi: cassandra: reprovision restbase1010 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/376209 (https://phabricator.wikimedia.org/T169939) [10:12:12] (03CR) 10Filippo Giunchedi: [C: 032] cassandra: reprovision restbase1010 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/376209 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi) [10:12:40] (03CR) 10Elukey: [C: 031] Remove debdeploy salt grains previously used for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376208 (owner: 10Muehlenhoff) [10:12:57] !log reimage restbase1010 with cassandra 3 - T169939 [10:12:58] (03PS1) 10Giuseppe Lavagetto: jobrunner: fix ProxyPass directives for LVS vhost [puppet] - 10https://gerrit.wikimedia.org/r/376211 [10:13:10] (03PS2) 10Giuseppe Lavagetto: jobrunner: fix ProxyPass directives for LVS vhost [puppet] - 10https://gerrit.wikimedia.org/r/376211 [10:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:11] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [10:13:22] 10Operations, 10OCG-General, 10Reading-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3583895 (10ovasileva) [10:13:47] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: fix ProxyPass directives for LVS vhost [puppet] - 10https://gerrit.wikimedia.org/r/376211 (owner: 10Giuseppe Lavagetto) [10:15:03] (03PS1) 10Muehlenhoff: Remove cache salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376212 [10:17:01] (03PS1) 10Muehlenhoff: Remove lvs salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376213 [10:19:31] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner_tls: add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/376023 [10:20:16] (03PS1) 10Muehlenhoff: Remove db salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376214 [10:20:50] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::jobrunner_tls: add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/376023 (owner: 10Giuseppe Lavagetto) [10:24:00] (03PS1) 10Muehlenhoff: Remove hadoop salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376215 [10:25:27] !log installing perl update from jessie 8.9 point release [10:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:17] !log installing perl update from stretch 9.1 point release [10:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:27] (03CR) 10Elukey: [C: 031] Remove hadoop salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376215 (owner: 10Muehlenhoff) [10:34:54] PROBLEM - puppet last run on db2059 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[set debconf flag seen for wireshark-common/install-setuid] [10:39:04] RECOVERY - puppet last run on db2059 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:40:48] (03CR) 10Jcrespo: [C: 031] Remove db salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376214 (owner: 10Muehlenhoff) [11:01:17] 10Operations: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3583945 (10MoritzMuehlenhoff) wtp1031/wtp1032 are not fully installed, it seems like the initial puppet run after the installation didn't happen? [11:12:30] (03PS1) 10Jcrespo: mariadb: Depool db2040 for reboot and upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376218 (https://phabricator.wikimedia.org/T167121) [11:15:27] !log Disable puppet on db1100 for mydumper/myloader [11:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:48] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2040 for reboot and upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376218 (https://phabricator.wikimedia.org/T167121) (owner: 10Jcrespo) [11:17:19] (03Merged) 10jenkins-bot: mariadb: Depool db2040 for reboot and upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376218 (https://phabricator.wikimedia.org/T167121) (owner: 10Jcrespo) [11:17:29] (03CR) 10jenkins-bot: mariadb: Depool db2040 for reboot and upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376218 (https://phabricator.wikimedia.org/T167121) (owner: 10Jcrespo) [11:18:25] (03PS1) 10Filippo Giunchedi: cassandra: reprovision restbase1008 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/376219 (https://phabricator.wikimedia.org/T169939) [11:18:47] (03PS2) 10Filippo Giunchedi: cassandra: reprovision restbase1008 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/376219 (https://phabricator.wikimedia.org/T169939) [11:19:38] (03CR) 10Filippo Giunchedi: [C: 032] cassandra: reprovision restbase1008 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/376219 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi) [11:20:17] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2040 for reboot and upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376220 [11:20:41] !log reimage restbase1008 with cassandra 3 - T169939 [11:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:56] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [11:20:58] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2040 (duration: 00m 49s) [11:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:38] !log installing gnutls update from jessie 8.9 and stretch 9.1 point updates [11:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:42] (03PS1) 10Ema: varnish: use varnish::wikimedia_vcl for all files [puppet] - 10https://gerrit.wikimedia.org/r/376221 [11:24:53] (03CR) 10Mobrovac: [C: 04-1] "AFAIK, this is an external service not controlled by us, so this should go into the config template in the deploy repo." [puppet] - 10https://gerrit.wikimedia.org/r/374706 (owner: 10KartikMistry) [11:24:57] !log temporarily raise kafka log4j authorizer verbosity to DEBUG on kafka1012 - T173493 [11:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:11] T173493: Tune Kafka logs to register clients connected - https://phabricator.wikimedia.org/T173493 [11:25:50] (03Abandoned) 10Ema: varnish: drop varnish::wikimedia_vcl [puppet] - 10https://gerrit.wikimedia.org/r/376045 (owner: 10Ema) [11:28:35] (03CR) 10KartikMistry: "> AFAIK, this is an external service not controlled by us, so this" [puppet] - 10https://gerrit.wikimedia.org/r/374706 (owner: 10KartikMistry) [11:29:28] (03PS2) 10Ema: varnish: use varnish::wikimedia_vcl for all files [puppet] - 10https://gerrit.wikimedia.org/r/376221 [11:29:53] (03CR) 10jerkins-bot: [V: 04-1] varnish: use varnish::wikimedia_vcl for all files [puppet] - 10https://gerrit.wikimedia.org/r/376221 (owner: 10Ema) [11:30:50] jouncebot: refresh [11:30:52] I refreshed my knowledge about deployments. [11:30:57] jouncebot: next [11:30:57] In 1 hour(s) and 29 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170906T1300) [11:31:22] (03CR) 10Hashar: [C: 031] New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376181 (https://phabricator.wikimedia.org/T175113) (owner: 10Urbanecm) [11:35:06] (03CR) 10Hashar: [C: 031] Move config variables from the extension to config repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376017 (https://phabricator.wikimedia.org/T174962) (owner: 10Ladsgroup) [11:41:06] !log installing gtk+2.0 update from jessie 8.9 update [11:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:26] PROBLEM - dhclient process on restbase1010 is CRITICAL: Return code of 255 is out of bounds [11:45:16] PROBLEM - Check size of conntrack table on restbase1010 is CRITICAL: Return code of 255 is out of bounds [11:45:17] PROBLEM - puppet last run on restbase1010 is CRITICAL: Return code of 255 is out of bounds [11:45:26] PROBLEM - cassandra-a CQL 10.64.0.114:9042 on restbase1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:46:07] PROBLEM - Check systemd state on restbase1010 is CRITICAL: Return code of 255 is out of bounds [11:46:07] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: Return code of 255 is out of bounds [11:47:06] PROBLEM - Check the NTP synchronisation status of timesyncd on restbase1010 is CRITICAL: Return code of 255 is out of bounds [11:47:06] PROBLEM - salt-minion processes on restbase1010 is CRITICAL: Return code of 255 is out of bounds [11:47:06] PROBLEM - cassandra-a service on restbase1010 is CRITICAL: Return code of 255 is out of bounds [11:47:38] going to silence that bad boy [11:49:48] (03PS3) 10Ema: varnish: use varnish::wikimedia_vcl for all files [puppet] - 10https://gerrit.wikimedia.org/r/376221 [11:50:10] (03CR) 10jerkins-bot: [V: 04-1] varnish: use varnish::wikimedia_vcl for all files [puppet] - 10https://gerrit.wikimedia.org/r/376221 (owner: 10Ema) [12:01:54] (03PS4) 10Ema: varnish: use varnish::wikimedia_vcl for all files [puppet] - 10https://gerrit.wikimedia.org/r/376221 [12:07:07] PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [12:08:15] !log installing libapache2-mod-perl update from jessie 8.9 update [12:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:45] (03PS2) 10Gehel: wdqs - activate wdqs100[45] as wdqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/376025 (https://phabricator.wikimedia.org/T171210) [12:09:05] moritzm: do your updates have anything to do with lvs1001's puppetfail above? [12:09:57] (03CR) 10Gehel: [C: 032] wdqs - activate wdqs100[45] as wdqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/376025 (https://phabricator.wikimedia.org/T171210) (owner: 10Gehel) [12:10:24] (03CR) 10Mobrovac: [C: 04-1] "I guess you meant https://gerrit.wikimedia.org/r/374708, but my point is that the actual URI should go in that patch in scap/vars.yaml, no" [puppet] - 10https://gerrit.wikimedia.org/r/374706 (owner: 10KartikMistry) [12:12:08] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3457615 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs1004.eqiad.wmnet']... [12:12:41] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3584049 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs1004.eqiad.wmnet']... [12:13:39] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3584050 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs1005.eqiad.wmnet']... [12:14:55] (03PS1) 10Muehlenhoff: yubiauth: Remove unused salt grains [puppet] - 10https://gerrit.wikimedia.org/r/376231 [12:15:10] (03PS5) 10Ema: varnish: use varnish::wikimedia_vcl for all files [puppet] - 10https://gerrit.wikimedia.org/r/376221 [12:15:33] (03CR) 10Ema: [V: 032 C: 032] varnish: use varnish::wikimedia_vcl for all files [puppet] - 10https://gerrit.wikimedia.org/r/376221 (owner: 10Ema) [12:15:49] ema: yeah, puppet tries to ensure that tzdata is installed and if another package update happens during that (like the point update deployments), puppet fails [12:16:48] moritzm: ok, just checking. Thanks! :) [12:17:06] PROBLEM - Restbase root url on restbase1008 is CRITICAL: connect to address 10.64.32.178 and port 7231: Connection refused [12:18:46] PROBLEM - cassandra CQL 10.64.32.178:9042 on restbase1008 is CRITICAL: connect to address 10.64.32.178 and port 9042: Connection refused [12:19:36] PROBLEM - cassandra SSL 10.64.32.178:7001 on restbase1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:20:15] (03CR) 10Muehlenhoff: [C: 032] yubiauth: Remove unused salt grains [puppet] - 10https://gerrit.wikimedia.org/r/376231 (owner: 10Muehlenhoff) [12:20:19] (03PS2) 10Muehlenhoff: yubiauth: Remove unused salt grains [puppet] - 10https://gerrit.wikimedia.org/r/376231 [12:20:36] PROBLEM - Check systemd state on restbase1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:20:36] PROBLEM - cassandra service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [12:23:16] PROBLEM - puppet last run on restbase1008 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 18 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[restbase/deploy],Package[cassandra/metrics-collector] [12:29:46] 10Operations, 10Analytics-Kanban, 10hardware-requests: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097#3584079 (10elukey) [12:30:26] RECOVERY - puppet last run on restbase1008 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [12:31:26] (03PS2) 10Muehlenhoff: Remove hadoop salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376215 [12:33:06] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3584101 (10elukey) [12:34:47] RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [12:35:26] (03CR) 10Muehlenhoff: [C: 032] Remove hadoop salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376215 (owner: 10Muehlenhoff) [12:35:42] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3584115 (10elukey) [12:37:44] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3584144 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['wdqs1004.eqiad.wmnet'] ``` and were **ALL** successful. [12:38:11] (03PS2) 10Muehlenhoff: Remove debdeploy salt grains previously used for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376208 [12:38:11] ACKNOWLEDGEMENT - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel data import inprogress [12:39:02] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3584146 (10elukey) [12:39:33] (03CR) 10Muehlenhoff: [C: 032] Remove debdeploy salt grains previously used for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376208 (owner: 10Muehlenhoff) [12:44:12] (03PS2) 10Muehlenhoff: Remove db salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376214 [12:45:07] (03CR) 10Muehlenhoff: [C: 032] Remove db salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376214 (owner: 10Muehlenhoff) [12:46:39] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3584163 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs1005.eqiad.wmnet']... [12:47:11] PROBLEM - Check systemd state on ganeti1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:48:21] (03PS1) 10Muehlenhoff: Remove swift salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376234 [12:49:44] (03CR) 10Volans: [C: 031] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376203 (owner: 10Muehlenhoff) [12:50:40] (03PS1) 10Muehlenhoff: Remove kafka salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376237 [12:54:37] (03PS1) 10Muehlenhoff: Remove elasticsearch salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376241 [12:54:38] (03PS1) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242 [12:55:40] (03CR) 10Elukey: [C: 031] Remove kafka salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376237 (owner: 10Muehlenhoff) [12:57:16] (03PS1) 10Muehlenhoff: Memove releng-related salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376244 [12:59:56] (03PS1) 10Muehlenhoff: Remove WMCS-related salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376245 [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170906T1300). Please do the needful. [13:00:04] Urbanecm and Amir1: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:12] Present [13:00:19] o/ [13:00:40] Are you doing it hashar? [13:00:57] o/ [13:01:15] Reedy: feel free to handle it ? :D [13:01:21] (03PS1) 10Muehlenhoff: Remove ganeti salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376247 [13:01:23] lol, I don't mind either way [13:01:27] 10Operations: setup/deploy tegmen/WMF6381 as monitoring host - https://phabricator.wikimedia.org/T121583#3584199 (10akosiaris) [13:01:33] Reedy: please do so :] [13:01:39] 10Operations, 10Icinga: decom neon (shutdown neon (icinga) after it has been replaced ) - https://phabricator.wikimedia.org/T125023#3584202 (10akosiaris) [13:01:41] 10Operations, 10hardware-requests: EQIAD/CODFW: 2 hardware access request for monitoring - https://phabricator.wikimedia.org/T120842#3584203 (10akosiaris) [13:01:43] 10Operations: setup/deploy tegmen/WMF6381 as monitoring host - https://phabricator.wikimedia.org/T121583#1882900 (10akosiaris) 05Open>03Resolved Has been done a long time now. Resolving [13:01:52] (03PS1) 10Elukey: Remove stat1003 traces for decom [puppet] - 10https://gerrit.wikimedia.org/r/376248 (https://phabricator.wikimedia.org/T152712) [13:02:17] o/ [13:02:34] (03PS2) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242 [13:02:35] hashar, Reedy, zeljkof: Who'll be the swatter? :D [13:02:44] (03PS3) 10Reedy: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376181 (https://phabricator.wikimedia.org/T175113) (owner: 10Urbanecm) [13:02:47] (03CR) 10Reedy: [C: 032] New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376181 (https://phabricator.wikimedia.org/T175113) (owner: 10Urbanecm) [13:02:59] I can swat, but I see Reedy already volunteered :) [13:03:06] (03CR) 10Elukey: [C: 032] Remove stat1003 traces for decom [puppet] - 10https://gerrit.wikimedia.org/r/376248 (https://phabricator.wikimedia.org/T152712) (owner: 10Elukey) [13:03:25] (03PS1) 10Muehlenhoff: Remove parsoid salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376249 [13:04:10] PROBLEM - puppet last run on mw1201 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:04:19] (03Merged) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376181 (https://phabricator.wikimedia.org/T175113) (owner: 10Urbanecm) [13:04:41] (03CR) 10Alexandros Kosiaris: "Let's do what mobrovac suggests. That is the current status quo, makes sense." [puppet] - 10https://gerrit.wikimedia.org/r/374706 (owner: 10KartikMistry) [13:05:10] (03CR) 10Gehel: [C: 032] "All good, we are ready to deploy!" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) (owner: 10DCausse) [13:05:20] (03PS1) 10Muehlenhoff: Remove restbase salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376250 [13:05:24] (03CR) 10Gehel: [V: 032 C: 032] Switch this repo to a deb package [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) (owner: 10DCausse) [13:05:50] (03PS1) 10Filippo Giunchedi: site: use cassandra 3 for restbase1008 / restbase1010 [puppet] - 10https://gerrit.wikimedia.org/r/376252 [13:06:11] (03CR) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376181 (https://phabricator.wikimedia.org/T175113) (owner: 10Urbanecm) [13:06:12] (03PS2) 10Filippo Giunchedi: site: use cassandra 3 for restbase1008 / restbase1010 [puppet] - 10https://gerrit.wikimedia.org/r/376252 [13:06:41] PROBLEM - DPKG on restbase1008 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:06:56] (03CR) 10Filippo Giunchedi: [C: 032] site: use cassandra 3 for restbase1008 / restbase1010 [puppet] - 10https://gerrit.wikimedia.org/r/376252 (owner: 10Filippo Giunchedi) [13:07:30] (03PS1) 10Alexandros Kosiaris: Document in-datastore calico configuration [puppet] - 10https://gerrit.wikimedia.org/r/376254 (https://phabricator.wikimedia.org/T170111) [13:07:46] !log reedy@tin Synchronized wmf-config/throttle.php: Throttle exception T175113 (duration: 00m 49s) [13:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:57] T175113: Allow IP for creating account for school project for 14 days - https://phabricator.wikimedia.org/T175113 [13:08:12] (03PS4) 10Reedy: Move config variables from the extension to config repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376017 (https://phabricator.wikimedia.org/T174962) (owner: 10Ladsgroup) [13:08:41] (03CR) 10Reedy: [C: 032] Move config variables from the extension to config repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376017 (https://phabricator.wikimedia.org/T174962) (owner: 10Ladsgroup) [13:09:52] (03CR) 10Filippo Giunchedi: [C: 031] Remove swift salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376234 (owner: 10Muehlenhoff) [13:10:42] (03Merged) 10jenkins-bot: Move config variables from the extension to config repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376017 (https://phabricator.wikimedia.org/T174962) (owner: 10Ladsgroup) [13:10:49] 10Operations, 10Pybal, 10Traffic, 10monitoring, 10Patch-For-Review: pybal: add prometheus metrics - https://phabricator.wikimedia.org/T171710#3473875 (10faidon) I know a bunch of work happened during the Wikimania hackathon, but what's the status of this? [13:10:52] (03CR) 10jenkins-bot: Move config variables from the extension to config repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376017 (https://phabricator.wikimedia.org/T174962) (owner: 10Ladsgroup) [13:12:04] !log reedy@tin Synchronized wmf-config/Wikibase-labs.php: Move some wikidata config T174962 (duration: 00m 49s) [13:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:18] T174962: [Bug] Configuration on wikidata.beta.wmflabs.org is broken - https://phabricator.wikimedia.org/T174962 [13:12:27] PROBLEM - HHVM rendering on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:13:08] PROBLEM - Host ganeti1008 is DOWN: PING CRITICAL - Packet loss = 100% [13:13:16] !log reedy@tin Synchronized wmf-config/Wikibase-production.php: Move some wikidata config T174962 (duration: 00m 48s) [13:13:17] RECOVERY - HHVM rendering on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 75168 bytes in 0.171 second response time [13:13:28] RECOVERY - Check systemd state on ganeti1008 is OK: OK - running: The system is fully operational [13:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:37] RECOVERY - Host ganeti1008 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [13:13:42] Done and done [13:15:53] Thanks [13:19:26] !log restarting and upgrading db2040 [13:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:29] Reedy: thx :) [13:21:31] 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3584259 (10faidon) Am I right to understand that the current plan is 2 VMs? If so, yeah, that sounds absolutely fine :) [13:26:05] 10Operations, 10monitoring: diamond: certain counters always calculated as 0 - https://phabricator.wikimedia.org/T138758#3584272 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi We've upgraded to diamond 4 in {T97635} and its TCP collector includes `gauges` config option, resolving. [13:30:11] (03PS3) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242 [13:30:34] _joe_: regarding refreshLinks jobs, they are now 50 pages / job, do you think making the batch size smaller would help? [13:31:18] RECOVERY - puppet last run on mw1201 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [13:31:55] 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3584304 (10Ottomata) > Any downtime permanently affects the graphs. Just an uninformed idea: If you produce directly to graphite (and maybe prometheus too?) inste... [13:32:01] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3584305 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs1005.eqiad.wmnet']... [13:38:24] (03PS4) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242 [13:45:13] (03PS5) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242 [13:45:51] PROBLEM - cassandra-c service on restbase1008 is CRITICAL: Return code of 255 is out of bounds [13:46:01] PROBLEM - Disk space on restbase1008 is CRITICAL: Return code of 255 is out of bounds [13:46:01] PROBLEM - cassandra-c SSL 10.64.32.196:7001 on restbase1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [13:46:10] PROBLEM - cassandra-c CQL 10.64.32.196:9042 on restbase1008 is CRITICAL: connect to address 10.64.32.196 and port 9042: Connection refused [13:46:10] PROBLEM - DPKG on restbase1008 is CRITICAL: Return code of 255 is out of bounds [13:46:27] (03PS16) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) [13:46:50] PROBLEM - MD RAID on restbase1008 is CRITICAL: Return code of 255 is out of bounds [13:46:50] PROBLEM - configured eth on restbase1008 is CRITICAL: Return code of 255 is out of bounds [13:47:06] (03PS1) 10Jcrespo: mariadb: Move db2040's MariaDB socket to /run [puppet] - 10https://gerrit.wikimedia.org/r/376259 (https://phabricator.wikimedia.org/T148507) [13:47:09] (03PS17) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) [13:47:40] PROBLEM - dhclient process on restbase1008 is CRITICAL: Return code of 255 is out of bounds [13:48:31] PROBLEM - cassandra-a CQL 10.64.32.187:9042 on restbase1008 is CRITICAL: connect to address 10.64.32.187 and port 9042: Connection refused [13:48:31] PROBLEM - puppet last run on restbase1008 is CRITICAL: Return code of 255 is out of bounds [13:48:51] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:49:21] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:49:30] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:49:30] PROBLEM - salt-minion processes on restbase1008 is CRITICAL: Return code of 255 is out of bounds [13:49:30] PROBLEM - cassandra-a SSL 10.64.32.187:7001 on restbase1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [13:49:31] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:49:31] PROBLEM - salt-minion processes on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:49:40] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:49:50] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:49:50] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [13:50:05] checking stat1005 [13:50:24] (03CR) 10Jcrespo: [C: 032] mariadb: Move db2040's MariaDB socket to /run [puppet] - 10https://gerrit.wikimedia.org/r/376259 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [13:50:29] oom party [13:52:12] also known in some circles as oom mani padme hum [13:52:40] RECOVERY - DPKG on stat1005 is OK: All packages OK [13:52:45] <_joe_> Amir1: honestly, I have to review a few details [13:52:51] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [13:52:51] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures [13:53:00] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [13:53:13] <_joe_> but if each job takes about 1 minute to execute on terbium, I don't know what will happen with jobrunners and their timeouts [13:53:30] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [13:53:30] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [13:53:34] <_joe_> I'll have to look into it a bit [13:53:40] RECOVERY - salt-minion processes on stat1005 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:54:27] thanks [13:55:37] what is interesting is that 99th percentile wait time is growing exponentially despite all of stuff we have done [13:57:21] 10Operations, 10monitoring: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3584374 (10jcrespo) [13:57:23] 10Operations, 10monitoring, 10Patch-For-Review: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121#3584372 (10jcrespo) 05Open>03Resolved I think the reboot and/or upgrade fixed it (db2040). [13:58:21] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3584376 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['wdqs1005.eqiad.wmnet'] ``` Of which those **FAILED**: ``` set(['wdqs10... [13:58:30] RECOVERY - salt-minion processes on restbase1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:58:40] RECOVERY - puppet last run on restbase1008 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [13:58:41] RECOVERY - dhclient process on restbase1008 is OK: PROCS OK: 0 processes with command name dhclient [13:58:51] RECOVERY - MD RAID on restbase1008 is OK: OK: Active: 15, Working: 15, Failed: 0, Spare: 0 [13:58:52] RECOVERY - configured eth on restbase1008 is OK: OK - interfaces up [13:59:01] RECOVERY - cassandra-c service on restbase1008 is OK: OK - cassandra-c is active [13:59:10] RECOVERY - Disk space on restbase1008 is OK: DISK OK [13:59:11] RECOVERY - cassandra-c SSL 10.64.32.196:7001 on restbase1008 is OK: SSL OK - Certificate restbase1008-c valid until 2018-08-17 16:11:00 +0000 (expires in 345 days) [13:59:11] RECOVERY - DPKG on restbase1008 is OK: All packages OK [13:59:31] RECOVERY - cassandra-a SSL 10.64.32.187:7001 on restbase1008 is OK: SSL OK - Certificate restbase1008-a valid until 2018-08-17 16:10:58 +0000 (expires in 345 days) [14:01:11] RECOVERY - cassandra-c CQL 10.64.32.196:9042 on restbase1008 is OK: TCP OK - 0.000 second response time on 10.64.32.196 port 9042 [14:03:23] apparently my patch to reduce the size from 100 to 50 didn't help [14:04:50] PROBLEM - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - kubemaster_6443 - Could not depool server chlorine.eqiad.wmnet because of too many down! [14:05:01] ignore ^ [14:05:03] it's me playing [14:05:17] (03PS6) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242 [14:05:20] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - kubemaster_6443 - Could not depool server chlorine.eqiad.wmnet because of too many down! [14:05:40] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - kubemaster_6443 - Could not depool server chlorine.eqiad.wmnet because of too many down! [14:06:21] PROBLEM - PyBal IPVS diff check on lvs1003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([chlorine.eqiad.wmnet]) [14:07:20] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([chlorine.eqiad.wmnet]) [14:07:30] PROBLEM - PyBal IPVS diff check on lvs1009 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([chlorine.eqiad.wmnet]) [14:08:14] (03PS2) 10Muehlenhoff: Remove swift salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376234 [14:08:30] PROBLEM - PyBal IPVS diff check on lvs1010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([chlorine.eqiad.wmnet]) [14:08:58] (03CR) 10Muehlenhoff: [C: 032] Remove swift salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376234 (owner: 10Muehlenhoff) [14:09:18] apergos: ping [14:09:21] (03PS18) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) [14:09:37] (03PS19) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) [14:09:41] RECOVERY - cassandra-a CQL 10.64.32.187:9042 on restbase1008 is OK: TCP OK - 0.000 second response time on 10.64.32.187 port 9042 [14:09:51] (03PS2) 10Muehlenhoff: Remove kafka salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376237 [14:10:07] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [14:12:10] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2051665 [14:12:24] (03PS7) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242 [14:12:40] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [14:13:06] (03CR) 10Muehlenhoff: [C: 032] Remove kafka salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376237 (owner: 10Muehlenhoff) [14:14:30] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [14:15:00] RECOVERY - PyBal backends health check on lvs1009 is OK: PYBAL OK - All pools are healthy [14:16:21] RECOVERY - PyBal IPVS diff check on lvs1003 is OK: OK: no difference between hosts in IPVS/PyBal [14:17:04] !log wdqs1005 is coming up after a few reimaging issues, expect some icinga noise... [14:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:21] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal [14:17:30] RECOVERY - PyBal IPVS diff check on lvs1009 is OK: OK: no difference between hosts in IPVS/PyBal [14:18:13] (03PS8) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242 [14:18:26] RECOVERY - PyBal IPVS diff check on lvs1010 is OK: OK: no difference between hosts in IPVS/PyBal [14:19:56] PROBLEM - Host restbase1010 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:02] (03PS1) 10Muehlenhoff: Remove analytics salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376263 [14:21:06] RECOVERY - Host restbase1010 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [14:22:20] (03Draft1) 10Paladox: Use keyholder_key in scap/scap.cfg [software/gerrit] - 10https://gerrit.wikimedia.org/r/376264 [14:22:22] (03PS2) 10Paladox: Use keyholder_key in scap/scap.cfg [software/gerrit] - 10https://gerrit.wikimedia.org/r/376264 [14:22:51] (03PS1) 10Muehlenhoff: Remove k8s salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376265 [14:23:18] (03PS1) 10Filippo Giunchedi: cassandra: enable jmx_exporter for cassandra 3 cluster [puppet] - 10https://gerrit.wikimedia.org/r/376267 (https://phabricator.wikimedia.org/T169939) [14:23:30] (03PS9) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242 [14:23:34] (03CR) 10jerkins-bot: [V: 04-1] cassandra: enable jmx_exporter for cassandra 3 cluster [puppet] - 10https://gerrit.wikimedia.org/r/376267 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi) [14:24:15] PROBLEM - Check systemd state on restbase1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:24:25] (03PS2) 10Filippo Giunchedi: cassandra: enable jmx_exporter for cassandra 3 cluster [puppet] - 10https://gerrit.wikimedia.org/r/376267 (https://phabricator.wikimedia.org/T169939) [14:25:30] TheresNoTime: yes? [14:25:35] PROBLEM - Check systemd state on kafka-jumbo1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:25:43] (03CR) 10Filippo Giunchedi: [C: 032] cassandra: enable jmx_exporter for cassandra 3 cluster [puppet] - 10https://gerrit.wikimedia.org/r/376267 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi) [14:28:53] (03PS1) 10Ema: varnish::logging::statsd: instance_name future parser check [puppet] - 10https://gerrit.wikimedia.org/r/376269 [14:29:01] apergos: how's it going? I've been chatting to someone at OVH about them providing a mirror for the XML dumps (they seem interested), and I've looped `ops-dumps`into the latest email as they're asking some questions I obviously can't answer [14:29:26] (03PS1) 10Alexandros Kosiaris: kubernetes: Allow switching storage backend versions [puppet] - 10https://gerrit.wikimedia.org/r/376270 [14:29:40] (03CR) 10Krinkle: Make values stackable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [14:29:42] it might be useful if we could provide answers to some of the questions on the [[Mirroring Wikimedia project XML dumps]] page (such as estimated traffic etc) [14:29:59] (03PS1) 10Muehlenhoff: Remove labtest salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376271 [14:30:57] TheresNoTime: I see your mail from about 20 minutes ago, I'm happy to carry on the conversation there [14:30:58] (03CR) 10Elukey: [C: 031] Remove analytics salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376263 (owner: 10Muehlenhoff) [14:31:29] apergos: please do, out of my depth in terms of what they're asking :) [14:32:02] 10Operations, 10Ops-Access-Requests: WMF LDAP group access for pivot.wikimedia.org - https://phabricator.wikimedia.org/T175156#3584447 (10schoenbaechler) [14:32:08] (03CR) 10Ema: [V: 032 C: 032] varnish::logging::statsd: instance_name future parser check [puppet] - 10https://gerrit.wikimedia.org/r/376269 (owner: 10Ema) [14:32:08] well I don't see the email where they ask for info, so you might have to forward that to me or something [14:32:18] or ask them to, whichever [14:32:31] I'mhappy to respond with info once I see their questions [14:32:35] (03PS1) 10Muehlenhoff: Remove sca/scb salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376272 [14:33:35] (03CR) 10Krinkle: Make values stackable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [14:34:11] (03PS10) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242 [14:35:32] (03PS1) 10Muehlenhoff: Remove dumps/snapshot salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376273 [14:37:49] (03CR) 10ArielGlenn: [C: 031] "Fine by me." [puppet] - 10https://gerrit.wikimedia.org/r/376273 (owner: 10Muehlenhoff) [14:38:14] (03PS1) 10Muehlenhoff: Remove NFS salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376274 [14:38:28] (03PS2) 10Alexandros Kosiaris: kubernetes: Allow switching storage backend versions [puppet] - 10https://gerrit.wikimedia.org/r/376270 [14:38:52] (03CR) 10Alexandros Kosiaris: [C: 032] kubernetes: Allow switching storage backend versions [puppet] - 10https://gerrit.wikimedia.org/r/376270 (owner: 10Alexandros Kosiaris) [14:41:10] 10Operations, 10monitoring: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3584495 (10jcrespo) [14:41:13] 10Operations, 10monitoring, 10Patch-For-Review: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121#3584493 (10jcrespo) 05Resolved>03Open checking es1019 [14:41:35] (03PS1) 10Jcrespo: mariadb: Depool es1019 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376276 (https://phabricator.wikimedia.org/T167121) [14:41:41] (03PS1) 10Muehlenhoff: Remove puppetmaster/puppetdb salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376277 [14:42:33] (03PS11) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242 [14:42:55] (03CR) 10jerkins-bot: [V: 04-1] varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242 (owner: 10Ema) [14:43:26] (03CR) 10Giuseppe Lavagetto: [C: 031] "Small nitpicky request of better logging, but LGTM" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac) [14:43:45] <_joe_> mobrovac: /go mobrovac [14:43:48] <_joe_> augh [14:43:59] <_joe_> I decided to write you in private after all [14:44:03] haha [14:44:10] <_joe_> but yeah, see my comment [14:46:18] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3584505 (10Krinkle) >>! In T173710#3583445, @Joe wrote: > As a side comment: this is one of the cases where I would've loved to have an elastic environme... [14:46:22] (03PS12) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242 [14:48:17] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/376277 (owner: 10Muehlenhoff) [14:48:36] 10Operations, 10monitoring, 10Patch-For-Review, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring - https://phabricator.wikimedia.org/T148541#2725758 (10mark) @fgiunchedi: Could you elaborate why the SNMP exporter to prometheus didn't wor... [14:51:01] (03PS13) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242 [14:52:12] (03CR) 10Muehlenhoff: "The monitoring hosts already have a carte blanche via the 'monitoring-all' ferm::rules in modules/base/manifests/firewall.pp, so that can " [puppet] - 10https://gerrit.wikimedia.org/r/376024 (owner: 10Giuseppe Lavagetto) [14:53:00] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es1019 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376276 (https://phabricator.wikimedia.org/T167121) (owner: 10Jcrespo) [14:53:09] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2040 for reboot and upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376220 (owner: 10Jcrespo) [14:53:42] (03PS2) 10Muehlenhoff: Remove dumps/snapshot salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376273 [14:54:15] (03CR) 10Muehlenhoff: [C: 032] Remove dumps/snapshot salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376273 (owner: 10Muehlenhoff) [14:54:36] (03Merged) 10jenkins-bot: mariadb: Depool es1019 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376276 (https://phabricator.wikimedia.org/T167121) (owner: 10Jcrespo) [14:54:48] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2040 for reboot and upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376220 (owner: 10Jcrespo) [14:55:20] (03PS2) 10Muehlenhoff: Remove analytics salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376263 [14:56:29] (03CR) 10jenkins-bot: mariadb: Depool es1019 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376276 (https://phabricator.wikimedia.org/T167121) (owner: 10Jcrespo) [14:56:34] (03CR) 10Muehlenhoff: [C: 032] Remove analytics salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376263 (owner: 10Muehlenhoff) [14:56:39] (03PS6) 10Mobrovac: JobQueue: Add the RunSingleJob.php script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 [14:57:12] (03CR) 10Mobrovac: [C: 031] JobQueue: Add the RunSingleJob.php script (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac) [15:01:10] (03PS20) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) [15:03:59] (03CR) 10Giuseppe Lavagetto: JobQueue: Add the RunSingleJob.php script (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac) [15:04:11] <_joe_> mobrovac: another small correction, sorry :P [15:04:19] huh kk [15:04:23] <_joe_> but then we can just merge and test it [15:04:29] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3584606 (10jcrespo) > Of course, that doesn't apply to cases that are limited by a common resource (e.g. database). If I could add to the ideal scenario... [15:04:36] <_joe_> the endpoint on LVS should already work [15:05:16] duh, good point _joe_, i wanted to write $match[1] but ended up with $value lol [15:06:17] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2040 (duration: 00m 49s) [15:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:47] (03PS7) 10Mobrovac: JobQueue: Add the RunSingleJob.php script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 [15:07:08] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3584613 (10Joe) >>! In T173710#3584505, @Krinkle wrote: >>>! In T173710#3583445, @Joe wrote: >> As a side comment: this is one of the cases where I would... [15:07:15] (03CR) 10Mobrovac: JobQueue: Add the RunSingleJob.php script (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac) [15:07:36] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM, let's merge this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac) [15:07:39] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1019 (duration: 00m 49s) [15:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:01] _joe_: in a meeting, let's merge/sync 20 mins from now? [15:08:08] <_joe_> ok [15:08:13] <_joe_> ping me when you're done [15:11:55] !log cp1049 - restart varnish backend (mailbox lag) [15:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:52] (03PS14) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242 [15:13:32] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3584649 (10Papaul) @elukey I spoke today with one of the Dell manager on this case. He ensure me that he will personal follow this case with the engineer working with me. He asked that i go ahead and update the firmw... [15:16:38] 10Operations, 10Diamond, 10Traffic, 10monitoring, 10Prometheus-metrics-monitoring: Enable diamond PowerDNSRecursor collector on dnsrecursors - https://phabricator.wikimedia.org/T169600#3584652 (10faidon) a:03akosiaris [15:17:08] RECOVERY - Check systemd state on kafka-jumbo1003 is OK: OK - running: The system is fully operational [15:20:51] (03PS1) 10Andrew Bogott: openstack: allow primary glance server to rsync to secondary [puppet] - 10https://gerrit.wikimedia.org/r/376280 [15:21:18] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3584658 (10Anomie) >>! In T171392#3583337, @Verdy_p... [15:21:33] (03CR) 10Andrew Bogott: [C: 032] openstack: allow primary glance server to rsync to secondary [puppet] - 10https://gerrit.wikimedia.org/r/376280 (owner: 10Andrew Bogott) [15:22:17] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 0 [15:25:35] (03CR) 10Jforrester: [C: 04-2] "> I don't see why we can't land this already. We're already directing people to it on the wmfwiki website, and IIRC we're already receivin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372824 (https://phabricator.wikimedia.org/T173684) (owner: 10Urbanecm) [15:26:21] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3584666 (10zhuyifei1999) >>! In T171392#3583339, @Ve... [15:27:22] (03CR) 10Reedy: "I do note 1st October is a sunday. And we don't deploy on a sunday..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372824 (https://phabricator.wikimedia.org/T173684) (owner: 10Urbanecm) [15:28:08] !log firmware upgrade on mw2256 [15:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:00] (03PS1) 10Cmjohnson: Fixing kafka-jumbo1005 production dns located in wrong vlan, and adding asset tags to mgmt. [dns] - 10https://gerrit.wikimedia.org/r/376283 [15:30:08] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [15:30:23] mw2256 ? [15:30:29] * akosiaris looking [15:31:29] damn... never read backlog... my bad [15:31:53] (03CR) 10Cmjohnson: [C: 032] Fixing kafka-jumbo1005 production dns located in wrong vlan, and adding asset tags to mgmt. [dns] - 10https://gerrit.wikimedia.org/r/376283 (owner: 10Cmjohnson) [15:32:29] (03PS2) 10Cmjohnson: Fixing kafka-jumbo1005 production dns located in wrong vlan, and adding asset tags to mgmt. [dns] - 10https://gerrit.wikimedia.org/r/376283 [15:33:05] (03CR) 10Cmjohnson: [V: 032 C: 032] Fixing kafka-jumbo1005 production dns located in wrong vlan, and adding asset tags to mgmt. [dns] - 10https://gerrit.wikimedia.org/r/376283 (owner: 10Cmjohnson) [15:35:18] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [15:36:13] <_joe_> akosiaris: did you power it up again? [15:36:28] _joe_: it's papaul upgrading the firmware [15:36:31] <_joe_> oh, ok, yes [15:36:44] at least I am not the only one guilty of not reading the backlog [15:36:46] <_joe_> because there was a request from elukey not to power it back up [15:37:27] PROBLEM - dhclient process on mw2256 is CRITICAL: Return code of 255 is out of bounds [15:37:27] PROBLEM - salt-minion processes on mw2256 is CRITICAL: Return code of 255 is out of bounds [15:37:28] PROBLEM - MD RAID on mw2256 is CRITICAL: Return code of 255 is out of bounds [15:37:37] PROBLEM - configured eth on mw2256 is CRITICAL: Return code of 255 is out of bounds [15:37:38] PROBLEM - Disk space on mw2256 is CRITICAL: Return code of 255 is out of bounds [15:37:48] PROBLEM - nutcracker process on mw2256 is CRITICAL: Return code of 255 is out of bounds [15:37:49] 10Operations, 10Operations-Software-Development, 10monitoring, 10User-fgiunchedi: monitor SSD wear levels - https://phabricator.wikimedia.org/T86556#3584730 (10fgiunchedi) [15:37:51] 10Operations, 10monitoring, 10User-fgiunchedi: Monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552#3584733 (10fgiunchedi) [15:37:57] PROBLEM - Apache HTTP on mw2256 is CRITICAL: connect to address 10.192.16.55 and port 80: Connection refused [15:37:57] PROBLEM - HHVM rendering on mw2256 is CRITICAL: connect to address 10.192.16.55 and port 80: Connection refused [15:37:58] PROBLEM - Check whether ferm is active by checking the default input chain on mw2256 is CRITICAL: Return code of 255 is out of bounds [15:37:58] PROBLEM - SSH on mw2256 is CRITICAL: connect to address 10.192.16.55 and port 22: Connection refused [15:37:58] PROBLEM - Check size of conntrack table on mw2256 is CRITICAL: Return code of 255 is out of bounds [15:38:04] the downtime might have been expired, I'll completely silence the host sorry [15:38:06] 10Operations, 10Operations-Software-Development, 10monitoring, 10User-fgiunchedi: monitor SSD wear levels - https://phabricator.wikimedia.org/T86556#971260 (10fgiunchedi) Folding into parent task as duplicate [15:38:07] PROBLEM - puppet last run on mw2256 is CRITICAL: Return code of 255 is out of bounds [15:38:07] PROBLEM - nutcracker port on mw2256 is CRITICAL: Return code of 255 is out of bounds [15:38:08] PROBLEM - Check systemd state on mw2256 is CRITICAL: Return code of 255 is out of bounds [15:38:17] PROBLEM - Nginx local proxy to apache on mw2256 is CRITICAL: connect to address 10.192.16.55 and port 443: Connection refused [15:38:18] PROBLEM - HHVM processes on mw2256 is CRITICAL: Return code of 255 is out of bounds [15:38:27] PROBLEM - DPKG on mw2256 is CRITICAL: Return code of 255 is out of bounds [15:38:53] 10Operations, 10monitoring, 10Patch-For-Review: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121#3584737 (10jcrespo) es1019 seems to have rebroken T155691. I have depooled it, but it will take days to get effective (because backups do not res... [15:42:29] (03CR) 10KartikMistry: "> Let's do what mobrovac suggests. That is the current status quo," [puppet] - 10https://gerrit.wikimedia.org/r/374706 (owner: 10KartikMistry) [15:44:18] (03PS1) 10Volans: Cluster management: add some roles from neodymium [puppet] - 10https://gerrit.wikimedia.org/r/376284 (https://phabricator.wikimedia.org/T166300) [15:44:52] (03PS21) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) [15:44:52] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [15:45:18] PROBLEM - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:45:28] (03PS8) 10Mobrovac: JobQueue: Add the RunSingleJob.php script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 [15:45:34] <_joe_> uh? [15:45:35] uh, lvs3001 [15:45:38] PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:45:42] ouch [15:45:48] <_joe_> ema: why is bgp not switching? [15:45:50] it was having a degraded raid... [15:46:24] <_joe_> mobrovac: wait a sec please [15:46:27] console is unresponsive [15:46:34] what has degraded raid? 3001? [15:46:34] <_joe_> lvs3003 is unreachable too [15:46:40] k [15:46:42] <_joe_> nah scratch that [15:46:43] <_joe_> that [15:46:45] <_joe_> 's me [15:46:56] <_joe_> still it should've caught up by now [15:47:01] bblack: yes it was having degraded raid since a while T166965 [15:47:02] T166965: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166965 [15:47:13] PROBLEM - Host lvs3003 is DOWN: PING CRITICAL - Packet loss = 100% [15:47:20] IWUT? [15:47:21] ouch [15:47:29] sites down in the EU? [15:47:32] yes [15:47:37] ok let's depool esams ? [15:47:39] yes let's depool for now, the problem looks tricky [15:47:45] <_joe_> cat someone depool esams? [15:47:48] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [15:47:53] PROBLEM - Host text-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [15:48:24] (03PS1) 10BBlack: depool esams [dns] - 10https://gerrit.wikimedia.org/r/376285 [15:48:41] (03CR) 10Alexandros Kosiaris: [C: 031] depool esams [dns] - 10https://gerrit.wikimedia.org/r/376285 (owner: 10BBlack) [15:48:45] (03CR) 10Jcrespo: [C: 031] depool esams [dns] - 10https://gerrit.wikimedia.org/r/376285 (owner: 10BBlack) [15:49:14] (03CR) 10BBlack: [V: 032 C: 032] depool esams [dns] - 10https://gerrit.wikimedia.org/r/376285 (owner: 10BBlack) [15:49:44] (03CR) 10Mobrovac: [C: 031] Remove restbase salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376250 (owner: 10Muehlenhoff) [15:49:53] so when I switched my bastion to bast3001, bast3001 was reporting dns errors looking up lvs hostnames [15:50:03] PROBLEM - Check systemd state on restbase1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:51:16] (03CR) 10Mobrovac: [C: 031] Remove sca/scb salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376272 (owner: 10Muehlenhoff) [15:53:33] !log powercycling lvs3001 [15:53:43] PROBLEM - Check systemd state on kafka-jumbo1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:43] RECOVERY - Check systemd state on kafka-jumbo1006 is OK: OK - running: The system is fully operational [15:55:40] (03PS1) 10Gehel: Add a "Section" to the package metadata [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376287 [15:56:54] Wikipedia down? (GErmany) [15:57:00] yes [15:57:13] RECOVERY - Host lvs3001 is UP: PING OK - Packet loss = 0%, RTA = 83.74 ms [15:57:28] Any possibility to circumvent it? ^^ [15:57:56] Guest12334_: refresh, your dns may be cached [15:58:18] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 83.74 ms [15:58:23] RECOVERY - Host text-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 83.77 ms [15:58:37] Huh, now it works again [15:58:58] RECOVERY - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 16877 bytes in 0.488 second response time [15:59:01] <_joe_> Nemo_bis: it should've been working already for some time [15:59:04] yeah. people fixed it ;) [15:59:19] RECOVERY - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 16879 bytes in 0.529 second response time [16:00:04] PROBLEM - Check systemd state on lvs3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:00:15] ACKNOWLEDGEMENT - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T175168 [16:00:18] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T175168#3584827 (10ops-monitoring-bot) [16:01:04] _joe_: yeah it worked for me when guest12334 asked [16:01:12] +1 [16:01:15] Nemo_bis: depending on isp/browser/etc. it could could have a delay on seeing it up again [16:01:30] <_joe_> depending on broken DNS caching chains :) [16:01:34] yep [16:01:38] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T175168#3584835 (10Volans) [16:01:38] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T168619#3584837 (10Volans) [16:01:51] it should be seconds, but, we cannot control beyond the infrastructure :-) [16:02:22] well [16:02:28] we do control our TTLs, and they're 10 minutes [16:02:51] so when we do the depool, the expectation (set by us) is anywhere from 0-10m randomly for broken users to see things work again [16:02:54] <_joe_> bblack: we do, but some isps dns recursors don't respect cache TTL [16:03:05] yeah but even those that do, there's no expectation of an instant fix for all [16:03:08] oh, so large? [16:03:21] there are tradeoffs [16:03:27] yeah, I know [16:03:36] specially if you cannot predict it [16:03:54] https://phabricator.wikimedia.org/T140365 [16:04:03] ^ ticket about dropping the TTL from 10m -> 5m [16:05:33] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 2284738 [16:06:14] PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100% [16:06:54] mw2256 is starting to get annoying [16:07:42] jynus: I silenced all the alarms except the host one since we do need to know when it goes up and down [16:08:42] (03CR) 10Hoo man: [C: 04-1] "Looks fine, briefly tested the changes to the WD dump script on snapshot1007. -1 for the wrong file header." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/375791 (https://phabricator.wikimedia.org/T174929) (owner: 10ArielGlenn) [16:08:51] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3584855 (10Eevans) [16:08:55] 10Operations, 10Services (watching): Disk errors: restbase1010.eqiad.wmnet - https://phabricator.wikimedia.org/T174392#3584854 (10Eevans) 05Open>03Resolved [16:09:13] PROBLEM - Check Varnish expiry mailbox lag on cp1062 is CRITICAL: CRITICAL: expiry mailbox lag is 2100772 [16:12:01] <_joe_> mobrovac: you can go on now, sorry [16:12:31] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3584863 (10Papaul) In the process of updating the firmware on the server, the server got again in a frozen state. nothing on the monitor and no keyboard response as well. [16:16:06] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms [16:20:09] (03PS2) 10Gehel: Add a "Section" to the package metadata [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376287 [16:20:35] elukey: ah, things were happening [16:20:54] I thought it was stalled, hence my anoyance [16:21:27] jynus: yeah I know :( we are trying to figure out why it randomly freezes [16:23:49] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3584905 (10elukey) All hosts up with OS installed and puppet/salt running. [16:24:35] (03PS13) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) [16:24:36] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3584927 (10elukey) [16:25:04] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3352337 (10elukey) [16:25:16] RECOVERY - Host lvs3003 is UP: PING OK - Packet loss = 0%, RTA = 83.87 ms [16:25:23] _joe_: eh now i have a meeting in 5 mins, we'll have to postpone, tomorrow morning? [16:25:31] (03CR) 10Phedenskog: Make values stackable (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [16:26:40] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3584940 (10Eevans) [16:31:02] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/376284 (https://phabricator.wikimedia.org/T166300) (owner: 10Volans) [16:32:54] <_joe_> mobrovac: ok [16:33:00] <_joe_> mobrovac: you have too many meetings [16:33:07] tell me something i don't know [16:33:12] :P [16:37:17] (03CR) 10Volans: [C: 032] "Noop on neodymium as expected, change on sarin: https://puppet-compiler.wmflabs.org/compiler02/7755/" [puppet] - 10https://gerrit.wikimedia.org/r/376284 (https://phabricator.wikimedia.org/T166300) (owner: 10Volans) [16:40:08] !log cp1062 - varnish backend restart (mailbox lag) [16:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:39] !log cp1072 - varnish backend restart (mailbox lag) [16:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:10] !log disable puppet for cloud things to test changes [16:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:28] PROBLEM - HP RAID on ms-be2023 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Failed: 1I:1:5 - Controller: OK - Battery/Capacitor: OK [16:43:31] ACKNOWLEDGEMENT - HP RAID on ms-be2023 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Failed: 1I:1:5 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T175174 [16:43:34] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T175174#3585012 (10ops-monitoring-bot) [16:45:09] (03PS22) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) [16:45:11] godog: at least this time the handler worked... I'll close it as a duplicate [16:45:27] volans: heheh indeed, thanks! [16:45:37] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 0 [16:45:43] godog: why did it alarm again btw? [16:46:01] volans: working on it [16:46:03] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T174777#3585025 (10Volans) [16:46:05] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T175174#3585023 (10Volans) [16:46:10] papaul: ah ok, thanks! [16:46:22] (03CR) 10Rush: [C: 032] openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [16:46:46] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376291 (https://phabricator.wikimedia.org/T128546) [16:49:17] RECOVERY - Check Varnish expiry mailbox lag on cp1062 is OK: OK: expiry mailbox lag is 0 [16:49:17] (03CR) 10DCausse: [C: 032] Add a "Section" to the package metadata [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376287 (owner: 10Gehel) [16:49:18] PROBLEM - Check systemd state on restbase2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:50:57] PROBLEM - puppet last run on labtestvirt2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:51:07] RECOVERY - Check systemd state on lvs3001 is OK: OK - running: The system is fully operational [16:51:36] !log demon@tin Synchronized php-1.30.0-wmf.17/extensions/Flow/includes/: I284b5aa (duration: 01m 01s) [16:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:19] (03PS14) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) [16:53:21] (03CR) 10Muehlenhoff: [C: 031] Add a "Section" to the package metadata [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376287 (owner: 10Gehel) [16:57:07] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:57:31] (03CR) 10Chad: [C: 032] Don't bother polluting function namespace, just use an anonymous one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374663 (owner: 10Chad) [16:57:58] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 27 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [16:57:58] PROBLEM - DPKG on labtestneutron2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:58:58] RECOVERY - DPKG on labtestneutron2001 is OK: All packages OK [16:59:09] (03Merged) 10jenkins-bot: Don't bother polluting function namespace, just use an anonymous one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374663 (owner: 10Chad) [16:59:19] (03PS1) 10Rush: openstack: correct key path for labtest settings [puppet] - 10https://gerrit.wikimedia.org/r/376292 (https://phabricator.wikimedia.org/T171494) [16:59:56] (03CR) 10DCausse: [V: 032 C: 032] Add a "Section" to the package metadata [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376287 (owner: 10Gehel) [17:01:28] PROBLEM - Check Varnish expiry mailbox lag on cp1048 is CRITICAL: CRITICAL: expiry mailbox lag is 2192370 [17:01:41] (03PS2) 10Rush: openstack: correct key paths for profiles [puppet] - 10https://gerrit.wikimedia.org/r/376292 (https://phabricator.wikimedia.org/T171494) [17:02:34] (03CR) 10Rush: [C: 032] openstack: correct key paths for profiles [puppet] - 10https://gerrit.wikimedia.org/r/376292 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [17:02:58] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 5 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [17:04:07] 10Operations, 10Ops-Access-Requests: WMF LDAP group access for pivot.wikimedia.org - https://phabricator.wikimedia.org/T175156#3584447 (10Dzahn) I just checked this and yea pivot.wikimedia.org is using LDAP auth and one of the groups "wmf" or "nda" are enough to be granted access. Adding you to "wmf" isn't a... [17:05:18] (03PS1) 10Rush: openstack: amend keypaths for labtest nova settings [puppet] - 10https://gerrit.wikimedia.org/r/376293 (https://phabricator.wikimedia.org/T171494) [17:05:49] (03CR) 10Rush: [C: 032] openstack: amend keypaths for labtest nova settings [puppet] - 10https://gerrit.wikimedia.org/r/376293 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [17:07:00] (03PS1) 10Ema: Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/376294 [17:08:11] (03CR) 10Ema: [V: 032 C: 032] Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/376294 (owner: 10Ema) [17:08:18] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [17:09:41] (03PS1) 10Dzahn: admins: add rschoenbaechler to LDAP-only WMF users [puppet] - 10https://gerrit.wikimedia.org/r/376295 (https://phabricator.wikimedia.org/T175156) [17:10:34] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: WMF LDAP group access for pivot.wikimedia.org - https://phabricator.wikimedia.org/T175156#3585203 (10Dzahn) cn: Schoenbaechler mail: rschoenbaechler@wikimedia.org uid: schoenbaechler note to others: cn/uid differ, watch out [17:11:57] RECOVERY - Host lvs1007 is UP: PING OK - Packet loss = 0%, RTA = 0.16 ms [17:12:45] (03PS2) 10Dzahn: admins: add schoenbaechler to LDAP-only WMF users [puppet] - 10https://gerrit.wikimedia.org/r/376295 (https://phabricator.wikimedia.org/T175156) [17:13:28] RECOVERY - HP RAID on ms-be2023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [17:15:44] (03CR) 10Dzahn: [C: 032] admins: add schoenbaechler to LDAP-only WMF users [puppet] - 10https://gerrit.wikimedia.org/r/376295 (https://phabricator.wikimedia.org/T175156) (owner: 10Dzahn) [17:16:20] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T174777#3585248 (10Papaul) a:05Papaul>03fgiunchedi @fgiunchedi Disk replacement complete. Below please see information for return package. {F9360309} [17:16:34] !log modify-ldap-group on terbium is broken [17:16:40] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team-Backlog (Current), 10User-Ladsgroup: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3585250 (10awight) [17:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:18] RECOVERY - puppet last run on labtestvirt2002 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [17:19:22] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3585265 (10Verdy_p) >>! In T171392#3584666, @zhuyife... [17:20:04] !log added LDAP user schoenbaechler to WMF group (T175156) [17:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:16] T175156: WMF LDAP group access for pivot.wikimedia.org - https://phabricator.wikimedia.org/T175156 [17:21:32] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: WMF LDAP group access for pivot.wikimedia.org - https://phabricator.wikimedia.org/T175156#3585279 (10Dzahn) Hi Robin @schoenbaechler , you have been added to the relavant group. You should be able to login now, using your wikitech.wikimedia.org / LDAP... [17:21:46] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: WMF LDAP group access for pivot.wikimedia.org - https://phabricator.wikimedia.org/T175156#3585280 (10Dzahn) 05Open>03Resolved a:03Dzahn [17:23:12] !log demon@tin Synchronized wmf-config/FeaturedFeedsWMF.php: code cleanup (duration: 00m 49s) [17:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:40] (03CR) 10Rush: [C: 032] openstack: refactor corrections for labtestcontrol [puppet] - 10https://gerrit.wikimedia.org/r/376301 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [17:25:43] 10Operations, 10Performance-Team, 10hardware-requests, 10Patch-For-Review: Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3585349 (10Krinkle) [17:25:57] (03CR) 10Filippo Giunchedi: [C: 031] Remove restbase salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376250 (owner: 10Muehlenhoff) [17:27:42] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:28:36] (03PS1) 10Rush: openstack: rabbit_user keypath nova specific [puppet] - 10https://gerrit.wikimedia.org/r/376302 (https://phabricator.wikimedia.org/T171494) [17:28:45] (03PS2) 10Rush: openstack: rabbit_user keypath nova specific [puppet] - 10https://gerrit.wikimedia.org/r/376302 (https://phabricator.wikimedia.org/T171494) [17:28:58] 10Operations, 10Ops-Access-Requests: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#3585418 (10Dzahn) [17:29:04] 10Operations, 10Security-Team, 10vm-requests: provide ganeti VM for security team sectools - https://phabricator.wikimedia.org/T138650#3585416 (10Dzahn) 05declined>03Open @EddieGP Maybe, not sure. I'll take it an reopen to figure it out. [17:29:39] (03CR) 10Rush: [C: 032] openstack: rabbit_user keypath nova specific [puppet] - 10https://gerrit.wikimedia.org/r/376302 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [17:31:32] 10Operations: use htpasswd instead of htdigest for arbcom archive passwords - https://phabricator.wikimedia.org/T157761#3585440 (10Dzahn) a:03Dzahn Oh, thanks @Aklapper :) yep [17:31:42] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [17:31:55] PROBLEM - puppet last run on labtestnet2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:32:43] 10Operations, 10Cloud-Services: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3585446 (10madhuvishy) @Robh @Cmjohnson I'm able to log in to both machines with their .wikimedia.org hostnames and run puppet fine. However, when I hop into the serial console, they bo... [17:33:08] (03PS1) 10Rush: openstack: key path correction for labtest nova network [puppet] - 10https://gerrit.wikimedia.org/r/376303 (https://phabricator.wikimedia.org/T171494) [17:33:11] (03PS10) 10Paladox: Gerrit: Set auth.userNameToLowerCase [puppet] - 10https://gerrit.wikimedia.org/r/368196 [17:33:12] (03PS2) 10Rush: openstack: key path correction for labtest nova network [puppet] - 10https://gerrit.wikimedia.org/r/376303 (https://phabricator.wikimedia.org/T171494) [17:34:09] (03CR) 10Rush: [C: 032] openstack: key path correction for labtest nova network [puppet] - 10https://gerrit.wikimedia.org/r/376303 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [17:34:26] (03PS4) 10Paladox: Gerrit: Set base url for commitlink [puppet] - 10https://gerrit.wikimedia.org/r/375922 [17:34:26] 10Operations, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3585453 (10Dzahn) p:05Triage>03Normal [17:34:29] (03PS8) 10Paladox: Gerrit: Remove ldap user and password from secure.config [puppet] - 10https://gerrit.wikimedia.org/r/366910 [17:34:49] (03PS21) 10Paladox: Zuul: Add systemd script for zuul [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) [17:35:52] RECOVERY - puppet last run on labtestnet2001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [17:37:52] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: decommission cp3001 & cp3002 - https://phabricator.wikimedia.org/T94215#3585506 (10Dzahn) @Robh does this need the decom template (after the fact)? [17:40:24] 10Operations, 10Cloud-Services: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3585536 (10Cmjohnson) Probably could use a bios update. [17:40:33] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3194210 (10Dzahn) Given all the work that has gone into this single host and it still being dead after all this.. i suggest we just give up on it and permanently decom it. It probably costs us less in the end that way. [17:44:49] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3585563 (10Anomie) >>! In T171392#3585265, @Verdy_p... [17:45:40] (03PS1) 10Rush: openstack: set spice_hostname [puppet] - 10https://gerrit.wikimedia.org/r/376304 (https://phabricator.wikimedia.org/T171494) [17:45:52] (03PS2) 10Rush: openstack: set spice_hostname [puppet] - 10https://gerrit.wikimedia.org/r/376304 (https://phabricator.wikimedia.org/T171494) [17:45:56] (03CR) 10jerkins-bot: [V: 04-1] openstack: set spice_hostname [puppet] - 10https://gerrit.wikimedia.org/r/376304 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [17:46:47] (03PS3) 10Rush: openstack: set spice_hostname [puppet] - 10https://gerrit.wikimedia.org/r/376304 (https://phabricator.wikimedia.org/T171494) [17:47:13] (03CR) 10Phedenskog: Make values stackable (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [17:47:19] (03CR) 10Rush: [C: 032] openstack: set spice_hostname [puppet] - 10https://gerrit.wikimedia.org/r/376304 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [17:47:39] (03PS2) 10Paladox: planet: add Wikimedia Readers blog [puppet] - 10https://gerrit.wikimedia.org/r/375085 (owner: 10BryanDavis) [17:48:07] 10Operations, 10ops-eqiad, 10Traffic, 10netops: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3585595 (10Cmjohnson) Created the bootable img using the HP utility provided in the iso. It is a Windows software and had to borrow from a family member. Booted the Service pack and... [17:49:12] (03CR) 10Dzahn: [C: 032] "thanks Paladox for adding the feed in both places (new for rawdog, upcoming replacement of planet-venus on stretch)" [puppet] - 10https://gerrit.wikimedia.org/r/375085 (owner: 10BryanDavis) [17:49:26] (03CR) 10Paladox: "Your welcome :)" [puppet] - 10https://gerrit.wikimedia.org/r/375085 (owner: 10BryanDavis) [17:49:40] (03PS1) 10Chad: group1 to wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376305 [17:49:52] (03PS3) 10Dzahn: planet: add Wikimedia Readers blog [puppet] - 10https://gerrit.wikimedia.org/r/375085 (owner: 10BryanDavis) [17:52:43] (03PS15) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) [17:53:00] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3585627 (10Jarekt) I can look through c:Module:Fallb... [17:53:09] !log cp1048 - varnish backend restart for mailbox lag... [17:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:40] (03PS1) 10Rush: openstack: set labspice to fqdns [puppet] - 10https://gerrit.wikimedia.org/r/376308 (https://phabricator.wikimedia.org/T171494) [17:54:19] (03CR) 10Phedenskog: Make values stackable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [17:54:23] (03PS2) 10Rush: openstack: set labspice to fqdns [puppet] - 10https://gerrit.wikimedia.org/r/376308 (https://phabricator.wikimedia.org/T171494) [17:54:46] (03PS2) 10BBlack: VCL: fix keep values at 7d [puppet] - 10https://gerrit.wikimedia.org/r/364605 [17:55:01] (03CR) 10Rush: [C: 032] openstack: set labspice to fqdns [puppet] - 10https://gerrit.wikimedia.org/r/376308 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [17:56:17] (03PS1) 10BBlack: browsersec: affect API calls and non-GET as well [puppet] - 10https://gerrit.wikimedia.org/r/376309 (https://phabricator.wikimedia.org/T163251) [17:56:19] (03PS1) 10BBlack: browsersec: bump to 14% 2017-09-07 [puppet] - 10https://gerrit.wikimedia.org/r/376310 (https://phabricator.wikimedia.org/T163251) [17:56:21] (03PS1) 10BBlack: browsersec: bump to 17% 2017-09-14 [puppet] - 10https://gerrit.wikimedia.org/r/376311 (https://phabricator.wikimedia.org/T163251) [17:56:23] (03PS1) 10BBlack: browsersec: bump to 20% 2017-09-21 [puppet] - 10https://gerrit.wikimedia.org/r/376312 (https://phabricator.wikimedia.org/T163251) [17:56:25] (03PS1) 10BBlack: browsersec: bump to 23% 2017-09-28 [puppet] - 10https://gerrit.wikimedia.org/r/376313 (https://phabricator.wikimedia.org/T163251) [17:56:27] (03PS1) 10BBlack: browsersec: bump to 26% 2017-10-05 [puppet] - 10https://gerrit.wikimedia.org/r/376314 (https://phabricator.wikimedia.org/T163251) [17:56:29] (03PS1) 10BBlack: browsersec: bump to 29% 2017-10-12 [puppet] - 10https://gerrit.wikimedia.org/r/376315 (https://phabricator.wikimedia.org/T163251) [17:56:31] (03PS1) 10BBlack: browsersec: bump to 100% 2017-10-17 [puppet] - 10https://gerrit.wikimedia.org/r/376316 (https://phabricator.wikimedia.org/T163251) [17:56:33] (03CR) 10Rush: [C: 04-1] "I backported this from the ping online but it appears it wasn't merged. This is now part of https://gerrit.wikimedia.org/r/#/c/376026/ wh" [puppet] - 10https://gerrit.wikimedia.org/r/375939 (https://phabricator.wikimedia.org/T170447) (owner: 10Andrew Bogott) [17:57:11] PROBLEM - Check Varnish expiry mailbox lag on cp1063 is CRITICAL: CRITICAL: expiry mailbox lag is 2143201 [17:57:54] (03PS1) 10Hoo man: Fix $wgPropertySuggesterDeprecatedIds for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376317 (https://phabricator.wikimedia.org/T174962) [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170906T1800). [18:00:04] jan_drewniak and MaxSem: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:06] I'll swat [18:00:16] o/ [18:00:38] * hoo just added https://gerrit.wikimedia.org/r/376317 to SWAT [18:00:43] a quick review would be nice [18:00:49] (03PS2) 10MaxSem: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376291 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [18:00:54] (03CR) 10MaxSem: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376291 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [18:01:11] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:01:40] RECOVERY - Check Varnish expiry mailbox lag on cp1048 is OK: OK: expiry mailbox lag is 0 [18:01:46] (03PS3) 10BBlack: VCL: fixed keep values: 7d def, 1d for text [puppet] - 10https://gerrit.wikimedia.org/r/364605 [18:01:47] Amir1: ^ [18:02:32] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376291 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [18:03:15] ema: so I've amended my lingering https://gerrit.wikimedia.org/r/#/c/364605/ patch that gets rid of keep-relative-to-TTL to also fix text down to 1d static, so that we don't risk ugly problems with MW's bad-304 [18:03:39] ema: it could stand to be better thought out or dealt with, but maybe if we're lucky this reduces the mailbox lag rate :P [18:04:29] anomie, Warning: Using deprecated fallback handling for comment rev_comment [Called from CommentStore::getCommentInternal in /srv/mediawiki/php-1.30.0- [18:04:29] wmf.17/includes/CommentStore.php at line 200] in /srv/mediawiki/php-1.30.0-wmf.17/includes/debug/MWDebug.php on line 309 [18:04:31] (03CR) 10BBlack: [C: 032] VCL: fixed keep values: 7d def, 1d for text [puppet] - 10https://gerrit.wikimedia.org/r/364605 (owner: 10BBlack) [18:04:50] heh apparently I split that over two channels, oh well [18:05:57] jan_drewniak, pulled on mwdebug1002 [18:06:40] MaxSem: looks good [18:08:11] MaxSem: I talked about that with no_justification in #mediawiki-core yesterday and earlier today. no_justification: Re those "Using deprecated fallback handling for comment" warnings, I found backtraces in error.log on mwlog1001. There seem to be three. Flow should be fixed by backporting https://gerrit.wikimedia.org/r/#/c/374861/. MobileFrontend has SpecialMobileHistory and SpecialMobileContributions, for which T175161 exists. [18:08:12] T175161: Special:MobileHistory warning: Using deprecated fallback handling for comment rev_comment [Called from CommentStore::getCommentInternal in /Users/jrobson/git/core/includes/CommentStore.php at line 200] - https://phabricator.wikimedia.org/T175161 [18:08:33] anomie: The Flow one I backported and sync'd already [18:08:46] !log maxsem@tin Synchronized portals/prod/wikipedia.org/assets: (no justification provided) (duration: 00m 50s) [18:08:50] Also, fwiw, I don't see the other ones in the log anymore tbh [18:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:08] Wait. Yes I do. group1 [18:09:09] Not 0 [18:09:12] * no_justification sighs [18:09:37] !log maxsem@tin Synchronized portals: (no justification provided) (duration: 00m 49s) [18:09:41] jan_drewniak, ^ [18:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:22] (03PS4) 10Rush: prometheus: allow setting a specific listening address and port [puppet] - 10https://gerrit.wikimedia.org/r/374650 (https://phabricator.wikimedia.org/T169039) [18:10:27] (03PS2) 10BBlack: browsersec: affect API calls and non-GET as well [puppet] - 10https://gerrit.wikimedia.org/r/376309 (https://phabricator.wikimedia.org/T163251) [18:10:30] (03PS1) 10Rush: openstack: cease managing nova files via openstack/common.pp [puppet] - 10https://gerrit.wikimedia.org/r/376318 (https://phabricator.wikimedia.org/T171494) [18:10:35] MaxSem: looks good in prod, thanks! [18:10:42] (03PS2) 10Rush: openstack: cease managing nova files via openstack/common.pp [puppet] - 10https://gerrit.wikimedia.org/r/376318 (https://phabricator.wikimedia.org/T171494) [18:10:44] (03CR) 10BBlack: [V: 032 C: 032] browsersec: affect API calls and non-GET as well [puppet] - 10https://gerrit.wikimedia.org/r/376309 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack) [18:11:13] hoo, you're next [18:11:18] (03PS3) 10Rush: openstack: cease managing nova files via openstack/common.pp [puppet] - 10https://gerrit.wikimedia.org/r/376318 (https://phabricator.wikimedia.org/T171494) [18:11:21] (03PS2) 10MaxSem: Fix $wgPropertySuggesterDeprecatedIds for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376317 (https://phabricator.wikimedia.org/T174962) (owner: 10Hoo man) [18:11:57] (03CR) 10MaxSem: [C: 032] Fix $wgPropertySuggesterDeprecatedIds for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376317 (https://phabricator.wikimedia.org/T174962) (owner: 10Hoo man) [18:12:01] (03CR) 10Rush: [C: 032] openstack: cease managing nova files via openstack/common.pp [puppet] - 10https://gerrit.wikimedia.org/r/376318 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [18:12:12] 10Operations, 10Performance-Team, 10hardware-requests, 10Patch-For-Review: Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3585730 (10Dzahn) a:03Dzahn [18:13:32] (03Merged) 10jenkins-bot: Fix $wgPropertySuggesterDeprecatedIds for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376317 (https://phabricator.wikimedia.org/T174962) (owner: 10Hoo man) [18:14:09] hoo, pulled on mwdebug1002 [18:14:21] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [18:14:39] looks good [18:15:09] 10Operations, 10monitoring: Review check_ping settings - https://phabricator.wikimedia.org/T173315#3523998 (10herron) Also, FWIW, https://www.icinga.com/docs/icinga1/latest/en/tuning.html item 11 outlines a similar approach. [18:16:30] hoo, first -production, then the other file? [18:16:40] PROBLEM - Check whether ferm is active by checking the default input chain on labcontrol1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [18:17:10] Both is fine, given the value in Wikibase.php is overwritte in -prod anyway [18:17:40] RECOVERY - Check whether ferm is active by checking the default input chain on labcontrol1002 is OK: OK ferm input default policy is set [18:18:16] !log maxsem@tin Synchronized wmf-config/Wikibase-production.php: https://gerrit.wikimedia.org/r/#/c/376317/2 (duration: 00m 48s) [18:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:37] Looks fine, thanks [18:18:39] sjoerddebruin: ^ [18:18:43] <3 [18:18:59] Working as intended now. [18:19:26] !log maxsem@tin Synchronized wmf-config/Wikibase.php: https://gerrit.wikimedia.org/r/#/c/376317/2 (duration: 00m 48s) [18:19:32] hoo, ^ [18:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:24] Looks good still [18:21:29] (03PS1) 10Rush: openstack: @resolve for saltcleaningcert rule [puppet] - 10https://gerrit.wikimedia.org/r/376320 (https://phabricator.wikimedia.org/T171494) [18:22:22] (03CR) 10Rush: [C: 032] openstack: @resolve for saltcleaningcert rule [puppet] - 10https://gerrit.wikimedia.org/r/376320 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [18:22:25] (03PS2) 10MaxSem: labs: Remove OAuth setting duplicating prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370291 [18:22:36] (03CR) 10MaxSem: [C: 032] labs: Remove OAuth setting duplicating prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370291 (owner: 10MaxSem) [18:23:57] (03Merged) 10jenkins-bot: labs: Remove OAuth setting duplicating prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370291 (owner: 10MaxSem) [18:25:22] !log maxsem@tin Synchronized wmf-config/CommonSettings-labs.php: https://gerrit.wikimedia.org/r/#/c/370291/2 (duration: 00m 49s) [18:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:02] (03PS9) 10Dzahn: Gerrit: Reveal the author in the title of the email [puppet] - 10https://gerrit.wikimedia.org/r/356645 (https://phabricator.wikimedia.org/T43608) (owner: 10Paladox) [18:27:03] (03PS3) 10MaxSem: Flow settings: wmg -> wg migration, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370292 [18:27:32] (03CR) 10MaxSem: [C: 032] Flow settings: wmg -> wg migration, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370292 (owner: 10MaxSem) [18:28:59] (03Merged) 10jenkins-bot: Flow settings: wmg -> wg migration, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370292 (owner: 10MaxSem) [18:30:04] (03PS1) 10Hashar: package_builder: typo: s/output/result/ directory [puppet] - 10https://gerrit.wikimedia.org/r/376322 [18:30:30] (03CR) 10Hashar: "That was confusing me :]" [puppet] - 10https://gerrit.wikimedia.org/r/376322 (owner: 10Hashar) [18:30:32] (03CR) 10Dzahn: [C: 032] Gerrit: Reveal the author in the title of the email [puppet] - 10https://gerrit.wikimedia.org/r/356645 (https://phabricator.wikimedia.org/T43608) (owner: 10Paladox) [18:31:50] hoo: what's up? [18:31:54] I just got here [18:33:45] !log maxsem@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/370292/3 (duration: 00m 49s) [18:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:28] Amir1: https://phabricator.wikimedia.org/T174962#3585663 [18:34:30] all solved by now [18:34:36] Oh, thanks [18:35:02] (03PS3) 10MaxSem: Flow settings: wmg -> wg migration, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370293 [18:35:09] (03CR) 10MaxSem: [C: 032] Flow settings: wmg -> wg migration, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370293 (owner: 10MaxSem) [18:38:36] (03CR) 10Phedenskog: Make values stackable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [18:38:59] (03Merged) 10jenkins-bot: Flow settings: wmg -> wg migration, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370293 (owner: 10MaxSem) [18:42:22] (03PS1) 10MaxSem: Revert "Flow settings: wmg -> wg migration, part 2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376325 [18:43:10] (03CR) 10MaxSem: [C: 032] Revert "Flow settings: wmg -> wg migration, part 2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376325 (owner: 10MaxSem) [18:44:41] (03Merged) 10jenkins-bot: Revert "Flow settings: wmg -> wg migration, part 2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376325 (owner: 10MaxSem) [18:45:10] MaxSem: why revert? [18:45:21] legoktm: php sucks [18:45:24] went boom [18:45:40] ugh [18:45:47] (03PS1) 10MaxSem: Revert "Flow settings: wmg -> wg migration, part 1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376326 [18:45:49] (03CR) 10MaxSem: [C: 032] Revert "Flow settings: wmg -> wg migration, part 1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376326 (owner: 10MaxSem) [18:46:05] will figure out later [18:46:42] 10Operations: use htpasswd instead of htdigest for arbcom archive passwords - https://phabricator.wikimedia.org/T157761#3585913 (10Jalexander) FTR this can get held off for now (or even just closed as rejected). We're transitioning away from Mailman for this list. [18:46:53] (03CR) 10MaxSem: [V: 032 C: 032] Revert "Flow settings: wmg -> wg migration, part 1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376326 (owner: 10MaxSem) [18:48:17] !log maxsem@tin Synchronized wmf-config/CommonSettings.php: revert (duration: 00m 49s) [18:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:47] (03CR) 10Paladox: "Works on 2.14" [puppet] - 10https://gerrit.wikimedia.org/r/356645 (https://phabricator.wikimedia.org/T43608) (owner: 10Paladox) [18:48:50] PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:49:37] (03PS5) 10Paladox: Gerrit: Set base url for commitlink [puppet] - 10https://gerrit.wikimedia.org/r/375922 [18:50:51] !log maxsem@tin Synchronized wmf-config/: just to make sure... (duration: 00m 50s) [18:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:11] ok, we're done here [18:54:13] * MaxSem hides [18:57:55] (03PS1) 10Rush: openstack: clean up common and nova common packages [puppet] - 10https://gerrit.wikimedia.org/r/376329 (https://phabricator.wikimedia.org/T171494) [18:58:38] (03CR) 10Rush: [C: 032] openstack: clean up common and nova common packages [puppet] - 10https://gerrit.wikimedia.org/r/376329 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [19:00:04] RainbowSprinkles: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170906T1900). [19:04:00] RECOVERY - puppet last run on labstore1004 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [19:09:50] 10Operations, 10Cloud-Services, 10wikitech.wikimedia.org, 10HHVM: Move wikitech (silver) to HHVM - https://phabricator.wikimedia.org/T98813#1278203 (10Jdforrester-WMF) >>! In T98813#3365586, @bd808 wrote: >>>! In T98813#3135116, @greg wrote: >> Added T161553 as a subtask per above comments. > > I removed... [19:12:32] (03PS1) 10Rush: openstack: preserve hiera settings for new virt role [puppet] - 10https://gerrit.wikimedia.org/r/376331 (https://phabricator.wikimedia.org/T171494) [19:12:49] Reedy, no_justification: The only remaining blocker to T31902 is T104148 – worth getting done? [19:12:50] T31902: Tidy up wmf-config CommonSettings.php and InitialiseSettings.php - https://phabricator.wikimedia.org/T31902 [19:12:50] T104148: Change Squid references in Wikimedia configuration files - https://phabricator.wikimedia.org/T104148 [19:13:05] 2011 bugs FTW. [19:13:14] (03CR) 10Rush: [C: 032] openstack: preserve hiera settings for new virt role [puppet] - 10https://gerrit.wikimedia.org/r/376331 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [19:15:10] James_F: Meh. I honestly don't care enough. [19:15:30] * James_F grins. [19:15:49] 10Operations: use htpasswd instead of htdigest for arbcom archive passwords - https://phabricator.wikimedia.org/T157761#3586004 (10Dzahn) ok, thanks @Jalexander ! @eliza told me about it and i was about to set it to stalled for that reason [19:16:00] I mean by all means do it, but I've got 99 problems and legacy references to squid ain't one [19:19:11] (03Restored) 10Dzahn: Cleanup: squid.php → ReverseProxy.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309742 (https://phabricator.wikimedia.org/T104148) (owner: 10Dereckson) [19:19:21] James_F: Seems to be a relatively easy jfdi if someone wants to deploy it [19:19:31] Deploy newfilename.php first [19:19:36] Then CommonSettings [19:19:43] Then sync-dir the lot to get rid of the old [19:20:01] Yeah. but no_justification abandoned the change a few days ago (as it'd sat there untouched for months). [19:20:13] he's the worst [19:20:35] (03CR) 10Dzahn: "per IRC today - this is still wanted as it's the last blocker for also closing https://phabricator.wikimedia.org/T31902 entirely - but nee" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309742 (https://phabricator.wikimedia.org/T104148) (owner: 10Dereckson) [19:21:22] first thing that needs a manual rebase [19:21:28] which will be fun when it's a year old [19:21:30] At least no_justification gave justification :p [19:22:17] "old" :) [19:22:23] mutante: it might be easiest just to do it from scratch [19:22:28] Rather than faff with a rebase [19:22:48] Unless it'll rebase fine on normal git [19:22:48] mutante: I dropped basically every patch for mw-config that had sat untouched for a year-ish [19:22:50] whereas jgit sucks [19:23:30] Only conflict is commonsettings [19:23:32] In local rebase [19:23:35] yea, right.. they are getting harder the older they get [19:23:46] ah :) [19:23:46] It wasn't even that they're harder to rebase or anything [19:23:53] It's just that they clearly aren't important and nobody cares [19:24:25] Actually. It's probably deceptive [19:24:29] New file, deleted file [19:24:32] It's mostly renames [19:24:34] Yeahhhhh [19:24:38] Should redo by hand [19:24:57] The new files will be the old new files from 2016 [19:25:04] On a rebase [19:25:18] Oh, maybe not [19:25:19] nvm [19:25:20] Misread [19:25:21] Still [19:25:23] W/e [19:25:24] Go ahead [19:25:35] fuck that shit up [19:26:37] you know what is also in there... [19:26:41] string "labs" heh [19:27:00] People love filenames way more than I [19:27:11] * no_justification goes back to not giving any fucks :) [19:28:25] 10Operations, 10Thumbor, 10Performance-Team (Radar), 10User-fgiunchedi: thumbor1003 behaves differently than other thumbor hosts - https://phabricator.wikimedia.org/T174997#3586066 (10Gilles) [19:28:51] mutante: Indeed. There's a task for that too. [19:29:05] mutante: Though it's hard-coded as a realm, IIUC, so… [19:29:24] Everyone hates having to make fixes in ops/puppet. ;-) [19:29:27] I had a patch [19:29:31] I abandoned it recently [19:29:32] heh, yea, is it $realm cloud yet ?:) [19:29:37] Reedy: You /always/ have a patch. [19:29:44] ALL OF THE PATCHES [19:29:46] mutante: s/cloud/staging/, surely. [19:30:16] eh.. ok :) [19:30:42] but all the $realm checks should be replaced with Hiera ..afaict [19:31:09] * James_F leaves that to people that know what they're doing. [19:33:41] beta [19:33:45] deployment-prep! [19:33:51] ;-) [19:34:15] otherwikisweshouldcareaboutbutdontasmuchasweshould [19:36:16] 10Operations, 10Pybal, 10Traffic: Implement stateless TCP balancing in our LVS servers - https://phabricator.wikimedia.org/T175203#3586081 (10BBlack) [19:36:24] 10Operations, 10Pybal, 10Traffic: Implement stateless TCP balancing in our LVS servers - https://phabricator.wikimedia.org/T175203#3586097 (10BBlack) p:05Triage>03High [19:36:46] 10Operations: Fix LVS "sh" shortcomings - https://phabricator.wikimedia.org/T86651#973429 (10BBlack) [19:36:48] 10Operations, 10Pybal, 10Traffic: Implement stateless TCP balancing in our LVS servers - https://phabricator.wikimedia.org/T175203#3586081 (10BBlack) [19:37:51] 10Operations, 10Ops-Access-Requests: Requesting access to Production shell access for rho (Rita Ho) - https://phabricator.wikimedia.org/T175204#3586102 (10RHo) [19:38:06] 10Operations, 10Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: Do not deploy Cirrus elasticsearch plugins on logstash cluster - https://phabricator.wikimedia.org/T174933#3586116 (10debt) 05Open>03Resolved Thanks! [19:39:37] : Everyone hates having to make fixes in ops/puppet. ;-) [19:39:39] {{cn}} [19:39:46] "Everyone" is awfully broad ;-) [19:40:31] (03PS1) 10Ottomata: Apply kafka::jumbo::broker on new kafka-jumbo100* hosts [puppet] - 10https://gerrit.wikimedia.org/r/376336 (https://phabricator.wikimedia.org/T167992) [19:40:36] (03PS1) 10Dzahn: copy squid.php->reverse-proxy.php, squid-labs->reverse-proxy-staging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376337 (https://phabricator.wikimedia.org/T104148) [19:41:05] James_F: Thoughts on T175080 btw? [19:41:06] T175080: Flow fails to load content when running CirrusSearchLinksUpdate jobs - https://phabricator.wikimedia.org/T175080 [19:41:12] (03CR) 10Ottomata: [C: 032] Apply kafka::jumbo::broker on new kafka-jumbo100* hosts [puppet] - 10https://gerrit.wikimedia.org/r/376336 (https://phabricator.wikimedia.org/T167992) (owner: 10Ottomata) [19:41:16] Was super spammy on group0 yesterday. Pretty sure it's not Cirrus' fault ultimately [19:41:32] no_justification: Eurgh. [19:41:48] It's probably harmless. But it was louddddddddd [19:41:52] So I'm holding group1 for the moment. [19:41:57] Hopefully not til COB [19:42:14] no_justification: That sounds like it's caused by the fix for the bug Ariel asked for in the dumps for Flow. [19:42:40] no_justification: Previously Flow was loading content for some call that WikiPage doesn't load content. [19:42:53] ok, https://gerrit.wikimedia.org/r/#/c/376337/1 but i'm not also fixing the _content_ of the files , heh [19:42:57] no_justification: Sounds like Cirrus is for some reason depending on it loading content. [19:43:05] e.g. "# our text squid in beta labs gets forwarded requests [19:43:07] * James_F hunts. [19:43:59] lol, "text squid in beta labs" is actually tripple combo or so :) [19:44:57] 10Operations, 10Ops-Access-Requests: Requesting access to Production shell access for rho (Rita Ho) - https://phabricator.wikimedia.org/T175204#3586102 (10Reedy) Can you get your manager to sign off too? [19:45:01] no_justification: Hmm. No, that was https://phabricator.wikimedia.org/T172025 but it wasn't merged. https://gerrit.wikimedia.org/r/#/q/project:mediawiki/extensions/Flow+is:merged doesn't show anything confusing. [19:45:11] s/confusing/suggestive. [19:45:28] (03PS16) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) [19:45:42] (03Abandoned) 10Dzahn: Cleanup: squid.php → ReverseProxy.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309742 (https://phabricator.wikimedia.org/T104148) (owner: 10Dereckson) [19:45:48] (03PS1) 10Ottomata: Un-apply kafka role -- these should be stretch, not jessie! :/ [puppet] - 10https://gerrit.wikimedia.org/r/376339 (https://phabricator.wikimedia.org/T167992) [19:46:02] (03CR) 10Ottomata: [V: 032 C: 032] Un-apply kafka role -- these should be stretch, not jessie! :/ [puppet] - 10https://gerrit.wikimedia.org/r/376339 (https://phabricator.wikimedia.org/T167992) (owner: 10Ottomata) [19:46:40] PROBLEM - puppet last run on kafka-jumbo1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:47:38] 10Operations, 10Ops-Access-Requests: Requesting access to Production shell access for rho (Rita Ho) - https://phabricator.wikimedia.org/T175204#3586162 (10RHo) Sure - adding @Nirzar for sign-off [19:48:50] !log reboot labvirt1018 [19:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:05] (03PS1) 10Ottomata: Install kafka-jumbo as Stretch [puppet] - 10https://gerrit.wikimedia.org/r/376340 (https://phabricator.wikimedia.org/T167992) [19:49:39] (03CR) 10Ottomata: [C: 032] Install kafka-jumbo as Stretch [puppet] - 10https://gerrit.wikimedia.org/r/376340 (https://phabricator.wikimedia.org/T167992) (owner: 10Ottomata) [19:50:59] (03CR) 10Jforrester: "After this:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376337 (https://phabricator.wikimedia.org/T104148) (owner: 10Dzahn) [19:51:21] PROBLEM - puppet last run on kafka-jumbo1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:51:30] PROBLEM - puppet last run on kafka-jumbo1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:51:30] 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10Epic, 10Services (watching): 2017/18 annual plan program 8: Multi-datacenter support - https://phabricator.wikimedia.org/T175206#3586179 (10GWicke) [19:53:07] !log reimaging kafka-jumbo* with stretch [19:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:28] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3352337 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['kafka-jumbo1001... [19:54:29] James_F: Yeah. It's bothersome because it's /definitely/ new to wmf.17 [19:54:56] no_justification: I'm assuming it was there before anomie's fix for the comment thing? [19:55:05] Yeah it was [19:55:06] no_justification: That's the only recent code in Flow. [19:55:09] Hmm. OK. [19:55:14] I think, at least [19:55:48] 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10Epic, 10Services (watching): 2017/18 annual plan program 8: Multi-datacenter support - https://phabricator.wikimedia.org/T175206#3586225 (10GWicke) [19:57:31] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 7 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:58:06] (03PS1) 10Rush: openstack: correction to 376331 [puppet] - 10https://gerrit.wikimedia.org/r/376345 (https://phabricator.wikimedia.org/T171494) [19:58:48] (03PS2) 10Rush: openstack: move virt settings under role/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/376345 (https://phabricator.wikimedia.org/T171494) [19:58:56] (03PS3) 10Rush: openstack: move virt settings under role/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/376345 (https://phabricator.wikimedia.org/T171494) [19:59:40] (03CR) 10Rush: [C: 032] openstack: move virt settings under role/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/376345 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [20:00:06] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170906T2000). Please do the needful. [20:00:15] Nothing for ORES. [20:01:12] arlo will do a parsoid deploy [20:02:43] 10Operations, 10Epic, 10Goal, 10Services (doing), and 2 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3586259 (10Pchelolo) [20:06:52] 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10Epic, 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support, Q2 goals - https://phabricator.wikimedia.org/T175213#3586325 (10GWicke) [20:07:41] RECOVERY - puppet last run on kafka-jumbo1003 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [20:07:50] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:07:50] RECOVERY - puppet last run on kafka-jumbo1006 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [20:08:01] RECOVERY - puppet last run on kafka-jumbo1001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [20:11:31] (03PS1) 10Ottomata: Revert "Un-apply kafka role -- these should be stretch, not jessie! :/" [puppet] - 10https://gerrit.wikimedia.org/r/376347 [20:11:38] (03CR) 10Ottomata: [V: 032 C: 032] Revert "Un-apply kafka role -- these should be stretch, not jessie! :/" [puppet] - 10https://gerrit.wikimedia.org/r/376347 (owner: 10Ottomata) [20:11:42] (03PS2) 10Ottomata: Revert "Un-apply kafka role -- these should be stretch, not jessie! :/" [puppet] - 10https://gerrit.wikimedia.org/r/376347 [20:11:44] (03CR) 10Ottomata: [V: 032 C: 032] Revert "Un-apply kafka role -- these should be stretch, not jessie! :/" [puppet] - 10https://gerrit.wikimedia.org/r/376347 (owner: 10Ottomata) [20:13:48] (03PS1) 10Ottomata: Revert "Revert "Un-apply kafka role -- these should be stretch, not jessie! :/"" [puppet] - 10https://gerrit.wikimedia.org/r/376350 [20:14:00] (03CR) 10Ottomata: [V: 032 C: 032] "Ah, these are still reimaging... :0" [puppet] - 10https://gerrit.wikimedia.org/r/376350 (owner: 10Ottomata) [20:14:17] 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10Epic, 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support, Q2 goals - https://phabricator.wikimedia.org/T175213#3586366 (10GWicke) [20:16:47] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3586375 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['kafka-jumbo1001.eqiad.wmnet', 'kafka-jumbo1002.eqiad.wmnet', 'k... [20:16:57] (03PS1) 10Ottomata: Revert "Revert "Revert "Un-apply kafka role -- these should be stretch, not jessie! :/""" [puppet] - 10https://gerrit.wikimedia.org/r/376371 [20:17:17] (03CR) 10Ottomata: [V: 032 C: 032] Revert "Revert "Revert "Un-apply kafka role -- these should be stretch, not jessie! :/""" [puppet] - 10https://gerrit.wikimedia.org/r/376371 (owner: 10Ottomata) [20:18:42] !log arlolra@tin Started deploy [parsoid/deploy@f07ac8c]: Updating Parsoid to f9d367ea [20:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:12] 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10Epic, 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support, Q2 goals - https://phabricator.wikimedia.org/T175213#3586379 (10GWicke) [20:21:48] (03PS1) 10Ottomata: Don't use ganglia on new kafka-jumbo hosts [puppet] - 10https://gerrit.wikimedia.org/r/376376 [20:22:41] (03CR) 10Ottomata: [C: 032] Don't use ganglia on new kafka-jumbo hosts [puppet] - 10https://gerrit.wikimedia.org/r/376376 (owner: 10Ottomata) [20:27:10] !log arlolra@tin Finished deploy [parsoid/deploy@f07ac8c]: Updating Parsoid to f9d367ea (duration: 08m 27s) [20:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:30] (03PS1) 10Ottomata: Add debug notifies to figure out error message in prod [puppet] - 10https://gerrit.wikimedia.org/r/376377 (https://phabricator.wikimedia.org/T167992) [20:29:34] (03PS1) 10Hashar: package_builder: test -nt differs in bash vs dash [puppet] - 10https://gerrit.wikimedia.org/r/376378 [20:30:00] (03CR) 10jerkins-bot: [V: 04-1] Add debug notifies to figure out error message in prod [puppet] - 10https://gerrit.wikimedia.org/r/376377 (https://phabricator.wikimedia.org/T167992) (owner: 10Ottomata) [20:31:09] (03CR) 10Ottomata: [V: 032 C: 032] Add debug notifies to figure out error message in prod [puppet] - 10https://gerrit.wikimedia.org/r/376377 (https://phabricator.wikimedia.org/T167992) (owner: 10Ottomata) [20:33:50] (03PS1) 10Ottomata: Debugging [puppet] - 10https://gerrit.wikimedia.org/r/376379 (https://phabricator.wikimedia.org/T167992) [20:34:14] !log Updated Parsoid to f9d367ea (T169342) [20:34:14] (03PS2) 10Ottomata: Debugging [puppet] - 10https://gerrit.wikimedia.org/r/376379 (https://phabricator.wikimedia.org/T167992) [20:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:27] T169342: Gallery output for missing images is not consistent with PHP parser and is missing data - https://phabricator.wikimedia.org/T169342 [20:34:43] 10Operations, 10Ops-Access-Requests: Requesting access to Production shell access for rho (Rita Ho) - https://phabricator.wikimedia.org/T175204#3586434 (10Nirzar) Looks good [20:35:05] (03CR) 10Ottomata: [C: 032] Debugging [puppet] - 10https://gerrit.wikimedia.org/r/376379 (https://phabricator.wikimedia.org/T167992) (owner: 10Ottomata) [20:37:21] (03CR) 10Hashar: "Only happens when buildresult/Packages is missing. dash just skip it :]" [puppet] - 10https://gerrit.wikimedia.org/r/376378 (owner: 10Hashar) [20:39:43] (03PS1) 10Ottomata: Use fq +profile::kafka::broker::kafka_cluster_name when configuring a broker [puppet] - 10https://gerrit.wikimedia.org/r/376388 [20:40:20] (03CR) 10jerkins-bot: [V: 04-1] Use fq +profile::kafka::broker::kafka_cluster_name when configuring a broker [puppet] - 10https://gerrit.wikimedia.org/r/376388 (owner: 10Ottomata) [20:40:57] (03CR) 10Ottomata: [V: 032 C: 032] Use fq +profile::kafka::broker::kafka_cluster_name when configuring a broker [puppet] - 10https://gerrit.wikimedia.org/r/376388 (owner: 10Ottomata) [20:43:42] (03PS1) 10Ottomata: /etc/kafka/mirror should require confluent-kafka package [puppet] - 10https://gerrit.wikimedia.org/r/376395 (https://phabricator.wikimedia.org/T376379) [20:44:19] (03CR) 10Ottomata: [C: 032] /etc/kafka/mirror should require confluent-kafka package [puppet] - 10https://gerrit.wikimedia.org/r/376395 (https://phabricator.wikimedia.org/T376379) (owner: 10Ottomata) [20:45:41] 10Operations, 10Cloud-Services: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3586455 (10RobH) IIRC the bios is already the newest version. I flashed the bios and the ilom when I installed them. [20:53:31] (03PS1) 10Ottomata: Allow new kafka-jumbo hosts to talk to zookeeper on conf* [puppet] - 10https://gerrit.wikimedia.org/r/376407 (https://phabricator.wikimedia.org/T167992) [20:56:53] (03CR) 10Ottomata: [C: 032] Allow new kafka-jumbo hosts to talk to zookeeper on conf* [puppet] - 10https://gerrit.wikimedia.org/r/376407 (https://phabricator.wikimedia.org/T167992) (owner: 10Ottomata) [20:57:13] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [20:57:46] checking [20:57:47] could be related [20:58:43] PROBLEM - nova-api process on labnet1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-api [20:58:52] PROBLEM - nova-api http on labnet1002 is CRITICAL: connect to address 10.64.20.25 and port 8774: Connection refused [21:00:04] kaldari, MaxSem, and Niharika: Dear anthropoid, the time has come. Please deploy ArticleCreationWorkflow (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170906T2100). [21:00:57] (03PS1) 10Rush: openstack: port hiera settings to new openstack::net role [puppet] - 10https://gerrit.wikimedia.org/r/376420 (https://phabricator.wikimedia.org/T171494) [21:01:02] (03PS2) 10Rush: openstack: port hiera settings to new openstack::net role [puppet] - 10https://gerrit.wikimedia.org/r/376420 (https://phabricator.wikimedia.org/T171494) [21:01:27] (03Draft1) 10Paladox: Gerrit: Convert its base templates to soy (closure template) [puppet] - 10https://gerrit.wikimedia.org/r/376406 (https://phabricator.wikimedia.org/T140366) [21:01:28] (03PS2) 10Paladox: Gerrit: Convert its base templates to soy (closure template) [puppet] - 10https://gerrit.wikimedia.org/r/376406 (https://phabricator.wikimedia.org/T140366) [21:01:47] MatmaRex i finally fixed the issue you described :) [21:02:20] kaldari, what's the battle plan? [21:02:24] MaxSem: kaldari: What do we need to do for ^? [21:02:36] (03PS3) 10Rush: openstack: port hiera settings to new openstack::net role [puppet] - 10https://gerrit.wikimedia.org/r/376420 (https://phabricator.wikimedia.org/T171494) [21:02:44] I think he made a patch for it. [21:02:46] (03PS4) 10Rush: openstack: port hiera settings to new openstack::net role [puppet] - 10https://gerrit.wikimedia.org/r/376420 (https://phabricator.wikimedia.org/T171494) [21:03:26] (03PS1) 10Ottomata: Add monitoring hostgroup for jumbo_kafka_eqiad [puppet] - 10https://gerrit.wikimedia.org/r/376424 [21:03:42] (03CR) 10Ottomata: [V: 032 C: 032] Add monitoring hostgroup for jumbo_kafka_eqiad [puppet] - 10https://gerrit.wikimedia.org/r/376424 (owner: 10Ottomata) [21:04:19] (03CR) 10Rush: [C: 032] openstack: port hiera settings to new openstack::net role [puppet] - 10https://gerrit.wikimedia.org/r/376420 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [21:04:25] (03PS5) 10Rush: openstack: port hiera settings to new openstack::net role [puppet] - 10https://gerrit.wikimedia.org/r/376420 (https://phabricator.wikimedia.org/T171494) [21:04:31] (03PS3) 10Paladox: Gerrit: Convert its base templates to soy (closure template) [puppet] - 10https://gerrit.wikimedia.org/r/376406 (https://phabricator.wikimedia.org/T140366) [21:04:41] MaxSem: We're on our own. :P I wanna do this. [21:04:55] dooooo it [21:05:08] MaxSem: Lets see. For beta cluster... [21:05:33] (03CR) 10Paladox: [C: 04-1] "This fixes a security issue described in task. This needs chad +1 and for us to be on 2.14 and to be running change https://gerrit-review." [puppet] - 10https://gerrit.wikimedia.org/r/376406 (https://phabricator.wikimedia.org/T140366) (owner: 10Paladox) [21:08:42] PROBLEM - puppet last run on labnet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:09:35] (03PS1) 10Ottomata: Add kafka rack (row) awareness configs [puppet] - 10https://gerrit.wikimedia.org/r/376428 (https://phabricator.wikimedia.org/T167992) [21:10:23] (03CR) 10Ottomata: [C: 032] Add kafka rack (row) awareness configs [puppet] - 10https://gerrit.wikimedia.org/r/376428 (https://phabricator.wikimedia.org/T167992) (owner: 10Ottomata) [21:10:29] (03PS2) 10Ottomata: Add kafka rack (row) awareness configs [puppet] - 10https://gerrit.wikimedia.org/r/376428 (https://phabricator.wikimedia.org/T167992) [21:10:31] (03CR) 10Ottomata: [V: 032 C: 032] Add kafka rack (row) awareness configs [puppet] - 10https://gerrit.wikimedia.org/r/376428 (https://phabricator.wikimedia.org/T167992) (owner: 10Ottomata) [21:10:54] (03PS1) 10Rush: openstack: correct key paths for nova/network hiera [puppet] - 10https://gerrit.wikimedia.org/r/376429 (https://phabricator.wikimedia.org/T171494) [21:11:33] (03PS2) 10Rush: openstack: correct key paths for nova/network hiera [puppet] - 10https://gerrit.wikimedia.org/r/376429 (https://phabricator.wikimedia.org/T171494) [21:12:20] (03CR) 10Rush: [C: 032] openstack: correct key paths for nova/network hiera [puppet] - 10https://gerrit.wikimedia.org/r/376429 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [21:13:07] (03PS4) 10Paladox: Gerrit: Convert its base templates to soy (closure template) [puppet] - 10https://gerrit.wikimedia.org/r/376406 (https://phabricator.wikimedia.org/T140366) [21:16:34] (03PS1) 10Niharika29: Configure ACW for Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376431 (https://phabricator.wikimedia.org/T175054) [21:16:41] MaxSem: ^ [21:17:18] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [21:17:23] MaxSem: Some extensions, like LoginNotify and ULS don't have a wfLoadExtension in there. When do I need to add it and when not? [21:17:57] !log rebooting kafka-jumbo1004 [21:18:05] (03CR) 10jerkins-bot: [V: 04-1] Configure ACW for Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376431 (https://phabricator.wikimedia.org/T175054) (owner: 10Niharika29) [21:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:20] Niharika, LN is loaded from the main CS.php [21:19:59] (03PS2) 10Niharika29: Configure ACW for Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376431 (https://phabricator.wikimedia.org/T175054) [21:20:16] MaxSem: So for beta cluster we load both prod and labs CS files? That doesn't make sense. [21:20:18] (03CR) 10MaxSem: Configure ACW for Beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376431 (https://phabricator.wikimedia.org/T175054) (owner: 10Niharika29) [21:20:23] (03PS1) 10Rush: openstack: pass in network_public_ip for nova network [puppet] - 10https://gerrit.wikimedia.org/r/376433 (https://phabricator.wikimedia.org/T171494) [21:20:45] makes perfect sense: labs is a copy of prod [21:21:41] (03CR) 10Rush: [C: 032] openstack: pass in network_public_ip for nova network [puppet] - 10https://gerrit.wikimedia.org/r/376433 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [21:23:00] (03PS3) 10Niharika29: Configure ACW for Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376431 (https://phabricator.wikimedia.org/T175054) [21:23:50] (03PS1) 10Ottomata: Enable topic deletion for kafka-jumbo [puppet] - 10https://gerrit.wikimedia.org/r/376436 [21:24:18] (03PS2) 10Ottomata: Enable topic deletion for kafka-jumbo [puppet] - 10https://gerrit.wikimedia.org/r/376436 [21:24:36] (03CR) 10MaxSem: [C: 04-1] Configure ACW for Beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376431 (https://phabricator.wikimedia.org/T175054) (owner: 10Niharika29) [21:25:17] (03CR) 10Ottomata: [C: 032] Enable topic deletion for kafka-jumbo [puppet] - 10https://gerrit.wikimedia.org/r/376436 (owner: 10Ottomata) [21:25:34] MaxSem: Then how do I exclude multiple rights? :| (And why the heck is this still not documented) [21:26:24] Where's the config.txt? [21:26:30] what do you mean by not documented? https://github.com/wikimedia/mediawiki-extensions-ArticleCreationWorkflow/blob/master/doc/config.txt [21:27:10] MaxSem: Doh, didn't see that. We can't exclude multiple rights? Okay. ¯\_(ツ)_/¯ [21:27:37] (03PS4) 10Niharika29: Configure ACW for Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376431 (https://phabricator.wikimedia.org/T175054) [21:33:04] MaxSem: ^ [21:33:35] (03CR) 10MaxSem: [C: 031] Configure ACW for Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376431 (https://phabricator.wikimedia.org/T175054) (owner: 10Niharika29) [21:33:50] MaxSem: Who's gonna +2? :P [21:34:10] you said you wanted to do it yourself? [21:34:30] MaxSem: You can +2? [21:34:40] (03CR) 10MaxSem: [C: 032] Configure ACW for Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376431 (https://phabricator.wikimedia.org/T175054) (owner: 10Niharika29) [21:34:44] pfft:P [21:34:59] self merge ftw [21:35:02] test in production ftw [21:35:11] :P [21:35:34] bonus points for enwiki [21:35:46] MaxSem wins today [21:36:08] hope you guys don't mind if I do a Node.js service deploy, too. Should not affect any of the MW stuff you are doing. [21:36:10] (03Merged) 10jenkins-bot: Configure ACW for Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376431 (https://phabricator.wikimedia.org/T175054) (owner: 10Niharika29) [21:37:16] MaxSem: Steps same as for prod and a full scap, right? [21:37:24] o_ [21:37:27] whyyyyyy? [21:37:31] Niharika: just sync-file [21:37:41] Ah okay. It's already there. [21:37:42] so tin/noc is up to date [21:37:50] Gotcha. [21:39:29] !log niharika29@tin Synchronized wmf-config/CommonSettings-labs.php: Config for ArticleCreationWorkflow T175054 (duration: 00m 50s) [21:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:42] T175054: Test ArticleCreationWorkflow on the Beta Cluster - https://phabricator.wikimedia.org/T175054 [21:40:26] "21:39:19 Check 'Logstash Error rate for mw1265.eqiad.wmnet' failed: ERROR: 7% OVER_THRESHOLD (Avg. Error rate: Before: 0.37, After: 4.00, Threshold: 3.71)" [21:47:33] (03PS1) 10Rush: openstack: bare_metal is a hash and set main accordingly [puppet] - 10https://gerrit.wikimedia.org/r/376437 (https://phabricator.wikimedia.org/T171494) [21:48:29] (03CR) 10Andrew Bogott: [C: 031] openstack: bare_metal is a hash and set main accordingly [puppet] - 10https://gerrit.wikimedia.org/r/376437 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [21:49:00] (03PS2) 10Rush: openstack: bare_metal is a hash and set main accordingly [puppet] - 10https://gerrit.wikimedia.org/r/376437 (https://phabricator.wikimedia.org/T171494) [21:50:21] (03CR) 10Rush: [C: 032] openstack: bare_metal is a hash and set main accordingly [puppet] - 10https://gerrit.wikimedia.org/r/376437 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [21:55:50] (03PS1) 10Rush: openstack: add nova-fullstack upstart template to openstack2 [puppet] - 10https://gerrit.wikimedia.org/r/376439 (https://phabricator.wikimedia.org/T171494) [21:56:09] (03PS2) 10Rush: openstack: add nova-fullstack upstart template to openstack2 [puppet] - 10https://gerrit.wikimedia.org/r/376439 (https://phabricator.wikimedia.org/T171494) [21:57:02] (03CR) 10Rush: [C: 032] openstack: add nova-fullstack upstart template to openstack2 [puppet] - 10https://gerrit.wikimedia.org/r/376439 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [21:59:42] RECOVERY - puppet last run on labnet1001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [22:02:49] Hi! [22:03:10] channel: does anyone knows how long takes the Cloaks to be assigned? I've open my request around 4 weeks ago [22:03:35] wrong ops :) [22:03:42] see #wikimedia-ops [22:03:44] :) [22:03:55] hehe [22:04:06] greg-g: thanks [22:04:12] (03PS1) 10Rush: openstack: preserve limiting SSH listen IP for nova network host [puppet] - 10https://gerrit.wikimedia.org/r/376442 (https://phabricator.wikimedia.org/T171494) [22:04:29] np [22:05:27] (03CR) 10Rush: [C: 032] openstack: preserve limiting SSH listen IP for nova network host [puppet] - 10https://gerrit.wikimedia.org/r/376442 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [22:06:32] !log bsitzmann@tin Started deploy [mobileapps/deploy@507a479]: Update mobileapps to 2cb6281 (T168848 T169277 T169274 T162179 T164033 T167921 T174698 T168848 T174808) [22:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:50] T164033: Test size of "Reading stripped" HTML vs non-stripped HTML - https://phabricator.wikimedia.org/T164033 [22:06:51] T168848: Bootstrap an initial version of the Page Summary API in MCS - https://phabricator.wikimedia.org/T168848 [22:06:51] T169277: Investigate missing page in specific "On this day" event - https://phabricator.wikimedia.org/T169277 [22:06:51] T169274: Expand "On this day" endpoint language support - https://phabricator.wikimedia.org/T169274 [22:06:51] T167921: Support Lazy loading of page content not needed for first paint - https://phabricator.wikimedia.org/T167921 [22:06:51] T174698: Parenthetical stripping is too aggressive - https://phabricator.wikimedia.org/T174698 [22:06:52] T174808: Add swagger spec for summary endpoint - https://phabricator.wikimedia.org/T174808 [22:06:52] T162179: Extract HTML Compatibility Layer from MCS Mobile Sections API - https://phabricator.wikimedia.org/T162179 [22:11:25] !log bsitzmann@tin Finished deploy [mobileapps/deploy@507a479]: Update mobileapps to 2cb6281 (T168848 T169277 T169274 T162179 T164033 T167921 T174698 T168848 T174808) (duration: 04m 53s) [22:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:05] 10Operations, 10monitoring: Review check_ping settings - https://phabricator.wikimedia.org/T173315#3523998 (10Dzahn) I read that item 11 and noticed the very end of it "//**Another option would be to use a faster plugin (i.e. check_fping) as the host_check_command instead of check_ping.**//". How about that on... [22:17:49] (03CR) 10Chad: [V: 032 C: 032] Use keyholder_key in scap/scap.cfg [software/gerrit] - 10https://gerrit.wikimedia.org/r/376264 (owner: 10Paladox) [22:18:04] (03PS1) 10MaxSem: Try fixing ACW setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376443 [22:18:16] !log demon@tin Started deploy [gerrit/gerrit@d4f9a77]: (no justification provided) [22:18:24] !log demon@tin Finished deploy [gerrit/gerrit@d4f9a77]: (no justification provided) (duration: 00m 07s) [22:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:52] !log prior deploy was no-op [22:18:59] (03CR) 10jerkins-bot: [V: 04-1] Try fixing ACW setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376443 (owner: 10MaxSem) [22:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:05] (03CR) 10MaxSem: [C: 032] Try fixing ACW setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376443 (owner: 10MaxSem) [22:20:03] (03CR) 10jerkins-bot: [V: 04-1] Try fixing ACW setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376443 (owner: 10MaxSem) [22:20:13] 10Operations, 10Analytics, 10Traffic: Invalid "wikimedia" family in unique devices data due to misplaced WMF-Last-Access-Global cookie - https://phabricator.wikimedia.org/T174640#3586807 (10Dzahn) [22:20:41] (03CR) 10Chad: "Ah, forgot to read the task ;-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372824 (https://phabricator.wikimedia.org/T173684) (owner: 10Urbanecm) [22:20:49] (03PS2) 10MaxSem: Try fixing ACW setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376443 [22:21:18] (03CR) 10MaxSem: [C: 032] Try fixing ACW setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376443 (owner: 10MaxSem) [22:22:16] (03CR) 10Dzahn: "i'm sure mail to the old address will be forwarded for month :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372824 (https://phabricator.wikimedia.org/T173684) (owner: 10Urbanecm) [22:22:46] (03Merged) 10jenkins-bot: Try fixing ACW setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376443 (owner: 10MaxSem) [22:24:47] (03CR) 10Chad: [C: 031] Gerrit: Set auth.userNameToLowerCase [puppet] - 10https://gerrit.wikimedia.org/r/368196 (owner: 10Paladox) [22:24:59] (03CR) 10Dzahn: [C: 032] contint: aptly server in labs [puppet] - 10https://gerrit.wikimedia.org/r/374805 (https://phabricator.wikimedia.org/T161882) (owner: 10Hashar) [22:26:07] (03CR) 10Greg Grossmeier: "I don't think this is needed any more? Moritz was going to put the trusty php5.5 package in the production apt repo for us (see ops list t" [puppet] - 10https://gerrit.wikimedia.org/r/374805 (https://phabricator.wikimedia.org/T161882) (owner: 10Hashar) [22:27:47] (03CR) 10Dzahn: "oh, ok, thanks Greg! I guess the dependency https://gerrit.wikimedia.org/r/#/c/374837/ might still be wanted though.. will see list" [puppet] - 10https://gerrit.wikimedia.org/r/374805 (https://phabricator.wikimedia.org/T161882) (owner: 10Hashar) [22:29:34] (03CR) 10Dzahn: "would this change be desired even if PHP5.5 packages are added to apt.wikimedia.org? to generally support https.. or would we stop using a" [puppet] - 10https://gerrit.wikimedia.org/r/374837 (owner: 10Hashar) [22:30:45] no_justification: does your +1 on https://gerrit.wikimedia.org/r/#/c/368196/ also mean "anytime" or "now" ? :) [22:31:02] or better with maintenance.. [22:31:22] i think i remember "PITA to revert" [22:31:58] Just a general +1. I would say today/now but I've had a cruddy day and I've got a headache [22:32:31] ok, earlier in a day then it is [22:32:56] Eh, not so much that it's late it's just /today/ has been terribly shitty [22:33:33] Murphy has it out for me today [22:33:52] ugh, sure! get better soon. [22:34:55] (03CR) 10Dzahn: [C: 032] Gerrit: Set base url for commitlink [puppet] - 10https://gerrit.wikimedia.org/r/375922 (owner: 10Paladox) [22:35:11] (03PS6) 10Dzahn: Gerrit: Set base url for commitlink [puppet] - 10https://gerrit.wikimedia.org/r/375922 (owner: 10Paladox) [22:35:14] does the one that can land anytime :) [22:35:21] Yeah [22:38:05] * legoktm hugs no_justification [22:39:56] 10Operations, 10Ops-Access-Requests: Requesting access to Production shell access for rho (Rita Ho) - https://phabricator.wikimedia.org/T175204#3586102 (10Dzahn) Hi @RHo please create a new SSH key (that isn't the same as one used for labs or something else before) and attach it to this ticket. Here is how to... [22:40:57] (03CR) 10Paladox: "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/375922 (owner: 10Paladox) [22:40:57] 10Operations, 10Ops-Access-Requests: Requesting access to Production shell access for rho (Rita Ho) - https://phabricator.wikimedia.org/T175204#3586878 (10Dzahn) a:03ema [22:41:05] (03CR) 10Paladox: "Thanks." [software/gerrit] - 10https://gerrit.wikimedia.org/r/376264 (owner: 10Paladox) [22:42:01] 10Operations, 10Icinga, 10monitoring: Icinga check for sysctl settings - https://phabricator.wikimedia.org/T160060#3586880 (10Dzahn) [22:44:33] 10Operations: use htpasswd instead of htdigest for arbcom archive passwords - https://phabricator.wikimedia.org/T157761#3586884 (10Dzahn) 05Open>03stalled [22:44:55] 10Operations: use htpasswd instead of htdigest for arbcom archive passwords - https://phabricator.wikimedia.org/T157761#3015614 (10Dzahn) p:05Normal>03Low [22:46:39] 10Operations, 10Phabricator, 10Patch-For-Review: Only allow Phabricator weekly project changes cron job on production, not labs - https://phabricator.wikimedia.org/T173297#3586887 (10Dzahn) @paladox is this resolved now? did i fix it? [22:46:57] 10Operations, 10Phabricator, 10Patch-For-Review: Only allow Phabricator weekly project changes cron job on production, not labs - https://phabricator.wikimedia.org/T173297#3586888 (10Paladox) 05Open>03Resolved @Dzahn yep :). [22:50:11] (03CR) 10Dzahn: jsbench: Prep osmium for decom and remove 've' and 'jsbench' roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/376151 (https://phabricator.wikimedia.org/T175093) (owner: 10Krinkle) [22:58:27] (03PS14) 10Paladox: gerrit: DO NOT MERGE [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738 [22:58:36] (03PS15) 10Paladox: gerrit: DO NOT MERGE [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738 [22:58:43] (03PS14) 10Paladox: Gerrit: Upgrading gerrit to 2.14.4-pre (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363734 [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170906T2300). Please do the needful. [23:00:04] ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:27] I can swat [23:01:02] just me today :) [23:01:18] should be easy enough. Probably ship the config patch first but should work in either order [23:02:39] (03PS16) 10Paladox: gerrit: Ajust scap files (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738 [23:03:01] (03CR) 1020after4: [C: 032] Configure CirrusSearch human relevance survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374655 (https://phabricator.wikimedia.org/T174106) (owner: 10EBernhardson) [23:03:11] (03PS15) 10Paladox: Gerrit: Upgrading gerrit to 2.14.4-pre (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363734 [23:03:34] (03PS2) 10Krinkle: jsbench: Prep osmium for decom and remove 've' and 'jsbench' roles [puppet] - 10https://gerrit.wikimedia.org/r/376151 (https://phabricator.wikimedia.org/T175093) [23:04:24] (03PS4) 1020after4: Configure CirrusSearch human relevance survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374655 (https://phabricator.wikimedia.org/T174106) (owner: 10EBernhardson) [23:05:14] ebernhardson: the config change had a merge conflict, rebased but I'm not sure if I need to wait for jenkins or just v+2 it https://gerrit.wikimedia.org/r/#/c/374655/ [23:05:31] doesn't look like jenkins is gonna pick it up [23:06:23] fun! removing +2 and re-adding it should kick jenkins into gear [23:09:38] syncing config change [23:10:12] !log twentyafterfour@tin Synchronized wmf-config/: deploy config for CirrusSearch human relevancy survey. Change-Id: I272c69e5a3bb6e833fca59282142d6b237fd9e60 Bug: T174106 (duration: 00m 52s) [23:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:27] T174106: Search Relevance Survey test #3: action items - https://phabricator.wikimedia.org/T174106 [23:10:39] uhm, error rate for mw1277 96% over threshold? wtf [23:11:15] twentyafterfour: looks like it wanted to sync the files one at a time [23:11:48] uhm no [23:12:07] misspelled? [23:12:27] !log twentyafterfour@tin Synchronized wmf-config/: roll-back due to huge error rate spike (duration: 00m 51s) [23:12:37] wmgWMESearchRelevancePages .... [23:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:40] maybe it was just a race [23:12:43] hmm [23:12:58] twentyafterfour: i think so, just double checked and its spelled same in InitialiseSettings.php and CirrusSearch-common.php [23:12:58] I don't get why the canary failed but it went ahead with the deploy [23:13:03] that's a bug in scap for sure [23:13:16] It also allowed two parallel scaps earlier this week. [23:13:24] Did something change in scap's checks recently? [23:13:25] :-o [23:13:42] James_F: I'm not sure, there was a new scap release recently [23:13:59] twentyafterfour: Hmm. I'm suspicious. [23:14:03] ebernhardson: I'll try again, this time syncing just one file at a time [23:14:16] James_F: me too, I'll look into that after swat [23:15:20] twentyafterfour: yes, it's not atomic [23:15:29] twentyafterfour: sure. Should be the new file first, then initialise settings, then cirrussearch-common [23:15:29] so if you look the error message on fatalmonitor [23:15:36] you'll see some file wanted a variable not still here [23:15:51] ie it synced the file using the variable BEFORE the file providing the variable [23:15:55] I really want to change the way that stuff is initialized [23:15:57] that's why order matters [23:16:08] Dereckson: yeah... [23:16:17] gotta love globals [23:17:00] some day we could setup repo auth mode and get true atomic deploys. but so much work to get mediawiki into a place where that would work ... [23:17:08] !log twentyafterfour@tin scap failed: average error rate on 9/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 for details) [23:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:26] still no good even just syncing the initializesettings file [23:18:00] twentyafterfour: did you sync the new file first? i'm seeing Warning: include(/srv/mediawiki/wmf-config/CirrusSearch-rel-survey.php): File not found in /srv/mediawiki/wmf-config/InitialiseSettings.php on line 19417 [23:18:48] why is IS including other files? o_0 [23:18:56] Reedy: because that file has 18k lines [23:19:35] (03PS3) 10Dzahn: jsbench: Prep osmium for decom and remove 've' and 'jsbench' roles [puppet] - 10https://gerrit.wikimedia.org/r/376151 (https://phabricator.wikimedia.org/T175093) (owner: 10Krinkle) [23:19:58] !log twentyafterfour@tin Synchronized wmf-config/CirrusSearch-rel-survey.php: sync the added file first (duration: 00m 49s) [23:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:07] sorry these are rookie mistakes. I don't do config changes often enough (I already learned these lessons...shame on me) [23:21:19] !log twentyafterfour@tin Synchronized wmf-config/InitialiseSettings.php: Now sync initializesettings (duration: 00m 49s) [23:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:11] !log twentyafterfour@tin Synchronized wmf-config/CirrusSearch-common.php: CirrusSearch-common.php goes last (duration: 00m 48s) [23:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:51] ebernhardson: ok config change sync'd without blowing up. Now I'll sync the extension, wmf.17 first [23:28:26] !log twentyafterfour@tin Synchronized php-1.30.0-wmf.17/extensions/WikimediaEvents/: Sync Change-Id: I7ae522155e67610d25b5857d7b3918559bce8bc7 to wmf.17 refs T174387 (duration: 00m 49s) [23:28:37] ebernhardson: can you confirm that it's deployed and working as expected? Or do I need to sync to wmf.16 first? [23:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:39] T174387: relevance survey: develop backend infrastructure to support lots of queries and lots of results per query - https://phabricator.wikimedia.org/T174387 [23:29:42] twentyafterfour: would need to at least pull wmf.16 to mwdebug1001, as it's only configured on enwiki [23:29:52] twentyafterfour: i suppose if i had thought of it could have configured a page or two on testwiki [23:31:47] (03CR) 10Aaron Schulz: JobQueue: Add the RunSingleJob.php script (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac) [23:31:57] (fwiw it will eventually be used on other languages, but this is still a test deployment) [23:32:57] (03PS4) 10GeoffreyT2000: Rename Wikisaurus namespace on Wiktionary to "Thesaurus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374063 (https://phabricator.wikimedia.org/T174264) [23:33:48] ebernhardson: I ran scap pull on mwdebug1001 [23:37:28] ebernhardson: everything look ok? [23:38:41] twentyafterfour: yup it looks sane [23:39:16] ok thanks! syncing to all webservers [23:41:01] !log twentyafterfour@tin Synchronized php-1.30.0-wmf.16/extensions/WikimediaEvents/: Sync Change-Id: I7ae522155e67610d25b5857d7b3918559bce8bc7 to all webservers refs T174387 (duration: 00m 49s) [23:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:14] T174387: relevance survey: develop backend infrastructure to support lots of queries and lots of results per query - https://phabricator.wikimedia.org/T174387 [23:42:27] sweet, thanks! [23:42:51] :) [23:47:30] (03CR) 10Dzahn: [C: 032] jsbench: Prep osmium for decom and remove 've' and 'jsbench' roles [puppet] - 10https://gerrit.wikimedia.org/r/376151 (https://phabricator.wikimedia.org/T175093) (owner: 10Krinkle) [23:48:47] (03CR) 10Dzahn: [C: 032] webperf: Decom webperf::ve service [puppet] - 10https://gerrit.wikimedia.org/r/376146 (https://phabricator.wikimedia.org/T175083) (owner: 10Krinkle) [23:49:58] (03PS2) 10Dzahn: webperf: Decom webperf::ve service [puppet] - 10https://gerrit.wikimedia.org/r/376146 (https://phabricator.wikimedia.org/T175083) (owner: 10Krinkle) [23:50:37] 10Operations: Change prod uid from diego to dsaez, so it can match with the ldap uid - https://phabricator.wikimedia.org/T175220#3587007 (10diego) [23:51:15] (03CR) 10Dzahn: [C: 032] "doing , per https://phabricator.wikimedia.org/T175083#3582325" [puppet] - 10https://gerrit.wikimedia.org/r/376146 (https://phabricator.wikimedia.org/T175083) (owner: 10Krinkle) [23:52:32] (03PS4) 10Dzahn: jsbench: Prep osmium for decom and remove 've' and 'jsbench' roles [puppet] - 10https://gerrit.wikimedia.org/r/376151 (https://phabricator.wikimedia.org/T175093) (owner: 10Krinkle) [23:52:36] 10Operations, 10Traffic: Lower geodns TTLs from 600 to 300 - https://phabricator.wikimedia.org/T140365#2462333 (10herron) It would also be good to know that a single server will handle the increased load when degraded. Adjusting the TTL before adding redundancy/capacity may be advantageous in that it could hi... [23:54:00] (03CR) 10Dzahn: "Krinkle: note this means you are losing shell access with this merge, let us know if you still need to save any data from osmium" [puppet] - 10https://gerrit.wikimedia.org/r/376151 (https://phabricator.wikimedia.org/T175093) (owner: 10Krinkle) [23:54:38] PROBLEM - jmxtrans on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar [23:54:38] PROBLEM - Check systemd state on kafka-jumbo1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:54:48] PROBLEM - jmxtrans on kafka-jumbo1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar [23:54:48] PROBLEM - jmxtrans on kafka-jumbo1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar [23:54:59] PROBLEM - jmxtrans on kafka-jumbo1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar [23:55:08] PROBLEM - jmxtrans on kafka-jumbo1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar [23:55:08] PROBLEM - salt-minion processes on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:55:09] PROBLEM - jmxtrans on kafka-jumbo1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar [23:56:47] looking! [23:57:20] ottomata: i ran puppet on jumbo1001 - it is failed puppet service [23:57:26] but it shouldnt run as service [23:57:34] ? [23:57:56] well, it said systemd status is degraded [23:58:04] and the failed unit is puppet.service [23:58:16] but that seems maybe unrelated to the java procs above [23:58:24] huh [23:59:32] mutante: hmm yeah [23:59:36] ok, i'm going to ack and look tomorrow [23:59:52] it cant be related to the VE service, right :) [23:59:57] ottomata: ok