[00:00:12] <Reedy>	 We can probably continue and file a bug. But I have pinged Tim in another channel
[00:00:33] <raynor>	 important thing to mention - this is a Minerva skin fix and Minerva is the only skin that have unit tests
[00:01:03] <raynor>	 some MWCore libraries expect skin to be Vector and some tests might not work when Minerva skin is applied (we had such issue)
[00:02:24] <raynor>	 but now everything should be fixed and Minerva doesn't interfere with Lua engine
[00:02:57] <Reedy>	 raynor: https://github.com/wikimedia/mediawiki-extensions-Scribunto/commit/7418a571ac59cc25b682c681a9c2dd330c4a983a
[00:05:22] <Reedy>	 Backporting to .16 and .17
[00:05:24] <Reedy>	 Wont' take long
[00:05:44] <raynor>	 good find Reedy
[00:06:54] <raynor>	 that might be it. but the setfenv() call was already there. why it fails when you pass different param?
[00:07:32] <Reedy>	 I guess the upstream change to the php extension https://gerrit.wikimedia.org/r/#/c/367935/
[00:09:18] <logmsgbot>	 !log reedy@tin Synchronized php-1.30.0-wmf.17/extensions/Scribunto/tests/phpunit/: Fix broken test (duration: 00m 50s)
[00:09:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:10:43] <logmsgbot>	 !log reedy@tin Synchronized php-1.30.0-wmf.16/extensions/Scribunto/tests/phpunit/: Fix broken test (duration: 00m 49s)
[00:10:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:14:37] <logmsgbot>	 !log reedy@tin Synchronized php-1.30.0-wmf.16/skins/MinervaNeue: (no justification provided) (duration: 00m 49s)
[00:14:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:14:56] <Reedy>	 !log that was T174747 Adjust language icon color
[00:15:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:09] <stashbot>	 T174747: [regression] language icon is darker than other icons - https://phabricator.wikimedia.org/T174747
[00:15:36] <Reedy>	 raynor: deployed
[00:15:45] <raynor>	 testing
[00:15:47] <raynor>	 which server?
[00:16:42] <Reedy>	 it's on all servers
[00:17:52] <raynor>	 tested on mwdebug1002 - it's fixed
[00:18:35] <raynor>	 production fixed
[00:19:02] <raynor>	 Reedy: it works. Thanks for deploying that patch
[00:19:08] <Reedy>	 cool, np :)
[00:22:49] <icinga-wm>	 RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[00:33:29] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 2
[00:47:19] <wikibugs>	 10Operations, 10OCG-General, 10Reading-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3582911 (10GWicke) Thanks for the update & clarity on the timeline, @ovasileva! It is much appreciated.
[00:53:30] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 1058
[01:26:01] <wikibugs>	 (03PS1) 10Kaldari: Limit ArticleCreationWorkflow to just simplewiki to troubleshoot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376164
[01:26:33] <wikibugs>	 (03PS2) 10Kaldari: Limit ArticleCreationWorkflow to just simplewiki to troubleshoot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376164
[01:29:46] <wikibugs>	 (03Abandoned) 10Kaldari: Limit ArticleCreationWorkflow to just simplewiki to troubleshoot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376164 (owner: 10Kaldari)
[01:37:54] <wikibugs>	 (03CR) 10Chad: "I don't see why we can't land this already. We're already directing people to it on the wmfwiki website, and IIRC we're already receiving " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372824 (https://phabricator.wikimedia.org/T173684) (owner: 10Urbanecm)
[02:01:39] <davidwbarratt>	 could someone create a repo on github for me?
[02:04:08] <legoktm>	 davidwbarratt: sure, what do you need?
[02:04:36] <davidwbarratt>	 legoktm uhh. let's call it logstash-report
[02:04:49] <davidwbarratt>	 legoktm it's for https://phabricator.wikimedia.org/T174191#3582190
[02:05:05] <davidwbarratt>	 and here's my github user: https://github.com/davidbarratt/
[02:05:39] <davidwbarratt>	 legoktm or logstash-limiter-reporter if that's better
[02:05:48] <davidwbarratt>	 legoktm or just limiter-reporter
[02:06:17] <legoktm>	 davidwbarratt: you and comm tech should have admin access on https://github.com/wikimedia/logstash-report feel free to rename it :)
[02:06:30] <davidwbarratt>	 legoktm perfect! thanks!
[02:31:25] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.16) (duration: 08m 18s)
[02:31:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:59:49] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[03:01:39] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[03:07:11] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.17) (duration: 14m 49s)
[03:07:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:10:49] <icinga-wm>	 RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[03:10:50] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[03:14:17] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Sep  6 03:14:16 UTC 2017 (duration 7m 6s)
[03:14:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:21:07] <wikibugs>	 (03PS4) 10KartikMistry: Matxin MT service for ContentTranslation [puppet] - 10https://gerrit.wikimedia.org/r/374706
[03:22:38] <wikibugs>	 (03Abandoned) 10KartikMistry: Configurable mode_path for apertium [puppet] - 10https://gerrit.wikimedia.org/r/297350 (https://phabricator.wikimedia.org/T139330) (owner: 10KartikMistry)
[03:28:10] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 744.58 seconds
[03:51:01] <wikibugs>	 (03PS1) 10Chad: Remove spurious newline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376166
[03:51:03] <wikibugs>	 (03CR) 10Chad: [C: 032] Remove spurious newline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376166 (owner: 10Chad)
[03:52:29] <wikibugs>	 (03Merged) 10jenkins-bot: Remove spurious newline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376166 (owner: 10Chad)
[03:52:30] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[03:52:30] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[03:52:39] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[03:52:42] <wikibugs>	 (03CR) 10jenkins-bot: Remove spurious newline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376166 (owner: 10Chad)
[03:52:49] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[03:52:49] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[03:52:49] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[03:52:49] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[03:52:49] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[03:52:50] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[03:52:50] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[03:52:50] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[03:52:59] <icinga-wm>	 PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[03:53:00] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[03:53:19] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[03:53:19] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[03:53:19] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[03:53:19] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[03:53:20] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[03:53:20] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[03:53:29] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[03:53:29] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[03:53:29] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[03:53:39] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[03:55:40] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0
[03:55:40] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0
[03:57:39] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[03:57:40] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[03:58:50] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0
[03:58:50] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0
[04:01:29] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:01:29] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:01:29] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:01:29] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:01:29] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:01:30] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:01:30] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:01:30] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:01:31] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:01:40] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:01:49] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:01:49] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:01:50] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:01:59] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:01:59] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:01:59] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:01:59] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:01:59] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:02:00] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:02:00] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:02:01] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:02:01] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:02:02] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:02:02] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:02:03] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:02:10] <icinga-wm>	 PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:02:10] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:02:19] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:02:19] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:02:19] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:07:52] <wikibugs>	 (03PS3) 10Chad: Remove $stdlogo entirely [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359037 (owner: 10Reedy)
[04:08:59] <logmsgbot>	 !log demon@tin Synchronized wmf-config/throttle.php: no-op (duration: 00m 49s)
[04:09:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:23:59] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 93.28 seconds
[04:30:29] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0
[04:30:29] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0
[04:39:41] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0
[04:39:41] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0
[04:42:49] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89893.40 seconds
[04:42:50] <icinga-wm>	 RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[04:42:50] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[04:46:39] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:46:39] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:46:40] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:46:40] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:46:40] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:46:40] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:46:40] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:46:49] <icinga-wm>	 PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:46:51] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:46:51] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:46:59] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:47:09] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:47:09] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:47:09] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:47:09] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:47:09] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:47:10] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:47:10] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:47:11] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:47:11] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:47:29] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:47:29] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:47:29] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:47:30] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:47:39] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:47:39] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:47:39] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:47:39] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:47:39] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:52:50] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:55:10] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[04:55:11] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[04:55:11] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[04:55:11] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[04:55:11] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave
[04:58:59] <icinga-wm>	 PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:59:00] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:59:00] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:59:00] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:59:09] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:59:19] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:59:19] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:59:19] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:59:19] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:59:19] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:59:20] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:59:20] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:59:21] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:59:21] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:59:32] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:59:32] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:59:39] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:59:40] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:59:49] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:59:49] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:59:49] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:59:49] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[04:59:49] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:59:50] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:59:50] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:59:51] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:59:51] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:59:52] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[04:59:52] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[04:59:53] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:09:30] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[05:09:40] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[05:09:40] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[05:09:49] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave
[05:09:50] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[05:09:50] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[05:09:50] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[05:09:59] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m2 on dbstore1001 is OK: OK slave_sql_lag not a slave
[05:10:00] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[05:10:00] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[05:10:00] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[05:10:00] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 80447.93 seconds
[05:10:00] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89112.93 seconds
[05:10:00] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[05:10:01] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: x1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 4108.95 seconds
[05:12:53] <wikibugs>	 (03CR) 10MZMcBride: "I thought there was a previous comment to this effect, but I'm wary of a search default that includes private wikis. Inverting the argumen" [puppet] - 10https://gerrit.wikimedia.org/r/262068 (https://phabricator.wikimedia.org/T71581) (owner: 10Reedy)
[05:13:59] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:13:59] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:13:59] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:13:59] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:13:59] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:14:00] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:14:09] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:14:09] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:14:09] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:14:10] <icinga-wm>	 PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:14:10] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:14:10] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:14:20] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:14:20] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:14:29] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:14:30] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:14:30] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:14:30] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:14:30] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:14:39] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:14:39] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:14:39] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:14:39] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:14:49] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:14:49] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:14:50] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:20:09] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:20:10] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:20:10] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:20:10] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:27:00] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:27:09] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:27:09] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:27:10] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:27:10] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:27:10] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:27:19] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:27:19] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:27:19] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:27:19] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:27:19] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:27:19] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:27:20] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:27:20] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:27:29] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:27:29] <icinga-wm>	 PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:27:29] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:27:29] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:27:39] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:27:40] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:27:40] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:27:40] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:27:49] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:27:49] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:27:50] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:27:50] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:27:50] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:27:50] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:28:00] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:28:00] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:29:35] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T174857#3583074 (10Marostegui) 05Open>03Resolved a:03Cmjohnson This is all good now, thanks a lot Chris!  ``` root@db1059:~# megacli -LDPDInfo -aAll  Adapter #0  Number of Virtual Disks: 1 Virtual Drive: 0 (Ta...
[05:33:09] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[05:33:09] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[05:38:49] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:38:50] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:38:51] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:38:51] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:38:59] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:38:59] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:39:00] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:39:00] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:39:00] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:39:00] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:39:10] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:39:10] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:39:19] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:39:20] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:39:29] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:39:29] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:39:29] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:39:29] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect
[05:39:29] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:39:29] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:39:29] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:39:30] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:39:30] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:39:31] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect
[05:39:31] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:39:32] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect
[05:41:40] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 87585.37 seconds
[05:41:40] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[05:41:40] <icinga-wm>	 RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[05:51:21] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[05:51:21] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[05:51:21] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[05:51:21] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[05:51:21] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[05:51:30] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave
[05:51:30] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[05:51:39] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[05:51:39] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[05:51:39] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[05:51:39] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m2 on dbstore1001 is OK: OK slave_sql_lag not a slave
[05:51:39] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[05:51:40] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[05:51:40] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[05:51:40] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 86290.01 seconds
[05:51:41] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 78347.02 seconds
[05:51:41] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: x1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 4.03 seconds
[05:52:09] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[05:52:09] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[05:52:09] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave
[05:53:09] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes
[05:53:10] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[06:06:29] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Add db1100 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/376178 (https://phabricator.wikimedia.org/T172679)
[06:06:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add db1100 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/376178 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui)
[06:08:38] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Add db1100 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/376178 (https://phabricator.wikimedia.org/T172679)
[06:14:02] <wikibugs>	 (03PS1) 10Marostegui: s5.hosts: Add db1100 [software] - 10https://gerrit.wikimedia.org/r/376179 (https://phabricator.wikimedia.org/T172679)
[06:20:13] <wikibugs>	 (03CR) 10Marostegui: [C: 032] s5.hosts: Add db1100 [software] - 10https://gerrit.wikimedia.org/r/376179 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui)
[06:20:57] <wikibugs>	 (03Merged) 10jenkins-bot: s5.hosts: Add db1100 [software] - 10https://gerrit.wikimedia.org/r/376179 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui)
[06:37:49] <marostegui>	 !log Truncate l10n_cache table across production - T150306
[06:38:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:38:02] <stashbot>	 T150306: truncate l10n_cache table on WMF wikis - https://phabricator.wikimedia.org/T150306
[06:46:31] <moritzm>	 !log installing php-luasandbox 2.0.14 on API canaries along with HHVM restart (T173705)
[06:46:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:46:44] <stashbot>	 T173705: HHVM: Unknown exception - https://phabricator.wikimedia.org/T173705
[07:08:32] <wikibugs>	 (03PS1) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376181 (https://phabricator.wikimedia.org/T175113)
[07:09:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376181 (https://phabricator.wikimedia.org/T175113) (owner: 10Urbanecm)
[07:11:38] <wikibugs>	 (03CR) 10Marostegui: [C: 032] mariadb: Add db1100 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/376178 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui)
[07:11:40] <wikibugs>	 (03PS2) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376181 (https://phabricator.wikimedia.org/T175113)
[07:13:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "That looks fine, but I would prefer if we could use the opportunity to rename the ferm::client (e.g. to swift-object-server-incoming or so" [puppet] - 10https://gerrit.wikimedia.org/r/374170 (https://phabricator.wikimedia.org/T173731) (owner: 10Filippo Giunchedi)
[07:15:04] <marostegui>	 !log Stop MySQL on db1049 to copy its content to db1100 - https://phabricator.wikimedia.org/T172679
[07:15:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:16:19] <wikibugs>	 (03PS12) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902)
[07:27:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/374169 (https://phabricator.wikimedia.org/T173731) (owner: 10Filippo Giunchedi)
[07:34:14] <wikibugs>	 (03CR) 10Muehlenhoff: "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/376029 (https://phabricator.wikimedia.org/T164817) (owner: 10Muehlenhoff)
[07:34:19] <wikibugs>	 (03PS5) 10Muehlenhoff: cumin: extend aliases [puppet] - 10https://gerrit.wikimedia.org/r/376029 (https://phabricator.wikimedia.org/T164817)
[07:34:53] <wikibugs>	 (03CR) 10Smalyshev: [C: 031] wdqs - activate wdqs100[45] as wdqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/376025 (https://phabricator.wikimedia.org/T171210) (owner: 10Gehel)
[07:43:11] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table bawiktionary.l10n_cache doesnt exist on query. Default database: bawiktionary. [Query snipped]
[07:44:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] cumin: extend aliases [puppet] - 10https://gerrit.wikimedia.org/r/376029 (https://phabricator.wikimedia.org/T164817) (owner: 10Muehlenhoff)
[07:53:11] <volans>	 marostegui: have you seen this? ^^^
[07:53:36] <volans>	 let me know if you need a hand
[07:55:21] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 868.98 seconds
[07:57:13] <marostegui>	 i am fixing that
[07:57:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/376187
[07:57:26] <marostegui>	 ah
[07:57:27] <marostegui>	 you did
[07:58:15] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/376187 (owner: 10Muehlenhoff)
[07:58:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/376187 (owner: 10Muehlenhoff)
[07:58:41] <volans>	 marostegui: no, I didn't
[07:58:50] <marostegui>	 maybe jynus did
[07:59:11] <wikibugs>	 (03PS3) 10Addshore: Enable Newsletter on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364734 (https://phabricator.wikimedia.org/T110170)
[08:00:03] <marostegui>	 going to review dbstore1002 and 1001 to see why the tables aren't there and create them empty
[08:00:29] <wikibugs>	 (03PS4) 10Addshore: Enable Newsletter on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364734 (https://phabricator.wikimedia.org/T110170)
[08:00:30] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[08:01:04] <jynus>	 I did nothing, got waken by the alert
[08:01:18] <marostegui>	 i fixed it yes
[08:01:20] <marostegui>	 did it page?
[08:01:23] <marostegui>	 i didn't get anything
[08:03:17] <volans>	 I bet it "paged" jynus aNag ;)
[08:03:31] <jynus>	 yesterday one of those tokudb tables got corrupted, but why was that missing?
[08:03:58] <marostegui>	 it is missing because the whole bawiktionary is missing on the dbstore servers
[08:04:19] <marostegui>	 maybe that is a deleted or unused wiki or something
[08:04:22] <marostegui>	 let's see
[08:05:08] <marostegui>	 yep, it is on deleted list
[08:05:40] <jynus>	 I think people move things to the deleted lists but then keep writing
[08:06:15] <jynus>	 because they do not handle the config correctly
[08:06:31] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 193.86 seconds
[08:06:32] <marostegui>	 yeah, if it exists on the master it should exist everywhere on the replicaiton chain (that is how i see it)
[08:06:42] <jynus>	 except on labs
[08:06:55] <marostegui>	 yeah, that yes :)
[08:07:32] <jynus>	 it could be that someone drop it from x1
[08:07:42] <jynus>	 and it gets dropped overally
[08:07:53] <marostegui>	 ah, true
[08:07:56] <marostegui>	 could be
[08:09:33] <jynus>	 also, dbstore1002 is one of the few replicas that is not read only
[08:11:40] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89992.16 seconds
[08:16:20] <wikibugs>	 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3583333 (10elukey)
[08:16:40] <wikibugs>	 10Operations, 10DBA, 10Phabricator: Decom db1048 (BBU Faulty - slave lagging) - https://phabricator.wikimedia.org/T160731#3583335 (10jcrespo) Not yet, this is still in use.
[08:17:50] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3463322 (10Verdy_p) The Module:Country version is no...
[08:20:22] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3583339 (10Verdy_p) Note that the current "kludge" u...
[08:33:38] <wikibugs>	 (03PS1) 10Addshore: Add WMDE log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376191 (https://phabricator.wikimedia.org/T174948)
[08:36:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376043 (https://phabricator.wikimedia.org/T144479) (owner: 10Gilles)
[08:37:19] <wikibugs>	 10Operations, 10Performance-Team, 10Thumbor, 10User-fgiunchedi: thumbor1003 behaves differently than other thumbor hosts - https://phabricator.wikimedia.org/T174997#3583378 (10fgiunchedi)
[08:38:01] <wikibugs>	 (03CR) 10Addshore: [C: 032] Enable Newsletter on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364734 (https://phabricator.wikimedia.org/T110170) (owner: 10Addshore)
[08:39:32] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Newsletter on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364734 (https://phabricator.wikimedia.org/T110170) (owner: 10Addshore)
[08:39:42] <wikibugs>	 (03CR) 10jenkins-bot: Enable Newsletter on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364734 (https://phabricator.wikimedia.org/T110170) (owner: 10Addshore)
[08:39:54] <wikibugs>	 10Operations, 10Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: Do not deploy Cirrus elasticsearch plugins on logstash cluster - https://phabricator.wikimedia.org/T174933#3583382 (10Gehel) @debt no more work to be done here, feel free to close.
[08:43:03] <logmsgbot>	 !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T110170 [[gerrit:364734|Enable Newsletter on mediawikiwiki]] (duration: 00m 51s)
[08:43:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:17] <stashbot>	 T110170: Goal: Deploy Newsletter extension in Wikimedia - https://phabricator.wikimedia.org/T110170
[08:46:26] <wikibugs>	 (03PS2) 10Addshore: Add WMDE log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376191 (https://phabricator.wikimedia.org/T174948)
[08:46:37] <wikibugs>	 (03CR) 10Addshore: [C: 032] Add WMDE log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376191 (https://phabricator.wikimedia.org/T174948) (owner: 10Addshore)
[08:48:07] <wikibugs>	 (03Merged) 10jenkins-bot: Add WMDE log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376191 (https://phabricator.wikimedia.org/T174948) (owner: 10Addshore)
[08:48:19] <wikibugs>	 (03CR) 10jenkins-bot: Add WMDE log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376191 (https://phabricator.wikimedia.org/T174948) (owner: 10Addshore)
[08:49:53] <logmsgbot>	 !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T174948 [[gerrit:376191|Add WMDE log channel]] (duration: 00m 49s)
[08:50:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:08] <stashbot>	 T174948: Deploy 'hack' patch & logging for tracking user registrations and guided tour - https://phabricator.wikimedia.org/T174948
[08:51:15] <wikibugs>	 (03PS1) 10Filippo Giunchedi: cassandra: reprovision restbase1009 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/376195 (https://phabricator.wikimedia.org/T169939)
[08:52:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] cassandra: reprovision restbase1009 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/376195 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi)
[08:53:32] <godog>	 !log reimage restbase1009 with cassandra 3 - T169939
[08:53:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:47] <stashbot>	 T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939
[08:55:42] <wikibugs>	 (03PS3) 10Addshore: Move config variables from the extension to config repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376017 (https://phabricator.wikimedia.org/T174962) (owner: 10Ladsgroup)
[08:57:06] <wikibugs>	 (03CR) 10Addshore: [C: 031] Move config variables from the extension to config repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376017 (https://phabricator.wikimedia.org/T174962) (owner: 10Ladsgroup)
[09:01:33] <wikibugs>	 (03PS6) 10Jcrespo: mariadb: Add cluster manager hosts to allowed admin port users [puppet] - 10https://gerrit.wikimedia.org/r/362217
[09:02:24] <wikibugs>	 (03CR) 10Hashar: [C: 031] "Indeed from the doc:" [puppet] - 10https://gerrit.wikimedia.org/r/375939 (https://phabricator.wikimedia.org/T170447) (owner: 10Andrew Bogott)
[09:03:47] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend aliases [puppet] - 10https://gerrit.wikimedia.org/r/376196
[09:05:58] <jynus>	 !log disabling puppet on most db hosts to merge firewall changes safely
[09:06:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:16] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Add cluster manager hosts to allowed admin port users [puppet] - 10https://gerrit.wikimedia.org/r/362217 (owner: 10Jcrespo)
[09:08:19] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM. Nitpick on the commit message, prepend "cumin:"" [puppet] - 10https://gerrit.wikimedia.org/r/376196 (owner: 10Muehlenhoff)
[09:08:24] <wikibugs>	 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3583445 (10Joe) >>! In T173710#3581849, @aaron wrote: > Those refreshLInks jobs (from wikibase) are the only ones that use multiple titles per job, so th...
[09:09:54] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: refactor things to the profile [puppet] - 10https://gerrit.wikimedia.org/r/376020
[09:11:41] <icinga-wm>	 PROBLEM - Check systemd state on db1066 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:12:50] <wikibugs>	 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3583455 (10fgiunchedi) >>! In T158837#3582514, @Krinkle wrote: >>>! In T158837#3497281, @fgiunchedi wrote: >>  >> re: coal/coal-web it should be straightforward to...
[09:15:41] <icinga-wm>	 PROBLEM - Check systemd state on labsdb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:16:03] <wikibugs>	 (03PS2) 10Muehlenhoff: Extend Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/376196
[09:16:37] <wikibugs>	 (03PS3) 10Muehlenhoff: Extend Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/376196
[09:17:26] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] Matxin MT service for ContentTranslation [puppet] - 10https://gerrit.wikimedia.org/r/374706 (owner: 10KartikMistry)
[09:18:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::jobrunner: refactor things to the profile [puppet] - 10https://gerrit.wikimedia.org/r/376020 (owner: 10Giuseppe Lavagetto)
[09:19:26] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Do not try to resolve IP addresses for ferm [puppet] - 10https://gerrit.wikimedia.org/r/376197
[09:19:58] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Do not try to resolve IP addresses for ferm [puppet] - 10https://gerrit.wikimedia.org/r/376197
[09:20:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Extend Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/376196 (owner: 10Muehlenhoff)
[09:20:18] <wikibugs>	 (03PS4) 10Muehlenhoff: Extend Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/376196
[09:20:33] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Do not try to resolve IP addresses for ferm [puppet] - 10https://gerrit.wikimedia.org/r/376197 (owner: 10Jcrespo)
[09:20:57] <wikibugs>	 (03PS5) 10Muehlenhoff: Extend Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/376196
[09:21:06] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 032 C: 032] Extend Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/376196 (owner: 10Muehlenhoff)
[09:22:06] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: Add local-only port [puppet] - 10https://gerrit.wikimedia.org/r/376021 (https://phabricator.wikimedia.org/T174599)
[09:22:42] <icinga-wm>	 RECOVERY - Check systemd state on db1066 is OK: OK - running: The system is fully operational
[09:22:42] <icinga-wm>	 RECOVERY - Check systemd state on labsdb1004 is OK: OK - running: The system is fully operational
[09:28:50] <moritzm>	 !log installing libonig security uodates#
[09:29:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:02] <wikibugs>	 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3583488 (10elukey) @Papaul one thing that we could tell to Dell is that we have, as far as I can see, mw2251->60 that are identical, so our software is almost surely not the problem. I put a summary in this task abou...
[09:31:40] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Do not try to resolve IP addresses for ferm [puppet] - 10https://gerrit.wikimedia.org/r/376198
[09:31:52] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Do not try to resolve IP addresses for ferm [puppet] - 10https://gerrit.wikimedia.org/r/376198
[09:32:56] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Do not try to resolve IP addresses for ferm [puppet] - 10https://gerrit.wikimedia.org/r/376198 (owner: 10Jcrespo)
[09:39:49] <moritzm>	 !log installing libgd security updates
[09:40:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:26] <wikibugs>	 (03PS1) 10Elukey: role::mariadb::analytics::custom_repl_slave: raise el_sync batch to 10k [puppet] - 10https://gerrit.wikimedia.org/r/376201 (https://phabricator.wikimedia.org/T174815)
[09:46:24] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.48.131:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.131 and port 9042: Connection refused
[09:47:14] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.48.131:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[09:47:46] <godog>	 silencing ^
[09:48:29] <wikibugs>	 (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7732/" [puppet] - 10https://gerrit.wikimedia.org/r/376201 (https://phabricator.wikimedia.org/T174815) (owner: 10Elukey)
[09:49:14] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.48.131:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-c valid until 2018-08-17 16:11:04 +0000 (expires in 345 days)
[09:50:24] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.48.131:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.131 port 9042
[09:53:22] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::jobrunner: Add local-only port [puppet] - 10https://gerrit.wikimedia.org/r/376021 (https://phabricator.wikimedia.org/T174599) (owner: 10Giuseppe Lavagetto)
[09:53:29] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: Add local-only port [puppet] - 10https://gerrit.wikimedia.org/r/376021 (https://phabricator.wikimedia.org/T174599)
[09:56:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Also print amount of hosts not requiring a restart [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376203
[10:00:30] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: jobrunner: add missing newline in template [puppet] - 10https://gerrit.wikimedia.org/r/376204
[10:00:57] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: add missing newline in template [puppet] - 10https://gerrit.wikimedia.org/r/376204 (owner: 10Giuseppe Lavagetto)
[10:04:42] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner_tls: relay requests to the local-only port [puppet] - 10https://gerrit.wikimedia.org/r/376022 (https://phabricator.wikimedia.org/T174599)
[10:05:47] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::jobrunner_tls: relay requests to the local-only port [puppet] - 10https://gerrit.wikimedia.org/r/376022 (https://phabricator.wikimedia.org/T174599) (owner: 10Giuseppe Lavagetto)
[10:09:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove debdeploy salt grains previously used for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376208
[10:10:38] <wikibugs>	 (03PS1) 10Filippo Giunchedi: cassandra: reprovision restbase1010 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/376209 (https://phabricator.wikimedia.org/T169939)
[10:11:04] <wikibugs>	 (03PS2) 10Filippo Giunchedi: cassandra: reprovision restbase1010 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/376209 (https://phabricator.wikimedia.org/T169939)
[10:12:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] cassandra: reprovision restbase1010 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/376209 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi)
[10:12:40] <wikibugs>	 (03CR) 10Elukey: [C: 031] Remove debdeploy salt grains previously used for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376208 (owner: 10Muehlenhoff)
[10:12:57] <godog>	 !log reimage restbase1010 with cassandra 3 - T169939
[10:12:58] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: jobrunner: fix ProxyPass directives for LVS vhost [puppet] - 10https://gerrit.wikimedia.org/r/376211
[10:13:10] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: jobrunner: fix ProxyPass directives for LVS vhost [puppet] - 10https://gerrit.wikimedia.org/r/376211
[10:13:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:11] <stashbot>	 T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939
[10:13:22] <wikibugs>	 10Operations, 10OCG-General, 10Reading-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3583895 (10ovasileva)
[10:13:47] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: fix ProxyPass directives for LVS vhost [puppet] - 10https://gerrit.wikimedia.org/r/376211 (owner: 10Giuseppe Lavagetto)
[10:15:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove cache salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376212
[10:17:01] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove lvs salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376213
[10:19:31] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner_tls: add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/376023
[10:20:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove db salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376214
[10:20:50] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::jobrunner_tls: add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/376023 (owner: 10Giuseppe Lavagetto)
[10:24:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove hadoop salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376215
[10:25:27] <moritzm>	 !log installing perl update from jessie 8.9 point release
[10:25:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:17] <moritzm>	 !log installing perl update from stretch 9.1 point release
[10:26:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:27] <wikibugs>	 (03CR) 10Elukey: [C: 031] Remove hadoop salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376215 (owner: 10Muehlenhoff)
[10:34:54] <icinga-wm>	 PROBLEM - puppet last run on db2059 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[set debconf flag seen for wireshark-common/install-setuid]
[10:39:04] <icinga-wm>	 RECOVERY - puppet last run on db2059 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[10:40:48] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] Remove db salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376214 (owner: 10Muehlenhoff)
[11:01:17] <wikibugs>	 10Operations: rack and setup  wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3583945 (10MoritzMuehlenhoff) wtp1031/wtp1032 are not fully installed, it seems like the initial puppet run after the installation didn't happen?
[11:12:30] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool db2040 for reboot and upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376218 (https://phabricator.wikimedia.org/T167121)
[11:15:27] <marostegui>	 !log Disable puppet on db1100 for mydumper/myloader
[11:15:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:48] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2040 for reboot and upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376218 (https://phabricator.wikimedia.org/T167121) (owner: 10Jcrespo)
[11:17:19] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool db2040 for reboot and upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376218 (https://phabricator.wikimedia.org/T167121) (owner: 10Jcrespo)
[11:17:29] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool db2040 for reboot and upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376218 (https://phabricator.wikimedia.org/T167121) (owner: 10Jcrespo)
[11:18:25] <wikibugs>	 (03PS1) 10Filippo Giunchedi: cassandra: reprovision restbase1008 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/376219 (https://phabricator.wikimedia.org/T169939)
[11:18:47] <wikibugs>	 (03PS2) 10Filippo Giunchedi: cassandra: reprovision restbase1008 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/376219 (https://phabricator.wikimedia.org/T169939)
[11:19:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] cassandra: reprovision restbase1008 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/376219 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi)
[11:20:17] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: Depool db2040 for reboot and upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376220
[11:20:41] <godog>	 !log reimage restbase1008 with cassandra 3 - T169939
[11:20:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:56] <stashbot>	 T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939
[11:20:58] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2040 (duration: 00m 49s)
[11:21:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:38] <moritzm>	 !log installing gnutls update from jessie 8.9 and stretch 9.1 point updates
[11:21:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:42] <wikibugs>	 (03PS1) 10Ema: varnish: use varnish::wikimedia_vcl for all files [puppet] - 10https://gerrit.wikimedia.org/r/376221
[11:24:53] <wikibugs>	 (03CR) 10Mobrovac: [C: 04-1] "AFAIK, this is an external service not controlled by us, so this should go into the config template in the deploy repo." [puppet] - 10https://gerrit.wikimedia.org/r/374706 (owner: 10KartikMistry)
[11:24:57] <elukey>	 !log temporarily raise kafka log4j authorizer verbosity to DEBUG on kafka1012 - T173493
[11:25:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:11] <stashbot>	 T173493: Tune Kafka logs to register clients connected - https://phabricator.wikimedia.org/T173493
[11:25:50] <wikibugs>	 (03Abandoned) 10Ema: varnish: drop varnish::wikimedia_vcl [puppet] - 10https://gerrit.wikimedia.org/r/376045 (owner: 10Ema)
[11:28:35] <wikibugs>	 (03CR) 10KartikMistry: "> AFAIK, this is an external service not controlled by us, so this" [puppet] - 10https://gerrit.wikimedia.org/r/374706 (owner: 10KartikMistry)
[11:29:28] <wikibugs>	 (03PS2) 10Ema: varnish: use varnish::wikimedia_vcl for all files [puppet] - 10https://gerrit.wikimedia.org/r/376221
[11:29:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] varnish: use varnish::wikimedia_vcl for all files [puppet] - 10https://gerrit.wikimedia.org/r/376221 (owner: 10Ema)
[11:30:50] <hashar>	 jouncebot: refresh
[11:30:52] <jouncebot>	 I refreshed my knowledge about deployments.
[11:30:57] <hashar>	 jouncebot: next
[11:30:57] <jouncebot>	 In 1 hour(s) and 29 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170906T1300)
[11:31:22] <wikibugs>	 (03CR) 10Hashar: [C: 031] New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376181 (https://phabricator.wikimedia.org/T175113) (owner: 10Urbanecm)
[11:35:06] <wikibugs>	 (03CR) 10Hashar: [C: 031] Move config variables from the extension to config repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376017 (https://phabricator.wikimedia.org/T174962) (owner: 10Ladsgroup)
[11:41:06] <moritzm>	 !log installing gtk+2.0 update from jessie 8.9 update
[11:41:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:26] <icinga-wm>	 PROBLEM - dhclient process on restbase1010 is CRITICAL: Return code of 255 is out of bounds
[11:45:16] <icinga-wm>	 PROBLEM - Check size of conntrack table on restbase1010 is CRITICAL: Return code of 255 is out of bounds
[11:45:17] <icinga-wm>	 PROBLEM - puppet last run on restbase1010 is CRITICAL: Return code of 255 is out of bounds
[11:45:26] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.0.114:9042 on restbase1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:46:07] <icinga-wm>	 PROBLEM - Check systemd state on restbase1010 is CRITICAL: Return code of 255 is out of bounds
[11:46:07] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: Return code of 255 is out of bounds
[11:47:06] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on restbase1010 is CRITICAL: Return code of 255 is out of bounds
[11:47:06] <icinga-wm>	 PROBLEM - salt-minion processes on restbase1010 is CRITICAL: Return code of 255 is out of bounds
[11:47:06] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1010 is CRITICAL: Return code of 255 is out of bounds
[11:47:38] <godog>	 going to silence that bad boy
[11:49:48] <wikibugs>	 (03PS3) 10Ema: varnish: use varnish::wikimedia_vcl for all files [puppet] - 10https://gerrit.wikimedia.org/r/376221
[11:50:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] varnish: use varnish::wikimedia_vcl for all files [puppet] - 10https://gerrit.wikimedia.org/r/376221 (owner: 10Ema)
[12:01:54] <wikibugs>	 (03PS4) 10Ema: varnish: use varnish::wikimedia_vcl for all files [puppet] - 10https://gerrit.wikimedia.org/r/376221
[12:07:07] <icinga-wm>	 PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata]
[12:08:15] <moritzm>	 !log installing libapache2-mod-perl update from jessie 8.9 update
[12:08:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:45] <wikibugs>	 (03PS2) 10Gehel: wdqs - activate wdqs100[45] as wdqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/376025 (https://phabricator.wikimedia.org/T171210)
[12:09:05] <ema>	 moritzm: do your updates have anything to do with lvs1001's puppetfail above?
[12:09:57] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs - activate wdqs100[45] as wdqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/376025 (https://phabricator.wikimedia.org/T171210) (owner: 10Gehel)
[12:10:24] <wikibugs>	 (03CR) 10Mobrovac: [C: 04-1] "I guess you meant https://gerrit.wikimedia.org/r/374708, but my point is that the actual URI should go in that patch in scap/vars.yaml, no" [puppet] - 10https://gerrit.wikimedia.org/r/374706 (owner: 10KartikMistry)
[12:12:08] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3457615 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs1004.eqiad.wmnet']...
[12:12:41] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3584049 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs1004.eqiad.wmnet']...
[12:13:39] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3584050 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs1005.eqiad.wmnet']...
[12:14:55] <wikibugs>	 (03PS1) 10Muehlenhoff: yubiauth: Remove unused salt grains [puppet] - 10https://gerrit.wikimedia.org/r/376231
[12:15:10] <wikibugs>	 (03PS5) 10Ema: varnish: use varnish::wikimedia_vcl for all files [puppet] - 10https://gerrit.wikimedia.org/r/376221
[12:15:33] <wikibugs>	 (03CR) 10Ema: [V: 032 C: 032] varnish: use varnish::wikimedia_vcl for all files [puppet] - 10https://gerrit.wikimedia.org/r/376221 (owner: 10Ema)
[12:15:49] <moritzm>	 ema: yeah, puppet tries to ensure that tzdata is installed and if another package update happens during that (like the point update deployments), puppet fails
[12:16:48] <ema>	 moritzm: ok, just checking. Thanks! :)
[12:17:06] <icinga-wm>	 PROBLEM - Restbase root url on restbase1008 is CRITICAL: connect to address 10.64.32.178 and port 7231: Connection refused
[12:18:46] <icinga-wm>	 PROBLEM - cassandra CQL 10.64.32.178:9042 on restbase1008 is CRITICAL: connect to address 10.64.32.178 and port 9042: Connection refused
[12:19:36] <icinga-wm>	 PROBLEM - cassandra SSL 10.64.32.178:7001 on restbase1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[12:20:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] yubiauth: Remove unused salt grains [puppet] - 10https://gerrit.wikimedia.org/r/376231 (owner: 10Muehlenhoff)
[12:20:19] <wikibugs>	 (03PS2) 10Muehlenhoff: yubiauth: Remove unused salt grains [puppet] - 10https://gerrit.wikimedia.org/r/376231
[12:20:36] <icinga-wm>	 PROBLEM - Check systemd state on restbase1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:20:36] <icinga-wm>	 PROBLEM - cassandra service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[12:23:16] <icinga-wm>	 PROBLEM - puppet last run on restbase1008 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 18 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[restbase/deploy],Package[cassandra/metrics-collector]
[12:29:46] <wikibugs>	 10Operations, 10Analytics-Kanban, 10hardware-requests: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097#3584079 (10elukey)
[12:30:26] <icinga-wm>	 RECOVERY - puppet last run on restbase1008 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[12:31:26] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove hadoop salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376215
[12:33:06] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Kanban: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3584101 (10elukey)
[12:34:47] <icinga-wm>	 RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[12:35:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove hadoop salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376215 (owner: 10Muehlenhoff)
[12:35:42] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Kanban: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3584115 (10elukey)
[12:37:44] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3584144 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['wdqs1004.eqiad.wmnet'] ```  and were **ALL** successful.
[12:38:11] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove debdeploy salt grains previously used for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376208
[12:38:11] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on wdqs1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel data import inprogress
[12:39:02] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Kanban: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3584146 (10elukey)
[12:39:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove debdeploy salt grains previously used for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376208 (owner: 10Muehlenhoff)
[12:44:12] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove db salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376214
[12:45:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove db salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376214 (owner: 10Muehlenhoff)
[12:46:39] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3584163 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs1005.eqiad.wmnet']...
[12:47:11] <icinga-wm>	 PROBLEM - Check systemd state on ganeti1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:48:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove swift salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376234
[12:49:44] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376203 (owner: 10Muehlenhoff)
[12:50:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove kafka salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376237
[12:54:37] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove elasticsearch salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376241
[12:54:38] <wikibugs>	 (03PS1) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242
[12:55:40] <wikibugs>	 (03CR) 10Elukey: [C: 031] Remove kafka salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376237 (owner: 10Muehlenhoff)
[12:57:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Memove releng-related salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376244
[12:59:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove WMCS-related salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376245
[13:00:04] <jouncebot>	 addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170906T1300). Please do the needful.
[13:00:04] <jouncebot>	 Urbanecm and Amir1: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[13:00:12] <Urbanecm>	 Present
[13:00:19] <hashar>	 o/
[13:00:40] <Reedy>	 Are you doing it hashar?
[13:00:57] <Amir1>	 o/
[13:01:15] <hashar>	 Reedy:  feel free to handle it ? :D
[13:01:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove ganeti salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376247
[13:01:23] <Reedy>	 lol, I don't mind either way
[13:01:27] <wikibugs>	 10Operations: setup/deploy tegmen/WMF6381 as monitoring host - https://phabricator.wikimedia.org/T121583#3584199 (10akosiaris)
[13:01:33] <hashar>	 Reedy: please do so :]
[13:01:39] <wikibugs>	 10Operations, 10Icinga: decom neon (shutdown neon (icinga)  after it has been replaced ) - https://phabricator.wikimedia.org/T125023#3584202 (10akosiaris)
[13:01:41] <wikibugs>	 10Operations, 10hardware-requests: EQIAD/CODFW: 2 hardware access request for monitoring - https://phabricator.wikimedia.org/T120842#3584203 (10akosiaris)
[13:01:43] <wikibugs>	 10Operations: setup/deploy tegmen/WMF6381 as monitoring host - https://phabricator.wikimedia.org/T121583#1882900 (10akosiaris) 05Open>03Resolved Has been done a long time now. Resolving
[13:01:52] <wikibugs>	 (03PS1) 10Elukey: Remove stat1003 traces for decom [puppet] - 10https://gerrit.wikimedia.org/r/376248 (https://phabricator.wikimedia.org/T152712)
[13:02:17] <zeljkof>	 o/
[13:02:34] <wikibugs>	 (03PS2) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242
[13:02:35] <Urbanecm>	 hashar, Reedy, zeljkof: Who'll be the swatter? :D
[13:02:44] <wikibugs>	 (03PS3) 10Reedy: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376181 (https://phabricator.wikimedia.org/T175113) (owner: 10Urbanecm)
[13:02:47] <wikibugs>	 (03CR) 10Reedy: [C: 032] New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376181 (https://phabricator.wikimedia.org/T175113) (owner: 10Urbanecm)
[13:02:59] <zeljkof>	 I can swat, but I see Reedy already volunteered :)
[13:03:06] <wikibugs>	 (03CR) 10Elukey: [C: 032] Remove stat1003 traces for decom [puppet] - 10https://gerrit.wikimedia.org/r/376248 (https://phabricator.wikimedia.org/T152712) (owner: 10Elukey)
[13:03:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove parsoid salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376249
[13:04:10] <icinga-wm>	 PROBLEM - puppet last run on mw1201 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:04:19] <wikibugs>	 (03Merged) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376181 (https://phabricator.wikimedia.org/T175113) (owner: 10Urbanecm)
[13:04:41] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Let's do what mobrovac suggests. That is the current status quo, makes sense." [puppet] - 10https://gerrit.wikimedia.org/r/374706 (owner: 10KartikMistry)
[13:05:10] <wikibugs>	 (03CR) 10Gehel: [C: 032] "All good, we are ready to deploy!" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) (owner: 10DCausse)
[13:05:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove restbase salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376250
[13:05:24] <wikibugs>	 (03CR) 10Gehel: [V: 032 C: 032] Switch this repo to a deb package [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/352170 (https://phabricator.wikimedia.org/T158560) (owner: 10DCausse)
[13:05:50] <wikibugs>	 (03PS1) 10Filippo Giunchedi: site: use cassandra 3 for restbase1008 / restbase1010 [puppet] - 10https://gerrit.wikimedia.org/r/376252
[13:06:11] <wikibugs>	 (03CR) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376181 (https://phabricator.wikimedia.org/T175113) (owner: 10Urbanecm)
[13:06:12] <wikibugs>	 (03PS2) 10Filippo Giunchedi: site: use cassandra 3 for restbase1008 / restbase1010 [puppet] - 10https://gerrit.wikimedia.org/r/376252
[13:06:41] <icinga-wm>	 PROBLEM - DPKG on restbase1008 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[13:06:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] site: use cassandra 3 for restbase1008 / restbase1010 [puppet] - 10https://gerrit.wikimedia.org/r/376252 (owner: 10Filippo Giunchedi)
[13:07:30] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Document in-datastore calico configuration [puppet] - 10https://gerrit.wikimedia.org/r/376254 (https://phabricator.wikimedia.org/T170111)
[13:07:46] <logmsgbot>	 !log reedy@tin Synchronized wmf-config/throttle.php: Throttle exception T175113 (duration: 00m 49s)
[13:07:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:57] <stashbot>	 T175113: Allow IP for creating account for school project for 14 days - https://phabricator.wikimedia.org/T175113
[13:08:12] <wikibugs>	 (03PS4) 10Reedy: Move config variables from the extension to config repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376017 (https://phabricator.wikimedia.org/T174962) (owner: 10Ladsgroup)
[13:08:41] <wikibugs>	 (03CR) 10Reedy: [C: 032] Move config variables from the extension to config repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376017 (https://phabricator.wikimedia.org/T174962) (owner: 10Ladsgroup)
[13:09:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Remove swift salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376234 (owner: 10Muehlenhoff)
[13:10:42] <wikibugs>	 (03Merged) 10jenkins-bot: Move config variables from the extension to config repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376017 (https://phabricator.wikimedia.org/T174962) (owner: 10Ladsgroup)
[13:10:49] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10monitoring, 10Patch-For-Review: pybal: add prometheus metrics - https://phabricator.wikimedia.org/T171710#3473875 (10faidon) I know a bunch of work happened during the Wikimania hackathon, but what's the status of this?
[13:10:52] <wikibugs>	 (03CR) 10jenkins-bot: Move config variables from the extension to config repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376017 (https://phabricator.wikimedia.org/T174962) (owner: 10Ladsgroup)
[13:12:04] <logmsgbot>	 !log reedy@tin Synchronized wmf-config/Wikibase-labs.php: Move some wikidata config T174962 (duration: 00m 49s)
[13:12:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:18] <stashbot>	 T174962: [Bug] Configuration on wikidata.beta.wmflabs.org is broken - https://phabricator.wikimedia.org/T174962
[13:12:27] <icinga-wm>	 PROBLEM - HHVM rendering on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:13:08] <icinga-wm>	 PROBLEM - Host ganeti1008 is DOWN: PING CRITICAL - Packet loss = 100%
[13:13:16] <logmsgbot>	 !log reedy@tin Synchronized wmf-config/Wikibase-production.php: Move some wikidata config T174962 (duration: 00m 48s)
[13:13:17] <icinga-wm>	 RECOVERY - HHVM rendering on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 75168 bytes in 0.171 second response time
[13:13:28] <icinga-wm>	 RECOVERY - Check systemd state on ganeti1008 is OK: OK - running: The system is fully operational
[13:13:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:37] <icinga-wm>	 RECOVERY - Host ganeti1008 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms
[13:13:42] <Reedy>	 Done and done
[13:15:53] <Amir1>	 Thanks
[13:19:26] <jynus>	 !log restarting and upgrading db2040
[13:19:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:29] <hashar>	 Reedy: thx :)
[13:21:31] <wikibugs>	 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3584259 (10faidon) Am I right to understand that the current plan is 2 VMs? If so, yeah, that sounds absolutely fine :)
[13:26:05] <wikibugs>	 10Operations, 10monitoring: diamond: certain counters always calculated as 0 - https://phabricator.wikimedia.org/T138758#3584272 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi We've upgraded to diamond 4 in {T97635} and its TCP collector includes `gauges` config option, resolving.
[13:30:11] <wikibugs>	 (03PS3) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242
[13:30:34] <Amir1>	 _joe_: regarding refreshLinks jobs, they are now 50 pages / job, do you think making the batch size smaller would help?
[13:31:18] <icinga-wm>	 RECOVERY - puppet last run on mw1201 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[13:31:55] <wikibugs>	 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3584304 (10Ottomata) > Any downtime permanently affects the graphs.  Just an uninformed idea:  If you produce directly to graphite (and maybe prometheus too?) inste...
[13:32:01] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3584305 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs1005.eqiad.wmnet']...
[13:38:24] <wikibugs>	 (03PS4) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242
[13:45:13] <wikibugs>	 (03PS5) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242
[13:45:51] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1008 is CRITICAL: Return code of 255 is out of bounds
[13:46:01] <icinga-wm>	 PROBLEM - Disk space on restbase1008 is CRITICAL: Return code of 255 is out of bounds
[13:46:01] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.32.196:7001 on restbase1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[13:46:10] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.32.196:9042 on restbase1008 is CRITICAL: connect to address 10.64.32.196 and port 9042: Connection refused
[13:46:10] <icinga-wm>	 PROBLEM - DPKG on restbase1008 is CRITICAL: Return code of 255 is out of bounds
[13:46:27] <wikibugs>	 (03PS16) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494)
[13:46:50] <icinga-wm>	 PROBLEM - MD RAID on restbase1008 is CRITICAL: Return code of 255 is out of bounds
[13:46:50] <icinga-wm>	 PROBLEM - configured eth on restbase1008 is CRITICAL: Return code of 255 is out of bounds
[13:47:06] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Move db2040's MariaDB socket to /run [puppet] - 10https://gerrit.wikimedia.org/r/376259 (https://phabricator.wikimedia.org/T148507)
[13:47:09] <wikibugs>	 (03PS17) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494)
[13:47:40] <icinga-wm>	 PROBLEM - dhclient process on restbase1008 is CRITICAL: Return code of 255 is out of bounds
[13:48:31] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.32.187:9042 on restbase1008 is CRITICAL: connect to address 10.64.32.187 and port 9042: Connection refused
[13:48:31] <icinga-wm>	 PROBLEM - puppet last run on restbase1008 is CRITICAL: Return code of 255 is out of bounds
[13:48:51] <icinga-wm>	 PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds
[13:49:21] <icinga-wm>	 PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds
[13:49:30] <icinga-wm>	 PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds
[13:49:30] <icinga-wm>	 PROBLEM - salt-minion processes on restbase1008 is CRITICAL: Return code of 255 is out of bounds
[13:49:30] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.32.187:7001 on restbase1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[13:49:31] <icinga-wm>	 PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds
[13:49:31] <icinga-wm>	 PROBLEM - salt-minion processes on stat1005 is CRITICAL: Return code of 255 is out of bounds
[13:49:40] <icinga-wm>	 PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds
[13:49:50] <icinga-wm>	 PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds
[13:49:50] <icinga-wm>	 PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds
[13:50:05] <elukey>	 checking stat1005
[13:50:24] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Move db2040's MariaDB socket to /run [puppet] - 10https://gerrit.wikimedia.org/r/376259 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo)
[13:50:29] <elukey>	 oom party
[13:52:12] <godog>	 also known in some circles as oom mani padme hum
[13:52:40] <icinga-wm>	 RECOVERY - DPKG on stat1005 is OK: All packages OK
[13:52:45] <_joe_>	 Amir1: honestly, I have to review a few details
[13:52:51] <icinga-wm>	 RECOVERY - configured eth on stat1005 is OK: OK - interfaces up
[13:52:51] <icinga-wm>	 RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures
[13:53:00] <icinga-wm>	 RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient
[13:53:13] <_joe_>	 but if each job takes about 1 minute to execute on terbium, I don't know what will happen with jobrunners and their timeouts
[13:53:30] <icinga-wm>	 RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational
[13:53:30] <icinga-wm>	 RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[13:53:34] <_joe_>	 I'll have to look into it a bit
[13:53:40] <icinga-wm>	 RECOVERY - salt-minion processes on stat1005 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[13:54:27] <Amir1>	 thanks
[13:55:37] <Amir1>	 what is interesting is that 99th percentile wait time is growing exponentially despite all of stuff we have done 
[13:57:21] <wikibugs>	 10Operations, 10monitoring: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3584374 (10jcrespo)
[13:57:23] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121#3584372 (10jcrespo) 05Open>03Resolved I think the reboot and/or upgrade fixed it (db2040).
[13:58:21] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3584376 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['wdqs1005.eqiad.wmnet'] ```  Of which those **FAILED**: ``` set(['wdqs10...
[13:58:30] <icinga-wm>	 RECOVERY - salt-minion processes on restbase1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[13:58:40] <icinga-wm>	 RECOVERY - puppet last run on restbase1008 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[13:58:41] <icinga-wm>	 RECOVERY - dhclient process on restbase1008 is OK: PROCS OK: 0 processes with command name dhclient
[13:58:51] <icinga-wm>	 RECOVERY - MD RAID on restbase1008 is OK: OK: Active: 15, Working: 15, Failed: 0, Spare: 0
[13:58:52] <icinga-wm>	 RECOVERY - configured eth on restbase1008 is OK: OK - interfaces up
[13:59:01] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1008 is OK: OK - cassandra-c is active
[13:59:10] <icinga-wm>	 RECOVERY - Disk space on restbase1008 is OK: DISK OK
[13:59:11] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.32.196:7001 on restbase1008 is OK: SSL OK - Certificate restbase1008-c valid until 2018-08-17 16:11:00 +0000 (expires in 345 days)
[13:59:11] <icinga-wm>	 RECOVERY - DPKG on restbase1008 is OK: All packages OK
[13:59:31] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.32.187:7001 on restbase1008 is OK: SSL OK - Certificate restbase1008-a valid until 2018-08-17 16:10:58 +0000 (expires in 345 days)
[14:01:11] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.32.196:9042 on restbase1008 is OK: TCP OK - 0.000 second response time on 10.64.32.196 port 9042
[14:03:23] <Amir1>	 apparently my patch to reduce the size from 100 to 50 didn't help 
[14:04:50] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - kubemaster_6443 - Could not depool server chlorine.eqiad.wmnet because of too many down!
[14:05:01] <akosiaris>	 ignore ^
[14:05:03] <akosiaris>	 it's me playing
[14:05:17] <wikibugs>	 (03PS6) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242
[14:05:20] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - kubemaster_6443 - Could not depool server chlorine.eqiad.wmnet because of too many down!
[14:05:40] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - kubemaster_6443 - Could not depool server chlorine.eqiad.wmnet because of too many down!
[14:06:21] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([chlorine.eqiad.wmnet])
[14:07:20] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([chlorine.eqiad.wmnet])
[14:07:30] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1009 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([chlorine.eqiad.wmnet])
[14:08:14] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove swift salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376234
[14:08:30] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([chlorine.eqiad.wmnet])
[14:08:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove swift salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376234 (owner: 10Muehlenhoff)
[14:09:18] <TheresNoTime>	 apergos: ping
[14:09:21] <wikibugs>	 (03PS18) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494)
[14:09:37] <wikibugs>	 (03PS19) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494)
[14:09:41] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.32.187:9042 on restbase1008 is OK: TCP OK - 0.000 second response time on 10.64.32.187 port 9042
[14:09:51] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove kafka salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376237
[14:10:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[14:12:10] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2051665
[14:12:24] <wikibugs>	 (03PS7) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242
[14:12:40] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[14:13:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove kafka salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376237 (owner: 10Muehlenhoff)
[14:14:30] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy
[14:15:00] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1009 is OK: PYBAL OK - All pools are healthy
[14:16:21] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1003 is OK: OK: no difference between hosts in IPVS/PyBal
[14:17:04] <gehel>	 !log wdqs1005 is coming up after a few reimaging issues, expect some icinga noise...
[14:17:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:21] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal
[14:17:30] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1009 is OK: OK: no difference between hosts in IPVS/PyBal
[14:18:13] <wikibugs>	 (03PS8) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242
[14:18:26] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1010 is OK: OK: no difference between hosts in IPVS/PyBal
[14:19:56] <icinga-wm>	 PROBLEM - Host restbase1010 is DOWN: PING CRITICAL - Packet loss = 100%
[14:21:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove analytics salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376263
[14:21:06] <icinga-wm>	 RECOVERY - Host restbase1010 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[14:22:20] <wikibugs>	 (03Draft1) 10Paladox: Use keyholder_key in scap/scap.cfg [software/gerrit] - 10https://gerrit.wikimedia.org/r/376264
[14:22:22] <wikibugs>	 (03PS2) 10Paladox: Use keyholder_key in scap/scap.cfg [software/gerrit] - 10https://gerrit.wikimedia.org/r/376264
[14:22:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove k8s salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376265
[14:23:18] <wikibugs>	 (03PS1) 10Filippo Giunchedi: cassandra: enable jmx_exporter for cassandra 3 cluster [puppet] - 10https://gerrit.wikimedia.org/r/376267 (https://phabricator.wikimedia.org/T169939)
[14:23:30] <wikibugs>	 (03PS9) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242
[14:23:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cassandra: enable jmx_exporter for cassandra 3 cluster [puppet] - 10https://gerrit.wikimedia.org/r/376267 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi)
[14:24:15] <icinga-wm>	 PROBLEM - Check systemd state on restbase1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:24:25] <wikibugs>	 (03PS2) 10Filippo Giunchedi: cassandra: enable jmx_exporter for cassandra 3 cluster [puppet] - 10https://gerrit.wikimedia.org/r/376267 (https://phabricator.wikimedia.org/T169939)
[14:25:30] <apergos>	 TheresNoTime: yes?
[14:25:35] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:25:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] cassandra: enable jmx_exporter for cassandra 3 cluster [puppet] - 10https://gerrit.wikimedia.org/r/376267 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi)
[14:28:53] <wikibugs>	 (03PS1) 10Ema: varnish::logging::statsd: instance_name future parser check [puppet] - 10https://gerrit.wikimedia.org/r/376269
[14:29:01] <TheresNoTime>	 apergos: how's it going? I've been chatting to someone at OVH about them providing a mirror for the XML dumps (they seem interested), and I've looped `ops-dumps`into the latest email as they're asking some questions I obviously can't answer
[14:29:26] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: kubernetes: Allow switching storage backend versions [puppet] - 10https://gerrit.wikimedia.org/r/376270
[14:29:40] <wikibugs>	 (03CR) 10Krinkle: Make values stackable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog)
[14:29:42] <TheresNoTime>	 it might be useful if we could provide answers to some of the questions on the [[Mirroring Wikimedia project XML dumps]] page (such as estimated traffic etc)
[14:29:59] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove labtest salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376271
[14:30:57] <apergos>	 TheresNoTime: I see your mail from about 20 minutes ago, I'm happy to carry on the conversation there
[14:30:58] <wikibugs>	 (03CR) 10Elukey: [C: 031] Remove analytics salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376263 (owner: 10Muehlenhoff)
[14:31:29] <TheresNoTime>	 apergos: please do, out of my depth in terms of what they're asking :)
[14:32:02] <wikibugs>	 10Operations, 10Ops-Access-Requests: WMF LDAP group access for pivot.wikimedia.org - https://phabricator.wikimedia.org/T175156#3584447 (10schoenbaechler)
[14:32:08] <wikibugs>	 (03CR) 10Ema: [V: 032 C: 032] varnish::logging::statsd: instance_name future parser check [puppet] - 10https://gerrit.wikimedia.org/r/376269 (owner: 10Ema)
[14:32:08] <apergos>	 well I don't see the email where they ask for info, so you might have to forward that to me or something
[14:32:18] <apergos>	 or ask them to, whichever
[14:32:31] <apergos>	 I'mhappy to respond with info once I see their questions
[14:32:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove sca/scb salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376272
[14:33:35] <wikibugs>	 (03CR) 10Krinkle: Make values stackable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog)
[14:34:11] <wikibugs>	 (03PS10) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242
[14:35:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove dumps/snapshot salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376273
[14:37:49] <wikibugs>	 (03CR) 10ArielGlenn: [C: 031] "Fine by me." [puppet] - 10https://gerrit.wikimedia.org/r/376273 (owner: 10Muehlenhoff)
[14:38:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove NFS salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376274
[14:38:28] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: kubernetes: Allow switching storage backend versions [puppet] - 10https://gerrit.wikimedia.org/r/376270
[14:38:52] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] kubernetes: Allow switching storage backend versions [puppet] - 10https://gerrit.wikimedia.org/r/376270 (owner: 10Alexandros Kosiaris)
[14:41:10] <wikibugs>	 10Operations, 10monitoring: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3584495 (10jcrespo)
[14:41:13] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121#3584493 (10jcrespo) 05Resolved>03Open checking es1019
[14:41:35] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool es1019 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376276 (https://phabricator.wikimedia.org/T167121)
[14:41:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove puppetmaster/puppetdb salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376277
[14:42:33] <wikibugs>	 (03PS11) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242
[14:42:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242 (owner: 10Ema)
[14:43:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "Small nitpicky request of better logging, but LGTM" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac)
[14:43:45] <_joe_>	 mobrovac: /go mobrovac 
[14:43:48] <_joe_>	 augh
[14:43:59] <_joe_>	 I decided to write you in private after all
[14:44:03] <mobrovac>	 haha
[14:44:10] <_joe_>	 but yeah, see my comment
[14:46:18] <wikibugs>	 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3584505 (10Krinkle) >>! In T173710#3583445, @Joe wrote: > As a side comment: this is one of the cases where I would've loved to have an elastic environme...
[14:46:22] <wikibugs>	 (03PS12) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242
[14:48:17] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/376277 (owner: 10Muehlenhoff)
[14:48:36] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring - https://phabricator.wikimedia.org/T148541#2725758 (10mark) @fgiunchedi: Could you elaborate why the SNMP exporter to prometheus didn't wor...
[14:51:01] <wikibugs>	 (03PS13) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242
[14:52:12] <wikibugs>	 (03CR) 10Muehlenhoff: "The monitoring hosts already have a carte blanche via the 'monitoring-all' ferm::rules in modules/base/manifests/firewall.pp, so that can " [puppet] - 10https://gerrit.wikimedia.org/r/376024 (owner: 10Giuseppe Lavagetto)
[14:53:00] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool es1019 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376276 (https://phabricator.wikimedia.org/T167121) (owner: 10Jcrespo)
[14:53:09] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2040 for reboot and upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376220 (owner: 10Jcrespo)
[14:53:42] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove dumps/snapshot salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376273
[14:54:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove dumps/snapshot salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376273 (owner: 10Muehlenhoff)
[14:54:36] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool es1019 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376276 (https://phabricator.wikimedia.org/T167121) (owner: 10Jcrespo)
[14:54:48] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2040 for reboot and upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376220 (owner: 10Jcrespo)
[14:55:20] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove analytics salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376263
[14:56:29] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool es1019 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376276 (https://phabricator.wikimedia.org/T167121) (owner: 10Jcrespo)
[14:56:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove analytics salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376263 (owner: 10Muehlenhoff)
[14:56:39] <wikibugs>	 (03PS6) 10Mobrovac: JobQueue: Add the RunSingleJob.php script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004
[14:57:12] <wikibugs>	 (03CR) 10Mobrovac: [C: 031] JobQueue: Add the RunSingleJob.php script (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac)
[15:01:10] <wikibugs>	 (03PS20) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494)
[15:03:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: JobQueue: Add the RunSingleJob.php script (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac)
[15:04:11] <_joe_>	 mobrovac: another small correction, sorry :P
[15:04:19] <mobrovac>	 huh kk
[15:04:23] <_joe_>	 but then we can just merge and test it
[15:04:29] <wikibugs>	 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3584606 (10jcrespo) > Of course, that doesn't apply to cases that are limited by a common resource (e.g. database).  If I could add to the ideal scenario...
[15:04:36] <_joe_>	 the endpoint on LVS should already work
[15:05:16] <mobrovac>	 duh, good point _joe_, i wanted to write $match[1] but ended up with $value lol
[15:06:17] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2040 (duration: 00m 49s)
[15:06:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:47] <wikibugs>	 (03PS7) 10Mobrovac: JobQueue: Add the RunSingleJob.php script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004
[15:07:08] <wikibugs>	 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3584613 (10Joe) >>! In T173710#3584505, @Krinkle wrote: >>>! In T173710#3583445, @Joe wrote: >> As a side comment: this is one of the cases where I would...
[15:07:15] <wikibugs>	 (03CR) 10Mobrovac: JobQueue: Add the RunSingleJob.php script (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac)
[15:07:36] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM, let's merge this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac)
[15:07:39] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1019 (duration: 00m 49s)
[15:07:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:01] <mobrovac>	 _joe_: in a meeting, let's merge/sync 20 mins from now?
[15:08:08] <_joe_>	 ok
[15:08:13] <_joe_>	 ping me when you're done
[15:11:55] <bblack>	 !log cp1049 - restart varnish backend (mailbox lag)
[15:12:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:52] <wikibugs>	 (03PS14) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242
[15:13:32] <wikibugs>	 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3584649 (10Papaul) @elukey I spoke today with one of the Dell manager on this case. He ensure me that he will personal follow this case with the engineer working with me. He asked that i go ahead and update the firmw...
[15:16:38] <wikibugs>	 10Operations, 10Diamond, 10Traffic, 10monitoring, 10Prometheus-metrics-monitoring: Enable diamond PowerDNSRecursor collector on dnsrecursors - https://phabricator.wikimedia.org/T169600#3584652 (10faidon) a:03akosiaris
[15:17:08] <icinga-wm>	 RECOVERY - Check systemd state on kafka-jumbo1003 is OK: OK - running: The system is fully operational
[15:20:51] <wikibugs>	 (03PS1) 10Andrew Bogott: openstack: allow primary glance server to rsync to secondary [puppet] - 10https://gerrit.wikimedia.org/r/376280
[15:21:18] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3584658 (10Anomie) >>! In T171392#3583337, @Verdy_p...
[15:21:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] openstack: allow primary glance server to rsync to secondary [puppet] - 10https://gerrit.wikimedia.org/r/376280 (owner: 10Andrew Bogott)
[15:22:17] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 0
[15:25:35] <wikibugs>	 (03CR) 10Jforrester: [C: 04-2] "> I don't see why we can't land this already. We're already directing people to it on the wmfwiki website, and IIRC we're already receivin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372824 (https://phabricator.wikimedia.org/T173684) (owner: 10Urbanecm)
[15:26:21] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3584666 (10zhuyifei1999) >>! In T171392#3583339, @Ve...
[15:27:22] <wikibugs>	 (03CR) 10Reedy: "I do note 1st October is a sunday. And we don't deploy on a sunday..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372824 (https://phabricator.wikimedia.org/T173684) (owner: 10Urbanecm)
[15:28:08] <papaul>	 !log firmware upgrade on mw2256 
[15:28:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:00] <wikibugs>	 (03PS1) 10Cmjohnson: Fixing kafka-jumbo1005 production dns located in wrong vlan, and adding asset tags to mgmt. [dns] - 10https://gerrit.wikimedia.org/r/376283
[15:30:08] <icinga-wm>	 PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100%
[15:30:23] <akosiaris>	 mw2256 ?
[15:30:29] * akosiaris looking
[15:31:29] <akosiaris>	 damn... never read backlog... my bad
[15:31:53] <wikibugs>	 (03CR) 10Cmjohnson: [C: 032] Fixing kafka-jumbo1005 production dns located in wrong vlan, and adding asset tags to mgmt. [dns] - 10https://gerrit.wikimedia.org/r/376283 (owner: 10Cmjohnson)
[15:32:29] <wikibugs>	 (03PS2) 10Cmjohnson: Fixing kafka-jumbo1005 production dns located in wrong vlan, and adding asset tags to mgmt. [dns] - 10https://gerrit.wikimedia.org/r/376283
[15:33:05] <wikibugs>	 (03CR) 10Cmjohnson: [V: 032 C: 032] Fixing kafka-jumbo1005 production dns located in wrong vlan, and adding asset tags to mgmt. [dns] - 10https://gerrit.wikimedia.org/r/376283 (owner: 10Cmjohnson)
[15:35:18] <icinga-wm>	 RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms
[15:36:13] <_joe_>	 akosiaris: did you power it up again?
[15:36:28] <akosiaris>	 _joe_: it's papaul upgrading the firmware 
[15:36:31] <_joe_>	 oh, ok, yes
[15:36:44] <akosiaris>	 at least I am not the only one guilty of not reading the backlog
[15:36:46] <_joe_>	 because there was a request from elukey not to power it back up
[15:37:27] <icinga-wm>	 PROBLEM - dhclient process on mw2256 is CRITICAL: Return code of 255 is out of bounds
[15:37:27] <icinga-wm>	 PROBLEM - salt-minion processes on mw2256 is CRITICAL: Return code of 255 is out of bounds
[15:37:28] <icinga-wm>	 PROBLEM - MD RAID on mw2256 is CRITICAL: Return code of 255 is out of bounds
[15:37:37] <icinga-wm>	 PROBLEM - configured eth on mw2256 is CRITICAL: Return code of 255 is out of bounds
[15:37:38] <icinga-wm>	 PROBLEM - Disk space on mw2256 is CRITICAL: Return code of 255 is out of bounds
[15:37:48] <icinga-wm>	 PROBLEM - nutcracker process on mw2256 is CRITICAL: Return code of 255 is out of bounds
[15:37:49] <wikibugs>	 10Operations, 10Operations-Software-Development, 10monitoring, 10User-fgiunchedi: monitor SSD wear levels - https://phabricator.wikimedia.org/T86556#3584730 (10fgiunchedi)
[15:37:51] <wikibugs>	 10Operations, 10monitoring, 10User-fgiunchedi: Monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552#3584733 (10fgiunchedi)
[15:37:57] <icinga-wm>	 PROBLEM - Apache HTTP on mw2256 is CRITICAL: connect to address 10.192.16.55 and port 80: Connection refused
[15:37:57] <icinga-wm>	 PROBLEM - HHVM rendering on mw2256 is CRITICAL: connect to address 10.192.16.55 and port 80: Connection refused
[15:37:58] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2256 is CRITICAL: Return code of 255 is out of bounds
[15:37:58] <icinga-wm>	 PROBLEM - SSH on mw2256 is CRITICAL: connect to address 10.192.16.55 and port 22: Connection refused
[15:37:58] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw2256 is CRITICAL: Return code of 255 is out of bounds
[15:38:04] <elukey>	 the downtime might have been expired, I'll completely silence the host sorry
[15:38:06] <wikibugs>	 10Operations, 10Operations-Software-Development, 10monitoring, 10User-fgiunchedi: monitor SSD wear levels - https://phabricator.wikimedia.org/T86556#971260 (10fgiunchedi) Folding into parent task as duplicate
[15:38:07] <icinga-wm>	 PROBLEM - puppet last run on mw2256 is CRITICAL: Return code of 255 is out of bounds
[15:38:07] <icinga-wm>	 PROBLEM - nutcracker port on mw2256 is CRITICAL: Return code of 255 is out of bounds
[15:38:08] <icinga-wm>	 PROBLEM - Check systemd state on mw2256 is CRITICAL: Return code of 255 is out of bounds
[15:38:17] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw2256 is CRITICAL: connect to address 10.192.16.55 and port 443: Connection refused
[15:38:18] <icinga-wm>	 PROBLEM - HHVM processes on mw2256 is CRITICAL: Return code of 255 is out of bounds
[15:38:27] <icinga-wm>	 PROBLEM - DPKG on mw2256 is CRITICAL: Return code of 255 is out of bounds
[15:38:53] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121#3584737 (10jcrespo) es1019 seems to have rebroken T155691. I have depooled it, but it will take days to get effective (because backups do not res...
[15:42:29] <wikibugs>	 (03CR) 10KartikMistry: "> Let's do what mobrovac suggests. That is the current status quo," [puppet] - 10https://gerrit.wikimedia.org/r/374706 (owner: 10KartikMistry)
[15:44:18] <wikibugs>	 (03PS1) 10Volans: Cluster management: add some roles from neodymium [puppet] - 10https://gerrit.wikimedia.org/r/376284 (https://phabricator.wikimedia.org/T166300)
[15:44:52] <wikibugs>	 (03PS21) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494)
[15:44:52] <icinga-wm>	 PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:45:18] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:45:28] <wikibugs>	 (03PS8) 10Mobrovac: JobQueue: Add the RunSingleJob.php script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004
[15:45:34] <_joe_>	 uh?
[15:45:35] <ema>	 uh, lvs3001
[15:45:38] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:45:42] <akosiaris>	 ouch
[15:45:48] <_joe_>	 ema: why is bgp not switching?
[15:45:50] <volans>	 it was having a degraded raid...
[15:46:24] <_joe_>	 mobrovac: wait a sec please
[15:46:27] <akosiaris>	 console is unresponsive
[15:46:34] <bblack>	 what has degraded raid? 3001?
[15:46:34] <_joe_>	 lvs3003 is unreachable too
[15:46:40] <mobrovac>	 k
[15:46:42] <_joe_>	 nah scratch that
[15:46:43] <_joe_>	 that
[15:46:45] <_joe_>	 's me
[15:46:56] <_joe_>	 still it should've caught up by now
[15:47:01] <volans>	 bblack: yes it was having degraded raid since a while T166965
[15:47:02] <stashbot>	 T166965: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T166965
[15:47:13] <icinga-wm>	 PROBLEM - Host lvs3003 is DOWN: PING CRITICAL - Packet loss = 100%
[15:47:20] <volans>	 IWUT?
[15:47:21] <ema>	 ouch
[15:47:29] <Steinsplitter>	 sites down in the EU?
[15:47:32] <akosiaris>	 yes
[15:47:37] <akosiaris>	 ok let's depool esams ?
[15:47:39] <bblack>	 yes let's depool for now, the problem looks tricky
[15:47:45] <_joe_>	 cat someone depool esams?
[15:47:48] <icinga-wm>	 PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[15:47:53] <icinga-wm>	 PROBLEM - Host text-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100%
[15:48:24] <wikibugs>	 (03PS1) 10BBlack: depool esams [dns] - 10https://gerrit.wikimedia.org/r/376285
[15:48:41] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] depool esams [dns] - 10https://gerrit.wikimedia.org/r/376285 (owner: 10BBlack)
[15:48:45] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] depool esams [dns] - 10https://gerrit.wikimedia.org/r/376285 (owner: 10BBlack)
[15:49:14] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] depool esams [dns] - 10https://gerrit.wikimedia.org/r/376285 (owner: 10BBlack)
[15:49:44] <wikibugs>	 (03CR) 10Mobrovac: [C: 031] Remove restbase salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376250 (owner: 10Muehlenhoff)
[15:49:53] <bblack>	 so when I switched my bastion to bast3001, bast3001 was reporting dns errors looking up lvs hostnames
[15:50:03] <icinga-wm>	 PROBLEM - Check systemd state on restbase1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:51:16] <wikibugs>	 (03CR) 10Mobrovac: [C: 031] Remove sca/scb salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376272 (owner: 10Muehlenhoff)
[15:53:33] <akosiaris>	 !log powercycling lvs3001
[15:53:43] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:53:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:43] <icinga-wm>	 RECOVERY - Check systemd state on kafka-jumbo1006 is OK: OK - running: The system is fully operational
[15:55:40] <wikibugs>	 (03PS1) 10Gehel: Add a "Section" to the package metadata [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376287
[15:56:54] <Guest12334_>	 Wikipedia down? (GErmany)
[15:57:00] <Nemo_bis>	 yes
[15:57:13] <icinga-wm>	 RECOVERY - Host lvs3001 is UP: PING OK - Packet loss = 0%, RTA = 83.74 ms
[15:57:28] <Guest12334_>	 Any possibility to circumvent it? ^^
[15:57:56] <jynus>	 Guest12334_: refresh, your dns may be cached
[15:58:18] <icinga-wm>	 RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 83.74 ms
[15:58:23] <icinga-wm>	 RECOVERY - Host text-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 83.77 ms
[15:58:37] <Guest12334_>	 Huh, now it works again
[15:58:58] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 16877 bytes in 0.488 second response time
[15:59:01] <_joe_>	 Nemo_bis: it should've been working already for some time
[15:59:04] <MatmaRex>	 yeah. people fixed it ;)
[15:59:19] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 16879 bytes in 0.529 second response time
[16:00:04] <icinga-wm>	 PROBLEM - Check systemd state on lvs3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:00:15] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T175168
[16:00:18] <wikibugs>	 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T175168#3584827 (10ops-monitoring-bot)
[16:01:04] <Nemo_bis>	 _joe_: yeah it worked for me when guest12334 asked
[16:01:12] <Steinsplitter>	 +1
[16:01:15] <jynus>	 Nemo_bis: depending on isp/browser/etc. it could could have a delay on seeing it up again
[16:01:30] <_joe_>	 depending on broken DNS caching chains :)
[16:01:34] <jynus>	 yep
[16:01:38] <wikibugs>	 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T175168#3584835 (10Volans)
[16:01:38] <wikibugs>	 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T168619#3584837 (10Volans)
[16:01:51] <jynus>	 it should be seconds, but, we cannot control beyond the infrastructure :-)
[16:02:22] <bblack>	 well
[16:02:28] <bblack>	 we do control our TTLs, and they're 10 minutes
[16:02:51] <bblack>	 so when we do the depool, the expectation (set by us) is anywhere from 0-10m randomly for broken users to see things work again
[16:02:54] <_joe_>	 bblack: we do, but some isps dns recursors don't respect cache TTL
[16:03:05] <bblack>	 yeah but even those that do, there's no expectation of an instant fix for all
[16:03:08] <jynus>	 oh, so large?
[16:03:21] <bblack>	 there are tradeoffs
[16:03:27] <jynus>	 yeah, I know
[16:03:36] <jynus>	 specially if you cannot predict it
[16:03:54] <bblack>	 https://phabricator.wikimedia.org/T140365
[16:04:03] <bblack>	 ^ ticket about dropping the TTL from 10m -> 5m
[16:05:33] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 2284738
[16:06:14] <icinga-wm>	 PROBLEM - Host mw2256 is DOWN: PING CRITICAL - Packet loss = 100%
[16:06:54] <jynus>	 mw2256 is starting to get annoying
[16:07:42] <elukey>	 jynus: I silenced all the alarms except the host one since we do need to know when it goes up and down
[16:08:42] <wikibugs>	 (03CR) 10Hoo man: [C: 04-1] "Looks fine, briefly tested the changes to the WD dump script on snapshot1007. -1 for the wrong file header." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/375791 (https://phabricator.wikimedia.org/T174929) (owner: 10ArielGlenn)
[16:08:51] <wikibugs>	 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3584855 (10Eevans)
[16:08:55] <wikibugs>	 10Operations, 10Services (watching): Disk errors: restbase1010.eqiad.wmnet - https://phabricator.wikimedia.org/T174392#3584854 (10Eevans) 05Open>03Resolved
[16:09:13] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1062 is CRITICAL: CRITICAL: expiry mailbox lag is 2100772
[16:12:01] <_joe_>	 mobrovac: you can go on now, sorry 
[16:12:31] <wikibugs>	 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3584863 (10Papaul)  In the process of updating the firmware on the server, the server got again in a frozen state. nothing on the monitor and no keyboard response as well.
[16:16:06] <icinga-wm>	 RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms
[16:20:09] <wikibugs>	 (03PS2) 10Gehel: Add a "Section" to the package metadata [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376287
[16:20:35] <jynus>	 elukey: ah, things were happening
[16:20:54] <jynus>	 I thought it was stalled, hence my anoyance
[16:21:27] <elukey>	 jynus: yeah I know :( we are trying to figure out why it randomly freezes
[16:23:49] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3584905 (10elukey) All hosts up with OS installed and puppet/salt running.
[16:24:35] <wikibugs>	 (03PS13) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902)
[16:24:36] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3584927 (10elukey)
[16:25:04] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3352337 (10elukey)
[16:25:16] <icinga-wm>	 RECOVERY - Host lvs3003 is UP: PING OK - Packet loss = 0%, RTA = 83.87 ms
[16:25:23] <mobrovac>	 _joe_: eh now i have a meeting in 5 mins, we'll have to postpone, tomorrow morning?
[16:25:31] <wikibugs>	 (03CR) 10Phedenskog: Make values stackable (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog)
[16:26:40] <wikibugs>	 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3584940 (10Eevans)
[16:31:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/376284 (https://phabricator.wikimedia.org/T166300) (owner: 10Volans)
[16:32:54] <_joe_>	 mobrovac: ok
[16:33:00] <_joe_>	 mobrovac: you have too many meetings
[16:33:07] <mobrovac>	 tell me something i don't know
[16:33:12] <mobrovac>	 :P
[16:37:17] <wikibugs>	 (03CR) 10Volans: [C: 032] "Noop on neodymium as expected, change on sarin: https://puppet-compiler.wmflabs.org/compiler02/7755/" [puppet] - 10https://gerrit.wikimedia.org/r/376284 (https://phabricator.wikimedia.org/T166300) (owner: 10Volans)
[16:40:08] <bblack>	 !log cp1062 - varnish backend restart (mailbox lag)
[16:40:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:39] <bblack>	 !log cp1072 - varnish backend restart (mailbox lag)
[16:40:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:10] <chasemp>	 !log disable puppet for cloud things to test changes
[16:41:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:43:28] <icinga-wm>	 PROBLEM - HP RAID on ms-be2023 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Failed: 1I:1:5 - Controller: OK - Battery/Capacitor: OK
[16:43:31] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on ms-be2023 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Failed: 1I:1:5 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T175174
[16:43:34] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T175174#3585012 (10ops-monitoring-bot)
[16:45:09] <wikibugs>	 (03PS22) 10Rush: openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494)
[16:45:11] <volans>	 godog: at least this time the handler worked... I'll close it as a duplicate
[16:45:27] <godog>	 volans: heheh indeed, thanks!
[16:45:37] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 0
[16:45:43] <volans>	 godog: why did it alarm again btw?
[16:46:01] <papaul>	 volans: working on it
[16:46:03] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T174777#3585025 (10Volans)
[16:46:05] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T175174#3585023 (10Volans)
[16:46:10] <volans>	 papaul: ah ok, thanks!
[16:46:22] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: nova components for module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376026 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[16:46:46] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376291 (https://phabricator.wikimedia.org/T128546)
[16:49:17] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1062 is OK: OK: expiry mailbox lag is 0
[16:49:17] <wikibugs>	 (03CR) 10DCausse: [C: 032] Add a "Section" to the package metadata [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376287 (owner: 10Gehel)
[16:49:18] <icinga-wm>	 PROBLEM - Check systemd state on restbase2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:50:57] <icinga-wm>	 PROBLEM - puppet last run on labtestvirt2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:51:07] <icinga-wm>	 RECOVERY - Check systemd state on lvs3001 is OK: OK - running: The system is fully operational
[16:51:36] <logmsgbot>	 !log demon@tin Synchronized php-1.30.0-wmf.17/extensions/Flow/includes/: I284b5aa (duration: 01m 01s)
[16:51:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:19] <wikibugs>	 (03PS14) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902)
[16:53:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] Add a "Section" to the package metadata [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376287 (owner: 10Gehel)
[16:57:07] <icinga-wm>	 PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:57:31] <wikibugs>	 (03CR) 10Chad: [C: 032] Don't bother polluting function namespace, just use an anonymous one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374663 (owner: 10Chad)
[16:57:58] <icinga-wm>	 PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 27 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map
[16:57:58] <icinga-wm>	 PROBLEM - DPKG on labtestneutron2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[16:58:58] <icinga-wm>	 RECOVERY - DPKG on labtestneutron2001 is OK: All packages OK
[16:59:09] <wikibugs>	 (03Merged) 10jenkins-bot: Don't bother polluting function namespace, just use an anonymous one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374663 (owner: 10Chad)
[16:59:19] <wikibugs>	 (03PS1) 10Rush: openstack: correct key path for labtest settings [puppet] - 10https://gerrit.wikimedia.org/r/376292 (https://phabricator.wikimedia.org/T171494)
[16:59:56] <wikibugs>	 (03CR) 10DCausse: [V: 032 C: 032] Add a "Section" to the package metadata [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376287 (owner: 10Gehel)
[17:01:28] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1048 is CRITICAL: CRITICAL: expiry mailbox lag is 2192370
[17:01:41] <wikibugs>	 (03PS2) 10Rush: openstack: correct key paths for profiles [puppet] - 10https://gerrit.wikimedia.org/r/376292 (https://phabricator.wikimedia.org/T171494)
[17:02:34] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: correct key paths for profiles [puppet] - 10https://gerrit.wikimedia.org/r/376292 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[17:02:58] <icinga-wm>	 RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 5 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map
[17:04:07] <wikibugs>	 10Operations, 10Ops-Access-Requests: WMF LDAP group access for pivot.wikimedia.org - https://phabricator.wikimedia.org/T175156#3584447 (10Dzahn) I just checked this and yea pivot.wikimedia.org is using LDAP auth and one of the groups "wmf" or "nda" are enough to be granted access.  Adding you to "wmf" isn't a...
[17:05:18] <wikibugs>	 (03PS1) 10Rush: openstack: amend keypaths for labtest nova settings [puppet] - 10https://gerrit.wikimedia.org/r/376293 (https://phabricator.wikimedia.org/T171494)
[17:05:49] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: amend keypaths for labtest nova settings [puppet] - 10https://gerrit.wikimedia.org/r/376293 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[17:07:00] <wikibugs>	 (03PS1) 10Ema: Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/376294
[17:08:11] <wikibugs>	 (03CR) 10Ema: [V: 032 C: 032] Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/376294 (owner: 10Ema)
[17:08:18] <icinga-wm>	 RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[17:09:41] <wikibugs>	 (03PS1) 10Dzahn: admins: add rschoenbaechler to LDAP-only WMF users [puppet] - 10https://gerrit.wikimedia.org/r/376295 (https://phabricator.wikimedia.org/T175156)
[17:10:34] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: WMF LDAP group access for pivot.wikimedia.org - https://phabricator.wikimedia.org/T175156#3585203 (10Dzahn)  cn: Schoenbaechler  mail: rschoenbaechler@wikimedia.org  uid: schoenbaechler  note to others:  cn/uid differ, watch out
[17:11:57] <icinga-wm>	 RECOVERY - Host lvs1007 is UP: PING OK - Packet loss = 0%, RTA = 0.16 ms
[17:12:45] <wikibugs>	 (03PS2) 10Dzahn: admins: add schoenbaechler to LDAP-only WMF users [puppet] - 10https://gerrit.wikimedia.org/r/376295 (https://phabricator.wikimedia.org/T175156)
[17:13:28] <icinga-wm>	 RECOVERY - HP RAID on ms-be2023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK
[17:15:44] <wikibugs>	 (03CR) 10Dzahn: [C: 032] admins: add schoenbaechler to LDAP-only WMF users [puppet] - 10https://gerrit.wikimedia.org/r/376295 (https://phabricator.wikimedia.org/T175156) (owner: 10Dzahn)
[17:16:20] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T174777#3585248 (10Papaul) a:05Papaul>03fgiunchedi @fgiunchedi  Disk replacement complete. Below please see information for return package.   {F9360309}
[17:16:34] <mutante>	 !log modify-ldap-group on terbium is broken 
[17:16:40] <wikibugs>	 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team-Backlog (Current), 10User-Ladsgroup: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3585250 (10awight)
[17:16:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:18] <icinga-wm>	 RECOVERY - puppet last run on labtestvirt2002 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[17:19:22] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3585265 (10Verdy_p) >>! In T171392#3584666, @zhuyife...
[17:20:04] <mutante>	 !log added LDAP user schoenbaechler to WMF group (T175156)
[17:20:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:16] <stashbot>	 T175156: WMF LDAP group access for pivot.wikimedia.org - https://phabricator.wikimedia.org/T175156
[17:21:32] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: WMF LDAP group access for pivot.wikimedia.org - https://phabricator.wikimedia.org/T175156#3585279 (10Dzahn) Hi Robin @schoenbaechler , you have been added to the relavant group. You should be able to login now, using your wikitech.wikimedia.org / LDAP...
[17:21:46] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: WMF LDAP group access for pivot.wikimedia.org - https://phabricator.wikimedia.org/T175156#3585280 (10Dzahn) 05Open>03Resolved a:03Dzahn
[17:23:12] <logmsgbot>	 !log demon@tin Synchronized wmf-config/FeaturedFeedsWMF.php: code cleanup (duration: 00m 49s)
[17:23:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:40] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: refactor corrections for labtestcontrol [puppet] - 10https://gerrit.wikimedia.org/r/376301 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[17:25:43] <wikibugs>	 10Operations, 10Performance-Team, 10hardware-requests, 10Patch-For-Review: Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3585349 (10Krinkle)
[17:25:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Remove restbase salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376250 (owner: 10Muehlenhoff)
[17:27:42] <icinga-wm>	 PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:28:36] <wikibugs>	 (03PS1) 10Rush: openstack: rabbit_user keypath nova specific [puppet] - 10https://gerrit.wikimedia.org/r/376302 (https://phabricator.wikimedia.org/T171494)
[17:28:45] <wikibugs>	 (03PS2) 10Rush: openstack: rabbit_user keypath nova specific [puppet] - 10https://gerrit.wikimedia.org/r/376302 (https://phabricator.wikimedia.org/T171494)
[17:28:58] <wikibugs>	 10Operations, 10Ops-Access-Requests: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#3585418 (10Dzahn)
[17:29:04] <wikibugs>	 10Operations, 10Security-Team, 10vm-requests: provide ganeti VM for security team sectools - https://phabricator.wikimedia.org/T138650#3585416 (10Dzahn) 05declined>03Open @EddieGP Maybe, not sure. I'll take it an reopen to figure it out.
[17:29:39] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: rabbit_user keypath nova specific [puppet] - 10https://gerrit.wikimedia.org/r/376302 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[17:31:32] <wikibugs>	 10Operations: use htpasswd instead of htdigest for arbcom archive passwords - https://phabricator.wikimedia.org/T157761#3585440 (10Dzahn) a:03Dzahn Oh, thanks @Aklapper :)  yep
[17:31:42] <icinga-wm>	 RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[17:31:55] <icinga-wm>	 PROBLEM - puppet last run on labtestnet2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:32:43] <wikibugs>	 10Operations, 10Cloud-Services: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3585446 (10madhuvishy) @Robh @Cmjohnson I'm able to log in to both machines with their .wikimedia.org hostnames and run puppet fine.  However, when I hop into the serial console, they bo...
[17:33:08] <wikibugs>	 (03PS1) 10Rush: openstack: key path correction for labtest nova network [puppet] - 10https://gerrit.wikimedia.org/r/376303 (https://phabricator.wikimedia.org/T171494)
[17:33:11] <wikibugs>	 (03PS10) 10Paladox: Gerrit: Set auth.userNameToLowerCase [puppet] - 10https://gerrit.wikimedia.org/r/368196
[17:33:12] <wikibugs>	 (03PS2) 10Rush: openstack: key path correction for labtest nova network [puppet] - 10https://gerrit.wikimedia.org/r/376303 (https://phabricator.wikimedia.org/T171494)
[17:34:09] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: key path correction for labtest nova network [puppet] - 10https://gerrit.wikimedia.org/r/376303 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[17:34:26] <wikibugs>	 (03PS4) 10Paladox: Gerrit: Set base url for commitlink [puppet] - 10https://gerrit.wikimedia.org/r/375922
[17:34:26] <wikibugs>	 10Operations, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#3585453 (10Dzahn) p:05Triage>03Normal
[17:34:29] <wikibugs>	 (03PS8) 10Paladox: Gerrit: Remove ldap user and password from secure.config [puppet] - 10https://gerrit.wikimedia.org/r/366910
[17:34:49] <wikibugs>	 (03PS21) 10Paladox: Zuul: Add systemd script for zuul [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833)
[17:35:52] <icinga-wm>	 RECOVERY - puppet last run on labtestnet2001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[17:37:52] <wikibugs>	 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: decommission cp3001 & cp3002 - https://phabricator.wikimedia.org/T94215#3585506 (10Dzahn) @Robh does this need the decom template (after the fact)?
[17:40:24] <wikibugs>	 10Operations, 10Cloud-Services: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3585536 (10Cmjohnson) Probably could use a bios update.
[17:40:33] <wikibugs>	 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3194210 (10Dzahn) Given all the work that has gone into this single host and it still being dead after all this.. i suggest we just give up on it and permanently decom it. It probably costs us less in the end that way.
[17:44:49] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3585563 (10Anomie) >>! In T171392#3585265, @Verdy_p...
[17:45:40] <wikibugs>	 (03PS1) 10Rush: openstack: set spice_hostname [puppet] - 10https://gerrit.wikimedia.org/r/376304 (https://phabricator.wikimedia.org/T171494)
[17:45:52] <wikibugs>	 (03PS2) 10Rush: openstack: set spice_hostname [puppet] - 10https://gerrit.wikimedia.org/r/376304 (https://phabricator.wikimedia.org/T171494)
[17:45:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: set spice_hostname [puppet] - 10https://gerrit.wikimedia.org/r/376304 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[17:46:47] <wikibugs>	 (03PS3) 10Rush: openstack: set spice_hostname [puppet] - 10https://gerrit.wikimedia.org/r/376304 (https://phabricator.wikimedia.org/T171494)
[17:47:13] <wikibugs>	 (03CR) 10Phedenskog: Make values stackable (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog)
[17:47:19] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: set spice_hostname [puppet] - 10https://gerrit.wikimedia.org/r/376304 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[17:47:39] <wikibugs>	 (03PS2) 10Paladox: planet: add Wikimedia Readers blog [puppet] - 10https://gerrit.wikimedia.org/r/375085 (owner: 10BryanDavis)
[17:48:07] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10netops: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3585595 (10Cmjohnson) Created the bootable img using the HP utility provided in the iso. It is a Windows software and had to borrow from a family member. Booted the Service pack and...
[17:49:12] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "thanks Paladox for adding the feed in both places (new for rawdog, upcoming replacement of planet-venus on stretch)" [puppet] - 10https://gerrit.wikimedia.org/r/375085 (owner: 10BryanDavis)
[17:49:26] <wikibugs>	 (03CR) 10Paladox: "Your welcome :)" [puppet] - 10https://gerrit.wikimedia.org/r/375085 (owner: 10BryanDavis)
[17:49:40] <wikibugs>	 (03PS1) 10Chad: group1 to wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376305
[17:49:52] <wikibugs>	 (03PS3) 10Dzahn: planet: add Wikimedia Readers blog [puppet] - 10https://gerrit.wikimedia.org/r/375085 (owner: 10BryanDavis)
[17:52:43] <wikibugs>	 (03PS15) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902)
[17:53:00] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3585627 (10Jarekt) I can look through c:Module:Fallb...
[17:53:09] <bblack>	 !log cp1048 - varnish backend restart for mailbox lag...
[17:53:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:40] <wikibugs>	 (03PS1) 10Rush: openstack: set labspice to fqdns [puppet] - 10https://gerrit.wikimedia.org/r/376308 (https://phabricator.wikimedia.org/T171494)
[17:54:19] <wikibugs>	 (03CR) 10Phedenskog: Make values stackable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog)
[17:54:23] <wikibugs>	 (03PS2) 10Rush: openstack: set labspice to fqdns [puppet] - 10https://gerrit.wikimedia.org/r/376308 (https://phabricator.wikimedia.org/T171494)
[17:54:46] <wikibugs>	 (03PS2) 10BBlack: VCL: fix keep values at 7d [puppet] - 10https://gerrit.wikimedia.org/r/364605
[17:55:01] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: set labspice to fqdns [puppet] - 10https://gerrit.wikimedia.org/r/376308 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[17:56:17] <wikibugs>	 (03PS1) 10BBlack: browsersec: affect API calls and non-GET as well [puppet] - 10https://gerrit.wikimedia.org/r/376309 (https://phabricator.wikimedia.org/T163251)
[17:56:19] <wikibugs>	 (03PS1) 10BBlack: browsersec: bump to 14% 2017-09-07 [puppet] - 10https://gerrit.wikimedia.org/r/376310 (https://phabricator.wikimedia.org/T163251)
[17:56:21] <wikibugs>	 (03PS1) 10BBlack: browsersec: bump to 17% 2017-09-14 [puppet] - 10https://gerrit.wikimedia.org/r/376311 (https://phabricator.wikimedia.org/T163251)
[17:56:23] <wikibugs>	 (03PS1) 10BBlack: browsersec: bump to 20% 2017-09-21 [puppet] - 10https://gerrit.wikimedia.org/r/376312 (https://phabricator.wikimedia.org/T163251)
[17:56:25] <wikibugs>	 (03PS1) 10BBlack: browsersec: bump to 23% 2017-09-28 [puppet] - 10https://gerrit.wikimedia.org/r/376313 (https://phabricator.wikimedia.org/T163251)
[17:56:27] <wikibugs>	 (03PS1) 10BBlack: browsersec: bump to 26% 2017-10-05 [puppet] - 10https://gerrit.wikimedia.org/r/376314 (https://phabricator.wikimedia.org/T163251)
[17:56:29] <wikibugs>	 (03PS1) 10BBlack: browsersec: bump to 29% 2017-10-12 [puppet] - 10https://gerrit.wikimedia.org/r/376315 (https://phabricator.wikimedia.org/T163251)
[17:56:31] <wikibugs>	 (03PS1) 10BBlack: browsersec: bump to 100% 2017-10-17 [puppet] - 10https://gerrit.wikimedia.org/r/376316 (https://phabricator.wikimedia.org/T163251)
[17:56:33] <wikibugs>	 (03CR) 10Rush: [C: 04-1] "I backported this from the ping online but it appears it wasn't merged.  This is now part of https://gerrit.wikimedia.org/r/#/c/376026/ wh" [puppet] - 10https://gerrit.wikimedia.org/r/375939 (https://phabricator.wikimedia.org/T170447) (owner: 10Andrew Bogott)
[17:57:11] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1063 is CRITICAL: CRITICAL: expiry mailbox lag is 2143201
[17:57:54] <wikibugs>	 (03PS1) 10Hoo man: Fix $wgPropertySuggesterDeprecatedIds for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376317 (https://phabricator.wikimedia.org/T174962)
[18:00:04] <jouncebot>	 addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170906T1800).
[18:00:04] <jouncebot>	 jan_drewniak and MaxSem: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[18:00:06] <MaxSem>	 I'll swat
[18:00:16] <jan_drewniak>	 o/
[18:00:38] * hoo just added https://gerrit.wikimedia.org/r/376317 to SWAT
[18:00:43] <hoo>	 a quick review would be nice
[18:00:49] <wikibugs>	 (03PS2) 10MaxSem: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376291 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[18:00:54] <wikibugs>	 (03CR) 10MaxSem: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376291 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[18:01:11] <icinga-wm>	 PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:01:40] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1048 is OK: OK: expiry mailbox lag is 0
[18:01:46] <wikibugs>	 (03PS3) 10BBlack: VCL: fixed keep values: 7d def, 1d for text [puppet] - 10https://gerrit.wikimedia.org/r/364605
[18:01:47] <hoo>	 Amir1: ^
[18:02:32] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376291 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[18:03:15] <bblack>	 ema: so I've amended my lingering https://gerrit.wikimedia.org/r/#/c/364605/ patch that gets rid of keep-relative-to-TTL to also fix text down to 1d static, so that we don't risk ugly problems with MW's bad-304
[18:03:39] <bblack>	 ema: it could stand to be better thought out or dealt with, but maybe if we're lucky this reduces the mailbox lag rate :P
[18:04:29] <MaxSem>	 anomie, Warning: Using deprecated fallback handling for comment rev_comment [Called from CommentStore::getCommentInternal in /srv/mediawiki/php-1.30.0-
[18:04:29] <MaxSem>	 wmf.17/includes/CommentStore.php at line 200] in /srv/mediawiki/php-1.30.0-wmf.17/includes/debug/MWDebug.php on line 309
[18:04:31] <wikibugs>	 (03CR) 10BBlack: [C: 032] VCL: fixed keep values: 7d def, 1d for text [puppet] - 10https://gerrit.wikimedia.org/r/364605 (owner: 10BBlack)
[18:04:50] <bblack>	 heh apparently I split that over two channels, oh well
[18:05:57] <MaxSem>	 jan_drewniak, pulled on mwdebug1002
[18:06:40] <jan_drewniak>	 MaxSem:  looks good
[18:08:11] <anomie>	 MaxSem: I talked about that with no_justification in #mediawiki-core yesterday and earlier today. <anomie> no_justification: Re those "Using deprecated fallback handling for comment" warnings, I found backtraces in error.log on mwlog1001. There seem to be three. Flow should be fixed by backporting https://gerrit.wikimedia.org/r/#/c/374861/. MobileFrontend has SpecialMobileHistory and SpecialMobileContributions, for which T175161 exists.
[18:08:12] <stashbot>	 T175161: Special:MobileHistory warning: Using deprecated fallback handling for comment rev_comment [Called from CommentStore::getCommentInternal in /Users/jrobson/git/core/includes/CommentStore.php at line 200] - https://phabricator.wikimedia.org/T175161
[18:08:33] <no_justification>	 anomie: The Flow one I backported and sync'd already
[18:08:46] <logmsgbot>	 !log maxsem@tin Synchronized portals/prod/wikipedia.org/assets: (no justification provided) (duration: 00m 50s)
[18:08:50] <no_justification>	 Also, fwiw, I don't see the other ones in the log anymore tbh
[18:08:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:08] <no_justification>	 Wait. Yes I do. group1
[18:09:09] <no_justification>	 Not 0
[18:09:12] * no_justification sighs
[18:09:37] <logmsgbot>	 !log maxsem@tin Synchronized portals: (no justification provided) (duration: 00m 49s)
[18:09:41] <MaxSem>	 jan_drewniak, ^
[18:09:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:22] <wikibugs>	 (03PS4) 10Rush: prometheus: allow setting a specific listening address and port [puppet] - 10https://gerrit.wikimedia.org/r/374650 (https://phabricator.wikimedia.org/T169039)
[18:10:27] <wikibugs>	 (03PS2) 10BBlack: browsersec: affect API calls and non-GET as well [puppet] - 10https://gerrit.wikimedia.org/r/376309 (https://phabricator.wikimedia.org/T163251)
[18:10:30] <wikibugs>	 (03PS1) 10Rush: openstack: cease managing nova files via openstack/common.pp [puppet] - 10https://gerrit.wikimedia.org/r/376318 (https://phabricator.wikimedia.org/T171494)
[18:10:35] <jan_drewniak>	 MaxSem:  looks good in prod, thanks!
[18:10:42] <wikibugs>	 (03PS2) 10Rush: openstack: cease managing nova files via openstack/common.pp [puppet] - 10https://gerrit.wikimedia.org/r/376318 (https://phabricator.wikimedia.org/T171494)
[18:10:44] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] browsersec: affect API calls and non-GET as well [puppet] - 10https://gerrit.wikimedia.org/r/376309 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack)
[18:11:13] <MaxSem>	 hoo, you're next
[18:11:18] <wikibugs>	 (03PS3) 10Rush: openstack: cease managing nova files via openstack/common.pp [puppet] - 10https://gerrit.wikimedia.org/r/376318 (https://phabricator.wikimedia.org/T171494)
[18:11:21] <wikibugs>	 (03PS2) 10MaxSem: Fix $wgPropertySuggesterDeprecatedIds for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376317 (https://phabricator.wikimedia.org/T174962) (owner: 10Hoo man)
[18:11:57] <wikibugs>	 (03CR) 10MaxSem: [C: 032] Fix $wgPropertySuggesterDeprecatedIds for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376317 (https://phabricator.wikimedia.org/T174962) (owner: 10Hoo man)
[18:12:01] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: cease managing nova files via openstack/common.pp [puppet] - 10https://gerrit.wikimedia.org/r/376318 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[18:12:12] <wikibugs>	 10Operations, 10Performance-Team, 10hardware-requests, 10Patch-For-Review: Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3585730 (10Dzahn) a:03Dzahn
[18:13:32] <wikibugs>	 (03Merged) 10jenkins-bot: Fix $wgPropertySuggesterDeprecatedIds for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376317 (https://phabricator.wikimedia.org/T174962) (owner: 10Hoo man)
[18:14:09] <MaxSem>	 hoo, pulled on mwdebug1002
[18:14:21] <icinga-wm>	 RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[18:14:39] <hoo>	 looks good
[18:15:09] <wikibugs>	 10Operations, 10monitoring: Review check_ping settings - https://phabricator.wikimedia.org/T173315#3523998 (10herron) Also, FWIW, https://www.icinga.com/docs/icinga1/latest/en/tuning.html item 11 outlines a similar approach.
[18:16:30] <MaxSem>	 hoo, first -production, then the other file?
[18:16:40] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on labcontrol1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[18:17:10] <hoo>	 Both is fine, given the value in Wikibase.php is overwritte in -prod anyway
[18:17:40] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on labcontrol1002 is OK: OK ferm input default policy is set
[18:18:16] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/Wikibase-production.php: https://gerrit.wikimedia.org/r/#/c/376317/2 (duration: 00m 48s)
[18:18:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:37] <hoo>	 Looks fine, thanks
[18:18:39] <hoo>	 sjoerddebruin: ^
[18:18:43] <sjoerddebruin>	 <3
[18:18:59] <sjoerddebruin>	 Working as intended now.
[18:19:26] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/Wikibase.php: https://gerrit.wikimedia.org/r/#/c/376317/2 (duration: 00m 48s)
[18:19:32] <MaxSem>	 hoo, ^
[18:19:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:24] <hoo>	 Looks good still
[18:21:29] <wikibugs>	 (03PS1) 10Rush: openstack: @resolve for saltcleaningcert rule [puppet] - 10https://gerrit.wikimedia.org/r/376320 (https://phabricator.wikimedia.org/T171494)
[18:22:22] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: @resolve for saltcleaningcert rule [puppet] - 10https://gerrit.wikimedia.org/r/376320 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[18:22:25] <wikibugs>	 (03PS2) 10MaxSem: labs: Remove OAuth setting duplicating prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370291
[18:22:36] <wikibugs>	 (03CR) 10MaxSem: [C: 032] labs: Remove OAuth setting duplicating prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370291 (owner: 10MaxSem)
[18:23:57] <wikibugs>	 (03Merged) 10jenkins-bot: labs: Remove OAuth setting duplicating prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370291 (owner: 10MaxSem)
[18:25:22] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/CommonSettings-labs.php: https://gerrit.wikimedia.org/r/#/c/370291/2 (duration: 00m 49s)
[18:25:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:26:02] <wikibugs>	 (03PS9) 10Dzahn: Gerrit: Reveal the author in the title of the email [puppet] - 10https://gerrit.wikimedia.org/r/356645 (https://phabricator.wikimedia.org/T43608) (owner: 10Paladox)
[18:27:03] <wikibugs>	 (03PS3) 10MaxSem: Flow settings: wmg -> wg migration, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370292
[18:27:32] <wikibugs>	 (03CR) 10MaxSem: [C: 032] Flow settings: wmg -> wg migration, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370292 (owner: 10MaxSem)
[18:28:59] <wikibugs>	 (03Merged) 10jenkins-bot: Flow settings: wmg -> wg migration, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370292 (owner: 10MaxSem)
[18:30:04] <wikibugs>	 (03PS1) 10Hashar: package_builder: typo: s/output/result/ directory [puppet] - 10https://gerrit.wikimedia.org/r/376322
[18:30:30] <wikibugs>	 (03CR) 10Hashar: "That was confusing me :]" [puppet] - 10https://gerrit.wikimedia.org/r/376322 (owner: 10Hashar)
[18:30:32] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Gerrit: Reveal the author in the title of the email [puppet] - 10https://gerrit.wikimedia.org/r/356645 (https://phabricator.wikimedia.org/T43608) (owner: 10Paladox)
[18:31:50] <Amir1>	 hoo: what's up?
[18:31:54] <Amir1>	 I just got here
[18:33:45] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/370292/3 (duration: 00m 49s)
[18:33:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:34:28] <hoo>	 Amir1: https://phabricator.wikimedia.org/T174962#3585663
[18:34:30] <hoo>	 all solved by now
[18:34:36] <Amir1>	 Oh, thanks
[18:35:02] <wikibugs>	 (03PS3) 10MaxSem: Flow settings: wmg -> wg migration, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370293
[18:35:09] <wikibugs>	 (03CR) 10MaxSem: [C: 032] Flow settings: wmg -> wg migration, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370293 (owner: 10MaxSem)
[18:38:36] <wikibugs>	 (03CR) 10Phedenskog: Make values stackable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog)
[18:38:59] <wikibugs>	 (03Merged) 10jenkins-bot: Flow settings: wmg -> wg migration, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370293 (owner: 10MaxSem)
[18:42:22] <wikibugs>	 (03PS1) 10MaxSem: Revert "Flow settings: wmg -> wg migration, part 2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376325
[18:43:10] <wikibugs>	 (03CR) 10MaxSem: [C: 032] Revert "Flow settings: wmg -> wg migration, part 2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376325 (owner: 10MaxSem)
[18:44:41] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Flow settings: wmg -> wg migration, part 2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376325 (owner: 10MaxSem)
[18:45:10] <legoktm>	 MaxSem: why revert?
[18:45:21] <Reedy>	 legoktm: php sucks
[18:45:24] <MaxSem>	 went boom
[18:45:40] <legoktm>	 ugh
[18:45:47] <wikibugs>	 (03PS1) 10MaxSem: Revert "Flow settings: wmg -> wg migration, part 1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376326
[18:45:49] <wikibugs>	 (03CR) 10MaxSem: [C: 032] Revert "Flow settings: wmg -> wg migration, part 1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376326 (owner: 10MaxSem)
[18:46:05] <MaxSem>	 will figure out later
[18:46:42] <wikibugs>	 10Operations: use htpasswd instead of htdigest for arbcom archive passwords - https://phabricator.wikimedia.org/T157761#3585913 (10Jalexander) FTR this can get held off for now (or even just closed as rejected). We're transitioning away from Mailman for this list.
[18:46:53] <wikibugs>	 (03CR) 10MaxSem: [V: 032 C: 032] Revert "Flow settings: wmg -> wg migration, part 1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376326 (owner: 10MaxSem)
[18:48:17] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/CommonSettings.php: revert (duration: 00m 49s)
[18:48:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:47] <wikibugs>	 (03CR) 10Paladox: "Works on 2.14" [puppet] - 10https://gerrit.wikimedia.org/r/356645 (https://phabricator.wikimedia.org/T43608) (owner: 10Paladox)
[18:48:50] <icinga-wm>	 PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:49:37] <wikibugs>	 (03PS5) 10Paladox: Gerrit: Set base url for commitlink [puppet] - 10https://gerrit.wikimedia.org/r/375922
[18:50:51] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/: just to make sure... (duration: 00m 50s)
[18:51:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:54:11] <MaxSem>	 ok, we're done here
[18:54:13] * MaxSem hides
[18:57:55] <wikibugs>	 (03PS1) 10Rush: openstack: clean up common and nova common packages [puppet] - 10https://gerrit.wikimedia.org/r/376329 (https://phabricator.wikimedia.org/T171494)
[18:58:38] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: clean up common and nova common packages [puppet] - 10https://gerrit.wikimedia.org/r/376329 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[19:00:04] <jouncebot>	 RainbowSprinkles: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170906T1900).
[19:04:00] <icinga-wm>	 RECOVERY - puppet last run on labstore1004 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[19:09:50] <wikibugs>	 10Operations, 10Cloud-Services, 10wikitech.wikimedia.org, 10HHVM: Move wikitech (silver) to HHVM - https://phabricator.wikimedia.org/T98813#1278203 (10Jdforrester-WMF) >>! In T98813#3365586, @bd808 wrote: >>>! In T98813#3135116, @greg wrote: >> Added T161553 as a subtask per above comments. >  > I removed...
[19:12:32] <wikibugs>	 (03PS1) 10Rush: openstack: preserve hiera settings for new virt role [puppet] - 10https://gerrit.wikimedia.org/r/376331 (https://phabricator.wikimedia.org/T171494)
[19:12:49] <James_F>	 Reedy, no_justification: The only remaining blocker to T31902 is T104148 – worth getting done?
[19:12:50] <stashbot>	 T31902: Tidy up wmf-config CommonSettings.php and InitialiseSettings.php - https://phabricator.wikimedia.org/T31902
[19:12:50] <stashbot>	 T104148: Change Squid references in Wikimedia configuration files - https://phabricator.wikimedia.org/T104148
[19:13:05] <James_F>	 2011 bugs FTW.
[19:13:14] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: preserve hiera settings for new virt role [puppet] - 10https://gerrit.wikimedia.org/r/376331 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[19:15:10] <no_justification>	 James_F: Meh. I honestly don't care enough.
[19:15:30] * James_F grins.
[19:15:49] <wikibugs>	 10Operations: use htpasswd instead of htdigest for arbcom archive passwords - https://phabricator.wikimedia.org/T157761#3586004 (10Dzahn) ok, thanks @Jalexander !  @eliza told me about it and i was about to set it to stalled for that reason
[19:16:00] <no_justification>	 I mean by all means do it, but I've got 99 problems and legacy references to squid ain't one
[19:19:11] <wikibugs>	 (03Restored) 10Dzahn: Cleanup: squid.php → ReverseProxy.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309742 (https://phabricator.wikimedia.org/T104148) (owner: 10Dereckson)
[19:19:21] <Reedy>	 James_F: Seems to be a relatively easy jfdi if someone wants to deploy it
[19:19:31] <Reedy>	 Deploy newfilename.php first
[19:19:36] <Reedy>	 Then CommonSettings
[19:19:43] <Reedy>	 Then sync-dir the lot to get rid of the old
[19:20:01] <James_F>	 Yeah. but no_justification abandoned the change a few days ago (as it'd sat there untouched for months).
[19:20:13] <Reedy>	 he's the worst
[19:20:35] <wikibugs>	 (03CR) 10Dzahn: "per IRC today - this is still wanted as it's the last blocker for also closing https://phabricator.wikimedia.org/T31902 entirely - but nee" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309742 (https://phabricator.wikimedia.org/T104148) (owner: 10Dereckson)
[19:21:22] <mutante>	 first thing that needs a manual rebase
[19:21:28] <mutante>	 which will be fun when it's a year old
[19:21:30] <no_justification>	 At least no_justification gave justification :p
[19:22:17] <mutante>	 "old" :)
[19:22:23] <Reedy>	 mutante: it might be easiest just to do it from scratch
[19:22:28] <Reedy>	 Rather than faff with a rebase
[19:22:48] <Reedy>	 Unless it'll rebase fine on normal git
[19:22:48] <no_justification>	 mutante: I dropped basically every patch for mw-config that had sat untouched for a year-ish
[19:22:50] <Reedy>	 whereas jgit sucks
[19:23:30] <no_justification>	 Only conflict is commonsettings
[19:23:32] <no_justification>	 In local rebase
[19:23:35] <mutante>	 yea, right.. they are getting harder the older they get
[19:23:46] <mutante>	 ah :)
[19:23:46] <no_justification>	 It wasn't even that they're harder to rebase or anything
[19:23:53] <no_justification>	 It's just that they clearly aren't important and nobody cares
[19:24:25] <no_justification>	 Actually. It's probably deceptive
[19:24:29] <no_justification>	 New file, deleted file
[19:24:32] <no_justification>	 It's mostly renames
[19:24:34] <no_justification>	 Yeahhhhh
[19:24:38] <no_justification>	 Should redo by hand
[19:24:57] <no_justification>	 The new files will be the old new files from 2016
[19:25:04] <no_justification>	 On a rebase
[19:25:18] <no_justification>	 Oh, maybe not
[19:25:19] <no_justification>	 nvm
[19:25:20] <no_justification>	 Misread
[19:25:21] <no_justification>	 Still
[19:25:23] <no_justification>	 W/e
[19:25:24] <no_justification>	 Go ahead
[19:25:35] <Reedy>	 fuck that shit up
[19:26:37] <mutante>	 you know what is also in there...
[19:26:41] <mutante>	 string "labs" heh
[19:27:00] <no_justification>	 People love filenames way more than I
[19:27:11] * no_justification goes back to not giving any fucks :)
[19:28:25] <wikibugs>	 10Operations, 10Thumbor, 10Performance-Team (Radar), 10User-fgiunchedi: thumbor1003 behaves differently than other thumbor hosts - https://phabricator.wikimedia.org/T174997#3586066 (10Gilles)
[19:28:51] <James_F>	 mutante: Indeed. There's a task for that too.
[19:29:05] <James_F>	 mutante: Though it's hard-coded as a realm, IIUC, so…
[19:29:24] <James_F>	 Everyone hates having to make fixes in ops/puppet. ;-)
[19:29:27] <Reedy>	 I had a patch
[19:29:31] <Reedy>	 I abandoned it recently
[19:29:32] <mutante>	 heh, yea, is it $realm cloud yet ?:)
[19:29:37] <James_F>	 Reedy: You /always/ have a patch.
[19:29:44] <Reedy>	 ALL OF THE PATCHES
[19:29:46] <James_F>	 mutante: s/cloud/staging/, surely.
[19:30:16] <mutante>	 eh.. ok :)
[19:30:42] <mutante>	 but all the $realm checks should be replaced with Hiera ..afaict
[19:31:09] * James_F leaves that to people that know what they're doing.
[19:33:41] <Reedy>	 beta
[19:33:45] <Reedy>	 deployment-prep!
[19:33:51] <James_F>	 ;-)
[19:34:15] <Reedy>	 otherwikisweshouldcareaboutbutdontasmuchasweshould
[19:36:16] <wikibugs>	 10Operations, 10Pybal, 10Traffic: Implement stateless TCP balancing in our LVS servers - https://phabricator.wikimedia.org/T175203#3586081 (10BBlack)
[19:36:24] <wikibugs>	 10Operations, 10Pybal, 10Traffic: Implement stateless TCP balancing in our LVS servers - https://phabricator.wikimedia.org/T175203#3586097 (10BBlack) p:05Triage>03High
[19:36:46] <wikibugs>	 10Operations: Fix LVS "sh" shortcomings - https://phabricator.wikimedia.org/T86651#973429 (10BBlack)
[19:36:48] <wikibugs>	 10Operations, 10Pybal, 10Traffic: Implement stateless TCP balancing in our LVS servers - https://phabricator.wikimedia.org/T175203#3586081 (10BBlack)
[19:37:51] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to Production shell access for rho (Rita Ho) - https://phabricator.wikimedia.org/T175204#3586102 (10RHo)
[19:38:06] <wikibugs>	 10Operations, 10Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: Do not deploy Cirrus elasticsearch plugins on logstash cluster - https://phabricator.wikimedia.org/T174933#3586116 (10debt) 05Open>03Resolved Thanks!
[19:39:37] <no_justification>	 <James_F>: Everyone hates having to make fixes in ops/puppet. ;-)
[19:39:39] <no_justification>	 {{cn}}
[19:39:46] <no_justification>	 "Everyone" is awfully broad ;-)
[19:40:31] <wikibugs>	 (03PS1) 10Ottomata: Apply kafka::jumbo::broker on new kafka-jumbo100* hosts [puppet] - 10https://gerrit.wikimedia.org/r/376336 (https://phabricator.wikimedia.org/T167992)
[19:40:36] <wikibugs>	 (03PS1) 10Dzahn: copy squid.php->reverse-proxy.php, squid-labs->reverse-proxy-staging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376337 (https://phabricator.wikimedia.org/T104148)
[19:41:05] <no_justification>	 James_F: Thoughts on T175080 btw?
[19:41:06] <stashbot>	 T175080: Flow fails to load content when running CirrusSearchLinksUpdate jobs - https://phabricator.wikimedia.org/T175080
[19:41:12] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Apply kafka::jumbo::broker on new kafka-jumbo100* hosts [puppet] - 10https://gerrit.wikimedia.org/r/376336 (https://phabricator.wikimedia.org/T167992) (owner: 10Ottomata)
[19:41:16] <no_justification>	 Was super spammy on group0 yesterday. Pretty sure it's not Cirrus' fault ultimately
[19:41:32] <James_F>	 no_justification: Eurgh.
[19:41:48] <no_justification>	 It's probably harmless. But it was louddddddddd
[19:41:52] <no_justification>	 So I'm holding group1 for the moment.
[19:41:57] <no_justification>	 Hopefully not til COB
[19:42:14] <James_F>	 no_justification: That sounds like it's caused by the fix for the bug Ariel asked for in the dumps for Flow.
[19:42:40] <James_F>	 no_justification: Previously Flow was loading content for some call that WikiPage doesn't load content.
[19:42:53] <mutante>	 ok, https://gerrit.wikimedia.org/r/#/c/376337/1 but i'm not also fixing the _content_ of the files , heh
[19:42:57] <James_F>	 no_justification: Sounds like Cirrus is for some reason depending on it loading content.
[19:43:05] <mutante>	 e.g. "# our text squid in beta labs gets forwarded requests
[19:43:07] * James_F hunts.
[19:43:59] <mutante>	 lol, "text squid in beta labs" is actually tripple combo or so :)
[19:44:57] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to Production shell access for rho (Rita Ho) - https://phabricator.wikimedia.org/T175204#3586102 (10Reedy) Can you get your manager to sign off too?
[19:45:01] <James_F>	 no_justification: Hmm. No, that was https://phabricator.wikimedia.org/T172025 but it wasn't merged. https://gerrit.wikimedia.org/r/#/q/project:mediawiki/extensions/Flow+is:merged doesn't show anything confusing.
[19:45:11] <James_F>	 s/confusing/suggestive.
[19:45:28] <wikibugs>	 (03PS16) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902)
[19:45:42] <wikibugs>	 (03Abandoned) 10Dzahn: Cleanup: squid.php → ReverseProxy.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309742 (https://phabricator.wikimedia.org/T104148) (owner: 10Dereckson)
[19:45:48] <wikibugs>	 (03PS1) 10Ottomata: Un-apply kafka role -- these should be stretch, not jessie! :/ [puppet] - 10https://gerrit.wikimedia.org/r/376339 (https://phabricator.wikimedia.org/T167992)
[19:46:02] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Un-apply kafka role -- these should be stretch, not jessie! :/ [puppet] - 10https://gerrit.wikimedia.org/r/376339 (https://phabricator.wikimedia.org/T167992) (owner: 10Ottomata)
[19:46:40] <icinga-wm>	 PROBLEM - puppet last run on kafka-jumbo1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:47:38] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to Production shell access for rho (Rita Ho) - https://phabricator.wikimedia.org/T175204#3586162 (10RHo) Sure - adding @Nirzar for sign-off
[19:48:50] <chasemp>	 !log reboot labvirt1018
[19:49:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:49:05] <wikibugs>	 (03PS1) 10Ottomata: Install kafka-jumbo as Stretch [puppet] - 10https://gerrit.wikimedia.org/r/376340 (https://phabricator.wikimedia.org/T167992)
[19:49:39] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Install kafka-jumbo as Stretch [puppet] - 10https://gerrit.wikimedia.org/r/376340 (https://phabricator.wikimedia.org/T167992) (owner: 10Ottomata)
[19:50:59] <wikibugs>	 (03CR) 10Jforrester: "After this:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376337 (https://phabricator.wikimedia.org/T104148) (owner: 10Dzahn)
[19:51:21] <icinga-wm>	 PROBLEM - puppet last run on kafka-jumbo1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:51:30] <icinga-wm>	 PROBLEM - puppet last run on kafka-jumbo1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:51:30] <wikibugs>	 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10Epic, 10Services (watching): 2017/18 annual plan program 8: Multi-datacenter support - https://phabricator.wikimedia.org/T175206#3586179 (10GWicke)
[19:53:07] <ottomata>	 !log reimaging kafka-jumbo* with stretch
[19:53:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:54:28] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3352337 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['kafka-jumbo1001...
[19:54:29] <no_justification>	 James_F: Yeah. It's bothersome because it's /definitely/ new to wmf.17
[19:54:56] <James_F>	 no_justification: I'm assuming it was there before anomie's fix for the comment thing?
[19:55:05] <no_justification>	 Yeah it was
[19:55:06] <James_F>	 no_justification: That's the only recent code in Flow.
[19:55:09] <James_F>	 Hmm. OK.
[19:55:14] <no_justification>	 I think, at least
[19:55:48] <wikibugs>	 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10Epic, 10Services (watching): 2017/18 annual plan program 8: Multi-datacenter support - https://phabricator.wikimedia.org/T175206#3586225 (10GWicke)
[19:57:31] <icinga-wm>	 PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 7 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[19:58:06] <wikibugs>	 (03PS1) 10Rush: openstack: correction to 376331 [puppet] - 10https://gerrit.wikimedia.org/r/376345 (https://phabricator.wikimedia.org/T171494)
[19:58:48] <wikibugs>	 (03PS2) 10Rush: openstack: move virt settings under role/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/376345 (https://phabricator.wikimedia.org/T171494)
[19:58:56] <wikibugs>	 (03PS3) 10Rush: openstack: move virt settings under role/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/376345 (https://phabricator.wikimedia.org/T171494)
[19:59:40] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: move virt settings under role/eqiad [puppet] - 10https://gerrit.wikimedia.org/r/376345 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[20:00:06] <jouncebot>	 gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170906T2000). Please do the needful.
[20:00:15] <halfak>	 Nothing for ORES. 
[20:01:12] <subbu>	 arlo will do a parsoid deploy
[20:02:43] <wikibugs>	 10Operations, 10Epic, 10Goal, 10Services (doing), and 2 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3586259 (10Pchelolo)
[20:06:52] <wikibugs>	 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10Epic, 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support, Q2 goals - https://phabricator.wikimedia.org/T175213#3586325 (10GWicke)
[20:07:41] <icinga-wm>	 RECOVERY - puppet last run on kafka-jumbo1003 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[20:07:50] <icinga-wm>	 RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[20:07:50] <icinga-wm>	 RECOVERY - puppet last run on kafka-jumbo1006 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[20:08:01] <icinga-wm>	 RECOVERY - puppet last run on kafka-jumbo1001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[20:11:31] <wikibugs>	 (03PS1) 10Ottomata: Revert "Un-apply kafka role -- these should be stretch, not jessie! :/" [puppet] - 10https://gerrit.wikimedia.org/r/376347
[20:11:38] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Revert "Un-apply kafka role -- these should be stretch, not jessie! :/" [puppet] - 10https://gerrit.wikimedia.org/r/376347 (owner: 10Ottomata)
[20:11:42] <wikibugs>	 (03PS2) 10Ottomata: Revert "Un-apply kafka role -- these should be stretch, not jessie! :/" [puppet] - 10https://gerrit.wikimedia.org/r/376347
[20:11:44] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Revert "Un-apply kafka role -- these should be stretch, not jessie! :/" [puppet] - 10https://gerrit.wikimedia.org/r/376347 (owner: 10Ottomata)
[20:13:48] <wikibugs>	 (03PS1) 10Ottomata: Revert "Revert "Un-apply kafka role -- these should be stretch, not jessie! :/"" [puppet] - 10https://gerrit.wikimedia.org/r/376350
[20:14:00] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] "Ah, these are still reimaging... :0" [puppet] - 10https://gerrit.wikimedia.org/r/376350 (owner: 10Ottomata)
[20:14:17] <wikibugs>	 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10Epic, 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support, Q2 goals - https://phabricator.wikimedia.org/T175213#3586366 (10GWicke)
[20:16:47] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3586375 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['kafka-jumbo1001.eqiad.wmnet', 'kafka-jumbo1002.eqiad.wmnet', 'k...
[20:16:57] <wikibugs>	 (03PS1) 10Ottomata: Revert "Revert "Revert "Un-apply kafka role -- these should be stretch, not jessie! :/""" [puppet] - 10https://gerrit.wikimedia.org/r/376371
[20:17:17] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Revert "Revert "Revert "Un-apply kafka role -- these should be stretch, not jessie! :/""" [puppet] - 10https://gerrit.wikimedia.org/r/376371 (owner: 10Ottomata)
[20:18:42] <logmsgbot>	 !log arlolra@tin Started deploy [parsoid/deploy@f07ac8c]: Updating Parsoid to f9d367ea
[20:18:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:12] <wikibugs>	 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10Epic, 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support, Q2 goals - https://phabricator.wikimedia.org/T175213#3586379 (10GWicke)
[20:21:48] <wikibugs>	 (03PS1) 10Ottomata: Don't use ganglia on new kafka-jumbo hosts [puppet] - 10https://gerrit.wikimedia.org/r/376376
[20:22:41] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Don't use ganglia on new kafka-jumbo hosts [puppet] - 10https://gerrit.wikimedia.org/r/376376 (owner: 10Ottomata)
[20:27:10] <logmsgbot>	 !log arlolra@tin Finished deploy [parsoid/deploy@f07ac8c]: Updating Parsoid to f9d367ea (duration: 08m 27s)
[20:27:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:30] <wikibugs>	 (03PS1) 10Ottomata: Add debug notifies to figure out error message in prod [puppet] - 10https://gerrit.wikimedia.org/r/376377 (https://phabricator.wikimedia.org/T167992)
[20:29:34] <wikibugs>	 (03PS1) 10Hashar: package_builder: test -nt differs in bash vs dash [puppet] - 10https://gerrit.wikimedia.org/r/376378
[20:30:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add debug notifies to figure out error message in prod [puppet] - 10https://gerrit.wikimedia.org/r/376377 (https://phabricator.wikimedia.org/T167992) (owner: 10Ottomata)
[20:31:09] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Add debug notifies to figure out error message in prod [puppet] - 10https://gerrit.wikimedia.org/r/376377 (https://phabricator.wikimedia.org/T167992) (owner: 10Ottomata)
[20:33:50] <wikibugs>	 (03PS1) 10Ottomata: Debugging [puppet] - 10https://gerrit.wikimedia.org/r/376379 (https://phabricator.wikimedia.org/T167992)
[20:34:14] <arlolra>	 !log Updated Parsoid to f9d367ea (T169342)
[20:34:14] <wikibugs>	 (03PS2) 10Ottomata: Debugging [puppet] - 10https://gerrit.wikimedia.org/r/376379 (https://phabricator.wikimedia.org/T167992)
[20:34:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:27] <stashbot>	 T169342: Gallery output for missing images is not consistent with PHP parser and is missing data - https://phabricator.wikimedia.org/T169342
[20:34:43] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to Production shell access for rho (Rita Ho) - https://phabricator.wikimedia.org/T175204#3586434 (10Nirzar) Looks good
[20:35:05] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Debugging [puppet] - 10https://gerrit.wikimedia.org/r/376379 (https://phabricator.wikimedia.org/T167992) (owner: 10Ottomata)
[20:37:21] <wikibugs>	 (03CR) 10Hashar: "Only happens when buildresult/Packages is missing.  dash just skip it :]" [puppet] - 10https://gerrit.wikimedia.org/r/376378 (owner: 10Hashar)
[20:39:43] <wikibugs>	 (03PS1) 10Ottomata: Use fq +profile::kafka::broker::kafka_cluster_name when configuring a broker [puppet] - 10https://gerrit.wikimedia.org/r/376388
[20:40:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Use fq +profile::kafka::broker::kafka_cluster_name when configuring a broker [puppet] - 10https://gerrit.wikimedia.org/r/376388 (owner: 10Ottomata)
[20:40:57] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Use fq +profile::kafka::broker::kafka_cluster_name when configuring a broker [puppet] - 10https://gerrit.wikimedia.org/r/376388 (owner: 10Ottomata)
[20:43:42] <wikibugs>	 (03PS1) 10Ottomata: /etc/kafka/mirror should require confluent-kafka package [puppet] - 10https://gerrit.wikimedia.org/r/376395 (https://phabricator.wikimedia.org/T376379)
[20:44:19] <wikibugs>	 (03CR) 10Ottomata: [C: 032] /etc/kafka/mirror should require confluent-kafka package [puppet] - 10https://gerrit.wikimedia.org/r/376395 (https://phabricator.wikimedia.org/T376379) (owner: 10Ottomata)
[20:45:41] <wikibugs>	 10Operations, 10Cloud-Services: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3586455 (10RobH) IIRC the bios is already the newest version.  I flashed the bios and the ilom when I installed them.
[20:53:31] <wikibugs>	 (03PS1) 10Ottomata: Allow new kafka-jumbo hosts to talk to zookeeper on conf* [puppet] - 10https://gerrit.wikimedia.org/r/376407 (https://phabricator.wikimedia.org/T167992)
[20:56:53] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Allow new kafka-jumbo hosts to talk to zookeeper on conf* [puppet] - 10https://gerrit.wikimedia.org/r/376407 (https://phabricator.wikimedia.org/T167992) (owner: 10Ottomata)
[20:57:13] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors
[20:57:46] <ottomata>	 checking
[20:57:47] <ottomata>	 could be related
[20:58:43] <icinga-wm>	 PROBLEM - nova-api process on labnet1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-api
[20:58:52] <icinga-wm>	 PROBLEM - nova-api http on labnet1002 is CRITICAL: connect to address 10.64.20.25 and port 8774: Connection refused
[21:00:04] <jouncebot>	 kaldari, MaxSem, and Niharika: Dear anthropoid, the time has come. Please deploy ArticleCreationWorkflow (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170906T2100).
[21:00:57] <wikibugs>	 (03PS1) 10Rush: openstack: port hiera settings to new openstack::net role [puppet] - 10https://gerrit.wikimedia.org/r/376420 (https://phabricator.wikimedia.org/T171494)
[21:01:02] <wikibugs>	 (03PS2) 10Rush: openstack: port hiera settings to new openstack::net role [puppet] - 10https://gerrit.wikimedia.org/r/376420 (https://phabricator.wikimedia.org/T171494)
[21:01:27] <wikibugs>	 (03Draft1) 10Paladox: Gerrit: Convert its base templates to soy (closure template) [puppet] - 10https://gerrit.wikimedia.org/r/376406 (https://phabricator.wikimedia.org/T140366)
[21:01:28] <wikibugs>	 (03PS2) 10Paladox: Gerrit: Convert its base templates to soy (closure template) [puppet] - 10https://gerrit.wikimedia.org/r/376406 (https://phabricator.wikimedia.org/T140366)
[21:01:47] <paladox>	 MatmaRex i finally fixed the issue you described :)
[21:02:20] <MaxSem>	 kaldari, what's the battle plan?
[21:02:24] <Niharika>	 MaxSem: kaldari: What do we need to do for ^?
[21:02:36] <wikibugs>	 (03PS3) 10Rush: openstack: port hiera settings to new openstack::net role [puppet] - 10https://gerrit.wikimedia.org/r/376420 (https://phabricator.wikimedia.org/T171494)
[21:02:44] <Niharika>	 I think he made a patch for it. 
[21:02:46] <wikibugs>	 (03PS4) 10Rush: openstack: port hiera settings to new openstack::net role [puppet] - 10https://gerrit.wikimedia.org/r/376420 (https://phabricator.wikimedia.org/T171494)
[21:03:26] <wikibugs>	 (03PS1) 10Ottomata: Add monitoring hostgroup for jumbo_kafka_eqiad [puppet] - 10https://gerrit.wikimedia.org/r/376424
[21:03:42] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Add monitoring hostgroup for jumbo_kafka_eqiad [puppet] - 10https://gerrit.wikimedia.org/r/376424 (owner: 10Ottomata)
[21:04:19] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: port hiera settings to new openstack::net role [puppet] - 10https://gerrit.wikimedia.org/r/376420 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[21:04:25] <wikibugs>	 (03PS5) 10Rush: openstack: port hiera settings to new openstack::net role [puppet] - 10https://gerrit.wikimedia.org/r/376420 (https://phabricator.wikimedia.org/T171494)
[21:04:31] <wikibugs>	 (03PS3) 10Paladox: Gerrit: Convert its base templates to soy (closure template) [puppet] - 10https://gerrit.wikimedia.org/r/376406 (https://phabricator.wikimedia.org/T140366)
[21:04:41] <Niharika>	 MaxSem: We're on our own. :P I wanna do this. 
[21:04:55] <MaxSem>	 dooooo it
[21:05:08] <Niharika>	 MaxSem: Lets see. For beta cluster...
[21:05:33] <wikibugs>	 (03CR) 10Paladox: [C: 04-1] "This fixes a security issue described in task. This needs chad +1 and for us to be on 2.14 and to be running change https://gerrit-review." [puppet] - 10https://gerrit.wikimedia.org/r/376406 (https://phabricator.wikimedia.org/T140366) (owner: 10Paladox)
[21:08:42] <icinga-wm>	 PROBLEM - puppet last run on labnet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:09:35] <wikibugs>	 (03PS1) 10Ottomata: Add kafka rack (row) awareness configs [puppet] - 10https://gerrit.wikimedia.org/r/376428 (https://phabricator.wikimedia.org/T167992)
[21:10:23] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Add kafka rack (row) awareness configs [puppet] - 10https://gerrit.wikimedia.org/r/376428 (https://phabricator.wikimedia.org/T167992) (owner: 10Ottomata)
[21:10:29] <wikibugs>	 (03PS2) 10Ottomata: Add kafka rack (row) awareness configs [puppet] - 10https://gerrit.wikimedia.org/r/376428 (https://phabricator.wikimedia.org/T167992)
[21:10:31] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Add kafka rack (row) awareness configs [puppet] - 10https://gerrit.wikimedia.org/r/376428 (https://phabricator.wikimedia.org/T167992) (owner: 10Ottomata)
[21:10:54] <wikibugs>	 (03PS1) 10Rush: openstack: correct key paths for nova/network hiera [puppet] - 10https://gerrit.wikimedia.org/r/376429 (https://phabricator.wikimedia.org/T171494)
[21:11:33] <wikibugs>	 (03PS2) 10Rush: openstack: correct key paths for nova/network hiera [puppet] - 10https://gerrit.wikimedia.org/r/376429 (https://phabricator.wikimedia.org/T171494)
[21:12:20] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: correct key paths for nova/network hiera [puppet] - 10https://gerrit.wikimedia.org/r/376429 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[21:13:07] <wikibugs>	 (03PS4) 10Paladox: Gerrit: Convert its base templates to soy (closure template) [puppet] - 10https://gerrit.wikimedia.org/r/376406 (https://phabricator.wikimedia.org/T140366)
[21:16:34] <wikibugs>	 (03PS1) 10Niharika29: Configure ACW for Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376431 (https://phabricator.wikimedia.org/T175054)
[21:16:41] <Niharika>	 MaxSem: ^
[21:17:18] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct
[21:17:23] <Niharika>	 MaxSem: Some extensions, like LoginNotify and ULS don't have a wfLoadExtension in there. When do I need to add it and when not?
[21:17:57] <ottomata>	 !log rebooting kafka-jumbo1004
[21:18:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Configure ACW for Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376431 (https://phabricator.wikimedia.org/T175054) (owner: 10Niharika29)
[21:18:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:19:20] <MaxSem>	 Niharika, LN is loaded from the main CS.php
[21:19:59] <wikibugs>	 (03PS2) 10Niharika29: Configure ACW for Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376431 (https://phabricator.wikimedia.org/T175054)
[21:20:16] <Niharika>	 MaxSem: So for beta cluster we load both prod and labs CS files? That doesn't make sense. 
[21:20:18] <wikibugs>	 (03CR) 10MaxSem: Configure ACW for Beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376431 (https://phabricator.wikimedia.org/T175054) (owner: 10Niharika29)
[21:20:23] <wikibugs>	 (03PS1) 10Rush: openstack: pass in network_public_ip for nova network [puppet] - 10https://gerrit.wikimedia.org/r/376433 (https://phabricator.wikimedia.org/T171494)
[21:20:45] <MaxSem>	 makes perfect sense: labs is a copy of prod
[21:21:41] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: pass in network_public_ip for nova network [puppet] - 10https://gerrit.wikimedia.org/r/376433 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[21:23:00] <wikibugs>	 (03PS3) 10Niharika29: Configure ACW for Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376431 (https://phabricator.wikimedia.org/T175054)
[21:23:50] <wikibugs>	 (03PS1) 10Ottomata: Enable topic deletion for kafka-jumbo [puppet] - 10https://gerrit.wikimedia.org/r/376436
[21:24:18] <wikibugs>	 (03PS2) 10Ottomata: Enable topic deletion for kafka-jumbo [puppet] - 10https://gerrit.wikimedia.org/r/376436
[21:24:36] <wikibugs>	 (03CR) 10MaxSem: [C: 04-1] Configure ACW for Beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376431 (https://phabricator.wikimedia.org/T175054) (owner: 10Niharika29)
[21:25:17] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Enable topic deletion for kafka-jumbo [puppet] - 10https://gerrit.wikimedia.org/r/376436 (owner: 10Ottomata)
[21:25:34] <Niharika>	 MaxSem: Then how do I exclude multiple rights? :| (And why the heck is this still not documented) 
[21:26:24] <Niharika>	 Where's the config.txt?
[21:26:30] <MaxSem>	 what do you mean by not documented? https://github.com/wikimedia/mediawiki-extensions-ArticleCreationWorkflow/blob/master/doc/config.txt
[21:27:10] <Niharika>	 MaxSem: Doh, didn't see that. We can't exclude multiple rights? Okay. ¯\_(ツ)_/¯ 
[21:27:37] <wikibugs>	 (03PS4) 10Niharika29: Configure ACW for Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376431 (https://phabricator.wikimedia.org/T175054)
[21:33:04] <Niharika>	 MaxSem: ^
[21:33:35] <wikibugs>	 (03CR) 10MaxSem: [C: 031] Configure ACW for Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376431 (https://phabricator.wikimedia.org/T175054) (owner: 10Niharika29)
[21:33:50] <Niharika>	 MaxSem: Who's gonna +2? :P 
[21:34:10] <MaxSem>	 you said you wanted to do it yourself?
[21:34:30] <Niharika>	 MaxSem: You can +2? 
[21:34:40] <wikibugs>	 (03CR) 10MaxSem: [C: 032] Configure ACW for Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376431 (https://phabricator.wikimedia.org/T175054) (owner: 10Niharika29)
[21:34:44] <MaxSem>	 pfft:P
[21:34:59] <Reedy>	 self merge ftw
[21:35:02] <Reedy>	 test in production ftw
[21:35:11] <Niharika>	 :P
[21:35:34] <Reedy>	 bonus points for enwiki
[21:35:46] <no_justification>	 MaxSem wins today
[21:36:08] <bearND>	 hope you guys don't mind if I do a Node.js service deploy, too. Should not affect any of the MW stuff you are doing.
[21:36:10] <wikibugs>	 (03Merged) 10jenkins-bot: Configure ACW for Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376431 (https://phabricator.wikimedia.org/T175054) (owner: 10Niharika29)
[21:37:16] <Niharika>	 MaxSem: Steps same as for prod and a full scap, right?
[21:37:24] <Reedy>	 o_
[21:37:27] <MaxSem>	 whyyyyyy?
[21:37:31] <Reedy>	 Niharika: just sync-file
[21:37:41] <Niharika>	 Ah okay. It's already there. 
[21:37:42] <Reedy>	 so tin/noc is up to date
[21:37:50] <Niharika>	 Gotcha. 
[21:39:29] <logmsgbot>	 !log niharika29@tin Synchronized wmf-config/CommonSettings-labs.php: Config for ArticleCreationWorkflow T175054 (duration: 00m 50s)
[21:39:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:42] <stashbot>	 T175054: Test ArticleCreationWorkflow on the Beta Cluster - https://phabricator.wikimedia.org/T175054
[21:40:26] <Niharika>	 "21:39:19 Check 'Logstash Error rate for mw1265.eqiad.wmnet' failed: ERROR: 7% OVER_THRESHOLD (Avg. Error rate: Before: 0.37, After: 4.00, Threshold: 3.71)"
[21:47:33] <wikibugs>	 (03PS1) 10Rush: openstack: bare_metal is a hash and set main accordingly [puppet] - 10https://gerrit.wikimedia.org/r/376437 (https://phabricator.wikimedia.org/T171494)
[21:48:29] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 031] openstack: bare_metal is a hash and set main accordingly [puppet] - 10https://gerrit.wikimedia.org/r/376437 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[21:49:00] <wikibugs>	 (03PS2) 10Rush: openstack: bare_metal is a hash and set main accordingly [puppet] - 10https://gerrit.wikimedia.org/r/376437 (https://phabricator.wikimedia.org/T171494)
[21:50:21] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: bare_metal is a hash and set main accordingly [puppet] - 10https://gerrit.wikimedia.org/r/376437 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[21:55:50] <wikibugs>	 (03PS1) 10Rush: openstack: add nova-fullstack upstart template to openstack2 [puppet] - 10https://gerrit.wikimedia.org/r/376439 (https://phabricator.wikimedia.org/T171494)
[21:56:09] <wikibugs>	 (03PS2) 10Rush: openstack: add nova-fullstack upstart template to openstack2 [puppet] - 10https://gerrit.wikimedia.org/r/376439 (https://phabricator.wikimedia.org/T171494)
[21:57:02] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: add nova-fullstack upstart template to openstack2 [puppet] - 10https://gerrit.wikimedia.org/r/376439 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[21:59:42] <icinga-wm>	 RECOVERY - puppet last run on labnet1001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[22:02:49] <dsaez>	 Hi!
[22:03:10] <dsaez>	 channel: does anyone knows how long takes the Cloaks to be assigned? I've open my request around 4 weeks ago
[22:03:35] <greg-g>	 wrong ops :)
[22:03:42] <greg-g>	 see #wikimedia-ops
[22:03:44] <greg-g>	 :)
[22:03:55] <dsaez>	 hehe
[22:04:06] <dsaez>	 greg-g: thanks
[22:04:12] <wikibugs>	 (03PS1) 10Rush: openstack: preserve limiting SSH listen IP for nova network host [puppet] - 10https://gerrit.wikimedia.org/r/376442 (https://phabricator.wikimedia.org/T171494)
[22:04:29] <greg-g>	 np
[22:05:27] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: preserve limiting SSH listen IP for nova network host [puppet] - 10https://gerrit.wikimedia.org/r/376442 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush)
[22:06:32] <logmsgbot>	 !log bsitzmann@tin Started deploy [mobileapps/deploy@507a479]: Update mobileapps to 2cb6281 (T168848 T169277 T169274 T162179 T164033 T167921 T174698 T168848 T174808)
[22:06:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:06:50] <stashbot>	 T164033: Test size of "Reading stripped" HTML vs non-stripped HTML - https://phabricator.wikimedia.org/T164033
[22:06:51] <stashbot>	 T168848: Bootstrap an initial version of the Page Summary API in MCS - https://phabricator.wikimedia.org/T168848
[22:06:51] <stashbot>	 T169277: Investigate missing page in specific "On this day" event - https://phabricator.wikimedia.org/T169277
[22:06:51] <stashbot>	 T169274: Expand "On this day" endpoint language support - https://phabricator.wikimedia.org/T169274
[22:06:51] <stashbot>	 T167921: Support Lazy loading of page content not needed for first paint  - https://phabricator.wikimedia.org/T167921
[22:06:51] <stashbot>	 T174698: Parenthetical stripping is too aggressive - https://phabricator.wikimedia.org/T174698
[22:06:52] <stashbot>	 T174808: Add swagger spec for summary endpoint - https://phabricator.wikimedia.org/T174808
[22:06:52] <stashbot>	 T162179: Extract HTML Compatibility Layer from MCS Mobile Sections API - https://phabricator.wikimedia.org/T162179
[22:11:25] <logmsgbot>	 !log bsitzmann@tin Finished deploy [mobileapps/deploy@507a479]: Update mobileapps to 2cb6281 (T168848 T169277 T169274 T162179 T164033 T167921 T174698 T168848 T174808) (duration: 04m 53s)
[22:11:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:05] <wikibugs>	 10Operations, 10monitoring: Review check_ping settings - https://phabricator.wikimedia.org/T173315#3523998 (10Dzahn) I read that item 11 and noticed the very end of it "//**Another option would be to use a faster plugin (i.e. check_fping) as the host_check_command instead of check_ping.**//". How about that on...
[22:17:49] <wikibugs>	 (03CR) 10Chad: [V: 032 C: 032] Use keyholder_key in scap/scap.cfg [software/gerrit] - 10https://gerrit.wikimedia.org/r/376264 (owner: 10Paladox)
[22:18:04] <wikibugs>	 (03PS1) 10MaxSem: Try fixing ACW setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376443
[22:18:16] <logmsgbot>	 !log demon@tin Started deploy [gerrit/gerrit@d4f9a77]: (no justification provided)
[22:18:24] <logmsgbot>	 !log demon@tin Finished deploy [gerrit/gerrit@d4f9a77]: (no justification provided) (duration: 00m 07s)
[22:18:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:18:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:18:52] <no_justification>	 !log prior deploy was no-op
[22:18:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Try fixing ACW setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376443 (owner: 10MaxSem)
[22:19:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:19:05] <wikibugs>	 (03CR) 10MaxSem: [C: 032] Try fixing ACW setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376443 (owner: 10MaxSem)
[22:20:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Try fixing ACW setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376443 (owner: 10MaxSem)
[22:20:13] <wikibugs>	 10Operations, 10Analytics, 10Traffic: Invalid "wikimedia" family in unique devices data due to misplaced WMF-Last-Access-Global cookie - https://phabricator.wikimedia.org/T174640#3586807 (10Dzahn)
[22:20:41] <wikibugs>	 (03CR) 10Chad: "Ah, forgot to read the task ;-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372824 (https://phabricator.wikimedia.org/T173684) (owner: 10Urbanecm)
[22:20:49] <wikibugs>	 (03PS2) 10MaxSem: Try fixing ACW setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376443
[22:21:18] <wikibugs>	 (03CR) 10MaxSem: [C: 032] Try fixing ACW setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376443 (owner: 10MaxSem)
[22:22:16] <wikibugs>	 (03CR) 10Dzahn: "i'm sure mail to the old address will be forwarded for month :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372824 (https://phabricator.wikimedia.org/T173684) (owner: 10Urbanecm)
[22:22:46] <wikibugs>	 (03Merged) 10jenkins-bot: Try fixing ACW setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376443 (owner: 10MaxSem)
[22:24:47] <wikibugs>	 (03CR) 10Chad: [C: 031] Gerrit: Set auth.userNameToLowerCase [puppet] - 10https://gerrit.wikimedia.org/r/368196 (owner: 10Paladox)
[22:24:59] <wikibugs>	 (03CR) 10Dzahn: [C: 032] contint: aptly server in labs [puppet] - 10https://gerrit.wikimedia.org/r/374805 (https://phabricator.wikimedia.org/T161882) (owner: 10Hashar)
[22:26:07] <wikibugs>	 (03CR) 10Greg Grossmeier: "I don't think this is needed any more? Moritz was going to put the trusty php5.5 package in the production apt repo for us (see ops list t" [puppet] - 10https://gerrit.wikimedia.org/r/374805 (https://phabricator.wikimedia.org/T161882) (owner: 10Hashar)
[22:27:47] <wikibugs>	 (03CR) 10Dzahn: "oh, ok, thanks Greg! I guess the dependency https://gerrit.wikimedia.org/r/#/c/374837/  might still be wanted though.. will see list" [puppet] - 10https://gerrit.wikimedia.org/r/374805 (https://phabricator.wikimedia.org/T161882) (owner: 10Hashar)
[22:29:34] <wikibugs>	 (03CR) 10Dzahn: "would this change be desired even if PHP5.5 packages are added to apt.wikimedia.org? to generally support https.. or would we stop using a" [puppet] - 10https://gerrit.wikimedia.org/r/374837 (owner: 10Hashar)
[22:30:45] <mutante>	 no_justification: does your +1 on https://gerrit.wikimedia.org/r/#/c/368196/  also mean "anytime" or "now" ? :)
[22:31:02] <mutante>	 or better with maintenance.. 
[22:31:22] <mutante>	 i think i remember "PITA to revert" 
[22:31:58] <no_justification>	 Just a general +1. I would say today/now but I've had a cruddy day and I've got a headache
[22:32:31] <mutante>	 ok, earlier in a day then it is
[22:32:56] <no_justification>	 Eh, not so much that it's late it's just /today/ has been terribly shitty
[22:33:33] <no_justification>	 Murphy has it out for me today 
[22:33:52] <mutante>	 ugh, sure! get better soon.
[22:34:55] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Gerrit: Set base url for commitlink [puppet] - 10https://gerrit.wikimedia.org/r/375922 (owner: 10Paladox)
[22:35:11] <wikibugs>	 (03PS6) 10Dzahn: Gerrit: Set base url for commitlink [puppet] - 10https://gerrit.wikimedia.org/r/375922 (owner: 10Paladox)
[22:35:14] <mutante>	 does the one that can land anytime :)
[22:35:21] <no_justification>	 Yeah
[22:38:05] * legoktm hugs no_justification
[22:39:56] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to Production shell access for rho (Rita Ho) - https://phabricator.wikimedia.org/T175204#3586102 (10Dzahn) Hi @RHo please create a new SSH key (that isn't the same as one used for labs or something else before) and attach it to this ticket.  Here is how to...
[22:40:57] <wikibugs>	 (03CR) 10Paladox: "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/375922 (owner: 10Paladox)
[22:40:57] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to Production shell access for rho (Rita Ho) - https://phabricator.wikimedia.org/T175204#3586878 (10Dzahn) a:03ema
[22:41:05] <wikibugs>	 (03CR) 10Paladox: "Thanks." [software/gerrit] - 10https://gerrit.wikimedia.org/r/376264 (owner: 10Paladox)
[22:42:01] <wikibugs>	 10Operations, 10Icinga, 10monitoring: Icinga check for sysctl settings - https://phabricator.wikimedia.org/T160060#3586880 (10Dzahn)
[22:44:33] <wikibugs>	 10Operations: use htpasswd instead of htdigest for arbcom archive passwords - https://phabricator.wikimedia.org/T157761#3586884 (10Dzahn) 05Open>03stalled
[22:44:55] <wikibugs>	 10Operations: use htpasswd instead of htdigest for arbcom archive passwords - https://phabricator.wikimedia.org/T157761#3015614 (10Dzahn) p:05Normal>03Low
[22:46:39] <wikibugs>	 10Operations, 10Phabricator, 10Patch-For-Review: Only allow Phabricator weekly project changes cron job on production, not labs - https://phabricator.wikimedia.org/T173297#3586887 (10Dzahn) @paladox is this resolved now? did i fix it?
[22:46:57] <wikibugs>	 10Operations, 10Phabricator, 10Patch-For-Review: Only allow Phabricator weekly project changes cron job on production, not labs - https://phabricator.wikimedia.org/T173297#3586888 (10Paladox) 05Open>03Resolved @Dzahn yep :).
[22:50:11] <wikibugs>	 (03CR) 10Dzahn: jsbench: Prep osmium for decom and remove 've' and 'jsbench' roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/376151 (https://phabricator.wikimedia.org/T175093) (owner: 10Krinkle)
[22:58:27] <wikibugs>	 (03PS14) 10Paladox: gerrit: DO NOT MERGE [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738
[22:58:36] <wikibugs>	 (03PS15) 10Paladox: gerrit: DO NOT MERGE [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738
[22:58:43] <wikibugs>	 (03PS14) 10Paladox: Gerrit: Upgrading gerrit to 2.14.4-pre (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363734
[23:00:04] <jouncebot>	 addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170906T2300). Please do the needful.
[23:00:04] <jouncebot>	 ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[23:00:27] <twentyafterfour>	 I can swat
[23:01:02] <ebernhardson>	 just me today :) 
[23:01:18] <ebernhardson>	 should be easy enough. Probably ship the config patch first but should work in either order
[23:02:39] <wikibugs>	 (03PS16) 10Paladox: gerrit: Ajust scap files (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738
[23:03:01] <wikibugs>	 (03CR) 1020after4: [C: 032] Configure CirrusSearch human relevance survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374655 (https://phabricator.wikimedia.org/T174106) (owner: 10EBernhardson)
[23:03:11] <wikibugs>	 (03PS15) 10Paladox: Gerrit: Upgrading gerrit to 2.14.4-pre (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363734
[23:03:34] <wikibugs>	 (03PS2) 10Krinkle: jsbench: Prep osmium for decom and remove 've' and 'jsbench' roles [puppet] - 10https://gerrit.wikimedia.org/r/376151 (https://phabricator.wikimedia.org/T175093)
[23:04:24] <wikibugs>	 (03PS4) 1020after4: Configure CirrusSearch human relevance survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374655 (https://phabricator.wikimedia.org/T174106) (owner: 10EBernhardson)
[23:05:14] <twentyafterfour>	 ebernhardson: the config change had a merge conflict, rebased but I'm not sure if I need to wait for jenkins or just v+2 it https://gerrit.wikimedia.org/r/#/c/374655/
[23:05:31] <twentyafterfour>	 doesn't look like jenkins is gonna pick it up
[23:06:23] <ebernhardson>	 fun! removing +2 and re-adding it should kick jenkins into gear
[23:09:38] <twentyafterfour>	 syncing config change
[23:10:12] <logmsgbot>	 !log twentyafterfour@tin Synchronized wmf-config/: deploy config for CirrusSearch human relevancy survey. Change-Id: I272c69e5a3bb6e833fca59282142d6b237fd9e60 Bug: T174106 (duration: 00m 52s)
[23:10:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:10:27] <stashbot>	 T174106: Search Relevance Survey test #3: action items - https://phabricator.wikimedia.org/T174106
[23:10:39] <twentyafterfour>	 uhm, error rate for mw1277 96% over threshold? wtf
[23:11:15] <ebernhardson>	 twentyafterfour: looks like it wanted to sync the files one at a time
[23:11:48] <twentyafterfour>	 uhm no
[23:12:07] <ebernhardson>	 misspelled?
[23:12:27] <logmsgbot>	 !log twentyafterfour@tin Synchronized wmf-config/: roll-back due to huge error rate spike (duration: 00m 51s)
[23:12:37] <twentyafterfour>	 wmgWMESearchRelevancePages  ....
[23:12:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:12:40] <twentyafterfour>	 maybe it was just a race
[23:12:43] <twentyafterfour>	 hmm
[23:12:58] <ebernhardson>	 twentyafterfour: i think so, just double checked and its spelled same in InitialiseSettings.php and CirrusSearch-common.php
[23:12:58] <twentyafterfour>	 I don't get why the canary failed but it went ahead with the deploy
[23:13:03] <twentyafterfour>	 that's a bug in scap for sure
[23:13:16] <James_F>	 It also allowed two parallel scaps earlier this week.
[23:13:24] <James_F>	 Did something change in scap's checks recently?
[23:13:25] <twentyafterfour>	 :-o
[23:13:42] <twentyafterfour>	 James_F: I'm not sure, there was a new scap release recently
[23:13:59] <James_F>	 twentyafterfour: Hmm. I'm suspicious.
[23:14:03] <twentyafterfour>	 ebernhardson: I'll try again, this time syncing just one file at a time
[23:14:16] <twentyafterfour>	 James_F: me too, I'll look into that after swat
[23:15:20] <Dereckson>	 twentyafterfour: yes, it's not atomic
[23:15:29] <ebernhardson>	 twentyafterfour: sure. Should be the new file first, then initialise settings, then cirrussearch-common
[23:15:29] <Dereckson>	 so if you look the error message on fatalmonitor
[23:15:36] <Dereckson>	 you'll see some file wanted a variable not still here
[23:15:51] <Dereckson>	 ie it synced the file using the variable BEFORE the file providing the variable
[23:15:55] <twentyafterfour>	 I really want to change the way that stuff is initialized
[23:15:57] <Dereckson>	 that's why order matters
[23:16:08] <twentyafterfour>	 Dereckson: yeah...
[23:16:17] <twentyafterfour>	 gotta love globals
[23:17:00] <ebernhardson>	 some day we could setup repo auth mode and get true atomic deploys. but so much work to get mediawiki into a place where that would work ...
[23:17:08] <logmsgbot>	 !log twentyafterfour@tin scap failed: average error rate on 9/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 for details)
[23:17:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:17:26] <twentyafterfour>	 still no good even just syncing the initializesettings file
[23:18:00] <ebernhardson>	 twentyafterfour: did you sync the new file first? i'm seeing Warning: include(/srv/mediawiki/wmf-config/CirrusSearch-rel-survey.php): File not found in /srv/mediawiki/wmf-config/InitialiseSettings.php on line 19417
[23:18:48] <Reedy>	 why is IS including other files? o_0
[23:18:56] <ebernhardson>	 Reedy: because that file has 18k lines
[23:19:35] <wikibugs>	 (03PS3) 10Dzahn: jsbench: Prep osmium for decom and remove 've' and 'jsbench' roles [puppet] - 10https://gerrit.wikimedia.org/r/376151 (https://phabricator.wikimedia.org/T175093) (owner: 10Krinkle)
[23:19:58] <logmsgbot>	 !log twentyafterfour@tin Synchronized wmf-config/CirrusSearch-rel-survey.php: sync the added file first  (duration: 00m 49s)
[23:20:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:21:07] <twentyafterfour>	 sorry these are rookie mistakes. I don't do config changes often enough (I already learned these lessons...shame on me)
[23:21:19] <logmsgbot>	 !log twentyafterfour@tin Synchronized wmf-config/InitialiseSettings.php: Now sync initializesettings (duration: 00m 49s)
[23:21:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:23:11] <logmsgbot>	 !log twentyafterfour@tin Synchronized wmf-config/CirrusSearch-common.php: CirrusSearch-common.php goes last (duration: 00m 48s)
[23:23:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:23:51] <twentyafterfour>	 ebernhardson: ok config change sync'd without blowing up. Now I'll sync the extension, wmf.17 first
[23:28:26] <logmsgbot>	 !log twentyafterfour@tin Synchronized php-1.30.0-wmf.17/extensions/WikimediaEvents/: Sync Change-Id: I7ae522155e67610d25b5857d7b3918559bce8bc7 to wmf.17 refs T174387 (duration: 00m 49s)
[23:28:37] <twentyafterfour>	 ebernhardson: can you confirm that it's deployed and working as expected? Or do I need to sync to wmf.16 first?
[23:28:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:28:39] <stashbot>	 T174387: relevance survey: develop backend infrastructure to support lots of queries and lots of results per query - https://phabricator.wikimedia.org/T174387
[23:29:42] <ebernhardson>	 twentyafterfour: would need to at least pull wmf.16 to mwdebug1001, as it's only configured on enwiki
[23:29:52] <ebernhardson>	 twentyafterfour: i suppose if i had thought of it could have configured a page or two on testwiki
[23:31:47] <wikibugs>	 (03CR) 10Aaron Schulz: JobQueue: Add the RunSingleJob.php script (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac)
[23:31:57] <ebernhardson>	 (fwiw it will eventually be used on other languages, but this is still a test deployment)
[23:32:57] <wikibugs>	 (03PS4) 10GeoffreyT2000: Rename Wikisaurus namespace on Wiktionary to "Thesaurus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374063 (https://phabricator.wikimedia.org/T174264)
[23:33:48] <twentyafterfour>	 ebernhardson: I ran scap pull on mwdebug1001
[23:37:28] <twentyafterfour>	 ebernhardson: everything look ok?
[23:38:41] <ebernhardson>	 twentyafterfour: yup it looks sane 
[23:39:16] <twentyafterfour>	 ok thanks!  syncing to all webservers
[23:41:01] <logmsgbot>	 !log twentyafterfour@tin Synchronized php-1.30.0-wmf.16/extensions/WikimediaEvents/: Sync Change-Id: I7ae522155e67610d25b5857d7b3918559bce8bc7 to all webservers refs T174387 (duration: 00m 49s)
[23:41:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:41:14] <stashbot>	 T174387: relevance survey: develop backend infrastructure to support lots of queries and lots of results per query - https://phabricator.wikimedia.org/T174387
[23:42:27] <ebernhardson>	 sweet, thanks!
[23:42:51] <twentyafterfour>	 :)
[23:47:30] <wikibugs>	 (03CR) 10Dzahn: [C: 032] jsbench: Prep osmium for decom and remove 've' and 'jsbench' roles [puppet] - 10https://gerrit.wikimedia.org/r/376151 (https://phabricator.wikimedia.org/T175093) (owner: 10Krinkle)
[23:48:47] <wikibugs>	 (03CR) 10Dzahn: [C: 032] webperf: Decom webperf::ve service [puppet] - 10https://gerrit.wikimedia.org/r/376146 (https://phabricator.wikimedia.org/T175083) (owner: 10Krinkle)
[23:49:58] <wikibugs>	 (03PS2) 10Dzahn: webperf: Decom webperf::ve service [puppet] - 10https://gerrit.wikimedia.org/r/376146 (https://phabricator.wikimedia.org/T175083) (owner: 10Krinkle)
[23:50:37] <wikibugs>	 10Operations: Change  prod uid  from diego to dsaez, so it can match with the ldap uid - https://phabricator.wikimedia.org/T175220#3587007 (10diego)
[23:51:15] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "doing , per https://phabricator.wikimedia.org/T175083#3582325" [puppet] - 10https://gerrit.wikimedia.org/r/376146 (https://phabricator.wikimedia.org/T175083) (owner: 10Krinkle)
[23:52:32] <wikibugs>	 (03PS4) 10Dzahn: jsbench: Prep osmium for decom and remove 've' and 'jsbench' roles [puppet] - 10https://gerrit.wikimedia.org/r/376151 (https://phabricator.wikimedia.org/T175093) (owner: 10Krinkle)
[23:52:36] <wikibugs>	 10Operations, 10Traffic: Lower geodns TTLs from 600 to 300 - https://phabricator.wikimedia.org/T140365#2462333 (10herron) It would also be good to know that a single server will handle the increased load when degraded.  Adjusting the TTL before adding redundancy/capacity may be advantageous in that it could hi...
[23:54:00] <wikibugs>	 (03CR) 10Dzahn: "Krinkle: note this means you are losing shell access with this merge, let us know if you still need to save any data from osmium" [puppet] - 10https://gerrit.wikimedia.org/r/376151 (https://phabricator.wikimedia.org/T175093) (owner: 10Krinkle)
[23:54:38] <icinga-wm>	 PROBLEM - jmxtrans on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar
[23:54:38] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[23:54:48] <icinga-wm>	 PROBLEM - jmxtrans on kafka-jumbo1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar
[23:54:48] <icinga-wm>	 PROBLEM - jmxtrans on kafka-jumbo1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar
[23:54:59] <icinga-wm>	 PROBLEM - jmxtrans on kafka-jumbo1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar
[23:55:08] <icinga-wm>	 PROBLEM - jmxtrans on kafka-jumbo1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar
[23:55:08] <icinga-wm>	 PROBLEM - salt-minion processes on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[23:55:09] <icinga-wm>	 PROBLEM - jmxtrans on kafka-jumbo1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar
[23:56:47] <ottomata>	 looking!  
[23:57:20] <mutante>	 ottomata: i ran puppet on jumbo1001 - it is failed puppet service
[23:57:26] <mutante>	 but it shouldnt run as service
[23:57:34] <ottomata>	 ?
[23:57:56] <mutante>	 well, it said systemd status is degraded
[23:58:04] <mutante>	 and the failed unit is puppet.service
[23:58:16] <mutante>	 but that seems maybe unrelated to the java procs above
[23:58:24] <ottomata>	 huh
[23:59:32] <ottomata>	 mutante:  hmm yeah
[23:59:36] <ottomata>	 ok, i'm going to ack and look tomorrow
[23:59:52] <mutante>	 it cant be related to the VE service, right :)
[23:59:57] <mutante>	 ottomata: ok