[00:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Evening SWAT (Max 8 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180301T0000). [00:00:05] RoanKattouw and MaxSem: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:08:53] Sorry for the delay, I will SWAT [00:10:28] (03PS3) 10Catrope: Enable ORES filters on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414886 (https://phabricator.wikimedia.org/T130279) [00:10:32] (03CR) 10Catrope: [C: 032] Enable ORES filters on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414886 (https://phabricator.wikimedia.org/T130279) (owner: 10Catrope) [00:11:01] (03CR) 10Catrope: [C: 032] beta: remove $wgFragmentMode, matches prod now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414009 (owner: 10MaxSem) [00:11:14] (03PS2) 10Catrope: beta: remove $wgFragmentMode, matches prod now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414009 (owner: 10MaxSem) [00:11:16] (03CR) 10Catrope: beta: remove $wgFragmentMode, matches prod now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414009 (owner: 10MaxSem) [00:11:18] (03CR) 10Catrope: [C: 032] beta: remove $wgFragmentMode, matches prod now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414009 (owner: 10MaxSem) [00:11:27] (03PS2) 10Catrope: beta: remove $wgSecureLogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414010 (owner: 10MaxSem) [00:11:31] (03CR) 10Catrope: [C: 032] beta: remove $wgSecureLogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414010 (owner: 10MaxSem) [00:11:53] (03Merged) 10jenkins-bot: Enable ORES filters on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414886 (https://phabricator.wikimedia.org/T130279) (owner: 10Catrope) [00:11:56] (03PS2) 10Catrope: beta: remove $wgStructuredChangeFiltersShowPreference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414011 (owner: 10MaxSem) [00:11:58] (03CR) 10Catrope: [C: 032] beta: remove $wgStructuredChangeFiltersShowPreference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414011 (owner: 10MaxSem) [00:12:08] (03CR) 10jenkins-bot: Enable ORES filters on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414886 (https://phabricator.wikimedia.org/T130279) (owner: 10Catrope) [00:12:12] (03PS2) 10Catrope: beta: remove $wmgUseTimeless [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414012 (owner: 10MaxSem) [00:12:15] (03CR) 10Catrope: [C: 032] beta: remove $wmgUseTimeless [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414012 (owner: 10MaxSem) [00:12:25] (03Merged) 10jenkins-bot: beta: remove $wgFragmentMode, matches prod now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414009 (owner: 10MaxSem) [00:13:00] (03Merged) 10jenkins-bot: beta: remove $wgSecureLogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414010 (owner: 10MaxSem) [00:13:11] (03Merged) 10jenkins-bot: beta: remove $wgStructuredChangeFiltersShowPreference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414011 (owner: 10MaxSem) [00:13:37] (03Merged) 10jenkins-bot: beta: remove $wmgUseTimeless [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414012 (owner: 10MaxSem) [00:16:45] (03CR) 10jenkins-bot: beta: remove $wgFragmentMode, matches prod now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414009 (owner: 10MaxSem) [00:16:49] (03CR) 10jenkins-bot: beta: remove $wgSecureLogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414010 (owner: 10MaxSem) [00:16:54] (03CR) 10jenkins-bot: beta: remove $wgStructuredChangeFiltersShowPreference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414011 (owner: 10MaxSem) [00:16:59] (03CR) 10jenkins-bot: beta: remove $wmgUseTimeless [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414012 (owner: 10MaxSem) [00:17:12] (03PS2) 10Catrope: Enable ORES filters on eswikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414889 (https://phabricator.wikimedia.org/T145394) [00:17:15] (03CR) 10Catrope: [C: 032] Enable ORES filters on eswikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414889 (https://phabricator.wikimedia.org/T145394) (owner: 10Catrope) [00:17:58] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable ORES filters on eswiki (T130279) (duration: 01m 14s) [00:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:12] T130279: Deploy ORES filters to Spanish Wikipedia - https://phabricator.wikimedia.org/T130279 [00:18:25] (03Merged) 10jenkins-bot: Enable ORES filters on eswikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414889 (https://phabricator.wikimedia.org/T145394) (owner: 10Catrope) [00:18:54] (03PS2) 10Catrope: beta: remove $wmgUse3d [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414013 (owner: 10MaxSem) [00:18:58] (03CR) 10Catrope: [C: 032] beta: remove $wmgUse3d [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414013 (owner: 10MaxSem) [00:20:13] (03Merged) 10jenkins-bot: beta: remove $wmgUse3d [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414013 (owner: 10MaxSem) [00:23:33] (03CR) 10jenkins-bot: Enable ORES filters on eswikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414889 (https://phabricator.wikimedia.org/T145394) (owner: 10Catrope) [00:27:42] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable ORES filters on eswikibooks (T145394) (duration: 01m 13s) [00:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:56] T145394: Deploy ORES filters in es.wikibooks - https://phabricator.wikimedia.org/T145394 [00:30:51] (03Merged) 10jenkins-bot: Remove $wgUsejQueryThree [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414014 (owner: 10MaxSem) [00:36:25] (03PS2) 10Catrope: Clean up $wgEchoPerUserBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414015 (owner: 10MaxSem) [00:36:27] (03CR) 10Catrope: [C: 032] Clean up $wgEchoPerUserBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414015 (owner: 10MaxSem) [00:36:29] (03CR) 10jerkins-bot: [V: 04-1] Clean up $wgEchoPerUserBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414015 (owner: 10MaxSem) [00:36:38] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Remove $wgUsejQueryThree (duration: 01m 14s) [00:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:21] (03CR) 10jerkins-bot: [V: 04-1] Clean up $wgEchoPerUserBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414015 (owner: 10MaxSem) [00:37:38] rebasing [00:39:32] (03PS3) 10Catrope: Clean up $wgEchoPerUserBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414015 (owner: 10MaxSem) [00:40:30] (03CR) 10Catrope: [C: 032] Clean up $wgEchoPerUserBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414015 (owner: 10MaxSem) [00:40:40] (03PS3) 10Catrope: beta: remove $wmgMinervaNeue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414016 (owner: 10MaxSem) [00:40:43] (03CR) 10Catrope: beta: remove $wmgMinervaNeue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414016 (owner: 10MaxSem) [00:40:47] (03CR) 10Catrope: [C: 032] beta: remove $wmgMinervaNeue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414016 (owner: 10MaxSem) [00:41:03] (03PS3) 10Catrope: beta: remove $wgReadingListsCentralWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414017 (owner: 10MaxSem) [00:41:05] (03CR) 10Catrope: [C: 032] beta: remove $wgReadingListsCentralWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414017 (owner: 10MaxSem) [00:41:56] (03PS3) 10Catrope: beta: remove $wmgUseReadingLists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414018 (owner: 10MaxSem) [00:41:58] (03CR) 10Catrope: [C: 032] beta: remove $wmgUseReadingLists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414018 (owner: 10MaxSem) [00:42:00] (03Merged) 10jenkins-bot: Clean up $wgEchoPerUserBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414015 (owner: 10MaxSem) [00:43:04] (03Merged) 10jenkins-bot: beta: remove $wmgMinervaNeue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414016 (owner: 10MaxSem) [00:43:20] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Clean up $wgEchoPerUserBlacklist setting (duration: 01m 14s) [00:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:41] PROBLEM - puppet last run on nitrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:59:41] PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:00:04] twentyafterfour: It is that lovely time of the day again! You are hereby commanded to deploy Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180301T0100). [01:00:04] No GERRIT patches in the queue for this window AFAICS. [01:00:09] Puppetdb [01:00:11] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:00:13] PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:01:21] PROBLEM - puppet last run on analytics1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:01:31] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:01:31] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:02:01] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:02:01] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:02:22] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:02:22] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:02:42] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:03:21] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:03:21] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:03:41] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:04:01] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:04:31] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:04:32] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:09:13] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint, 10User-Smalyshev: Activate kafka-based recent change poller for wikidata query service - https://phabricator.wikimedia.org/T188252#4012542 (10Smalyshev) [01:12:00] (03PS1) 10Smalyshev: Enable Kafka poller for wdqs2001 [puppet] - 10https://gerrit.wikimedia.org/r/415493 (https://phabricator.wikimedia.org/T188252) [01:28:21] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [01:28:21] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [01:28:41] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [01:29:01] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [01:29:31] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [01:29:41] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:29:42] RECOVERY - puppet last run on nitrogen is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:29:42] RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:30:12] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:30:12] RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:31:22] RECOVERY - puppet last run on analytics1065 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [01:31:31] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:31:31] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:32:01] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:32:06] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:32:22] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:32:22] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:32:41] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [01:47:21] (03PS2) 10Chad: Undeploy EmailAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410081 [01:47:23] (03CR) 10Chad: [C: 032] Undeploy EmailAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410081 (owner: 10Chad) [01:48:22] (03Merged) 10jenkins-bot: Undeploy EmailAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410081 (owner: 10Chad) [01:49:54] !log demon@tin Synchronized wmf-config/: Undeploying EmailAuth from beta, no-op (duration: 01m 16s) [01:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:56:24] (03PS1) 10Chad: Remove extension-list-wikitech, it's pointless [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415498 [01:59:46] (03CR) 10Chad: [C: 032] Remove extension-list-wikitech, it's pointless [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415498 (owner: 10Chad) [02:00:31] PROBLEM - MariaDB Slave Lag: s4 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.46 seconds [02:00:42] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.37 seconds [02:00:51] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.81 seconds [02:00:51] PROBLEM - MariaDB Slave Lag: s4 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.18 seconds [02:00:52] PROBLEM - MariaDB Slave Lag: s4 on db2073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.02 seconds [02:01:01] PROBLEM - MariaDB Slave Lag: s4 on db2090 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.26 seconds [02:01:11] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 318.58 seconds [02:02:53] (03Merged) 10jenkins-bot: Remove extension-list-wikitech, it's pointless [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415498 (owner: 10Chad) [02:03:20] !log demon@tin Synchronized docroot/noc/: cleanup extension-list-wikitech removal (duration: 01m 12s) [02:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:31] !log demon@tin Synchronized wmf-config/: removing extension-list-wikitech (duration: 01m 13s) [02:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:09:58] (03CR) 10jenkins-bot: beta: remove $wmgUse3d [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414013 (owner: 10MaxSem) [02:10:00] (03CR) 10jenkins-bot: Remove $wgUsejQueryThree [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414014 (owner: 10MaxSem) [02:10:02] (03CR) 10jenkins-bot: Clean up $wgEchoPerUserBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414015 (owner: 10MaxSem) [02:10:04] (03CR) 10jenkins-bot: beta: remove $wmgMinervaNeue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414016 (owner: 10MaxSem) [02:10:06] (03CR) 10jenkins-bot: Undeploy EmailAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410081 (owner: 10Chad) [02:10:08] (03CR) 10jenkins-bot: Remove extension-list-wikitech, it's pointless [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415498 (owner: 10Chad) [02:11:33] MaxSem: <3 [02:11:40] I'm on a quest to reduce that delta [02:20:03] (03CR) 10Jforrester: [C: 031] "Removing open C+2s on config just for safety." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414018 (owner: 10MaxSem) [02:20:21] (03CR) 10Jforrester: [C: 031] "Removing open C+2s on config just for safety." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414017 (owner: 10MaxSem) [02:20:23] (03PS4) 10Jforrester: beta: remove $wgReadingListsCentralWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414017 (owner: 10MaxSem) [02:20:25] (03PS4) 10Jforrester: beta: remove $wmgUseReadingLists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414018 (owner: 10MaxSem) [02:28:03] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.22) (duration: 06m 23s) [02:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:58:09] (03PS1) 10Dzahn: tendril: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415503 [02:58:34] (03CR) 10jerkins-bot: [V: 04-1] tendril: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415503 (owner: 10Dzahn) [03:07:44] (03PS1) 10Ayounsi: LibreNMS: Add Transport interfaces to list of main interfaces [puppet] - 10https://gerrit.wikimedia.org/r/415504 [03:08:32] (03PS2) 10Dzahn: tendril: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415503 [03:08:57] (03CR) 10jerkins-bot: [V: 04-1] tendril: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415503 (owner: 10Dzahn) [03:09:39] (03PS3) 10Dzahn: tendril: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415503 [03:10:03] (03CR) 10Ayounsi: [C: 032] LibreNMS: Add Transport interfaces to list of main interfaces [puppet] - 10https://gerrit.wikimedia.org/r/415504 (owner: 10Ayounsi) [03:28:31] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 658.00 seconds [03:30:51] (03PS1) 10KartikMistry: Enable Compact Language Links by default in Beta Wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415506 [03:33:21] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 22.91 seconds [03:33:41] RECOVERY - MariaDB Slave Lag: s4 on db2084 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [03:33:51] RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [03:33:52] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 0.40 seconds [03:34:01] RECOVERY - MariaDB Slave Lag: s4 on db2091 is OK: OK slave_sql_lag Replication lag: 0.38 seconds [03:34:01] RECOVERY - MariaDB Slave Lag: s4 on db2073 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [03:34:02] RECOVERY - MariaDB Slave Lag: s4 on db2090 is OK: OK slave_sql_lag Replication lag: 0.12 seconds [03:36:15] (03PS1) 10Dzahn: mediawiki_singlenode: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415508 [03:37:39] (03PS1) 10Dzahn: xenon: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415509 [03:42:50] (03PS1) 10Dzahn: simplelamp: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415510 [03:43:16] (03CR) 10jerkins-bot: [V: 04-1] simplelamp: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415510 (owner: 10Dzahn) [03:47:08] (03PS1) 10Dzahn: tendril: add support for stretch/php7 [puppet] - 10https://gerrit.wikimedia.org/r/415511 [03:47:53] (03PS2) 10Dzahn: simplelamp: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415510 [03:50:13] 10Operations, 10LuaSandbox: Build and deploy hhvm-luasandbox 3.0.1 to Wikimedia wikis - https://phabricator.wikimedia.org/T187673#4012767 (10Legoktm) [03:55:32] (03PS1) 10Dzahn: varnish: add director for transparency-private [puppet] - 10https://gerrit.wikimedia.org/r/415512 (https://phabricator.wikimedia.org/T188362) [04:08:17] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4012780 (10Prtksxna) >>! In T188362#4010716, @Dzahn wrote: > Access is still denied to me on T175445 which is surprising because... [04:26:01] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 196.75 seconds [06:02:01] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [06:02:11] PROBLEM - haproxy failover on dbproxy1005 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [06:02:11] PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:04:51] PROBLEM - Wikitech and wt-static content in sync on silver is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech [06:09:53] !log Reload haproxy on dbproxy1005 [06:10:32] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [06:14:21] RECOVERY - haproxy failover on dbproxy1005 is OK: OK check_failover servers up 2 down 0 [06:19:43] (03PS1) 10Marostegui: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415521 (https://phabricator.wikimedia.org/T187089) [06:20:22] PROBLEM - haproxy failover on dbproxy1005 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [06:21:48] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415521 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:22:22] RECOVERY - haproxy failover on dbproxy1005 is OK: OK check_failover servers up 2 down 0 [06:23:01] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415521 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:23:15] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415521 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:24:11] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:24:42] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [06:24:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1074 for alter table (duration: 01m 14s) [06:25:53] !log Deploy schema change on db1074 - T187089 T185128 T153182 [06:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:56] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [06:26:57] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [06:26:57] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [06:27:31] PROBLEM - haproxy failover on dbproxy1005 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [06:27:31] RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational [06:28:21] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active [06:33:21] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [06:33:31] PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:40:52] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [06:45:03] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012902 (10Marostegui) p:05Triage>03High [06:45:03] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [06:45:14] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012890 (10Marostegui) [06:48:33] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012915 (10Marostegui) [06:49:01] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [06:53:22] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [06:55:41] RECOVERY - haproxy failover on dbproxy1005 is OK: OK check_failover servers up 2 down 0 [06:57:31] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [06:57:32] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active [06:57:36] !log Restart nova-conductor on labcontrol1001 [06:57:41] RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational [06:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:47] !log restart nova-api on labnet1001 [06:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:36] (03Draft2) 10Tulsi Bhagat: Add Import sources on maiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415522 (https://phabricator.wikimedia.org/T188374) [07:03:01] 10Operations, 10MediaWiki-Configuration, 10Patch-For-Review, 10User-Joe, 10discovery-system: Prepare conftool for safely editing mediawiki-config values - https://phabricator.wikimedia.org/T185080#4012945 (10Joe) 05Open>03Resolved [07:03:04] 10Operations, 10MediaWiki-Configuration, 10Patch-For-Review, 10discovery-system: Use EtcdConfig in production to allow automation of a datacenter switch - https://phabricator.wikimedia.org/T182597#4012946 (10Joe) [07:03:18] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012947 (10chasemp) ```root@labcontrol1001:~# OS_TENANT_NAME=admin-monitoring openstack server list +--------------------------------------+-----------------------+--------+---------------------+ | ID... [07:03:22] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012890 (10madhuvishy) Things seem a lot better now since 06:57 madhuvishy: Restart nova-conductor on labcontrol1001 06:59 chasemp: restart nova-api on labnet1001 [07:03:50] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012950 (10Marostegui) >>! In T188589#4012901, @Marostegui wrote: > I am killing sleeping connections to nova database in a screen on db1009 for now. This was stopped at 06:56AM as @madhuvishy started to res... [07:04:07] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012951 (10chasemp) @madhuvishy restared nova-conductor and I restarted nova-api shortly thereafter. nova-conductor restart seems to have calmed things down. I restarted nova-api as it has a tendency to "g... [07:11:23] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [07:11:31] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [07:16:33] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012970 (10chasemp) I took some patience but post restarts I cleand up nova-fullstack's mess ``` 2007 OS_TENANT_NAME=admin-monitoring openstack server list 2008 OS_TENANT_NAME=admin-monitoring openstack s... [07:18:13] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012971 (10Marostegui) p:05High>03Normal Decreasing the task back to Normal priority as things look stable and leaving it open as per: ``` ˜/chasemp 8:16> marostegui: no worries and let's leave it open ti... [07:22:00] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T188286#4012976 (10Marostegui) 05Open>03Resolved Let's see if this lasts for long this time ``` root@db2048:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337E3350)... [07:24:18] !log demon@tin Synchronized php-1.31.0-wmf.22/maintenance/sql.php: adding --json output mode (duration: 01m 15s) [07:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:25] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012978 (10chasemp) ```root@labcontrol1001:~# OS_TENANT_NAME=contintcloud openstack server list +--------------------------------------+----------------------------+--------+---------------------+ | ID... [07:26:01] !log starting rolling reboot of elasticsearch / cirrus - eqiad (kernel upgrade and config changes) [07:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:35] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012981 (10madhuvishy) Some logs from nova-conductor corresponding to the time of incident, doesn't seem like the root cause but correlates with the db spike. https://phabricator.wikimedia.org/P6770 [07:45:31] (03CR) 10Marostegui: "Backups finished!" [puppet] - 10https://gerrit.wikimedia.org/r/415024 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [07:48:07] (03PS5) 10Ayounsi: LibreNMS: Allow librenms to write file in $install_dir [puppet] - 10https://gerrit.wikimedia.org/r/399101 [07:48:26] (03PS7) 10Vgutierrez: Provide unique logging for BGP instances [debs/pybal] - 10https://gerrit.wikimedia.org/r/414716 (https://phabricator.wikimedia.org/T188085) [07:49:10] (03CR) 10Ayounsi: [C: 032] LibreNMS: Allow librenms to write file in $install_dir [puppet] - 10https://gerrit.wikimedia.org/r/399101 (owner: 10Ayounsi) [07:52:36] !log run kafka preferred-replica-election on kafka1012 to force broker 18 to get back among Kafka topic leaders [07:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:54] !log reboot kafka-jumbo1001 for kerne updates - T188594 [07:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:09] T188594: Reboot all Analytics hosts for Kernel upgrade - https://phabricator.wikimedia.org/T188594 [07:56:54] (03CR) 10Vgutierrez: [C: 032] Provide unique logging for BGP instances [debs/pybal] - 10https://gerrit.wikimedia.org/r/414716 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [07:57:22] (03Merged) 10jenkins-bot: Provide unique logging for BGP instances [debs/pybal] - 10https://gerrit.wikimedia.org/r/414716 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [08:02:11] PROBLEM - Host chlorine is DOWN: PING CRITICAL - Packet loss = 100% [08:02:41] PROBLEM - Host planet1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:01] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:01] PROBLEM - Host mwdebug1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:12] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [08:03:13] PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:13] PROBLEM - Host netmon1003 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:13] PROBLEM - Host dubnium is DOWN: PING CRITICAL - Packet loss = 100% [08:03:13] PROBLEM - Host logstash1007 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:21] PROBLEM - Host hassium is DOWN: PING CRITICAL - Packet loss = 100% [08:03:21] PROBLEM - Host install1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:32] PROBLEM - Host rutherfordium is DOWN: PING CRITICAL - Packet loss = 100% [08:03:41] PROBLEM - SSH on ganeti1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:04:02] ah there you go, I was wondering why bohrium's ssh was slow [08:04:02] :D [08:04:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Memory test failure on elastic1021 - https://phabricator.wikimedia.org/T188595#4013001 (10Gehel) [08:04:51] RECOVERY - Wikitech and wt-static content in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (23643 200000s) [08:05:12] this one might have been me, the Piwik's archiver had to go through more days [08:05:27] I noticed that the cron is scheduled to start every morning at 8AM UTC [08:05:41] PROBLEM - Host elastic1021 is DOWN: PING CRITICAL - Packet loss = 100% [08:05:44] Cc: akosiaris [08:09:01] quite probable [08:09:11] RECOVERY - Host netmon1003 is UP: PING OK - Packet loss = 0%, RTA = 1.64 ms [08:09:11] RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 1.56 ms [08:09:11] RECOVERY - Host rutherfordium is UP: PING OK - Packet loss = 0%, RTA = 1.70 ms [08:09:11] RECOVERY - Host hassium is UP: PING OK - Packet loss = 0%, RTA = 1.51 ms [08:09:11] RECOVERY - Host dubnium is UP: PING OK - Packet loss = 0%, RTA = 1.46 ms [08:09:21] RECOVERY - Host chlorine is UP: PING OK - Packet loss = 0%, RTA = 1.20 ms [08:09:28] it isn't DRBD alone btw [08:09:31] RECOVERY - Host planet1001 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [08:09:31] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [08:09:35] I haven't managed to reproduce the issue [08:09:40] on the host that is [08:09:41] RECOVERY - SSH on ganeti1006 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [08:09:53] my only reproduction case up to now is doing heavy IO in a VM [08:10:06] that is DRBD backed (all of them are) [08:10:35] doing heavy IO in a plain VM or doing heavy IO on a DRBD device on the host does not reproduce it [08:10:41] RECOVERY - Host mwdebug1002 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [08:10:46] I want to test something else today though [08:10:51] RECOVERY - Host logstash1007 is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [08:10:51] RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [08:11:01] RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [08:11:11] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 48 probes of 289 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [08:12:31] RECOVERY - Host install1002 is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [08:14:11] (03PS1) 10Ayounsi: LibreNMS: fix puppet resource cycle [puppet] - 10https://gerrit.wikimedia.org/r/415538 [08:14:52] (03CR) 10Ayounsi: [C: 032] LibreNMS: fix puppet resource cycle [puppet] - 10https://gerrit.wikimedia.org/r/415538 (owner: 10Ayounsi) [08:16:11] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 9 probes of 289 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [08:19:54] (03CR) 10Ayounsi: [C: 031] Add hiera max_core_rtt data [puppet] - 10https://gerrit.wikimedia.org/r/413180 (https://phabricator.wikimedia.org/T157430) (owner: 10BBlack) [08:20:48] !log banning elastic1021 from cluster (failed memory) - T188595 [08:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:04] T188595: Memory test failure on elastic1021 - https://phabricator.wikimedia.org/T188595 [08:22:21] PROBLEM - etcd request latencies on chlorine is CRITICAL: CRITICAL - etcd_request_latencies is 2817747 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:23:21] RECOVERY - etcd request latencies on chlorine is OK: OK - etcd_request_latencies is 1595 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:23:32] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [08:28:50] akosiaris: can I retry the piwik archiver? [08:32:48] elukey: is there anyway to limit the IO ? [08:33:07] I guess no [08:33:25] let me first move the VM into an emptyish node so we don't cause problems to all other VMs [08:33:31] okok [08:34:00] !log reboot kafka1012 for kernel updates - T188594 [08:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:18] T188594: Reboot all Analytics hosts for Kernel upgrade - https://phabricator.wikimedia.org/T188594 [08:39:07] elukey: 1h 12m 21s remaining (estimated) [08:41:22] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 617 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 514994 keys, up 6 hours 22 minutes - replication_delay is 617 [08:41:33] !log draining restbase2009 for eventual reboot for kernel security update [08:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:52] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 638 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 515515 keys, up 6 hours 19 minutes - replication_delay is 638 [08:49:38] akosiaris: thanks! [08:51:13] so redis on 2005 seems trying to resync with master but the connection gets logs [08:51:16] *lost [08:52:05] !log rebooting prometheus servers in eqiad for kernel security update [08:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:41] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1290 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 514994 keys, up 6 hours 33 minutes - replication_delay is 1290 [08:57:13] on rdb1007 - cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits [08:59:52] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 615 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 515698 keys, up 6 hours 43 minutes - replication_delay is 615 [09:00:46] buffer limit is "normal 0 0 0 slave 2147483648 536870912 60 pubsub 33554432 8388608 60" [09:00:51] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 510912 keys, up 6 hours 44 minutes - replication_delay is 0 [09:01:49] so if this doesn't auto-recover in a bit we might need to push the soft limits (512M for 60s) a bit up [09:01:51] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 508729 keys, up 6 hours 42 minutes - replication_delay is 0 [09:02:12] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 509508 keys, up 6 hours 39 minutes - replication_delay is 0 [09:03:34] gooood [09:04:49] now I am running redis-cli --big-keys on rdb2005:6480 to see if there is a huge data structure [09:07:50] don't see anything out of the ordinary [09:17:39] !log draining restbase2010 for eventual reboot for kernel security update [09:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:33] (03PS1) 10Gehel: wdqs: fix reassignment of existing variable [puppet] - 10https://gerrit.wikimedia.org/r/415540 (https://phabricator.wikimedia.org/T188252) [09:23:20] (03CR) 10Gehel: [C: 032] wdqs: fix reassignment of existing variable [puppet] - 10https://gerrit.wikimedia.org/r/415540 (https://phabricator.wikimedia.org/T188252) (owner: 10Gehel) [09:23:53] (03PS2) 10Gehel: Enable Kafka poller for wdqs2001 [puppet] - 10https://gerrit.wikimedia.org/r/415493 (https://phabricator.wikimedia.org/T188252) (owner: 10Smalyshev) [09:26:46] (03CR) 10Gehel: [C: 031] "Puppet compiler is happy: https://puppet-compiler.wmflabs.org/compiler02/10211/wdqs2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/415493 (https://phabricator.wikimedia.org/T188252) (owner: 10Smalyshev) [09:26:56] (03CR) 10Fomafix: "> Seems like this can be merged w/o waiting for the other patches on" [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) (owner: 10Fomafix) [09:29:35] !log rebooting analytics1030 for kernel updates [09:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Memory test failure on elastic1021 - https://phabricator.wikimedia.org/T188595#4013130 (10MoritzMuehlenhoff) p:05Triage>03Normal [09:33:14] 10Operations, 10Analytics, 10User-Elukey: Import some Analytics git puppet submodules to operations/puppet - https://phabricator.wikimedia.org/T188377#4013131 (10MoritzMuehlenhoff) p:05Triage>03Normal [09:33:26] 10Operations: Add Prometheus collector for Tor - https://phabricator.wikimedia.org/T188098#4013132 (10MoritzMuehlenhoff) p:05Triage>03Normal [09:40:11] 10Operations, 10Puppet, 10Patch-For-Review: compile/diff catalogs between puppetdb v2 (production) and puppetdb v4 - https://phabricator.wikimedia.org/T188544#4013143 (10Joe) >! In T188544#4012015, @herron wrote: > The newer version of puppetdbquery requires newer puppetdb which in turn requires the newer pu... [09:41:44] 10Operations: Gain visibility into httpd mod_proxy actions - https://phabricator.wikimedia.org/T188601#4013149 (10fgiunchedi) [09:43:29] !log reboot kafka1013 for kernel security updates [09:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:57] 10Operations, 10Wikimedia-Apache-configuration, 10User-Joe: Gain visibility into httpd mod_proxy actions - https://phabricator.wikimedia.org/T188601#4013161 (10Joe) [09:45:39] 10Operations: Decrease the amount of IRC spam in case of widespread puppet failures - https://phabricator.wikimedia.org/T188602#4013164 (10fgiunchedi) [09:47:54] the incident report for yesterday's puppetmaster failure -> https://wikitech.wikimedia.org/wiki/Incident_documentation/20180228-puppetmaster [09:48:17] feel free to edit/add/etc [09:49:21] that last task T188602 might be a duplicate as I seem to remember we had a similar task before, but couldn't find it [09:49:22] T188602: Decrease the amount of IRC spam in case of widespread puppet failures - https://phabricator.wikimedia.org/T188602 [09:57:38] !log draining restbase2011 for eventual reboot for kernel security update [09:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:07] !log reboot kafka1014 for kernel security updates [09:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:35] marostegui: we are going to reboot CI in case you had some DB patches for mediawiki-config [10:02:29] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4013204 (10Aklapper) T175445 is neither NDA nor Security but custom... Going to https://phabricator.wikimedia.org/maniphest/task/... [10:02:53] !log rebooting contint1001 for kernel security update [10:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:31] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4013230 (10jcrespo) This was warned in advance at T188210 [10:11:52] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [10:12:12] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [10:12:31] elukey: feel free to do whatever you want with piwik [10:12:32] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [10:12:52] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1023 is CRITICAL: CRITICAL: 58.62% of data above the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [10:13:22] kafka unhappy eh? elukey ^ [10:13:38] it's probably related to the kafka1014 reboot ? [10:14:18] very likely yeah [10:15:01] checking! [10:15:22] (03PS4) 10Fomafix: Add additional aliases for sr-cyrl and sr-latn next to sr-ec and sr-el [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) [10:15:48] godog: it might have taken a bit more to recover due to kafka1014 [10:16:15] metrics in https://grafana.wikimedia.org/dashboard/db/kafka?refresh=5m&orgId=1&from=now-3h&to=now are good [10:16:32] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1012 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [10:17:12] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [10:18:20] \o/ [10:18:23] poor kafka [10:18:52] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [10:18:52] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1023 is OK: OK: Less than 50.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1 [10:19:18] akosiaris: is bohrium ready to go? [10:20:03] !log rebooting labnodepool1001 for kernel security update [10:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:59] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4013246 (10jcrespo) The immediate solution is T183469 [10:21:44] elukey: yup, I already pinged you about it. Probably lost in the kafka noise [10:22:00] akosiaris: argh sorry didn't see it! thanks :) [10:23:37] restarted the archiver, let's see how it goes [10:27:13] !log draining restbase2012 for eventual reboot for kernel security update [10:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:03] PROBLEM - Hadoop NodeManager on analytics1030 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [10:32:09] this is me, downtime expired [10:32:37] !log rolling reboot of parsoid in codfw for kernel security update [10:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:53] <_joe_> brb [10:36:26] 10Operations, 10ops-eqiad: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#4013260 (10akosiaris) The following have failed up to now to reproduce * Heavy IO on a plain volume on a ganeti host (thankfully) * Heavy IO on a DRBD backed device on ganeti host * H... [10:38:03] RECOVERY - Hadoop NodeManager on analytics1030 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [10:38:18] !log eqsin LVSs: reboot for retpoline kernel updates T188092 [10:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:47] Hi, can you tell me, which user to I add as reviewer for patches: https://gerrit.wikimedia.org/r/#/c/412960/ https://gerrit.wikimedia.org/r/#/c/412963/ to it changes can be deployed. Thanks! [11:08:55] !log reboot kafka1020 for kernel updates [11:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:10] !log ulsfo LVSs: reboot for retpoline kernel updates T188092 [11:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:08] !log draining restbase1007 for eventual reboot for kernel security update [11:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:29] !log reboot kafka-jumbo1002 for kernel security updates [11:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:22] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - CRITICAL - search_9200: Servers elastic1027.eqiad.wmnet, elastic1034.eqiad.wmnet, elastic1049.eqiad.wmnet, elastic1042.eqiad.wmnet, elastic1041.eqiad.wmnet, elastic1017.eqiad.wmnet, elastic1029.eqiad.wmnet, elastic1052.eqiad.wmnet are marked down but pooled [11:29:49] looking ^ [11:30:22] RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy [11:32:40] !log reboot kafka1022 for kernel updates [11:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:06] !log restarting labsdb1011 [11:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:56] gehel: troubles with elastic1*? https://grafana.wikimedia.org/dashboard/db/pybal?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-server=lvs1003&var-service=search-https_9243&from=1519901714639&to=1519904003119 [11:34:37] ema: looking [11:34:49] thanks! [11:35:25] ema: rolling restart in progress, but it should be mostly transparent... [11:36:09] !log reboot kafka-jumbo1003 for kernel updates [11:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:31] the reason why we got the alert above is that pybal reached the depooling threshold for the service, so it refused to depool those hosts even though they were found "down" by the healthchecks [11:37:21] ema: there has been a peak in 99%-ile response time, correlated with the graphs you just sent [11:37:39] * gehel is guessing master re-election, but needs to check [11:37:39] gehel: how "rolling" have the restarts been? :) [11:39:51] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is CRITICAL: CRITICAL - kafka_broker_under_replicated_partitions is 11 https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=kafka_jumbo&var-kafka_brokers=kafka-jumbo1004 [11:40:21] gehel: at 11:27 we had 35 pooled servers, then a few seconds after 3 were gone (32 pooled) [11:40:32] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: CRITICAL - kafka_broker_under_replicated_partitions is 10 https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=kafka_jumbo&var-kafka_brokers=kafka-jumbo1005 [11:40:41] gehel: then something happened at 11:28 and they all went down from pybal's POW [11:40:42] ema: yep, that was master reelection... which again took more time than expected (similar thing on codfw yesterday) [11:40:48] ok [11:41:10] Jumbo should recover soon [11:41:33] I need to dig a bit more into this. Re-election is supposed to be fast, and reads are supposed to be possible even during election. [11:42:20] Something is fishy, most probably related to the restarts. [11:42:57] ema: the 3 depools at 11:27 were explicit depool before rebooting those servers [11:43:19] so very much expected. The all down at 11:28 is not normal [11:44:27] right [11:45:00] I'll get something to eat first. Things seems stable enough [11:45:25] yup, same here o/ [11:46:20] (03PS4) 10Vgutierrez: pybal: Prometheus based icinga check for BGP established sessions [puppet] - 10https://gerrit.wikimedia.org/r/415260 (https://phabricator.wikimedia.org/T188085) [11:48:27] !log powercycling wtp2013, stuck in reboot [11:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:39] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler02/10213/" [puppet] - 10https://gerrit.wikimedia.org/r/415260 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [11:59:46] !log draining restbase1008 for eventual reboot for kernel security update [11:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:51] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is OK: OK - kafka_broker_under_replicated_partitions is 4 https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=kafka_jumbo&var-kafka_brokers=kafka-jumbo1005 [12:08:58] (03PS10) 10Jcrespo: mariadb: Set up es2001 as the temporary backup target [puppet] - 10https://gerrit.wikimedia.org/r/415024 (https://phabricator.wikimedia.org/T184696) [12:09:02] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is OK: OK - kafka_broker_under_replicated_partitions is 4 https://grafana.wikimedia.org/dashboard/db/prometheus-kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=kafka_jumbo&var-kafka_brokers=kafka-jumbo1004 [12:09:48] (03CR) 10Jcrespo: [C: 032] mariadb: Set up es2001 as the temporary backup target [puppet] - 10https://gerrit.wikimedia.org/r/415024 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [12:12:05] es2001 puppet is me [12:14:37] (03PS1) 10Jcrespo: mariadb-backups: Fix typo on config file [puppet] - 10https://gerrit.wikimedia.org/r/415552 (https://phabricator.wikimedia.org/T184696) [12:15:09] (03CR) 10Jcrespo: [C: 032] mariadb-backups: Fix typo on config file [puppet] - 10https://gerrit.wikimedia.org/r/415552 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [12:21:06] !log reboot kafka1023 for kernel updates [12:21:17] (03CR) 10Jayprakash12345: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415522 (https://phabricator.wikimedia.org/T188374) (owner: 10Tulsi Bhagat) [12:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:07] ema: it looks like the synchronization of cluster state after master re-election took 1.9 minute. That's a bit above the expected max of 30'. Not sure there is much we can do about it... [12:27:43] !log reboot kafka-jumbo1004 for kernel updates [12:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:04] Was elastic1021 being used by Wikidata as well? [12:38:45] Elastic search seems not to be updating on Wikidata.... [12:39:29] !log rolling reboot of parsoid in eqiad for kernel security update [12:39:30] (03Draft1) 10MarcoAurelio: beta: add nlwiki to langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415555 (https://phabricator.wikimedia.org/T188582) [12:39:34] (03PS2) 10MarcoAurelio: beta: add nlwiki to langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415555 (https://phabricator.wikimedia.org/T188582) [12:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:22] (03CR) 10Jayprakash12345: [C: 031] "This should add even if It is not related to T188582." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415555 (https://phabricator.wikimedia.org/T188582) (owner: 10MarcoAurelio) [12:43:27] sjoerddebruin: while cluster restart is in progress, we can expect delay on updates [12:43:53] sjoerddebruin: but they should get through eventually. The loss of elastic1021 should be completely transparent [12:44:31] !log draining restbase1009 for eventual reboot for kernel security update [12:44:34] Just looking for causes, sorry. [12:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:05] sjoerddebruin: no problem! Do you have a ticket on those lost / delayed updates? I'd be interested in having a look! [12:45:45] Already informed Lydia_WMDE about it, can create some ticket if she wants. ^ [12:46:11] Because I can't even use new created entities as value on Wikidata... [12:47:07] <_joe_> during a cirrus outage, jobs will fail [12:47:16] <_joe_> not sure when/if they will be retried [12:47:33] Yeah, that seems to be the case. [12:47:38] <_joe_> but that seems more serious than a problem on cirrus [12:47:45] _joe_, sjoerddebruin: we're not really in an outage, but in a cluster restart [12:47:45] <_joe_> < sjoerddebruin> Because I can't even use new created entities as value on Wikidata... [12:48:02] <_joe_> gehel: well we had a moment when traffic would be sent to servers marked as down [12:48:04] It shows up now, delay of 15± minutes. [12:48:09] <_joe_> partial outage? [12:48:20] <_joe_> anyways, the jobqueue will retry typically, with some delay [12:48:37] <_joe_> I'm looking at the cirrus jobs right now and that seems to be the case too [12:49:06] _joe_: yeah, but for about 3 minutes, the real impact is probably more because we explicitely freeze writes during a cluster restart. [12:49:18] <_joe_> oh there it is [12:49:24] <_joe_> jobs are write operations :) [12:49:37] yep, exactly [12:49:45] (03PS5) 10ArielGlenn: restbase dumps in xml format [dumps] - 10https://gerrit.wikimedia.org/r/413212 [12:49:47] Trending graphs on https://grafana.wikimedia.org/dashboard/db/job-queue-rate seem to confirm it. [12:49:55] Thank you! [12:50:08] (03CR) 10jerkins-bot: [V: 04-1] restbase dumps in xml format [dumps] - 10https://gerrit.wikimedia.org/r/413212 (owner: 10ArielGlenn) [12:50:54] usually, having some delay in the updates for search is not a big deal, but the wikidata use case seems to be different [12:51:05] <_joe_> that's broken then [12:51:10] <_joe_> I mean wikidata [12:51:23] <_joe_> things cannot depend on async jobs to function [12:51:57] <_joe_> either we make the job synchronous (yuck, double yuck), or we fix the code to work even when the job isn't completed [12:52:04] it's probably more about managing expectations... but I don't know enough there. I'll find the right people to talk to... [12:52:15] (03PS6) 10ArielGlenn: restbase dumps in xml format [dumps] - 10https://gerrit.wikimedia.org/r/413212 [12:52:22] <_joe_> yeah me neither [12:52:41] and no, we're not going to make writes synchronous :) [12:52:54] <_joe_> yeah I wasn't seriously suggesting that [12:52:57] Maybe https://gerrit.wikimedia.org/r/c/413492/ will improve things. [12:54:12] <_joe_> sjoerddebruin: that's what we're saying we don't want [12:54:19] Hmmm [12:54:33] <_joe_> that would make edit depend on write availability of the elasticsearch clusters [12:54:48] <_joe_> that's not something we really want imho [12:55:00] <_joe_> but I see it's merged by gehel's colleagues [12:55:01] <_joe_> :P [12:55:07] :) [12:55:13] <_joe_> bad dcausse [12:55:29] it's a best effort, if something bad this update is thrown away [12:55:39] <_joe_> meh [12:55:54] <_joe_> if the final result is entities will not be usable until indexed [12:56:05] no the normal update will still happen thanks to the jobqueue [12:56:16] <_joe_> yeah, but what I'm saying is [12:56:19] <_joe_> during maintenance [12:56:34] <_joe_> entities added to wikidata will not be usable, if I understand correctly [12:56:45] <_joe_> now either we stop doing maintenance [12:56:58] <_joe_> or we fix the UX to degrade nicely in that case [12:57:23] during maintenance this job will be ignored and users won't benefit from fast indexing, but normal job will pile up in the job queue and resume as soon as maintainance is done [12:58:20] it is all about the definition of "broken", new entities are still usable, but they are not searchable, which makes it less convenient to refer to them [12:58:45] <_joe_> oh ok [12:58:50] <_joe_> that's ok I guess [12:59:06] <_joe_> the original report was they were unusable [12:59:48] Yeah, don't ask me how Wikidata handles items as values. The item did show up and was able to be published 15± minutes later. [13:00:13] yes it can be misleading for editors to not see their pages just after creation, this patch aims to reduce that annoyance [13:00:48] dcausse: I barely understand that above patch. Does it still respect the freezes? [13:00:57] in most cases it'll be ok but during maintenance editors won't benefit from that feature [13:00:59] gehel: yes [13:01:07] ok, kool! [13:01:39] <_joe_> yeah I was worried by how I understood the original report [13:01:51] I'm hacking around my restart scripts, and I think I can probably shave 30 minute per server group. But of course, writes will mess that up... [13:01:52] <_joe_> but probably the UX is not great in that situation [13:02:10] very true, a small hint in the UI would be nice [13:03:02] Just like the watchlist notices that results may be X seconds behind? [13:03:42] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4013727 (10jcrespo) [13:03:45] yes something like that, so that editors don't get confused or afraid of their edit being lost [13:16:48] !log esams LVSs: reboot for retpoline kernel updates T188092 [13:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:25] !log reboot kafka-jumbo100[5,6] for kernel updates [13:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:58] (03PS2) 10Arturo Borrero Gonzalez: toollabs: introduce base class for all toolforge roles [puppet] - 10https://gerrit.wikimedia.org/r/415057 (https://phabricator.wikimedia.org/T187193) [13:18:13] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] toollabs: introduce base class for all toolforge roles [puppet] - 10https://gerrit.wikimedia.org/r/415057 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez) [13:18:32] !log draining restbase1010 for eventual reboot for kernel security update [13:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:21] jouncebot: next [13:26:22] In 0 hour(s) and 33 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180301T1400) [13:33:44] !log force merging enwiki_general index on codfw to reclaim space [13:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:05] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4012890 (10Andrew) Sorry I slept through this last night! I'm catching up. A few facts: nova-api seems to connect directly to the database. Other than nova-api, nova-conductor is the service that marshals... [13:43:49] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4013834 (10jcrespo) The problem I see is that each openstack (I think it is openstack) application, has its own pool of connections- which has some issues for our infrastructure- first because it "reserves" r... [13:44:14] jouncebot: refresh [13:44:17] I refreshed my knowledge about deployments. [13:48:46] zeljkof: we will be syncing some changes for the jobqueue just a bit before the swat window (in 10 mins or so), do you mind starting a couple of mins later? [13:50:34] !log codfw LVSs: reboot for retpoline kernel updates T188092 [13:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:53] mobrovac: no problemo, last time I have checked there were nothing for swat [13:51:05] oh cool [13:51:10] thnx zeljkof, will ping you once done [13:54:54] !log draining restbase1011 for eventual reboot for kernel security update [13:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:30] jouncebot: next [13:57:30] In 0 hour(s) and 2 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180301T1400) [14:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180301T1400). [14:00:04] Lucas_WMDE: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:16] I’ll be back in ~5mins [14:00:32] hold the swat for 5 mins please [14:00:49] (03PS3) 10Mobrovac: Disable redis queue for cdnPurge for all but wikipedia, commons and wikidata. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415371 (https://phabricator.wikimedia.org/T188540) (owner: 10Ppchelko) [14:00:51] Ok [14:01:05] hello [14:01:32] !log ppchelko@tin Started deploy [cpjobqueue/deploy@b5255f0]: Enable kafka queue for cdnPurge for all but wikipedia, commons and wikidata. T188540 [14:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:47] T188540: Switch cdnPurge to Kafka - https://phabricator.wikimedia.org/T188540 [14:02:16] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@b5255f0]: Enable kafka queue for cdnPurge for all but wikipedia, commons and wikidata. T188540 (duration: 00m 44s) [14:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:38] zeljkof: You wil SWAT today? [14:04:04] (03CR) 10Mobrovac: [C: 032] Disable redis queue for cdnPurge for all but wikipedia, commons and wikidata. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415371 (https://phabricator.wikimedia.org/T188540) (owner: 10Ppchelko) [14:04:36] Jayprakash12345: SWAT is being delayed for ~5 minutes per request. Maintenance ongoing :) [14:05:35] (03Merged) 10jenkins-bot: Disable redis queue for cdnPurge for all but wikipedia, commons and wikidata. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415371 (https://phabricator.wikimedia.org/T188540) (owner: 10Ppchelko) [14:07:11] Hauskatze: Thank you very much for Inform [14:07:37] Jayprakash12345: yes [14:07:45] I’m here now [14:07:57] still waiting... [14:07:57] though my patch for SWAT is optional anyways [14:08:14] (03CR) 10jenkins-bot: Disable redis queue for cdnPurge for all but wikipedia, commons and wikidata. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415371 (https://phabricator.wikimedia.org/T188540) (owner: 10Ppchelko) [14:10:39] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: JobQueue: Enable EventBus for cdnPurge for all but wikipedia, commons and wikidata, file 1/2 - T188540 (duration: 01m 14s) [14:10:46] (03PS2) 10Zfilipin: Enable Quiz Extension at zhwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414506 (https://phabricator.wikimedia.org/T188213) (owner: 10Jayprakash12345) [14:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:55] T188540: Switch cdnPurge to Kafka - https://phabricator.wikimedia.org/T188540 [14:12:29] Lucas_WMDE: um, your patch has failing tests [14:12:34] !log mobrovac@tin Synchronized wmf-config/jobqueue.php: JobQueue: Enable EventBus for cdnPurge for all but wikipedia, commons and wikidata, file 2/2 - T188540 (duration: 01m 13s) [14:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:09] zeljkof: we're done, you can take over now, sorry for the wait [14:13:23] mobrovac: no problem [14:13:26] I can SWAT today [14:14:01] (03PS1) 10Jcrespo: mariadb-backups: Fix typo on codfw configuration, remove regexes [puppet] - 10https://gerrit.wikimedia.org/r/415565 (https://phabricator.wikimedia.org/T184696) [14:14:30] (03PS2) 10Jcrespo: mariadb-backups: Fix typo on codfw configuration, remove regexes [puppet] - 10https://gerrit.wikimedia.org/r/415565 (https://phabricator.wikimedia.org/T184696) [14:15:40] Jayprakash12345: around for SWAT? [14:15:49] zeljkof: Yes [14:16:09] Jayprakash12345: ok, merging your patch, I will let you know when it's at mwdebug1002, so you can test there [14:16:13] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414506 (https://phabricator.wikimedia.org/T188213) (owner: 10Jayprakash12345) [14:17:23] zeljkof: yes, I know :/ [14:17:27] (03Merged) 10jenkins-bot: Enable Quiz Extension at zhwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414506 (https://phabricator.wikimedia.org/T188213) (owner: 10Jayprakash12345) [14:17:41] but I’ll let you finish with Jayprakash12345 first and then try to explain my situation [14:18:14] Lucas_WMDE: start explaining :) [14:18:18] okay :) [14:18:19] (03CR) 10jenkins-bot: Enable Quiz Extension at zhwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414506 (https://phabricator.wikimedia.org/T188213) (owner: 10Jayprakash12345) [14:18:45] so my change is a backport of a fix for one of several bugs which, in combination, caused an incident recently (https://wikitech.wikimedia.org/wiki/Incident_documentation/20180226-WikibaseQualityConstraints) [14:18:54] zeljkof: we have a last-minute SWAT request as well, if there's still time: https://gerrit.wikimedia.org/r/#/c/415564/ /cc niedzielski [14:19:01] just a bug fix [14:19:06] the fix isn’t *necessary* right now – as far as I can tell, the bug can’t be triggered by anything else [14:19:12] mdholloway: sure, want to deploy it yourself, or should I? [14:19:18] T187955 is the underlying bug [14:19:19] T187955: Page preview shows icon instead of thumbnail - https://phabricator.wikimedia.org/T187955 [14:19:21] but that’s no guarantee, and the fix is trivial enough that I thought I’d suggest it for SWAT [14:19:27] especially since the SWAT was so empty anyways [14:19:40] I’m certain the CI failures are unrelated [14:19:49] but I’ll totally understand if you don’t want to apply it either [14:19:59] (03CR) 10Jcrespo: [C: 032] mariadb-backups: Fix typo on codfw configuration, remove regexes [puppet] - 10https://gerrit.wikimedia.org/r/415565 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [14:20:10] zeljkof: would you please? i don't know the SWAT procedure (or think i have the rights to do it... at least AFAIK :) ) [14:20:44] Lucas_WMDE: I would not like to force merge the patch until CI failures are fixed, maybe hashar would know what is wrong [14:20:49] mdholloway: no problem, will do [14:20:55] Hey all. I heard I might be required [14:21:05] zeljkof: which change? [14:21:06] mdholloway o/ [14:21:13] hashar: https://gerrit.wikimedia.org/r/c/415319/ [14:21:35] raynor: howdy :) zeljkof is going to deploy the backport of https://gerrit.wikimedia.org/r/#/c/415564/ for us [14:21:37] Jayprakash12345: your patch is at mwdebug1002, please test and let me know if I can deploy [14:21:47] [14:22:31] (03PS7) 10Andrew Bogott: mediawiki: move ::hhvm::admin include out of mediawiki module [puppet] - 10https://gerrit.wikimedia.org/r/415353 [14:23:03] zeljkof: https://zh.wikibooks.org/wiki/User:Jayprakash12345/Sandbox, Looks Goog [14:23:07] Good* [14:23:08] hashar: apparently the test includes a stub for PageImages which just provides that constant, so the test should still work even if the extension isn’t installed [14:23:14] but I don’t know more about that [14:23:21] zeljkof: Please Deploy [14:23:26] I’m not very familiar with the Wikibase tests [14:23:55] Jayprakash12345: deploying [14:24:48] mdholloway: can you please add the patch to the calendar? https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180301T1400 [14:24:56] zeljkof: sure, doing now [14:25:11] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:414506|Enable Quiz Extension at zhwikibooks (T188213)]] (duration: 01m 14s) [14:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:27] T188213: Enable Quiz extension on zhwikibooks - https://phabricator.wikimedia.org/T188213 [14:25:44] Lucas_WMDE: maybe addshore can help figure out why phan complains about https://gerrit.wikimedia.org/r/#/c/415319/ [14:25:55] \o [14:26:00] *looks* [14:26:07] Lucas_WMDE: since one job is failing, I would prefer not to deploy it :( [14:26:18] okay [14:26:21] if addshore or hashar can figure out what is wrong... [14:26:26] [14:26:27] !log draining restbase1012 for eventual reboot for kernel security update [14:26:28] worst case, the non-backported version will be deployed next week :) [14:26:39] there is 34 minutes left :) [14:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:54] sounds like that class isnt being loaded or looked at by phan [14:27:10] Jayprakash12345: deployed, please check and thanks for deploying with #releng! :) [14:27:22] it has a stub at least :D [14:27:38] !is_dir( './../../extensions/PageImages' ) ? [ 'tests/phan/stubs/pageimages.php' ] : [], [14:27:52] yup [14:28:00] zeljkof: Hey I add one more patch. please see [14:28:03] maybe the PageImages extension is present but not loaded? [14:28:23] Lucas_WMDE: pageimage must also be added to directory_list in the config [14:28:42] !log rolling restart of swift frontends in codfw for kernel security update [14:28:53] addshore: strange thing is that in master the same patch has no failed tests cc Lucas_WMDE [14:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:04] Jayprakash12345: sure, reviewing [14:29:19] zeljkof: hashar Lucas_WMDE https://gerrit.wikimedia.org/r/#/c/415567/ [14:29:32] mdholloway, raynor: I don't see your patch in the calendar [14:29:38] zeljkof: interesting, lets try this cherry picked on top [14:29:48] zeljkof: yeah, fixing edit conflict :/ [14:30:03] (03PS3) 10Zfilipin: Add Import sources on maiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415522 (https://phabricator.wikimedia.org/T188374) (owner: 10Tulsi Bhagat) [14:30:08] Tulsi: Hey [14:30:26] https://gerrit.wikimedia.org/r/#/c/415568/ is on the branch with my change and running on top of https://gerrit.wikimedia.org/r/#/c/415319/, lets see what happens [14:30:39] Jayprakash12345: Hi [14:30:42] PROBLEM - ElasticSearch health check for shards on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 978 threshold =0.1 breach: status: yellow, number_of_nodes: 32, unassigned_shards: 974, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3127, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 4, active_shard [14:30:42] : 89.6327833954, active_shards: 8421, initializing_shards: 0, number_of_data_nodes: 32, delayed_unassigned_shards: 0 [14:30:59] damn, again? [14:31:06] addshore: shouldn’t the commits be the other way around, first “add Pageimages” and then “fix empty condition list”? [14:31:10] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415522 (https://phabricator.wikimedia.org/T188374) (owner: 10Tulsi Bhagat) [14:31:16] gehel: still restarting? [14:31:22] Lucas_WMDE: we can re order them if it works :) [14:31:26] replacing the entries that nuked... [14:31:26] fair enough :D [14:31:28] <_joe_> gehel: should we switch to codfw? [14:31:32] thanks for looking into it! [14:31:43] oh, nm, looks like we're good [14:31:52] yep, still restarting, cluster is already yellow, checking why Icinga complained [14:32:07] zeljkof: done https://wikitech.wikimedia.org/w/index.php?title=Deployments#Thursday,_March_01 [14:32:27] zeljkof: looks like the stub got used for test but the extension was present for gate? [14:32:27] (03Merged) 10jenkins-bot: Add Import sources on maiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415522 (https://phabricator.wikimedia.org/T188374) (owner: 10Tulsi Bhagat) [14:32:40] (03CR) 10jenkins-bot: Add Import sources on maiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415522 (https://phabricator.wikimedia.org/T188374) (owner: 10Tulsi Bhagat) [14:32:42] RECOVERY - ElasticSearch health check for shards on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 35, unassigned_shards: 725, number_of_pending_tasks: 36, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3127, task_max_waiting_in_queue_millis: 33934, cluster_name: production-search-eqiad, relocating_shards: 2, active_shards_percent_as_numb [14:32:42] active_shards: 8658, initializing_shards: 12, number_of_data_nodes: 35, delayed_unassigned_shards: 0 [14:32:48] mdholloway: please stand by, you are next, as soon as the current patch is deployed, in 5 or so minutes [14:33:06] zeljkof: cool, standing by. // niedzielski: raynor ^ [14:33:30] zeljkof: There is no way to test 415522, Deploy if there is logs error occurr. [14:33:35] Jayprakash12345: your second patch is at mwdebug, please test and let me know if I can deploy [14:33:37] 10Operations, 10Puppet, 10User-fgiunchedi: Upgrade hiera to stretch (version 3) - https://phabricator.wikimedia.org/T188623#4013942 (10fgiunchedi) p:05Triage>03Normal [14:33:40] 👍 [14:33:56] zeljkof: There is no way to test 415522, Deploy if there is logs error occurr. [14:33:56] Ok, the check is just a little bit too agressive (we fail if < 90% of shards are allocated, and in this case, we went down to 89.63%). [14:34:00] we're still good [14:34:01] Jayprakash12345: sorry, did not understand you, you can not test it? [14:34:03] :thumbs_up: [14:34:09] Jayprakash12345: ok, deploying [14:34:49] zeljkof: I am with Tulsi's touch, He is admin there. [14:35:14] so can I ask, what actually went awry, ge hel? [14:35:29] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4013954 (10chasemp) This was in my email twice from last night. I am suspicious of this cron, but unsure if (part of) cause or effect really. `12:40 AM (7 hours ago)` (my local time) > Cron I looked in to see something about a restart but I couldn't find in the backread where it started [14:35:48] Lucas_WMDE: still failed https://integration.wikimedia.org/ci/job/mwext-php70-phan-docker/3626/console [14:35:56] damn [14:35:59] apergos: sure you can ask! I'll send an email once all is done with a recap. This rolling restart did not go as smooth as usual! [14:36:09] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 10.34% of data above the critical threshold [106250000.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [14:36:22] ok, I'll wait for that, thanks! [14:36:23] (03CR) 10Giuseppe Lavagetto: "LGTM, all of this needs a thorough refactoring btw, but if this unblocks you, all the better." [puppet] - 10https://gerrit.wikimedia.org/r/415353 (owner: 10Andrew Bogott) [14:36:27] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:415522|Add Import sources on maiwikimedia (T188374)]] (duration: 01m 13s) [14:36:32] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4013958 (10jcrespo) [14:36:38] Jayprakash12345: deployed, please check [14:36:39] I'm going to look at the labstore1003 situation there [14:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:42] T188374: Add Import sources on maiwikimedia - https://phabricator.wikimedia.org/T188374 [14:36:57] Tulsi: Hey, Please check. [14:36:59] That last ping was our check on the number of unassigned shards. When I restart a group of nodes, the shards allocated to those nodes disappear from the cluster (but we have replicas on other nodes, so we are fine). [14:37:05] addshore: Lucas_WMDE: if wikibase@wmf/1.31.0-wmf.23 has a broken phan, I guess we can force merge/deploy the change that was scheduled today [14:37:18] ACKNOWLEDGEMENT - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 10.34% of data above the critical threshold [106250000.0] cpettet looking https://grafana.wikimedia.org/dashboard/db/labs-monitoring [14:37:29] niedzielski: do you have an example on hand of an article that's still showing the icon? [14:37:43] mdholloway, niedzielski, raynor: please stand by, reviewing your patch [14:38:07] (03CR) 10Giuseppe Lavagetto: [C: 031] mediawiki: move ::hhvm::admin include out of mediawiki module [puppet] - 10https://gerrit.wikimedia.org/r/415353 (owner: 10Andrew Bogott) [14:38:14] 10Operations: netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994#4013965 (10aborrero) >>! In T187994#4009038, @faidon wrote: > I don't think it's easy for anyone to calculate the amount of effort required for this, but the stated 1-2 year long migration sounds longer than... [14:38:18] We check the number of allocated shards, but the shards are not perfectly balanced between nodes (considerations like size of shard are taken into account). In this case, the 3 nodes that restarted contained more than 10% of the shards, so the alarm was raised. [14:38:55] olliv: do we have an example article that had the problem to test? (i'm searching for one atm) [14:38:59] 3 nodes on a 36 node cluster **should** be less than 10% but were not. And the cluster is only 35 nodes at the moment, due to elastic1021 having RAM issues [14:39:12] apergos: and that was the high level overview [14:39:20] mdholloway: we do, let me find it [14:39:48] where can we test? [14:40:04] olliv: oh, not deployed yet, just want to be ready to test when it is :) [14:40:12] so it's really "icinga not quite smart enough" somehow [14:40:13] got it [14:40:20] olliv: any minute now [14:40:34] hashar: zeljkof Lucas_WMDE yeh, the patch looks very unrelated to the phan failure [14:40:45] mdholloway: https://el.wikipedia.org/wiki/%CE%93%CF%81%CE%B1%CE%BC%CE%BC%CE%B9%CE%BA%CE%AE_%CE%AC%CE%BB%CE%B3%CE%B5%CE%B2%CF%81%CE%B1, first link [14:41:06] olliv: thanks! // cc niedzielski ^ [14:41:29] addshore: I am still very reluctant to force merge it :( and deploy the patch with failing tests [14:41:55] zeljkof: it is unrelated failure :] [14:42:18] zeljkof: the phan job is broken in some non subtile way [14:42:26] hashar, addshore: if both of you think the patch is fine to deploy, I'll force merge it :) [14:42:37] Yep, I think it is good :) [14:43:19] hashar, addshore: and please leave a +1 so I have a clear conscience ;P [14:43:56] zeljkof: done! [14:44:02] addshore: so now phan is failing on master as well… hm [14:44:11] addshore, hashar, zeljkof: thanks :) [14:44:25] Lucas_WMDE, addshore, hashar: taking a deep breath and merging the patch ;) [14:44:31] Lucas_WMDE: did something recently change with PageImages? [14:44:41] not as far as I could tell from the GitHub mirror [14:44:49] or perhaps in integration/config? [14:46:39] addshore: hm, on the non-backported change the phan build reported “no files found, config error?” https://integration.wikimedia.org/ci/job/mwext-php70-phan-docker/3547/console [14:46:47] indeed [14:46:50] perhaps that config error was fixed and only exposed a long-standing bug? [14:46:54] try a "recheck" and see what it does now [14:47:05] (03PS22) 10Elukey: eventlogging: add systemd support [puppet] - 10https://gerrit.wikimedia.org/r/413362 (https://phabricator.wikimedia.org/T114199) [14:47:28] oh, I didn’t know you could recheck merged changes :) [14:48:22] mdholloway, niedzielski, raynor: your patch is at mwdebug1002, please test and let me know if I can deploy it [14:48:46] zeljkof: https://mai.wikimedia.org/wiki/%E0%A4%B5%E0%A4%BF%E0%A4%B6%E0%A5%87%E0%A4%B7:%E0%A4%B2%E0%A5%89%E0%A4%97/import?uselang=en [14:48:54] tested thanhs [14:49:05] thanks* [14:49:11] Jayprakash12345: great, thanks for letting me know! [14:50:03] 👍👍👍 [14:50:43] addshore: no problem on https://integration.wikimedia.org/ci/job/mwext-php70-phan-docker/3628/console [14:51:26] (03CR) 10Elukey: [C: 032] "no relevant changes on eventlog1001, currently running in deployment-prep:" [puppet] - 10https://gerrit.wikimedia.org/r/413362 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [14:52:36] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4014007 (10chasemp) >>! In T188589#4013954, @chasemp wrote: > This was in my email twice from last night. I am suspicious of this cron, but unsure if (part of) cause or effect really. > > `12:40 AM (7 hours... [14:54:06] niedzielski: was that a LGTM? :) //cc zeljkof [14:54:09] (03PS1) 10Giuseppe Lavagetto: hhvm::admin: remove inclusion of apache::mod::proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/415573 [14:54:12] (03PS1) 10Giuseppe Lavagetto: hhvm::admin: convert to using httpd instead of apache [puppet] - 10https://gerrit.wikimedia.org/r/415574 [14:54:25] <_joe_> andrewbogott: ^^ this _should_ work [14:54:28] <_joe_> still untested [14:54:34] zeljkof: ping once swat is done, we have follow-ups to do [14:54:38] mdholloway: it's look good to me but i can't actually break it again when i turn off the header! [14:54:51] I'll try it [14:55:02] niedzielski, mdholloway: so, can I deploy, or not? [14:55:06] mobrovac: will do [14:55:16] mdholloway: i'd feel better if i could see it break again [14:55:20] mobrovac: we might be a few minutes late, still waiting for CI for a patch [14:55:36] k thnx [14:56:33] zeljkof: double checking with raynor and olliv [14:56:45] mdholloway: ^ [14:57:01] niedzielski: ok, take you time [14:57:02] niedzielski: yep, looking here as well [14:57:33] niedzielski: works for me with the header, rebreaks when i don't send [14:57:49] so a +1 for deploy from me [14:58:18] images are showing up, but they're huge - we'll probably have to fix that separately [14:58:24] using ModHeader on Chrome [14:58:25] else, looks good to me as well [15:00:32] !log eqiad LVSs: reboot for retpoline kernel updates T188092 [15:00:41] zeljkof: olliv mdholloway raynor ok on my side [15:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:15] mdholloway, niedzielski, olliv: ok to deploy? [15:01:35] zeljkof: ok for me [15:01:44] zeljkof: yep, go for it [15:01:46] addshore: so the same null change passes phan on master and fails on wmf.23 https://gerrit.wikimedia.org/r/c/415575 [15:01:47] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4014023 (10chasemp) [15:01:50] very mysterious [15:01:56] <_joe_> andrewbogott: the patches are not ready, but that's because of other bugs I'll have to fix :P [15:01:57] mdholloway: ok, deploying [15:02:07] <_joe_> so go on with your one for now [15:02:13] ok, thanks [15:02:20] thanks zeljkof mdholloway olliv raynor [15:02:47] uh on [15:02:49] uh oh [15:02:53] ? [15:02:56] `sync-file failed: /srv/mediawiki-staging/php-1.31.0-wmf.22/extensions/Popups/.eslintrc.json is an invalid JSON file` [15:03:15] hashar, addshore: ever seen this?! ^ [15:03:23] (03PS8) 10Andrew Bogott: mediawiki: move ::hhvm::admin include out of mediawiki module [puppet] - 10https://gerrit.wikimedia.org/r/415353 [15:03:34] zeljkof: i can make another patch that deletes the comments from that file if it helps [15:04:42] (03CR) 10Andrew Bogott: [C: 032] mediawiki: move ::hhvm::admin include out of mediawiki module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415353 (owner: 10Andrew Bogott) [15:04:48] Lucas_WMDE:,hashar, addshore: https://gerrit.wikimedia.org/r/#/c/415319/ is `Blocked on Verified Label` [15:05:37] niedzielski: please do, until I take a look, I've never seen scap complain about that :| [15:05:41] zeljkof: there's // style line comments in .eslintrc.json. it's understood by ESLint but apparently not by whatever JSON validation tool is being used [15:06:14] zeljkof: ok just a moment [15:06:47] (03PS31) 10Andrew Bogott: labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) [15:07:21] zeljkof: try a full scap ? [15:07:25] or scap the parent dir? [15:07:41] hashar: I am already doing a scap of the entire extension [15:07:57] zeljkof: https://gerrit.wikimedia.org/r/#/c/415576/ [15:07:58] `scap sync-file php-1.31.0-wmf.22/extensions/Popups ...` [15:10:07] Lucas_WMDE:,hashar, addshore: we are already out of time, and 415319 is `Blocked on Verified Label`, if it's not urgent, let's deploy it on Monday, ok? [15:10:28] hmm [15:10:48] well the patch got rebased on top of an attempted fix [15:10:57] I guess we can rebase again and force merge / deploy it [15:10:59] I can do it [15:11:21] it’s also okay if we delay it [15:11:32] or just wait for the regular train, if Wikibase CI on wmf.23 is broken for some reason… [15:11:52] as I said, it’s not an urgent patch, the bug it fixes can’t currently be triggered AFAIK [15:12:08] and the config change that *would* trigger it is also blocked on another thing [15:12:11] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [93750000.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [15:12:54] !log upload puppetdb 4.4.0-1~wmf1 to component/puppetdb4 - T177253 [15:13:02] Lucas_WMDE: sounds good for monday so ;- [15:13:09] hashar: go ahead if you want to deploy, I am still deploying popups [15:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:11] T177253: Upgrade PuppetDB to version 4.4 - https://phabricator.wikimedia.org/T177253 [15:13:25] hashar: well, we’d need to figure out how to fix CI until then ;) [15:13:27] hashar, Lucas_WMDE: ok, Monday it is then [15:13:42] hashar, Lucas_WMDE: I'll remove my +2 [15:14:23] I’ll add it to the calendar [15:14:33] Lucas_WMDE: thanks! [15:15:01] niedzielski: 415576 is merged, deploying [15:15:23] 👍 [15:15:55] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded - https://phabricator.wikimedia.org/T188589#4014054 (10Marostegui) I agree with Jaime here - it is key to find what is causing this overload. Even though we have to replace this old host, we really need to find out what is causing this overload, otherw... [15:16:02] I have a pybal question (which is maybe an akosiaris thing?) I'm setting up a new wikitech using shared ::mediawiki classes. But it's separate from the rest of the app servers so I don't want it to have anything to do with pybal. Is there something I can/should do to prevent it from registering itself with pybal? [15:16:38] <_joe_> andrewbogott: which profiles/roles are you including? [15:16:53] https://gerrit.wikimedia.org/r/#/c/415019/31/modules/profile/manifests/openstack/base/wikitech/web.pp [15:16:56] 10Operations, 10Page-Previews, 10RESTBase, 10Traffic, and 2 others: Cached page previews not shown when refreshed - https://phabricator.wikimedia.org/T184534#4014055 (10Niedzielski) @bblack, @phuedx sorry to be a bother. This seems like an important issue as we're trying to rollout page previews to prod th... [15:17:07] <_joe_> also "registering itself with pybal" is not what the puppet code does [15:17:11] ::mediawiki and ::mediawiki::multimedia [15:17:14] ah not pybal, conftool you mean [15:17:25] I was perplexed by it [15:17:36] Honestly I'm not sure what I mean :) [15:17:50] <_joe_> so let's clarify that first [15:17:51] !log rolling restart of swift frontends in eqiad for kernel security update [15:17:54] But pybal is a thing, right? I don't want to be behind the load-balancers. [15:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:17] mediawiki::conftool [15:18:20] !log zfilipin@tin Synchronized php-1.31.0-wmf.22/extensions/Popups: SWAT: [[gerrit:415564|Fix: dont assume thumbnail URLs contain pixel size (T187955)]] (duration: 01m 14s) [15:18:28] <_joe_> andrewbogott: then don't configure it to be behind load-balancers [15:18:29] mdholloway, niedzielski, olliv, raynor_: deployed, please check and thanks for deploying with #releng! ;) [15:18:32] that's the class the does the "registration" [15:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:34] T187955: Page preview shows icon instead of thumbnail - https://phabricator.wikimedia.org/T187955 [15:18:36] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415577 [15:18:37] don't include it ? [15:18:39] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415577 [15:18:44] lol zeljkof [15:18:45] <_joe_> akosiaris: it's not what he's meaning [15:18:48] mobrovac: swat took longer than expected, but we are finally done [15:18:55] oh ? I misunderstood again ? [15:18:59] akosiaris, _joe_, if that's all there is to it then I'll just make sure not to [15:19:01] !log EU SWAT finished [15:19:03] kk zeljkof taking over the sync [15:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:15] <_joe_> andrewbogott: I think you just want to avoid including role::lvs::realserver [15:19:46] ok — I do include that, actually, but because I have my own service ip and lvs setup that this is behind. [15:19:58] <_joe_> ok so... [15:20:22] <_joe_> nothing will interfere with your setup [15:20:28] zeljkof: looking good! thanks again, zeljkof! [15:20:55] So probably my actual question is: "This mediawiki code is a bit spaghettified and I want to avoid having unintended impact on the production mediawiki cluster. What should I look out for?" [15:21:02] <_joe_> I'm not really sure what's bothering you - you don't want the load-balancers to send normal traffic to your machine [15:21:13] <_joe_> right? [15:21:17] right [15:21:24] <_joe_> that's taken care of by you not adding your servers to the pools [15:21:45] <_joe_> as long as you don't include labweb in the relevant groups in conftool-data, you're ok [15:21:53] <_joe_> it's quite decoupled from the rest [15:21:54] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415577 (owner: 10Marostegui) [15:21:56] Ok, that's great. I just wanted to confirm that by including ::mediawiki I wasn't implicitly adding my server to the pool [15:22:00] cool [15:22:08] !log draining restbase1013 for eventual reboot for kernel security update [15:22:09] thank you, this is encouraging [15:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:38] * andrewbogott has a meeting and then will try turning some of this on [15:23:11] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415577 (owner: 10Marostegui) [15:23:26] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415577 (owner: 10Marostegui) [15:23:41] thanks so much zeljkof mdholloway olliv raynor!! [15:24:35] marostegui: you scap syncing? [15:24:47] (03PS5) 10BBlack: Add hiera max_core_rtt data [puppet] - 10https://gerrit.wikimedia.org/r/413180 (https://phabricator.wikimedia.org/T157430) [15:24:49] (03PS5) 10BBlack: reload-vcl refactors/improvements [puppet] - 10https://gerrit.wikimedia.org/r/415204 (https://phabricator.wikimedia.org/T157430) [15:24:51] (03PS5) 10BBlack: Make inter-varnish probes great again [puppet] - 10https://gerrit.wikimedia.org/r/415205 (https://phabricator.wikimedia.org/T157430) [15:24:56] mobrovac: yeah [15:25:04] mobrovac: did I interrupt you?? [15:25:19] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1074 after alter table (duration: 01m 13s) [15:25:23] I am finished now [15:25:26] marostegui: will be done soon? i kind of need to get some fixes out for jobqueue, we're losing jobs [15:25:31] kk thnx marostegui [15:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:05] mobrovac: go ahead, that was my only change :) [15:26:49] !log mobrovac@tin Synchronized php-1.31.0-wmf.22/extensions/EventBus/includes/JobQueueEventBus.php: EventBus: Specify that EventBus queue supports delayed jobs (wmf/1.31.0-wmf.22) - T188540 (duration: 01m 13s) [15:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:06] T188540: Switch cdnPurge to Kafka - https://phabricator.wikimedia.org/T188540 [15:30:08] !log mobrovac@tin Synchronized php-1.31.0-wmf.23/extensions/EventBus/includes/JobQueueEventBus.php: EventBus: Specify that EventBus queue supports delayed jobs (wmf/1.31.0-wmf.23) - T188540 (duration: 01m 14s) [15:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:12] 10Operations: netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994#4014123 (10faidon) First, I don't think we should be thinking in terms of "using software from the 90s", at least not for something that is still as widely used and well-maintained as iptables (and to someth... [15:33:51] (03PS1) 10Marostegui: db-eqiad.php: Repooling db1060 for API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415578 [15:34:10] (03CR) 10BBlack: [C: 032] Add hiera max_core_rtt data [puppet] - 10https://gerrit.wikimedia.org/r/413180 (https://phabricator.wikimedia.org/T157430) (owner: 10BBlack) [15:34:12] (03CR) 10BBlack: [C: 032] reload-vcl refactors/improvements [puppet] - 10https://gerrit.wikimedia.org/r/415204 (https://phabricator.wikimedia.org/T157430) (owner: 10BBlack) [15:36:02] PROBLEM - puppet last run on cp5001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:38:20] (03PS6) 10BBlack: Make inter-varnish probes great again [puppet] - 10https://gerrit.wikimedia.org/r/415205 (https://phabricator.wikimedia.org/T157430) [15:38:23] (03PS1) 10BBlack: varnish: remove service dep on defaults file [puppet] - 10https://gerrit.wikimedia.org/r/415579 [15:39:01] (03CR) 10BBlack: [C: 032] varnish: remove service dep on defaults file [puppet] - 10https://gerrit.wikimedia.org/r/415579 (owner: 10BBlack) [15:39:46] mobrovac: can I deploy? [15:39:53] (it is not urgen) [15:40:05] yup, sure sorry, forgot to say i was done marostegui [15:40:17] No worries it was a last time change I just realised [15:40:17] Thanks [15:40:24] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repooling db1060 for API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415578 (owner: 10Marostegui) [15:41:02] RECOVERY - puppet last run on cp5001 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [15:41:38] (03Merged) 10jenkins-bot: db-eqiad.php: Repooling db1060 for API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415578 (owner: 10Marostegui) [15:41:52] (03CR) 10jenkins-bot: db-eqiad.php: Repooling db1060 for API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415578 (owner: 10Marostegui) [15:43:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1060 as API (duration: 01m 13s) [15:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:35] (03PS1) 10Urbanecm: Add ruwikimedia to wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415580 [15:43:46] (03PS2) 10Urbanecm: Add ruwikimedia to wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415580 (https://phabricator.wikimedia.org/T188456) [15:51:55] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4014170 (10Dzahn) Yes, I can see the ticket now, thanks @Prtksxna and @Aklapper. [15:52:45] !log draining restbase1014 for eventual reboot for kernel security update [15:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:38] !log rebooting radium (tor relay) for kernel security update [16:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:19] PROBLEM - ElasticSearch health check for shards on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 1003 threshold =0.1 breach: status: yellow, number_of_nodes: 32, unassigned_shards: 972, number_of_pending_tasks: 15, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3127, task_max_waiting_in_queue_millis: 136075, cluster_name: production-search-eqiad, relocating_shards: 30, acti [16:02:19] s_number: 89.643427355, active_shards: 8422, initializing_shards: 1, number_of_data_nodes: 32, delayed_unassigned_shards: 0 [16:02:38] hello elasticsearch [16:02:52] <_joe_> I'm here if needed [16:03:04] damn, we're missing 0.36% again. [16:03:23] no worries, I just have not fixed that check yet, coming up... [16:03:29] and all my apologies for the noise [16:03:31] ok [16:04:28] RECOVERY - ElasticSearch health check for shards on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 35, unassigned_shards: 780, number_of_pending_tasks: 60, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3127, task_max_waiting_in_queue_millis: 261965, cluster_name: production-search-eqiad, relocating_shards: 23, active_shards_percent_as_nu [16:04:28] , active_shards: 8603, initializing_shards: 12, number_of_data_nodes: 35, delayed_unassigned_shards: 0 [16:06:03] !log Fix s7 replication on labsdb1010 - T186579 [16:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:19] T186579: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579 [16:08:42] (03PS1) 10Gehel: elasticsearch: raise alerting threshold of unassigned shards [puppet] - 10https://gerrit.wikimedia.org/r/415581 [16:10:34] (03CR) 10DCausse: [C: 031] elasticsearch: raise alerting threshold of unassigned shards [puppet] - 10https://gerrit.wikimedia.org/r/415581 (owner: 10Gehel) [16:10:49] (03PS2) 10Gehel: elasticsearch: raise alerting threshold of unassigned shards [puppet] - 10https://gerrit.wikimedia.org/r/415581 [16:13:24] (03CR) 10Gehel: [C: 032] elasticsearch: raise alerting threshold of unassigned shards [puppet] - 10https://gerrit.wikimedia.org/r/415581 (owner: 10Gehel) [16:14:34] !log rebooting hafnium for kernel security update [16:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:01] akosiaris: as FYI piwik just completed the archive work (5h), nothing weird registered [16:23:57] nice [16:24:04] thanks for the info [16:24:53] 10Operations, 10ops-eqiad: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#4014290 (10MoritzMuehlenhoff) FWIF, I checked DRDB commits between 4.9 and current Linux git, but nothing obvious (could be related to all kinds of side effects in changes in kvm or ge... [16:25:21] (03PS3) 10Ottomata: Automate installation of spark2 oozie sharelib [puppet] - 10https://gerrit.wikimedia.org/r/415465 (https://phabricator.wikimedia.org/T159962) [16:28:31] (03PS1) 10Ottomata: Temporarliy run banner impression spark streaming job from 2.2.1 .jar [puppet] - 10https://gerrit.wikimedia.org/r/415584 (https://phabricator.wikimedia.org/T159962) [16:29:18] (03CR) 10Ottomata: [C: 032] Temporarliy run banner impression spark streaming job from 2.2.1 .jar [puppet] - 10https://gerrit.wikimedia.org/r/415584 (https://phabricator.wikimedia.org/T159962) (owner: 10Ottomata) [16:30:26] (03PS1) 10Marostegui: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415585 (https://phabricator.wikimedia.org/T186321) [16:31:43] (03PS1) 10Marostegui: db1093: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/415586 (https://phabricator.wikimedia.org/T186321) [16:31:53] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415585 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [16:32:08] (03PS1) 10Volans: CLI: fix setup_logging() when without path [software/cumin] - 10https://gerrit.wikimedia.org/r/415587 (https://phabricator.wikimedia.org/T188627) [16:32:12] (03PS2) 10Marostegui: db1093: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/415586 (https://phabricator.wikimedia.org/T186321) [16:32:55] (03CR) 10Marostegui: [C: 032] db1093: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/415586 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [16:34:08] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415585 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [16:34:42] (03PS4) 10Ottomata: Automate installation of spark2 oozie sharelib [puppet] - 10https://gerrit.wikimedia.org/r/415465 (https://phabricator.wikimedia.org/T159962) [16:35:57] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1093 - T186321 (duration: 01m 13s) [16:36:02] (03PS5) 10Ottomata: Automate installation of spark2 oozie sharelib [puppet] - 10https://gerrit.wikimedia.org/r/415465 (https://phabricator.wikimedia.org/T159962) [16:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:13] T186321: Prepare and indicate proper master db failover candidates for all eqiad database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321 [16:36:16] !log Restart mariadb on db1093 for binlog format change - T186321 [16:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:28] (03PS1) 10Herron: initial commit of 4.4.0-1 [debs/puppetdb] (4.4.0-1) - 10https://gerrit.wikimedia.org/r/415591 [16:38:04] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415585 (https://phabricator.wikimedia.org/T186321) (owner: 10Marostegui) [16:38:36] (03PS3) 10Ottomata: Add duplicate mediawiki avro Camus job to consume from Kafka jumbo and analytics [puppet] - 10https://gerrit.wikimedia.org/r/413792 (https://phabricator.wikimedia.org/T188136) [16:38:56] (03CR) 10Elukey: "Seems fine!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/415465 (https://phabricator.wikimedia.org/T159962) (owner: 10Ottomata) [16:40:31] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4014352 (10bd808) [16:41:28] !log reimporting database testreduce_0715 from db1009 to db2037 [16:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:10] (03CR) 10Ottomata: Automate installation of spark2 oozie sharelib (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/415465 (https://phabricator.wikimedia.org/T159962) (owner: 10Ottomata) [16:46:37] (03CR) 10Ottomata: [C: 032] "Looks good! https://puppet-compiler.wmflabs.org/compiler02/10217/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/413792 (https://phabricator.wikimedia.org/T188136) (owner: 10Ottomata) [16:46:53] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014404 (10Marostegui) I would like to do the following to be able to replace db1009. Get db1114 (512G) to replace db1073 (160G API in s1) as API in s... [16:47:09] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4014408 (10jcrespo) "by idle connections to the nova database" I don't think that is accurate- that is making things worse, but probably not the root cause. max_conne... [16:48:43] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415594 [16:49:50] (03PS2) 10Ottomata: Point Mediawiki Monolog at Kafka jumbo in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413796 (https://phabricator.wikimedia.org/T188136) [16:49:53] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014428 (10jcrespo) I was thinking of making db1114 a multi-misc. But at this point, anything goes as long as we get rid of db1009- the problems is the... [16:49:56] (03CR) 10Ottomata: Point Mediawiki Monolog at Kafka jumbo in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413796 (https://phabricator.wikimedia.org/T188136) (owner: 10Ottomata) [16:50:27] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415594 (owner: 10Marostegui) [16:50:48] (03PS3) 10Ottomata: Point Mediawiki Monolog at Kafka jumbo in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413796 (https://phabricator.wikimedia.org/T188136) [16:52:02] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014447 (10jcrespo) We need to update the original plan: https://gerrit.wikimedia.org/r/#/c/399792/3/wmf-config/db-eqiad.php [16:52:08] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415594 (owner: 10Marostegui) [16:52:14] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014448 (10Marostegui) >>! In T183469#4014428, @jcrespo wrote: > I was thinking of making db1114 a multi-misc. But at this point, anything goes as long... [16:52:22] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415594 (owner: 10Marostegui) [16:53:42] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1093 (duration: 01m 13s) [16:53:44] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014452 (10jcrespo) +1. If you have time to update a db-eqiad.php, that would be great to know pending steps. If not, I will do it when I have the time... [16:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:23] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014453 (10Marostegui) >>! In T183469#4014452, @jcrespo wrote: > +1. If you have time to update a db-eqiad.php, that would be great to know pending ste... [16:55:38] 10Operations, 10Cloud-Services, 10DBA: db1009 overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589#4014473 (10Andrew) > > Is this something that would be more safely done with `keystone-manage token_flush` or is that unrelated? These are two different things. To... [16:59:07] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014490 (10Marostegui) Let's go for mariadb 10.1+ stretch for db1114 so we can have a 10.1 as API in s1 - (we already have the two rc slaves as 10.1) [17:00:04] godog, moritzm, and _joe_: #bothumor My software never has bugs. It just develops random features. Rise for Puppet SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180301T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:02:45] (03PS1) 10Marostegui: mariadb: Move db1114 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/415597 (https://phabricator.wikimedia.org/T183469) [17:03:50] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415598 [17:04:33] (03PS1) 10BryanDavis: dynamicproxy: raise client_max_body_size to 256m [puppet] - 10https://gerrit.wikimedia.org/r/415599 (https://phabricator.wikimedia.org/T186731) [17:05:07] (03PS7) 10BBlack: Make inter-varnish probes great again [puppet] - 10https://gerrit.wikimedia.org/r/415205 (https://phabricator.wikimedia.org/T157430) [17:05:36] (03CR) 10BBlack: [C: 032] "PCC confirms values in commitmsg examples: https://puppet-compiler.wmflabs.org/compiler02/10218/" [puppet] - 10https://gerrit.wikimedia.org/r/415205 (https://phabricator.wikimedia.org/T157430) (owner: 10BBlack) [17:06:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415598 (owner: 10Marostegui) [17:07:36] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415598 (owner: 10Marostegui) [17:08:13] (03CR) 10Jcrespo: [C: 04-1] "Please set the host hiera key for section and disabling notifications." [puppet] - 10https://gerrit.wikimedia.org/r/415597 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [17:08:19] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415598 (owner: 10Marostegui) [17:09:07] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1093 (duration: 01m 13s) [17:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:30] (03PS4) 10Sau226: Disable main page deletion on enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414509 (https://phabricator.wikimedia.org/T184959) [17:09:52] (03PS2) 10Marostegui: mariadb: Move db1114 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/415597 (https://phabricator.wikimedia.org/T183469) [17:09:57] (03CR) 10Marostegui: "> Please set the host hiera key for section and disabling" [puppet] - 10https://gerrit.wikimedia.org/r/415597 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [17:11:11] (03CR) 10EBernhardson: [C: 031] Point Mediawiki Monolog at Kafka jumbo in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413796 (https://phabricator.wikimedia.org/T188136) (owner: 10Ottomata) [17:13:22] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/10220/" [puppet] - 10https://gerrit.wikimedia.org/r/415597 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [17:15:03] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415600 [17:17:13] (03CR) 10Ottomata: [C: 032] Automate installation of spark2 oozie sharelib [puppet] - 10https://gerrit.wikimedia.org/r/415465 (https://phabricator.wikimedia.org/T159962) (owner: 10Ottomata) [17:17:20] (03PS6) 10Ottomata: Automate installation of spark2 oozie sharelib [puppet] - 10https://gerrit.wikimedia.org/r/415465 (https://phabricator.wikimedia.org/T159962) [17:17:26] (03CR) 10Ottomata: [V: 032 C: 032] Automate installation of spark2 oozie sharelib [puppet] - 10https://gerrit.wikimedia.org/r/415465 (https://phabricator.wikimedia.org/T159962) (owner: 10Ottomata) [17:17:31] (03CR) 10Jcrespo: [C: 031] mariadb: Move db1114 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/415597 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [17:17:36] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415600 (owner: 10Marostegui) [17:18:52] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1068 - https://phabricator.wikimedia.org/T188187#4014579 (10Marostegui) Thanks Chris! ``` root@db1068:~# megacli -PDRbld -ShowProg -PhysDrv [32:2] -aALL Rebuild Progress on Device at Enclosure 32, Slot 2 Completed 6% in 8 Minutes. ``` [17:19:27] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415600 (owner: 10Marostegui) [17:19:41] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415600 (owner: 10Marostegui) [17:20:40] (03PS5) 10Sau226: Disable main page deletion on enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414509 (https://phabricator.wikimedia.org/T184959) [17:21:37] gehel replaced the ethernet cable and I am not able to login to wdqs1004 [17:21:56] Debian GNU/Linux 8 auto-installed on Wed Sep 6 12:24:33 UTC 2017. [17:21:57] cmjohnson@wdqs1004:~$ [17:22:31] 10Operations, 10Page-Previews, 10RESTBase, 10Traffic, and 2 others: Cached page previews not shown when refreshed - https://phabricator.wikimedia.org/T184534#4014590 (10Niedzielski) I tested a few more endpoints just using the [[ https://en.wikipedia.org/api/rest_v1/ | documentation site ]] and here's what... [17:22:41] cmjohnson1, not able? [17:22:52] sorry...able [17:22:59] :) [17:23:22] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4014592 (10Cmjohnson) I replaced the ethernet cable and was able to login Debian GNU/Linux 8 auto-installed on Wed Sep 6 12:24:33 UTC 2017. cmjohnson@wdqs1004:~$ [17:24:08] (03PS3) 10Marostegui: mariadb: Move db1114 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/415597 (https://phabricator.wikimedia.org/T183469) [17:24:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1093 (duration: 01m 28s) [17:24:59] (03CR) 10Marostegui: [C: 032] mariadb: Move db1114 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/415597 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [17:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:11] (03PS1) 10Ottomata: Properly install spark2_oozie_sharelib_install.sh [puppet] - 10https://gerrit.wikimedia.org/r/415602 (https://phabricator.wikimedia.org/T159962) [17:26:59] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014605 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1114.eqiad.wmnet'] ``` Th... [17:27:49] 10Operations: netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994#4014606 (10aborrero) >>! In T187994#4014123, @faidon wrote: > First, I don't think we should be thinking in terms of "using software from the 90s", at least not for something that is still as widely used and... [17:28:18] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415603 [17:29:12] 10Operations, 10Puppet, 10User-fgiunchedi: Upgrade hiera to stretch (version 3) - https://phabricator.wikimedia.org/T188623#4014608 (10fgiunchedi) [17:29:17] 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4014609 (10fgiunchedi) [17:29:19] 10Operations, 10Puppet, 10puppet-compiler: Puppet compiler failure to lookup some keys - https://phabricator.wikimedia.org/T185215#4014607 (10fgiunchedi) [17:29:44] 10Operations, 10Puppet, 10puppet-compiler: Puppet compiler failure to lookup some keys - https://phabricator.wikimedia.org/T185215#3909595 (10fgiunchedi) I've experienced the same with hiera 3 and a puppet master on stretch, likely related to "segmented keys" lookups [17:34:07] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415603 (owner: 10Marostegui) [17:35:16] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415603 (owner: 10Marostegui) [17:35:18] 10Operations, 10Puppet, 10puppet-compiler: Puppet compiler failure to lookup some keys - https://phabricator.wikimedia.org/T185215#3909595 (10jcrespo) "the change applied fine in production and the keys are looked up just fine", actually I saw this change fail yesterday on production? Check logs for puppet f... [17:35:49] (03PS1) 10Mobrovac: Mathoid chart: Use port 10042 [deployment-charts] - 10https://gerrit.wikimedia.org/r/415605 (https://phabricator.wikimedia.org/T184919) [17:37:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1093 (duration: 01m 14s) [17:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:16] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415603 (owner: 10Marostegui) [17:43:00] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4014644 (10Dzahn) Thanks Chris! First i wasn't able to login, then i was and could start a puppet run. Let's see if Icinga clears up and stays green for a while. [17:44:35] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4014652 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1114.eqiad.wmnet'] ``` and were **ALL** successful. [17:45:19] (03CR) 10Bstorm: [C: 031] "Sounds good, IMHO" [puppet] - 10https://gerrit.wikimedia.org/r/415599 (https://phabricator.wikimedia.org/T186731) (owner: 10BryanDavis) [17:46:16] !log re-enabling icinga notifications for wdqs1004 services, ethernet cable has been replaced (T188045) [17:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:30] T188045: wdqs1004 broken - https://phabricator.wikimedia.org/T188045 [17:46:45] (03PS3) 10Gehel: Enable Kafka poller for wdqs2001 [puppet] - 10https://gerrit.wikimedia.org/r/415493 (https://phabricator.wikimedia.org/T188252) (owner: 10Smalyshev) [17:46:47] (03PS3) 10Bstorm: wiki-replicas: Accommodate new comments table with rules and compatibility [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) [17:47:27] !log restarting wdqs-updater on wdqs1004 -T188045 [17:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:43] gehel: i'm not entirely convinced yet. just mostly, heh [17:48:00] mutante: same here, I'll keep an eye on it before repooling it [17:48:04] sounds good, ack [17:48:21] (03CR) 10Gehel: [C: 032] Enable Kafka poller for wdqs2001 [puppet] - 10https://gerrit.wikimedia.org/r/415493 (https://phabricator.wikimedia.org/T188252) (owner: 10Smalyshev) [17:48:51] it feels like my ssh connection to it sometimes freezes for a second.. then it goes on [17:49:28] to wdqs1005 it doesnt do that. on the other hand i ran ping a while from internal and external and there was no loss [17:49:36] (03PS1) 10Jcrespo: mariadb-backups: Change backup format to YYYY-MM-dd--HH-mm-SS [puppet] - 10https://gerrit.wikimedia.org/r/415608 (https://phabricator.wikimedia.org/T184696) [17:50:38] mutante: and I still have the DNS resolution errors in the logs, something is not right [17:51:07] gehel: my feeling says the NIC might be broken in a subtle way [17:51:22] !log depooling wdqs2001 and switching to kafka poller - T188252 [17:51:23] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4014677 (10Smalyshev) Still getting this: ``` Mar 01 17:48:34 wdqs1004 bash[1382]: Exception in thread "main" java.lang.RuntimeException: java.net.UnknownHostException: www.wikidata... [17:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:35] T188252: Activate kafka-based recent change poller for wikidata query service - https://phabricator.wikimedia.org/T188252 [17:52:15] i mean the ARP caches were already checked as i saw on ticket.. hmm [17:52:40] wdq4 definitely feels like network is broken... is there a way to replace network HW there? [17:52:53] also i wanted to say the same thing you said "Could you try moving the cable to another port on the switch" [17:56:11] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4014696 (10Dzahn) Ack, i was able to connect to it and run ping for a while but .. it's like it's freezing sometimes and then continues. Also Gehel could confirm there is still an... [17:56:51] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4014697 (10Dzahn) It's intermittent. Most of the time it works,but then it doesn't... ``` [wdqs1004:~] $ host www.wikidata.org www.wikidata.org has address 208.80.154.224 www.wik... [17:58:31] (03CR) 10Framawiki: "Ok @Urbanecm, thank you for your explanation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414114 (https://phabricator.wikimedia.org/T188064) (owner: 10Framawiki) [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180301T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:20] (03CR) 10Zoranzoki21: [C: 031] Enable rollback for editors at zh_classicalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414114 (https://phabricator.wikimedia.org/T188064) (owner: 10Framawiki) [18:00:36] Nothing for ORES. [18:02:15] going to deploy MCS soon [18:02:39] nothing for parsoid today [18:02:39] (03PS1) 10Jcrespo: Revert "Depool labsdb1011 to copy its data away" [puppet] - 10https://gerrit.wikimedia.org/r/415610 [18:03:13] (03CR) 10Jcrespo: [C: 04-1] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/415610 (owner: 10Jcrespo) [18:03:24] mutante: switch port changed for wdqs1004 and I am able to login for now [18:03:40] and now it freezes [18:03:50] oh..wait n/m [18:04:29] (03PS4) 10Imarlier: [WIP] coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) [18:04:54] (03CR) 10jerkins-bot: [V: 04-1] [WIP] coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [18:07:05] cmjohnson1: confirmed but also behaviour still like before with intermittent freezes [18:07:15] yea [18:07:25] !log bsitzmann@tin Started deploy [mobileapps/deploy@ada38aa]: Update mobileapps to 0db4a60 (T183833) [18:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:40] T183833: [Bug report] Removing parentheses breaks chemical formulas - https://phabricator.wikimedia.org/T183833 [18:08:41] (03PS1) 10Gehel: Revert "Enable Kafka poller for wdqs2001" [puppet] - 10https://gerrit.wikimedia.org/r/415613 (https://phabricator.wikimedia.org/T188252) [18:09:16] mutante or gehel if you have any log entries please add to task. I need to pull a Dell report and open a ticket [18:09:23] (03PS5) 10Imarlier: [WIP] coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) [18:09:40] (03CR) 10Gehel: [C: 032] Revert "Enable Kafka poller for wdqs2001" [puppet] - 10https://gerrit.wikimedia.org/r/415613 (https://phabricator.wikimedia.org/T188252) (owner: 10Gehel) [18:09:45] * cmjohnson1 switches to wmf wifi in datacenter...back in a min [18:10:51] yes, i just found some. flapping link in dmesg [18:11:01] [203721.906779] tg3 0000:02:00.0 eth0: Link is down [18:11:01] [203725.760204] tg3 0000:02:00.0 eth0: Link is up at 1000 Mbps, full duplex [18:11:08] [203782.391437] tg3 0000:02:00.0 eth0: Link is down [18:11:09] [203798.154227] tg3 0000:02:00.0 eth0: Link is up at 1000 Mbps, full duplex [18:12:04] great..thx [18:12:30] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4014871 (10Dzahn) from dmesg: ``` [203721.906779] tg3 0000:02:00.0 eth0: Link is down [203725.760204] tg3 0000:02:00.0 eth0: Link is up at 1000 Mbps, full duplex [203725.760214] t... [18:12:49] 10Operations, 10cloud-services-team (Kanban): Labstore1006/7 profile for meltdown kernel - https://phabricator.wikimedia.org/T185101#4014873 (10madhuvishy) [18:13:26] !log bsitzmann@tin Finished deploy [mobileapps/deploy@ada38aa]: Update mobileapps to 0db4a60 (T183833) (duration: 06m 01s) [18:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:42] T183833: [Bug report] Removing parentheses breaks chemical formulas - https://phabricator.wikimedia.org/T183833 [18:15:19] (03PS2) 10Andrew Bogott: dynamicproxy: raise client_max_body_size to 256m [puppet] - 10https://gerrit.wikimedia.org/r/415599 (https://phabricator.wikimedia.org/T186731) (owner: 10BryanDavis) [18:15:51] it's hard to get more out of logs because it keeps freezing on me. i disabled the icinga notifications again. it's odd that i didn't see any from icinga-wm meanwhile. [18:16:44] 10Operations, 10Ops-Access-Requests, 10Discovery-Search: Google Search Console access for Search Platform team - https://phabricator.wikimedia.org/T188453#4014910 (10EBjune) p:05Triage>03Normal I'm providing a proactive approval, if that's what's needed, and I can certainly echo the usefulness for the Se... [18:16:55] no, it's normal, in scheduled downtime :) [18:17:12] (03CR) 10Andrew Bogott: [C: 032] dynamicproxy: raise client_max_body_size to 256m [puppet] - 10https://gerrit.wikimedia.org/r/415599 (https://phabricator.wikimedia.org/T186731) (owner: 10BryanDavis) [18:17:23] mutante: yep, I did add the downtime... [18:21:15] alright :) [18:22:42] 10Operations, 10Ops-Access-Requests: Need access to graphite servers - https://phabricator.wikimedia.org/T188649#4014947 (10Imarlier) [18:23:13] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/10222/mwlog1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/415509 (owner: 10Dzahn) [18:23:27] (03PS2) 10Dzahn: xenon: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415509 [18:23:53] 10Operations, 10Ops-Access-Requests: Need access to webperf* servers - https://phabricator.wikimedia.org/T188650#4014970 (10Imarlier) [18:27:27] (03CR) 10Jforrester: [C: 031] Run initSiteStats twice a month [puppet] - 10https://gerrit.wikimedia.org/r/415066 (https://phabricator.wikimedia.org/T59788) (owner: 10Chad) [18:30:55] 10Operations, 10Ops-Access-Requests, 10Discovery-Search (Current work): Google Search Console access for Search Platform team - https://phabricator.wikimedia.org/T188453#4015045 (10EBjune) [18:32:14] (03PS1) 10Imarlier: NavigtationTiming: Enable oversampling for Singapore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) [18:32:42] (03PS1) 10Andrew Bogott: nova.conf: adjust db pool settings for all services [puppet] - 10https://gerrit.wikimedia.org/r/415619 (https://phabricator.wikimedia.org/T188589) [18:33:16] (03CR) 10jerkins-bot: [V: 04-1] NavigtationTiming: Enable oversampling for Singapore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [18:33:31] (03CR) 10Andrew Bogott: "The bad news is that the max pool size was already set to 10, so it's possible this isn't being observed at all by conductor." [puppet] - 10https://gerrit.wikimedia.org/r/415619 (https://phabricator.wikimedia.org/T188589) (owner: 10Andrew Bogott) [18:35:39] (03PS2) 10Andrew Bogott: nova.conf: adjust db pool settings for all services [puppet] - 10https://gerrit.wikimedia.org/r/415619 (https://phabricator.wikimedia.org/T188589) [18:37:29] (03PS2) 10Andrew Bogott: wikireplica_dns: Adjust dblist retrieval path [puppet] - 10https://gerrit.wikimedia.org/r/414107 (owner: 10BryanDavis) [18:38:07] (03PS1) 10Chad: Moving PerformanceInspector to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415620 [18:39:00] (03CR) 10Andrew Bogott: [C: 032] wikireplica_dns: Adjust dblist retrieval path [puppet] - 10https://gerrit.wikimedia.org/r/414107 (owner: 10BryanDavis) [18:40:54] (03CR) 10Chad: [C: 032] Moving PerformanceInspector to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415620 (owner: 10Chad) [18:42:09] (03CR) 10Jforrester: "Woohoo." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415620 (owner: 10Chad) [18:42:11] (03Merged) 10jenkins-bot: Moving PerformanceInspector to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415620 (owner: 10Chad) [18:42:23] (03CR) 10jenkins-bot: Moving PerformanceInspector to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415620 (owner: 10Chad) [18:42:56] James_F: Oh man, this makes me so happy :) [18:43:31] no_justification: Yup. [18:43:48] I think it'll help discourage Beta as the infinite playground of things we never end up deploying [18:43:49] !log demon@tin Synchronized docroot/noc/: killing extension-list-labs (duration: 01m 14s) [18:43:52] no_justification: Next epic is to remove the string "labs" and replace with "staging cluster", right? [18:43:58] * James_F grins. That too. [18:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:13] Since you now deploy to beta in a way that preps production [18:44:24] (and helps to discourage the race condition we currently hit in swapping) [18:44:27] (03CR) 1020after4: [C: 031] Moving PerformanceInspector to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415620 (owner: 10Chad) [18:45:52] 10Operations, 10Analytics: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#4015116 (10Ottomata) [18:45:58] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: disable performance inspector in prod explicitly (duration: 01m 14s) [18:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:50] (03PS1) 10Smalyshev: Add port to kafka config for WDQS poller [puppet] - 10https://gerrit.wikimedia.org/r/415621 (https://phabricator.wikimedia.org/T188252) [18:47:30] !log demon@tin Synchronized wmf-config/: killing extension-list-labs (duration: 01m 17s) [18:47:36] James_F: Honestly though: the naming is so low on my priority list [18:47:40] (03CR) 10jerkins-bot: [V: 04-1] Add port to kafka config for WDQS poller [puppet] - 10https://gerrit.wikimedia.org/r/415621 (https://phabricator.wikimedia.org/T188252) (owner: 10Smalyshev) [18:47:41] I don't care that it says labs [18:47:42] :) [18:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:07] 10Operations, 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, and 3 others: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#4015126 (10Ottomata) [18:50:35] (03PS2) 10Smalyshev: Add port to kafka config for WDQS poller [puppet] - 10https://gerrit.wikimedia.org/r/415621 (https://phabricator.wikimedia.org/T188252) [18:50:44] no_justification: Yeah, but not g.reg-g's. ;-) [18:53:11] (03CR) 10Ottomata: Add port to kafka config for WDQS poller (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415621 (https://phabricator.wikimedia.org/T188252) (owner: 10Smalyshev) [18:54:23] (03PS1) 10Gergő Tisza: Temporary account creation limit raise for outreach dashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415624 (https://phabricator.wikimedia.org/T188630) [18:54:40] 10Operations, 10Analytics, 10Analytics-Cluster, 10Traffic, and 2 others: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#4015160 (10Ottomata) [18:54:44] (03CR) 10Smalyshev: Add port to kafka config for WDQS poller (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415621 (https://phabricator.wikimedia.org/T188252) (owner: 10Smalyshev) [18:55:35] (03PS3) 10Smalyshev: Add port to kafka config for WDQS poller [puppet] - 10https://gerrit.wikimedia.org/r/415621 (https://phabricator.wikimedia.org/T188252) [18:56:01] ottomata: I assume kafka_config['string'] already has ports and everything else? [18:56:10] yup [18:56:19] 10Operations, 10ops-eqiad, 10fundraising-tech-ops, 10Patch-For-Review: Rack/setup frmon1001 - https://phabricator.wikimedia.org/T186073#4015171 (10Cmjohnson) management password changed [18:56:21] (03PS2) 10Gergő Tisza: Temporary account creation limit raise for outreach dashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415624 (https://phabricator.wikimedia.org/T188630) [18:56:23] cool! should have been using it from the start :) [18:56:32] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frpig1001 - https://phabricator.wikimedia.org/T187365#4015172 (10Cmjohnson) management password changed [18:56:33] thanks for pointing it out [18:56:33] SMalyshev: https://github.com/wikimedia/puppet/blob/production/modules/role/lib/puppet/parser/functions/kafka_config.rb#L28 [18:56:54] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frbast1001 - https://phabricator.wikimedia.org/T187363#4015173 (10Cmjohnson) management password changed [18:57:16] ottomata: yep, exactly what I needed! [19:00:03] PROBLEM - Host analytics1062 is DOWN: PING CRITICAL - Packet loss = 100% [19:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for Morning SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180301T1900). [19:00:04] framawiki: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:24] I can SWAT [19:00:32] I added a last minute patch, which is rather urgent [19:00:51] sorry to cut in the line [19:01:04] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415624 (https://phabricator.wikimedia.org/T188630) (owner: 10Gergő Tisza) [19:02:03] tgr: ah, crap. Is ^ supposed to be 2018-03-01? [19:02:16] (03PS4) 10Smalyshev: Fix kafka configuration ports & cluster string [puppet] - 10https://gerrit.wikimedia.org/r/415621 (https://phabricator.wikimedia.org/T188252) [19:02:39] (03Merged) 10jenkins-bot: Temporary account creation limit raise for outreach dashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415624 (https://phabricator.wikimedia.org/T188630) (owner: 10Gergő Tisza) [19:02:53] (03CR) 10jenkins-bot: Temporary account creation limit raise for outreach dashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415624 (https://phabricator.wikimedia.org/T188630) (owner: 10Gergő Tisza) [19:02:54] PROBLEM - Host analytics1062.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:03:34] tgr: if so, could you make a followup? [19:03:59] d'oh! [19:04:57] sorry analytics1062 is me [19:05:25] thcipriani: done [19:05:34] o/ [19:05:35] 10Operations, 10ops-eqiad, 10Analytics-Kanban: DIMM errors for analytics1062 - https://phabricator.wikimedia.org/T187164#4015226 (10Cmjohnson) Record: 7 Date/Time: 02/13/2018 09:09:38 Source: system Severity: Critical Description: Multi-bit memory errors detected on a memory device at location... [19:05:51] (03CR) 10Anomie: "Good start. There aren't really 28 comments, most of them are repeating the same thing multiple times." (0328 comments) [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [19:05:53] oops, not actually done, didn't see the patch is merged already [19:06:08] yeah, sorry :( [19:06:11] thcipriani: and I'm adding another last minute too, but it is a beta-only no-op [19:06:20] okie doke [19:07:33] (03PS1) 10Gergő Tisza: Fix throttle date for outreach dashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415629 [19:07:41] thcipriani: ^ done for reals [19:08:03] RECOVERY - Host analytics1062.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [19:08:24] probably should have cleaned up all the old stuff in there, will do that for the next swat [19:08:29] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415629 (owner: 10Gergő Tisza) [19:08:33] cool, thanks [19:08:44] just blindly assumed that all those are recent and copied the date [19:09:09] (03CR) 10BBlack: [C: 031] NavigtationTiming: Enable oversampling for Singapore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [19:09:11] (03PS3) 10Dzahn: xenon: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415509 [19:09:13] (03CR) 10Rush: "I know it's not my changeset but thanks anomie. Really helpful." [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [19:09:55] (03PS3) 10MarcoAurelio: beta: add nlwiki to langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415555 (https://phabricator.wikimedia.org/T188582) [19:09:58] (03Merged) 10jenkins-bot: Fix throttle date for outreach dashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415629 (owner: 10Gergő Tisza) [19:10:28] (03PS4) 10Dzahn: xenon: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415509 [19:11:11] (03PS2) 10Imarlier: NavigtationTiming: Enable oversampling for Singapore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) [19:11:45] (03CR) 10Imarlier: "> Patch Set 1: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [19:12:09] (03PS1) 10Gergő Tisza: Make last throttle limit raise work accross all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415630 [19:12:19] thcipriani: ^ will have one more change, sorry [19:12:21] (03CR) 10jenkins-bot: Fix throttle date for outreach dashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415629 (owner: 10Gergő Tisza) [19:12:23] PROBLEM - Host scb1003 is DOWN: PING CRITICAL - Packet loss = 100% [19:12:52] (03PS6) 10Imarlier: coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) [19:13:04] !log thcipriani@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:415629|Fix throttle date for outreach dashboard]] T188630 (duration: 01m 13s) [19:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:19] T188630: Accounts created through OAuth are rate-limited even when the user has account creator rights - https://phabricator.wikimedia.org/T188630 [19:13:23] (03CR) 10Andrew Bogott: [C: 032] labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [19:13:28] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415630 (owner: 10Gergő Tisza) [19:13:33] trying to check scb1003 that just went down [19:13:43] PROBLEM - Host scb1003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:13:46] also applying change on mwlog2001, but no oops [19:13:55] oh, mgmt also down, i was gonna say [19:13:56] (03PS32) 10Andrew Bogott: labweb: include mediawiki profiles [puppet] - 10https://gerrit.wikimedia.org/r/415019 (https://phabricator.wikimedia.org/T168470) [19:14:48] (03Merged) 10jenkins-bot: Make last throttle limit raise work accross all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415630 (owner: 10Gergő Tisza) [19:15:56] !log powercycling crashed scb1003 [19:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:31] (03PS2) 10Ottomata: Properly install spark2_oozie_sharelib_install.sh [puppet] - 10https://gerrit.wikimedia.org/r/415602 (https://phabricator.wikimedia.org/T159962) [19:16:49] (03PS7) 10Imarlier: coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) [19:16:51] (03CR) 10jenkins-bot: Make last throttle limit raise work accross all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415630 (owner: 10Gergő Tisza) [19:16:58] (03PS2) 10Thcipriani: Enable responsive references by default on rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414115 (https://phabricator.wikimedia.org/T187997) (owner: 10Framawiki) [19:17:00] (03CR) 10Dzahn: [C: 032] "no-op on mwlog1001/mwlog2001" [puppet] - 10https://gerrit.wikimedia.org/r/415509 (owner: 10Dzahn) [19:17:16] !log thcipriani@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:415630|Make last throttle limit raise work accross all wikis]] T188630 (duration: 01m 13s) [19:17:27] ^ tgr all live :) [19:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:39] thanks! [19:18:10] yw :) [19:18:27] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/10224/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/415602 (https://phabricator.wikimedia.org/T159962) (owner: 10Ottomata) [19:18:29] (03CR) 10Ottomata: [C: 032] Properly install spark2_oozie_sharelib_install.sh [puppet] - 10https://gerrit.wikimedia.org/r/415602 (https://phabricator.wikimedia.org/T159962) (owner: 10Ottomata) [19:18:37] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414115 (https://phabricator.wikimedia.org/T187997) (owner: 10Framawiki) [19:18:40] (03PS1) 10BBlack: numa_networking: test on 2/N caches per site+cluster [puppet] - 10https://gerrit.wikimedia.org/r/415631 [19:19:11] (03CR) 10Ottomata: [C: 032] "Luca you will be happy about that hiera var. Dunno why, but the if defined() thing didn't work :/" [puppet] - 10https://gerrit.wikimedia.org/r/415602 (https://phabricator.wikimedia.org/T159962) (owner: 10Ottomata) [19:19:57] (03PS5) 10Gehel: Fix kafka configuration ports & cluster string [puppet] - 10https://gerrit.wikimedia.org/r/415621 (https://phabricator.wikimedia.org/T188252) (owner: 10Smalyshev) [19:20:02] (03Merged) 10jenkins-bot: Enable responsive references by default on rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414115 (https://phabricator.wikimedia.org/T187997) (owner: 10Framawiki) [19:20:28] !log dzahn@neodymium conftool action : set/pooled=no; selector: name=scb1003.eqiad.wmnet [19:20:38] framawiki: responsive references for rowiki is live on mwdebug1002, check please [19:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:34] !log scb1003 depooled scb1003 from all services on scb because it went down, including mgmt [19:21:42] (03CR) 10Krinkle: coal: Process from Kafka instead of from ZMQ (0313 comments) [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [19:21:44] 10Operations, 10ops-eqiad: Memory initialization error on scb1003 - https://phabricator.wikimedia.org/T188385#4015278 (10Cmjohnson) @MoritzMuehlenhoff The DIMM at A2 was bad. I tested it by swapping to B2 and it the failure moved to B2. Fortunately we have a few R420 with similar DIMM decommissioned already a... [19:21:46] (03CR) 10Gehel: [C: 032] Fix kafka configuration ports & cluster string [puppet] - 10https://gerrit.wikimedia.org/r/415621 (https://phabricator.wikimedia.org/T188252) (owner: 10Smalyshev) [19:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:08] (03CR) 10jenkins-bot: Enable responsive references by default on rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414115 (https://phabricator.wikimedia.org/T187997) (owner: 10Framawiki) [19:22:13] 10Operations, 10ops-eqiad: rack/setup/install wdqs100[6-8] - https://phabricator.wikimedia.org/T188432#4015281 (10Cmjohnson) [19:22:13] RECOVERY - Host scb1003 is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [19:22:23] thcipriani: looks good [19:22:33] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:22:43] framawiki: ok, going live [19:23:20] wtf, it's back up ? [19:23:26] mutante https://phabricator.wikimedia.org/T188385 [19:23:50] cmjohnson1: oh..! gotcha, thanks [19:24:13] RECOVERY - Host scb1003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.85 ms [19:24:48] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:414115|Enable responsive references by default on rowiki]] T187997 (duration: 01m 15s) [19:24:57] ^ framawiki live now [19:24:59] (03PS1) 10Gehel: Enable Kafka poller for wdqs2001 [puppet] - 10https://gerrit.wikimedia.org/r/415633 (https://phabricator.wikimedia.org/T188252) [19:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:03] T187997: Enable responsive references by default on the Romanian Wikipedia - https://phabricator.wikimedia.org/T187997 [19:25:51] (03CR) 10Krinkle: coal: Process from Kafka instead of from ZMQ (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [19:25:55] tgr: I just noticed this in the logs: Notice: Undefined index: dbname in /srv/mediawiki/wmf-config/throttle-analyze.php on line 40 [19:26:42] (03CR) 10Krinkle: coal: Process from Kafka instead of from ZMQ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [19:26:59] (03PS2) 10Thcipriani: Enable rollback for editors at zh_classicalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414114 (https://phabricator.wikimedia.org/T188064) (owner: 10Framawiki) [19:27:33] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:28:01] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414114 (https://phabricator.wikimedia.org/T188064) (owner: 10Framawiki) [19:29:23] (03Merged) 10jenkins-bot: Enable rollback for editors at zh_classicalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414114 (https://phabricator.wikimedia.org/T188064) (owner: 10Framawiki) [19:29:26] !log gehel@tin Started deploy [wdqs/wdqs@86da751]: new updater to fix kafka poller issues [19:29:32] (03PS1) 10Ottomata: Also copy in hive-site.xml to spark2 oozie sharelib [puppet] - 10https://gerrit.wikimedia.org/r/415634 (https://phabricator.wikimedia.org/T159962) [19:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:00] thcipriani: hm, that was broken in https://gerrit.wikimedia.org/r/c/373698 apparently [19:30:01] (03CR) 10Bstorm: "Great catch of some typos (undeleted bits) and everything else. Working on the changes..." [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [19:30:04] (03PS2) 10Ottomata: Also copy in hive-site.xml to spark2 oozie sharelib [puppet] - 10https://gerrit.wikimedia.org/r/415634 (https://phabricator.wikimedia.org/T159962) [19:30:54] (03CR) 10Ottomata: [V: 032 C: 032] Also copy in hive-site.xml to spark2 oozie sharelib [puppet] - 10https://gerrit.wikimedia.org/r/415634 (https://phabricator.wikimedia.org/T159962) (owner: 10Ottomata) [19:31:21] framawiki: rollback for editors at zh_classicalwiki is on mwdebug1002, check please [19:31:38] !log gehel@tin Finished deploy [wdqs/wdqs@86da751]: new updater to fix kafka poller issues (duration: 02m 12s) [19:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:53] tgr: hrm, should that change be reverted? or should a isset check be in there some place? [19:31:57] (03CR) 10jenkins-bot: Enable rollback for editors at zh_classicalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414114 (https://phabricator.wikimedia.org/T188064) (owner: 10Framawiki) [19:31:58] thcipriani: it's ok too [19:32:02] I don't get what that patch is trying to do, it sets dbname well after it has been used [19:32:16] framawiki: thanks for checking, going live [19:32:23] (03PS2) 10Dzahn: exempt mwlog hosts from screen monitoring [puppet] - 10https://gerrit.wikimedia.org/r/407179 [19:33:33] (03CR) 10Dzahn: [C: 032] exempt mwlog hosts from screen monitoring [puppet] - 10https://gerrit.wikimedia.org/r/407179 (owner: 10Dzahn) [19:34:24] Urbanecm: ping as ayou are the author of throttle-analyze.php ^ [19:34:41] framawiki, what's happening? [19:34:49] (03PS1) 10Gergő Tisza: Revert "Automatically include commons and wikidata in $wmgThrottlingExceptions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415635 [19:35:07] thcipriani: ^ reverted, I'm pretty sure it didn't work anyway [19:35:31] tgr: thank you! [19:36:06] The patch should add wikidatawiki and commonswiki to all throttle rules, regardless on what was set in the file. [19:36:14] (03CR) 10Gehel: [C: 032] Enable Kafka poller for wdqs2001 [puppet] - 10https://gerrit.wikimedia.org/r/415633 (https://phabricator.wikimedia.org/T188252) (owner: 10Gehel) [19:36:20] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:414114|Enable rollback for editors at zh_classicalwiki]] T188064 (duration: 01m 14s) [19:36:24] (03PS2) 10Gehel: Enable Kafka poller for wdqs2001 [puppet] - 10https://gerrit.wikimedia.org/r/415633 (https://phabricator.wikimedia.org/T188252) [19:36:28] (03CR) 10Krinkle: Revert "Automatically include commons and wikidata in $wmgThrottlingExceptions" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415635 (owner: 10Gergő Tisza) [19:36:30] (03CR) 10jerkins-bot: [V: 04-1] Revert "Automatically include commons and wikidata in $wmgThrottlingExceptions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415635 (owner: 10Gergő Tisza) [19:36:32] (03CR) 10Krinkle: [C: 031] Revert "Automatically include commons and wikidata in $wmgThrottlingExceptions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415635 (owner: 10Gergő Tisza) [19:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:34] T188064: Enable rollback permission at zh-classical Wikipedia. - https://phabricator.wikimedia.org/T188064 [19:36:48] framawiki: all's live now. Thanks for the patches and clean up! [19:36:51] tgr, do you think you can guide me with writing this piece of code? ;) [19:36:58] (03CR) 10Krinkle: [C: 031] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415635 (owner: 10Gergő Tisza) [19:36:58] thcipriani: thanks ! [19:37:03] it doesn't though, dbname is checked at the beginning of the file, if commons is the current wiki and it's not in there, that code is never reached [19:37:30] Urbanecm: Shouldn't the addition of commons and wikidata to the list of allowed bdds take place before the first if? [19:38:00] framawiki, probably [19:38:18] Urbanecm: probably wrapping the first continue in that file into something like if (!in_array($wgDBname, ['commonswiki', 'wikidatawiki'], true)) is the easiest [19:38:35] I can write a patch if you think this is important [19:38:55] but given that it was broken for half a year and no one noticed, maybe it isn't? [19:39:21] At least we need to do something with T184685 [19:39:22] T184685: Do not allow users to add commonswiki, wikidatawiki to dbname in throttle.php - https://phabricator.wikimedia.org/T184685 [19:39:35] And the reason why I created the patch initially is v [19:39:36] T163872 [19:39:36] T163872: Automatically include commons and wikidata in $wmgThrottlingExceptions - https://phabricator.wikimedia.org/T163872 [19:40:13] AFAIK all throttle rules were extended to wikidatawiki/commonswiki even when not requested [19:40:25] (03CR) 10Thcipriani: [C: 032] "SWAT (related)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415635 (owner: 10Gergő Tisza) [19:40:37] 10Operations, 10ops-eqiad, 10SCB, 10Services (watching): Memory initialization error on scb1003 - https://phabricator.wikimedia.org/T188385#4015362 (10mobrovac) Please add the #scb tag in the future to tasks pertaining to SCB hosts so that we are aware of any upcoming changes. [19:41:03] pretty sure that code didn't actually work [19:41:15] thcipriani: am I next? [19:41:27] tgr, agree, I made a mistake, it happens :). I'd prefer to create a patch myself with adding you as a reviewer [19:41:49] ok [19:41:51] Hauskatze: getting there :) let me merge a revert so I can quiet down the logs [19:42:02] sure :) [19:42:08] (03PS2) 10Thcipriani: Revert "Automatically include commons and wikidata in $wmgThrottlingExceptions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415635 (owner: 10Gergő Tisza) [19:42:15] (03CR) 10Thcipriani: Revert "Automatically include commons and wikidata in $wmgThrottlingExceptions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415635 (owner: 10Gergő Tisza) [19:42:23] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415635 (owner: 10Gergő Tisza) [19:42:46] Urbanecm: I don't really understand the use case though [19:43:12] once you create an account on e.g. enwiki, you can go to commons and your account will be autocreated [19:43:22] since it's the same central login [19:43:28] (03Merged) 10jenkins-bot: Revert "Automatically include commons and wikidata in $wmgThrottlingExceptions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415635 (owner: 10Gergő Tisza) [19:43:32] there is no throttling for autocreations [19:43:42] (03CR) 10jenkins-bot: Revert "Automatically include commons and wikidata in $wmgThrottlingExceptions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415635 (owner: 10Gergő Tisza) [19:44:09] Yes, but what when you create an account at commons/wikidata for some reason? [19:44:48] I just have a hard time thinking of any such reason [19:45:13] I suppose the whole concept of filtering by wiki is somewhat pointless for the same reason [19:45:43] yup [19:45:49] (03PS1) 10Ottomata: Parameterize kafka_cluster_name for streams_check job [puppet] - 10https://gerrit.wikimedia.org/r/415636 (https://phabricator.wikimedia.org/T185136) [19:46:01] Maybe "The list of wikis you wish to allow account creation on (don't forget sister projects like Commons) [19:46:01] " from https://meta.wikimedia.org/wiki/Mass_account_creation#Requesting_temporary_lift_of_IP_cap is such kind of reason? [19:47:16] whoever wrote it might just have not understood the difference between creation and autocreation [19:47:30] or maybe that's super old pre-SUL documentation [19:47:56] I don't know, but I guess this is the reason for including commons when it might not be necessary [19:48:01] (03PS4) 10Thcipriani: beta: add nlwiki to langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415555 (https://phabricator.wikimedia.org/T188582) (owner: 10MarcoAurelio) [19:48:08] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415555 (https://phabricator.wikimedia.org/T188582) (owner: 10MarcoAurelio) [19:48:17] anyway, I'm not objecting to adding Commons/Wikidata to all throttle overrides, there is no harm in it, I just don't think it will achieve anything useful [19:48:17] !log thcipriani@tin Synchronized wmf-config/throttle-analyze.php: SWAT: [[gerrit:415635|Revert "Automatically include commons and wikidata in $wmgThrottlingExceptions"]] (duration: 01m 14s) [19:48:28] thcipriani: cannot be tested so feel free to directly push [19:48:30] BTW regarding throttle) I expierenced a strange behaiviour with throttle rule set with enough reserve, after an edit a "rate limited" error was shown. As an admin I granted confirmed status manually but...shouldn't it raise rate limits for editing as well? [19:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:43] s/push/deploy [19:49:06] no, that code only affects account creation throttles [19:49:18] Hauskatze: sure, so changes on beta get deployed after they're merged by https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ [19:49:27] (03Merged) 10jenkins-bot: beta: add nlwiki to langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415555 (https://phabricator.wikimedia.org/T188582) (owner: 10MarcoAurelio) [19:49:32] Hauskatze: I'll push around on prod as well, but that's really just to ensure things are tidy [19:49:35] well, those and login throttles [19:49:56] the part towards the end where it changes $wgAccountCreationThrottle is what does the work [19:50:04] https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/throttle-analyze.php#L46 and bad captcha rate limits [19:50:14] !log new kafka based poller for wdqs now enabled on wdqs2001 - T188252 [19:50:16] Do you think it is a good idea to include increasing edit rates as well? [19:50:21] elukey: ^^ [19:50:24] thcipriani: so for beta files there's no need to scap in prod? [19:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:28] T188252: Activate kafka-based recent change poller for wikidata query service - https://phabricator.wikimedia.org/T188252 [19:50:37] rate limits are about speed [19:50:45] file's on operations/mediawiki-config [19:51:05] account creation is limited for X / day, edit rate limits for X / minute or something on that scale [19:51:26] Hauskatze: if it's beta only pushing it out in prod won't affect beta. SWAT is as fine a time as any to merge those changes. [19:51:28] tgr, I know. Imagine I'm your instructor and you're a newbie attending particular wikicamp. I instruct you to create your sandbox with "bla bla" in it and the whole audience will do it at the same time and, thank's to IPv4, from the same IP [19:51:31] so it's not very likely to be hit unless you are having an unusually productive editathon [19:51:58] fair pont, with test edits it can happen [19:52:47] The test edits are the way how senior citizens learn to use the editor interface at Czech courses. [19:53:25] so yeah if you want to raise edit/stashedit/upload limits automatically that makes sense [19:53:56] (03PS1) 10Andrew Bogott: labweb: include standard vhost snippets from mediawiki::web::sites [puppet] - 10https://gerrit.wikimedia.org/r/415639 [19:54:43] (03CR) 10Andrew Bogott: [C: 032] labweb: include standard vhost snippets from mediawiki::web::sites [puppet] - 10https://gerrit.wikimedia.org/r/415639 (owner: 10Andrew Bogott) [19:55:03] RECOVERY - MegaRAID on db1068 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [19:55:27] (03PS2) 10BryanDavis: wiki replicas: add GlobalPreferences to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/403833 (https://phabricator.wikimedia.org/T184666) (owner: 10MaxSem) [19:55:45] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1068 - https://phabricator.wikimedia.org/T188187#4015403 (10Marostegui) It worked this time! ``` root@db1068:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Prim... [19:55:53] thcipriani: is SWAT done? Can I do a quick MCS deploy? [19:56:43] !log thcipriani@tin Synchronized langlist-labs: SWAT: [[gerrit:415555|beta: add nlwiki to langlist]] T188582 (beta-only change) (duration: 01m 13s) [19:56:51] (03CR) 10BryanDavis: [C: 031] wiki replicas: add GlobalPreferences to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/403833 (https://phabricator.wikimedia.org/T184666) (owner: 10MaxSem) [19:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:59] T188582: nlwikipedia on Beta Cluster is wrongly shown under "Other Wikimedia Projects" - https://phabricator.wikimedia.org/T188582 [19:57:35] bearND: SWAT is done, but I'm going to be doing Train here in a minute, if it's not urgent I'd rather wait [19:57:45] (03CR) 10Gergő Tisza: Revert "Automatically include commons and wikidata in $wmgThrottlingExceptions" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415635 (owner: 10Gergő Tisza) [19:59:01] thcipriani: ok, i'll wait then. Would you ping me when the train is done? [20:00:04] thcipriani: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180301T2000). [20:00:04] No GERRIT patches in the queue for this window AFAICS. [20:00:05] (03PS1) 10Andrew Bogott: Revert "labweb: include standard vhost snippets from mediawiki::web::sites" [puppet] - 10https://gerrit.wikimedia.org/r/415640 [20:00:25] bearND: sure. My current plan is to merge a backport, roll forward group1 (since we're a day behind), wait to ensure nothing new pops up, then do all wikis. I'll ping you after group1 goes out if it looks stable and you can squeeze in. [20:01:03] (03PS1) 10Andrew Bogott: labweb: include standard vhost snippets from mediawiki::web::sites [puppet] - 10https://gerrit.wikimedia.org/r/415641 [20:01:11] (03CR) 10Andrew Bogott: [C: 032] Revert "labweb: include standard vhost snippets from mediawiki::web::sites" [puppet] - 10https://gerrit.wikimedia.org/r/415640 (owner: 10Andrew Bogott) [20:01:28] (03CR) 10jenkins-bot: beta: add nlwiki to langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415555 (https://phabricator.wikimedia.org/T188582) (owner: 10MarcoAurelio) [20:01:35] thcipriani: great. ty [20:02:02] (03PS2) 10Andrew Bogott: labweb: include standard vhost snippets from mediawiki::web::sites [puppet] - 10https://gerrit.wikimedia.org/r/415641 [20:03:17] (03CR) 10Andrew Bogott: [C: 032] labweb: include standard vhost snippets from mediawiki::web::sites [puppet] - 10https://gerrit.wikimedia.org/r/415641 (owner: 10Andrew Bogott) [20:04:13] (03CR) 10Imarlier: coal: Process from Kafka instead of from ZMQ (0315 comments) [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [20:04:19] (03PS8) 10Imarlier: coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) [20:04:48] (03CR) 10jerkins-bot: [V: 04-1] coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [20:05:07] (03PS1) 10Gergő Tisza: Clarify throttling documentation a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415642 [20:05:40] (03PS2) 10Gergő Tisza: Clarify throttling documentation a bit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415642 [20:06:44] (03PS9) 10Imarlier: coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) [20:07:31] (03PS1) 10Andrew Bogott: labweb: remove 'php_flag engine off' from vhost [puppet] - 10https://gerrit.wikimedia.org/r/415644 [20:09:36] (03PS1) 10Andrew Bogott: add new misc-web ip for 'newwikitech.wikimedia.org' [dns] - 10https://gerrit.wikimedia.org/r/415645 (https://phabricator.wikimedia.org/T168470) [20:09:40] (03CR) 10Andrew Bogott: [C: 032] labweb: remove 'php_flag engine off' from vhost [puppet] - 10https://gerrit.wikimedia.org/r/415644 (owner: 10Andrew Bogott) [20:14:23] (03PS1) 10Andrew Bogott: labweb: set up temporary 'newwikitech.wikimedia.org' host [puppet] - 10https://gerrit.wikimedia.org/r/415647 (https://phabricator.wikimedia.org/T168470) [20:15:31] !log rebooting labweb1001 [20:15:34] !log thcipriani@tin Synchronized php-1.31.0-wmf.23/includes/specials/pagers/NewPagesPager.php: SWAT: [[gerrit:415592|NewPagesPages: Use array_merge rather than + for RC query info fields]] T188555 (duration: 01m 14s) [20:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:58] T188555: Notice: Undefined property: stdClass::$rc_timestamp in /srv/mediawiki/php-1.31.0-wmf.23/includes/specials/SpecialNewpages.php - https://phabricator.wikimedia.org/T188555 [20:16:23] (03CR) 10Andrew Bogott: [C: 032] add new misc-web ip for 'newwikitech.wikimedia.org' [dns] - 10https://gerrit.wikimedia.org/r/415645 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [20:16:26] (03PS2) 10Andrew Bogott: add new misc-web ip for 'newwikitech.wikimedia.org' [dns] - 10https://gerrit.wikimedia.org/r/415645 (https://phabricator.wikimedia.org/T168470) [20:17:30] (03CR) 10Andrew Bogott: [C: 032] labweb: set up temporary 'newwikitech.wikimedia.org' host [puppet] - 10https://gerrit.wikimedia.org/r/415647 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [20:22:35] (03PS1) 10Thcipriani: Revert "Revert "Group1 to 1.31.0-wmf.23"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415649 [20:25:35] (03CR) 10Thcipriani: [C: 032] Revert "Revert "Group1 to 1.31.0-wmf.23"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415649 (owner: 10Thcipriani) [20:26:48] (03Merged) 10jenkins-bot: Revert "Revert "Group1 to 1.31.0-wmf.23"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415649 (owner: 10Thcipriani) [20:28:28] !log thcipriani@tin rebuilt and synchronized wikiversions files: Group1 back to 1.31.0-wmf.23 [20:28:32] (03CR) 10jenkins-bot: Revert "Revert "Group1 to 1.31.0-wmf.23"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415649 (owner: 10Thcipriani) [20:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:49] Krinkle: , yt? [20:29:56] !log restarting labweb1002 [20:29:59] q about navtiming and userAgent [20:30:02] ottomata: OK [20:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:36] Krinkle: does this work? https://github.com/wikimedia/puppet/blob/production/modules/webperf/files/navtiming.py#L405 [20:30:50] !log thcipriani@tin Synchronized php: php link to 1.31.0-wmf.23 (duration: 01m 12s) [20:30:55] atm, userAgent in the EL data in kafka is an already parsed map... [20:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:50] ottomata: Yes, https://github.com/wikimedia/puppet/blob/d8c8047ff475fce9a65187f4cd8d762b5cc21a67/modules/webperf/files/navtiming.py#L95-L118 [20:31:58] ottomata: Both the old and new format are supported [20:32:01] AHHH it is your own parse_ua [20:32:02] coooool [20:32:03] so that we could merge it before the EL format change [20:32:03] great. [20:32:18] ottomata: However, you are right that the 'or' never happens because both sub-methods never return None [20:32:50] But oh well :) [20:32:54] ottomata: Thanks for checking though :) [20:33:16] We've got a unit test as well (woo!, don't see those much in puppet code) and they're passing time Jenkins checked :) [20:33:30] Python unit tests that is. [20:33:31] ok well, i got what I need...am considering bringing back raw user agent into event (maybe as different field now)...its annoying to have it removed at this stage. What we should have done in the past is added the parsed one as its own field, not removed the raw one [20:33:32] and then purged later [20:33:41] great :) [20:34:05] we need raw ua in hadoop to do bot/spider detection stuff [20:34:06] ottomata: Okay. I quite like having it pre-parsed, but I suppose there's use in the raw one as well in short-term? [20:35:25] 10Operations, 10Code-Stewardship-Reviews, 10Services (watching): zotero translation server: code stewardship request - https://phabricator.wikimedia.org/T187194#4015476 (10mobrovac) [20:39:18] ottomata: Oh cool. So we're bringing the pageview bot/spider detection to eventlogging? [20:40:42] (03PS2) 10BBlack: numa_networking: test on 2/N caches per site+cluster [puppet] - 10https://gerrit.wikimedia.org/r/415631 [20:41:37] (03CR) 10BBlack: [C: 032] numa_networking: test on 2/N caches per site+cluster [puppet] - 10https://gerrit.wikimedia.org/r/415631 (owner: 10BBlack) [20:42:59] 10Operations, 10Citoid, 10RESTBase, 10RESTBase-API, and 5 others: Set-up Citoid behind RESTBase - https://phabricator.wikimedia.org/T108646#4015516 (10mobrovac) [20:44:32] (03CR) 10Krinkle: coal: Process from Kafka instead of from ZMQ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [20:46:06] (03CR) 10Volans: "I'm not too familiar with this script, so excuse me if I've missed something obvious." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/415608 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [20:47:13] (03CR) 10Krinkle: [C: 04-1] "LGTM, one point remaining about multiple topics. Looks we're missing out on SaveTiming right now. I'll also try to get a test working loca" [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [20:47:29] 10Operations, 10Cassandra, 10Patch-For-Review, 10Services (next): enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471#1111859 (10mobrovac) @Eevans @fgiunchedi is there something left to be done here? [20:51:59] mdholloway: Is that GeoData API available in beta as well? If so, would be good to verify on beta and then backport in SWAT soon. [20:55:23] mdholloway: sounds good. sorry for breaking. not sure how to verify whether it's specifically ApiQueryGeoSearchElastic.php being followed (cc tgr) [20:55:40] Also I can backport now if you can test on the mwdebug machines. [20:56:18] thcipriani: sure, can do [20:57:45] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/GeoData/+/7997c22e9abeb42d77c0b5018761f71b27242489/includes/Hooks.php#362 [20:58:25] Krinkle: thcipriani: this is only deployed to group 1 now, right? [20:58:29] or has it gone out to group2 [20:58:33] group1 [20:58:36] kk, thx [20:59:22] tgr: aha, thx [20:59:33] waiting on backport to wind its way through jenkins now, FYI https://gerrit.wikimedia.org/r/#/c/415656/ [20:59:42] and https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/master/wmf-config/CommonSettings.php#2735 [20:59:52] so yeah, that class is used always [21:01:49] there is a log option in the X-Wikimedia-Debug browser plugin that will give you a link to a logstash search with the log records from just that request, so it should be easy to see if it is still throwing warnings [21:08:58] tgr: thanks for the pointer to the extension, btw. i'd just been using ModHeader [21:12:04] mdholloway: GeoData change is live on mwdebug1002, check please [21:12:09] * mdholloway looking [21:13:02] thcipriani: response looks good, not sure how to find that log link tgr mentioned (i have the checkbox checked) [21:13:18] should be somewhere in the page footer [21:13:58] https://wikitech.wikimedia.org/wiki/Incident_documentation/20180129-MediaWiki [21:14:02] grr, wrong window [21:14:03] if you know that the response was fixed, not worth the effort, though [21:14:18] yeah, response is fixed. [21:14:20] just in case you had nothing but the logspam to go on [21:14:45] k, going live [21:16:17] (03CR) 10Krinkle: NavigtationTiming: Enable oversampling for Singapore (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [21:16:19] :+1 [21:16:25] 👍 [21:17:32] !log thcipriani@tin Synchronized php-1.31.0-wmf.23/extensions/GeoData/includes/api/ApiQueryGeoSearchElastic.php: [[gerrit:415656|Fix undefined property error in ApiQueryGeoSearchElastic]] T188659 (duration: 01m 15s) [21:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:49] T188659: Notice: Undefined property: stdClass::$gt_primary in /srv/mediawiki/php-1.31.0-wmf.23/extensions/GeoData/includes/api/ApiQueryGeoSearchElastic.php on line 175 - https://phabricator.wikimedia.org/T188659 [21:17:52] ^ mdholloway live now. Thank you for the quick fix! [21:18:10] thcipriani: np, sorry for the bug! [21:18:27] (03CR) 10BBlack: [C: 031] "My earlier +1 was a conceptual one and remains, as in "sounds good to do this". I leave it you guys to sort out syntax issues :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [21:25:12] bearND: if you want to do a quick MCS deploy now works fine. [21:25:54] thcipriani: ok, thanks. Going to start one soon [21:26:29] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4015735 (10Volker_E) @Dzahn: Repos are requested. [21:27:14] (03PS1) 10ArielGlenn: update recompressxml so it can handle the new html dump schema [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/415689 [21:28:05] !log bsitzmann@tin Started deploy [mobileapps/deploy@bd9924e]: Update mobileapps to 1056fde (T183833) [21:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:23] T183833: [Bug report] Removing parentheses breaks chemical formulas - https://phabricator.wikimedia.org/T183833 [21:31:57] (03PS7) 10ArielGlenn: restbase dumps in xml format [dumps] - 10https://gerrit.wikimedia.org/r/413212 [21:33:19] !log bsitzmann@tin Finished deploy [mobileapps/deploy@bd9924e]: Update mobileapps to 1056fde (T183833) (duration: 05m 15s) [21:33:21] thcipriani: done. thanks! [21:33:32] sure thing :) [21:33:35] thanks for the ping [21:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:36] T183833: [Bug report] Removing parentheses breaks chemical formulas - https://phabricator.wikimedia.org/T183833 [21:33:39] (03PS1) 10Ottomata: Parse raw user_agent out of raw eventlogging client side event [puppet] - 10https://gerrit.wikimedia.org/r/415691 (https://phabricator.wikimedia.org/T188673) [21:35:20] ok, our logs look normal for us. I'm going to finish the 1.31.0-wmf.23 rollout. [21:35:39] +1 [21:35:40] (03PS5) 10Dzahn: icinga: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/409204 [21:36:03] (03PS1) 10Rush: openstack: glance should use glance_data hiera param [puppet] - 10https://gerrit.wikimedia.org/r/415693 (https://phabricator.wikimedia.org/T188266) [21:37:06] (03CR) 10Andrew Bogott: [C: 031] openstack: glance should use glance_data hiera param [puppet] - 10https://gerrit.wikimedia.org/r/415693 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [21:38:16] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4015774 (10Dzahn) @Volker_E Alright, thanks for the update! As you can see above i started uploading some changes and merged what... [21:39:16] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4015776 (10Volker_E) Great! [21:39:18] (03PS1) 10Thcipriani: all wikis to 1.31.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415694 [21:40:05] (03CR) 10Thcipriani: [C: 032] all wikis to 1.31.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415694 (owner: 10Thcipriani) [21:41:20] (03CR) 10Rush: [C: 032] openstack: glance should use glance_data hiera param [puppet] - 10https://gerrit.wikimedia.org/r/415693 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [21:41:31] (03Merged) 10jenkins-bot: all wikis to 1.31.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415694 (owner: 10Thcipriani) [21:41:45] (03CR) 10jenkins-bot: all wikis to 1.31.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415694 (owner: 10Thcipriani) [21:42:46] (03PS1) 10Dzahn: microsites::design: enable cloning from 2 new repos [puppet] - 10https://gerrit.wikimedia.org/r/415748 (https://phabricator.wikimedia.org/T185282) [21:44:09] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:44:18] (03PS1) 10Rush: glance: fix hiera template variable [puppet] - 10https://gerrit.wikimedia.org/r/415749 (https://phabricator.wikimedia.org/T188266) [21:44:46] (03CR) 10jerkins-bot: [V: 04-1] microsites::design: enable cloning from 2 new repos [puppet] - 10https://gerrit.wikimedia.org/r/415748 (https://phabricator.wikimedia.org/T185282) (owner: 10Dzahn) [21:45:16] (03CR) 10Rush: [C: 032] glance: fix hiera template variable [puppet] - 10https://gerrit.wikimedia.org/r/415749 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [21:45:17] !log thcipriani@tin rebuilt and synchronized wikiversions files: all wikis to 1.31.0-wmf.23 [21:45:18] (03CR) 10Andrew Bogott: [C: 031] glance: fix hiera template variable [puppet] - 10https://gerrit.wikimedia.org/r/415749 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [21:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:52] (03CR) 10Dzahn: [C: 032] icinga: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/409204 (owner: 10Dzahn) [21:47:01] (03PS6) 10Dzahn: icinga: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/409204 [21:47:06] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/10226/einsteinium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/409204 (owner: 10Dzahn) [21:49:22] (03PS1) 10Rush: glance: fix hiera template variable liberty [puppet] - 10https://gerrit.wikimedia.org/r/415751 (https://phabricator.wikimedia.org/T188266) [21:49:58] (03CR) 10Rush: [C: 032] glance: fix hiera template variable liberty [puppet] - 10https://gerrit.wikimedia.org/r/415751 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [21:54:09] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:57:02] (03CR) 10TerraCodes: "If it was not working, then it should have been fixed rather than removed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415635 (owner: 10Gergő Tisza) [22:03:55] 10Operations, 10Puppet, 10Patch-For-Review: PuppetDB misbehaving on 2017-07-15 - https://phabricator.wikimedia.org/T170740#4015854 (10Volans) We had OOMs also with puppet disabled on tegmen, so that's not the culprit. [22:05:07] (03PS7) 10Dzahn: icinga: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/409204 [22:09:57] (03PS1) 10EBernhardson: Setup Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415754 (https://phabricator.wikimedia.org/T187148) [22:10:21] (03PS8) 10Dzahn: icinga: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/409204 [22:10:52] (03CR) 10jerkins-bot: [V: 04-1] Setup Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415754 (https://phabricator.wikimedia.org/T187148) (owner: 10EBernhardson) [22:17:55] (03CR) 10Dzahn: [C: 032] "no-op. first on tegmen then einsteinium" [puppet] - 10https://gerrit.wikimedia.org/r/409204 (owner: 10Dzahn) [22:21:03] (03PS2) 10EBernhardson: Setup Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415754 (https://phabricator.wikimedia.org/T187148) [22:22:08] (03CR) 10jerkins-bot: [V: 04-1] Setup Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415754 (https://phabricator.wikimedia.org/T187148) (owner: 10EBernhardson) [22:22:57] 10Operations, 10Ops-Access-Requests: reinstate ezachte's access - https://phabricator.wikimedia.org/T188335#4015881 (10ezachte) Thanks @ottomata, I just managed to ssh to stat1005 from Ubuntu. following advice on wikitech page I (obviously) also regained access to phabricator. :-) [22:23:37] (03PS3) 10EBernhardson: Setup Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415754 (https://phabricator.wikimedia.org/T187148) [22:29:44] (03PS2) 10Dzahn: design.wm.org: prepare for second dir for style guide [puppet] - 10https://gerrit.wikimedia.org/r/414008 (https://phabricator.wikimedia.org/T185282) [22:30:40] (03PS3) 10Dzahn: design.wm.org: prepare for second dir for style guide [puppet] - 10https://gerrit.wikimedia.org/r/414008 (https://phabricator.wikimedia.org/T185282) [22:31:44] (03CR) 10Dzahn: [C: 032] design.wm.org: prepare for second dir for style guide [puppet] - 10https://gerrit.wikimedia.org/r/414008 (https://phabricator.wikimedia.org/T185282) (owner: 10Dzahn) [22:31:47] (03PS1) 10Andrew Bogott: labweb: include hhvm::admin [puppet] - 10https://gerrit.wikimedia.org/r/415758 [22:33:01] (03PS5) 10MaxSem: beta: remove $wgReadingListsCentralWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414017 [22:33:03] (03PS5) 10MaxSem: beta: remove $wmgUseReadingLists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414018 [22:33:05] (03PS1) 10MaxSem: beta: remove $wgAutoloadAttemptLowercase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415759 (https://phabricator.wikimedia.org/T166759) [22:35:05] !log rolling restart of elsticsearch / cirrus - eqiad complete, cluster is green [22:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:28] note: that's the first time I manage to restart a full cluster in a single day! Things are improving! [22:37:02] \o/ [22:39:32] (03PS2) 10Andrew Bogott: labweb: include equivalent functionality to hhvm::admin [puppet] - 10https://gerrit.wikimedia.org/r/415758 [22:40:10] (03CR) 10jerkins-bot: [V: 04-1] labweb: include equivalent functionality to hhvm::admin [puppet] - 10https://gerrit.wikimedia.org/r/415758 (owner: 10Andrew Bogott) [22:41:10] (03PS2) 10Dzahn: microsites::design: enable cloning from 2 new repos [puppet] - 10https://gerrit.wikimedia.org/r/415748 (https://phabricator.wikimedia.org/T185282) [22:41:37] (03CR) 10jerkins-bot: [V: 04-1] microsites::design: enable cloning from 2 new repos [puppet] - 10https://gerrit.wikimedia.org/r/415748 (https://phabricator.wikimedia.org/T185282) (owner: 10Dzahn) [22:43:59] (03PS3) 10Dzahn: microsites::design: enable cloning from 2 new repos [puppet] - 10https://gerrit.wikimedia.org/r/415748 (https://phabricator.wikimedia.org/T185282) [22:44:57] gehel: :) [22:45:55] At least one thing went right today... [22:48:18] (03PS1) 10Dzahn: icinga: add stretch compat for php-gd/php7.0-gd [puppet] - 10https://gerrit.wikimedia.org/r/415764 [22:53:58] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4015968 (10Bawolff) > Please also consider for your planning enough time for the security team to do a review. They will be be abl... [22:55:51] (03PS1) 10Dzahn: openstack:labtest:web: add some php7/stretch support [puppet] - 10https://gerrit.wikimedia.org/r/415765 [22:56:18] (03CR) 10jerkins-bot: [V: 04-1] openstack:labtest:web: add some php7/stretch support [puppet] - 10https://gerrit.wikimedia.org/r/415765 (owner: 10Dzahn) [22:57:04] (03PS1) 10Dzahn: icinga: fix php version number on stretch, 7 -> 7.0 [puppet] - 10https://gerrit.wikimedia.org/r/415767 [22:57:35] (03PS2) 10Dzahn: openstack:labtest:web: add some php7/stretch support [puppet] - 10https://gerrit.wikimedia.org/r/415765 [22:58:30] (03CR) 10Dzahn: [C: 032] icinga: fix php version number on stretch, 7 -> 7.0 [puppet] - 10https://gerrit.wikimedia.org/r/415767 (owner: 10Dzahn) [22:58:45] 10Operations, 10Ops-Access-Requests: reinstate ezachte's access - https://phabricator.wikimedia.org/T188335#4015975 (10DarTar) Thanks folks, and welcome back @ezachte. [23:00:06] report of commons deletion failures in #wikimedia-tech -- "Deletion is down on Commons in all forms; [WpiFVQpAME4AAAjJP1UAAABT] 2018-03-01 22:57:41: Fatal exception of type "Wikimedia\Rdbms\DBQueryError"" [23:00:28] (03PS1) 10Dzahn: openstack/wikitech: add some php7 support [puppet] - 10https://gerrit.wikimedia.org/r/415768 [23:02:11] The error in logstash is a Lock wait timeout exceeded on the page table [23:02:30] RileyH: was this one time or repeatable? [23:02:37] Repeatable [23:02:46] Tried to nuke 250+ pages, didn't work [23:02:50] Then tried individually, didn't work [23:02:55] Then tried delete.py; all failed [23:03:39] thcipriani ^^ [23:03:39] of course it would.. [23:04:03] It's probably not deployment related [23:04:50] mediawiki log for the error id above at https://logstash.wikimedia.org/goto/c0a9b587789e85e690486a9b4ca23ee2 [23:04:52] wmf.23 has been there for a few hours now [23:05:13] its probably commons db load I would guess [23:05:15] deletions are now performed fwiw and logged [23:05:24] I guess it's db saturation [23:05:29] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0 [23:05:42] lesson of the day: don't delete over 250 pages at once [23:06:42] anomie: now that wmf.23 is everywhere can we run the script again for T188014 ? :) [23:06:42] T188014: MassMessage doesn't work on dty.wikipedia - https://phabricator.wikimedia.org/T188014 [23:09:59] RileyH: I told you to slow down the bot :-P [23:10:10] ;-) [23:10:36] Reedy: So lesson of the day is don't run Special:Nuke? [23:10:45] Stellar advice [23:10:49] WFM [23:10:55] We can undeploy it too if it helps? [23:11:02] Silly users, using the interface as designed [23:11:04] (03PS1) 10TerraCodes: Remove wmf zero things [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415772 (https://phabricator.wikimedia.org/T187716) [23:11:23] it works great on a small wiki with low activity! [23:11:24] (03CR) 10Reedy: [C: 04-2] "Far too soon" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415772 (https://phabricator.wikimedia.org/T187716) (owner: 10TerraCodes) [23:11:41] RileyH: File a task that the extension should batch things [23:12:37] (03CR) 10TerraCodes: "> Patch Set 1: Code-Review-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415772 (https://phabricator.wikimedia.org/T187716) (owner: 10TerraCodes) [23:15:06] (03PS2) 10Jcrespo: Revert "Depool labsdb1011 to copy its data away" [puppet] - 10https://gerrit.wikimedia.org/r/415610 [23:15:37] Reedy: such as "Special:Nuke should batch deletions"? [23:15:48] Yup, something like that [23:15:50] reason: cuz reedy said so [23:16:01] (03CR) 10Jforrester: "> The task is in the "config - to process" part of the workboard?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415772 (https://phabricator.wikimedia.org/T187716) (owner: 10TerraCodes) [23:16:09] I think I can do that [23:16:19] the task, not the batching part [23:16:32] (now you'll say that's easy, etc.) [23:16:34] Step one [23:17:42] (03CR) 10TerraCodes: "> > The task is in the "config - to process" part of the workboard?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415772 (https://phabricator.wikimedia.org/T187716) (owner: 10TerraCodes) [23:18:48] (03CR) 10Jcrespo: [C: 032] Revert "Depool labsdb1011 to copy its data away" [puppet] - 10https://gerrit.wikimedia.org/r/415610 (owner: 10Jcrespo) [23:19:13] (03Abandoned) 10Jforrester: Remove wmf zero things [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415772 (https://phabricator.wikimedia.org/T187716) (owner: 10TerraCodes) [23:20:44] Reedy: here you go: T188679 [23:20:45] T188679: Nuke should batch the deletions - https://phabricator.wikimedia.org/T188679 [23:24:06] 10Operations, 10Ops-Access-Requests: Need access to webperf* servers - https://phabricator.wikimedia.org/T188650#4014970 (10Dzahn) Hi! So the relevant admin group is called "perf-team" and you are a member of it. Also the perf-team admin group gets added to a host when the role(webperf) is added to it. That... [23:30:27] (03PS2) 10Jcrespo: mariadb-backups: Change backup format to YYYY-MM-dd--HH-mm-SS [puppet] - 10https://gerrit.wikimedia.org/r/415608 (https://phabricator.wikimedia.org/T184696) [23:30:29] (03PS1) 10Jcrespo: labsdb: Depool labsdb1010 in preparatio for its recovery [puppet] - 10https://gerrit.wikimedia.org/r/415774 (https://phabricator.wikimedia.org/T186579) [23:31:21] (03PS2) 10Jcrespo: labsdb: Depool labsdb1010 in preparation for its recovery [puppet] - 10https://gerrit.wikimedia.org/r/415774 (https://phabricator.wikimedia.org/T186579) [23:32:15] (03CR) 10Jcrespo: [C: 032] labsdb: Depool labsdb1010 in preparation for its recovery [puppet] - 10https://gerrit.wikimedia.org/r/415774 (https://phabricator.wikimedia.org/T186579) (owner: 10Jcrespo) [23:32:21] (03PS3) 10Jcrespo: labsdb: Depool labsdb1010 in preparation for its recovery [puppet] - 10https://gerrit.wikimedia.org/r/415774 (https://phabricator.wikimedia.org/T186579) [23:35:45] (03PS1) 10Krinkle: Remove remnant references to mira.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/415775 (https://phabricator.wikimedia.org/T164588) [23:35:53] (03CR) 10EBernhardson: "Thanks! This has allowed it to mostly work, although still needs workarounds :S For whatever reason it's finding libhfds0 in /usr/lib, but" [puppet] - 10https://gerrit.wikimedia.org/r/411464 (owner: 10EBernhardson) [23:36:24] (03CR) 10Krinkle: "Disclaimer: I have no idea what I'm doing. Just spotted them in grep. In particular, I do not know if or when we usually remove the instal" [puppet] - 10https://gerrit.wikimedia.org/r/415775 (https://phabricator.wikimedia.org/T164588) (owner: 10Krinkle) [23:39:27] (03CR) 10Dzahn: "removing from DHCP: ack, releases: this is actually releases1001/2001 not deployment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415775 (https://phabricator.wikimedia.org/T164588) (owner: 10Krinkle) [23:41:13] (03CR) 10Krinkle: Remove remnant references to mira.codfw.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415775 (https://phabricator.wikimedia.org/T164588) (owner: 10Krinkle) [23:45:48] (03CR) 10Dzahn: Remove remnant references to mira.codfw.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415775 (https://phabricator.wikimedia.org/T164588) (owner: 10Krinkle) [23:46:14] (03CR) 10Dzahn: [C: 032] Remove remnant references to mira.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/415775 (https://phabricator.wikimedia.org/T164588) (owner: 10Krinkle) [23:46:20] (03PS2) 10Dzahn: Remove remnant references to mira.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/415775 (https://phabricator.wikimedia.org/T164588) (owner: 10Krinkle)