[01:31:05] PROBLEM - puppet last run on db1097 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:31:15] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:31:45] PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:31:55] PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:32:45] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:32:45] PROBLEM - puppet last run on cp1062 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:32:55] PROBLEM - puppet last run on db1082 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:33:15] PROBLEM - puppet last run on maps1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:33:46] PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:34:05] PROBLEM - puppet last run on db1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:34:16] PROBLEM - puppet last run on elastic1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:35:15] PROBLEM - puppet last run on aqs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:35:16] PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:35:26] PROBLEM - puppet last run on ms-be1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:58:46] RECOVERY - puppet last run on druid1002 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [01:59:16] RECOVERY - puppet last run on elastic1048 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [02:00:16] RECOVERY - puppet last run on aqs1006 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [02:00:16] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:00:26] RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:01:05] RECOVERY - puppet last run on db1097 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:01:15] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:01:45] RECOVERY - puppet last run on cp1047 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [02:01:53] (03PS6) 10Andrew Bogott: labspuppetbackend: rewrite of the read-only security layer [puppet] - 10https://gerrit.wikimedia.org/r/411520 (https://phabricator.wikimedia.org/T187499) [02:01:56] RECOVERY - puppet last run on elastic1018 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:02:45] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [02:02:45] RECOVERY - puppet last run on cp1062 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [02:02:55] RECOVERY - puppet last run on db1082 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [02:03:15] RECOVERY - puppet last run on maps1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [02:04:05] RECOVERY - puppet last run on db1073 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [02:09:40] (03PS7) 10Andrew Bogott: labspuppetbackend: rewrite of the read-only security layer [puppet] - 10https://gerrit.wikimedia.org/r/411520 (https://phabricator.wikimedia.org/T187499) [02:14:37] (03CR) 10Andrew Bogott: [C: 032] labspuppetbackend: rewrite of the read-only security layer [puppet] - 10https://gerrit.wikimedia.org/r/411520 (https://phabricator.wikimedia.org/T187499) (owner: 10Andrew Bogott) [02:50:09] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.21) (duration: 10m 59s) [02:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:55] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 841.45 seconds [03:35:55] PROBLEM - Check systemd state on labpuppetmaster1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:40:35] RECOVERY - Check systemd state on labpuppetmaster1001 is OK: OK - running: The system is fully operational [03:40:46] PROBLEM - Disk space on rhenium is CRITICAL: DISK CRITICAL - free space: / 1775 MB (3% inode=96%) [03:41:55] RECOVERY - Check systemd state on labpuppetmaster1002 is OK: OK - running: The system is fully operational [03:59:05] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 57.86 seconds [04:07:46] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 1 minute ago with 3 failures. Failed resources (up to 3 shown): Package[eventstreams/deploy],Exec[chown /srv/deployment/eventstreams for deploy-service],Service[pdfrender] [04:37:45] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:11:23] 10Operations, 10LuaSandbox: Build and deploy hhvm-luasandbox 3.0.0 to Wikimedia wikis - https://phabricator.wikimedia.org/T187673#3982203 (10Legoktm) [05:12:45] PROBLEM - HHVM rendering on mw1294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:35] RECOVERY - HHVM rendering on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 77362 bytes in 0.154 second response time [05:26:11] !log andrew@tin Started deploy [horizon/deploy@6a40f84]: rolling out several horizon bugfixes [05:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:26] !log andrew@tin Finished deploy [horizon/deploy@6a40f84]: rolling out several horizon bugfixes (duration: 03m 14s) [05:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:00] 10Operations: Remove dpatrick from security@ - https://phabricator.wikimedia.org/T187615#3982223 (10tstarling) 05Open>03Resolved a:03tstarling [06:12:01] (03PS5) 10Andrew Bogott: labweb horizon: share memcached among labwebs [puppet] - 10https://gerrit.wikimedia.org/r/411546 (https://phabricator.wikimedia.org/T187506) [06:12:03] (03PS1) 10Andrew Bogott: horizon nova_policy.json: forbid some cinder and neutron actions [puppet] - 10https://gerrit.wikimedia.org/r/412619 (https://phabricator.wikimedia.org/T187493) [06:13:22] (03CR) 10Andrew Bogott: [C: 032] horizon nova_policy.json: forbid some cinder and neutron actions [puppet] - 10https://gerrit.wikimedia.org/r/412619 (https://phabricator.wikimedia.org/T187493) (owner: 10Andrew Bogott) [06:17:46] (03PS6) 10Andrew Bogott: labweb horizon: share memcached among labwebs [puppet] - 10https://gerrit.wikimedia.org/r/411546 (https://phabricator.wikimedia.org/T187506) [06:17:48] (03PS1) 10Andrew Bogott: horizon nova_policy.json: add a forgotten comma [puppet] - 10https://gerrit.wikimedia.org/r/412621 [06:18:30] (03CR) 10Andrew Bogott: [C: 032] horizon nova_policy.json: add a forgotten comma [puppet] - 10https://gerrit.wikimedia.org/r/412621 (owner: 10Andrew Bogott) [06:27:05] (03PS1) 10Marostegui: db-eqiad.php: Depool db1089 and db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412622 (https://phabricator.wikimedia.org/T162807) [06:28:15] (03PS1) 10Marostegui: db1089: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/412623 [06:37:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1089 and db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412622 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [06:39:29] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 and db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412622 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [06:39:39] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 and db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412622 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [06:40:40] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1089 and db1105 - T162807 (duration: 00m 56s) [06:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:56] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [06:42:10] !log Deploy schema change on s6 codfw master (db2039), this will generate lag on codfw - T187089 T185128 T153182 [06:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:26] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [06:42:26] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [06:42:27] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [06:44:52] !log Stop MySQL on db1089 to update its socket path [06:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:40] (03CR) 10Marostegui: [C: 032] db1089: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/412623 (owner: 10Marostegui) [06:55:07] (03PS1) 10Marostegui: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412624 [06:55:52] (03PS1) 10Marostegui: db1090: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/412625 [06:56:34] (03PS2) 10Marostegui: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412624 [06:58:10] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412624 (owner: 10Marostegui) [06:59:36] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412624 (owner: 10Marostegui) [06:59:46] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412624 (owner: 10Marostegui) [07:00:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1090 (duration: 00m 55s) [07:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:13] !log Reboot db1090 for kernel ugprade, mariadb upgrade, socket path location upgrade [07:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:55] (03CR) 10Marostegui: [C: 032] db1090: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/412625 (owner: 10Marostegui) [07:14:40] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412627 [07:18:33] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412627 (owner: 10Marostegui) [07:20:01] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412627 (owner: 10Marostegui) [07:20:11] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412627 (owner: 10Marostegui) [07:21:12] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1090 (duration: 00m 55s) [07:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:32] (03PS1) 10Marostegui: db2034.yaml: Add role master to db2034 [puppet] - 10https://gerrit.wikimedia.org/r/412629 (https://phabricator.wikimedia.org/T184888) [07:42:40] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10User-Elukey: Phabricator down due to "Failed to `proc_open()`: proc_open() expects parameter 2 to be array" - https://phabricator.wikimedia.org/T186620#3982280 (10elukey) @mmodell let's remember that https://wikitech.wikimedia.org/wiki/Incident_doc... [07:42:44] !log Change topology on x1 codfw - T184888 [07:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:59] T184888: Replace codfw x1 master (db2033) (WAS: Failed BBU on db2033 (x1 master)) - https://phabricator.wikimedia.org/T184888 [07:43:42] (03CR) 10Marostegui: [C: 032] db2034.yaml: Add role master to db2034 [puppet] - 10https://gerrit.wikimedia.org/r/412629 (https://phabricator.wikimedia.org/T184888) (owner: 10Marostegui) [07:44:17] (03PS1) 10Muehlenhoff: Record extended MOU date for nettrom [puppet] - 10https://gerrit.wikimedia.org/r/412631 [07:45:20] (03PS2) 10Muehlenhoff: Record extended MOU date for nettrom [puppet] - 10https://gerrit.wikimedia.org/r/412631 [07:46:07] (03CR) 10Muehlenhoff: [C: 032] Record extended MOU date for nettrom [puppet] - 10https://gerrit.wikimedia.org/r/412631 (owner: 10Muehlenhoff) [07:47:21] (03PS1) 10Muehlenhoff: Record extended MOU date for shiladsen [puppet] - 10https://gerrit.wikimedia.org/r/412632 [07:47:59] (03PS1) 10Marostegui: db-codfw.php: Change x1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412633 (https://phabricator.wikimedia.org/T184888) [07:48:31] (03CR) 10Muehlenhoff: [C: 032] Record extended MOU date for shiladsen [puppet] - 10https://gerrit.wikimedia.org/r/412632 (owner: 10Muehlenhoff) [07:51:26] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3982284 (10elukey) >>! In T182832#3981076, @mmodell wro... [07:58:17] !log installing werkzeug security updates on trusty [07:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:19] (03CR) 10Marostegui: [C: 032] db-codfw.php: Change x1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412633 (https://phabricator.wikimedia.org/T184888) (owner: 10Marostegui) [08:00:47] (03Merged) 10jenkins-bot: db-codfw.php: Change x1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412633 (https://phabricator.wikimedia.org/T184888) (owner: 10Marostegui) [08:00:54] (03PS1) 10Marostegui: db2033.yaml: Remove master role from db2033 [puppet] - 10https://gerrit.wikimedia.org/r/412635 (https://phabricator.wikimedia.org/T184888) [08:00:58] (03CR) 10jenkins-bot: db-codfw.php: Change x1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412633 (https://phabricator.wikimedia.org/T184888) (owner: 10Marostegui) [08:01:39] (03PS2) 10Marostegui: db2033.yaml: Remove master role from db2033 [puppet] - 10https://gerrit.wikimedia.org/r/412635 (https://phabricator.wikimedia.org/T184888) [08:02:05] PROBLEM - DPKG on labtestservices2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:02:17] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Promote db2034 to x1 codfw master - T184888 (duration: 00m 56s) [08:02:19] (03CR) 10Marostegui: [C: 032] db2033.yaml: Remove master role from db2033 [puppet] - 10https://gerrit.wikimedia.org/r/412635 (https://phabricator.wikimedia.org/T184888) (owner: 10Marostegui) [08:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:32] T184888: Replace codfw x1 master (db2033) (WAS: Failed BBU on db2033 (x1 master)) - https://phabricator.wikimedia.org/T184888 [08:04:05] RECOVERY - DPKG on labtestservices2001 is OK: All packages OK [08:07:49] (03PS1) 10Marostegui: mysql-core_codfw: Add master/slave x1 codfw [puppet] - 10https://gerrit.wikimedia.org/r/412637 [08:08:15] PROBLEM - DPKG on labtestservices2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:08:58] ^that's me, harmless [08:10:15] RECOVERY - DPKG on labtestservices2002 is OK: All packages OK [08:11:16] !log repool mw1227 - T149287 [08:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:30] T149287: Heating alerts for mw servers in eqiad - https://phabricator.wikimedia.org/T149287 [08:11:59] 10Operations, 10ops-eqiad: Heating alerts for mw servers in eqiad - https://phabricator.wikimedia.org/T149287#3982305 (10fgiunchedi) >>! In T149287#3963590, @fgiunchedi wrote: > mw1227 has been alerting over the weekend of high load, I depooled it and noticed it was on the list of machines with temperature ove... [08:12:24] (03PS2) 10Marostegui: mysql-core_codfw: Add master/slave x1 codfw [puppet] - 10https://gerrit.wikimedia.org/r/412637 (https://phabricator.wikimedia.org/T184888) [08:12:52] (03CR) 10jerkins-bot: [V: 04-1] mysql-core_codfw: Add master/slave x1 codfw [puppet] - 10https://gerrit.wikimedia.org/r/412637 (https://phabricator.wikimedia.org/T184888) (owner: 10Marostegui) [08:13:51] (03PS3) 10Marostegui: mysql-core_codfw: Add master/slave x1 codfw [puppet] - 10https://gerrit.wikimedia.org/r/412637 (https://phabricator.wikimedia.org/T184888) [08:14:36] (03CR) 10Marostegui: [C: 032] mysql-core_codfw: Add master/slave x1 codfw [puppet] - 10https://gerrit.wikimedia.org/r/412637 (https://phabricator.wikimedia.org/T184888) (owner: 10Marostegui) [08:18:05] (03PS1) 10Marostegui: site.pp: Clarify db2033 and db2034 status [puppet] - 10https://gerrit.wikimedia.org/r/412640 [08:18:40] (03CR) 10Marostegui: [C: 032] site.pp: Clarify db2033 and db2034 status [puppet] - 10https://gerrit.wikimedia.org/r/412640 (owner: 10Marostegui) [08:20:30] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412641 [08:22:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412641 (owner: 10Marostegui) [08:22:54] (03PS1) 10Muehlenhoff: Decomission old video scalers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/412642 (https://phabricator.wikimedia.org/T187466) [08:23:46] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412641 (owner: 10Marostegui) [08:24:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1090 (duration: 00m 55s) [08:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:43] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412641 (owner: 10Marostegui) [08:29:30] (03CR) 10Elukey: [C: 031] Decomission old video scalers in eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412642 (https://phabricator.wikimedia.org/T187466) (owner: 10Muehlenhoff) [08:35:47] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for prometheus-etherpad-exporter [puppet] - 10https://gerrit.wikimedia.org/r/411276 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:36:38] (03PS2) 10Filippo Giunchedi: mail: switch icinga check to LE variant [puppet] - 10https://gerrit.wikimedia.org/r/410758 (https://phabricator.wikimedia.org/T181519) [08:38:26] (03CR) 10Filippo Giunchedi: [C: 032] mail: switch icinga check to LE variant [puppet] - 10https://gerrit.wikimedia.org/r/410758 (https://phabricator.wikimedia.org/T181519) (owner: 10Filippo Giunchedi) [08:38:54] (03PS2) 10Filippo Giunchedi: icinga: tweak thresholds for LE certs alerting [puppet] - 10https://gerrit.wikimedia.org/r/410759 (https://phabricator.wikimedia.org/T181519) [08:39:31] (03CR) 10Filippo Giunchedi: [C: 032] icinga: tweak thresholds for LE certs alerting [puppet] - 10https://gerrit.wikimedia.org/r/410759 (https://phabricator.wikimedia.org/T181519) (owner: 10Filippo Giunchedi) [08:41:46] (03PS1) 10Marostegui: db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412650 (https://phabricator.wikimedia.org/T187089) [08:43:48] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412650 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [08:45:22] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412650 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [08:46:54] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1098:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412650 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [08:47:01] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1098:3316 for alter table (duration: 00m 55s) [08:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:07] !log Deploy schema change on db1098:3316 - T187089 T185128 T153182 [08:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:21] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [08:49:21] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [08:49:22] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [08:51:02] (03PS1) 10Marostegui: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412652 [08:52:52] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412652 (owner: 10Marostegui) [08:54:21] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412652 (owner: 10Marostegui) [08:55:38] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1098:3317 for mariadb and kernel upgrade (duration: 00m 55s) [08:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:53] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412652 (owner: 10Marostegui) [09:07:20] (03PS2) 10Filippo Giunchedi: hieradata: enable SMART for misc wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/410413 (https://phabricator.wikimedia.org/T86552) [09:07:36] (03CR) 10jerkins-bot: [V: 04-1] hieradata: enable SMART for misc wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/410413 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [09:10:08] (03PS3) 10Filippo Giunchedi: hieradata: enable SMART for misc wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/410413 (https://phabricator.wikimedia.org/T86552) [09:13:05] (03PS1) 10Marostegui: db-eqiad.php: Repool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412655 (https://phabricator.wikimedia.org/T162807) [09:14:12] (03PS1) 10Filippo Giunchedi: hieradata: enable SMART on cp* in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/412656 (https://phabricator.wikimedia.org/T86552) [09:15:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412655 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:17:08] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412655 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:17:19] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412655 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [09:18:30] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1105 - T162807 (duration: 00m 55s) [09:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:43] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [09:25:41] 10Operations, 10Mail, 10Patch-For-Review: mail.wikimedia.org SSL cert expiring Mon 23 Oct 2017 - https://phabricator.wikimedia.org/T174081#3982524 (10fgiunchedi) [09:25:45] 10Operations, 10Mail, 10Patch-For-Review: tls expiry check for mx vs acme-setup renewal period - https://phabricator.wikimedia.org/T181519#3982521 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi This is done! [09:30:52] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412658 [09:31:49] 10Operations, 10netops: rhenium running out of disk space on / - https://phabricator.wikimedia.org/T187688#3982552 (10fgiunchedi) [09:33:16] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412658 (owner: 10Marostegui) [09:34:55] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412658 (owner: 10Marostegui) [09:35:57] I don't know much about rhenium, though / is slowly filling up, I suspect postgres needs to be moved to /srv instead [09:36:12] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1090 (duration: 00m 55s) [09:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:55] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412658 (owner: 10Marostegui) [09:39:27] (03PS2) 10Jcrespo: Revert "mariadb: Exclude labsdb1001,2,3 from megacli policy check" [puppet] - 10https://gerrit.wikimedia.org/r/405343 [09:39:31] (03CR) 10Jcrespo: Revert "mariadb: Exclude labsdb1001,2,3 from megacli policy check" [puppet] - 10https://gerrit.wikimedia.org/r/405343 (owner: 10Jcrespo) [09:41:28] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Exclude labsdb1001,2,3 from megacli policy check" [puppet] - 10https://gerrit.wikimedia.org/r/405343 (owner: 10Jcrespo) [09:43:22] !log Upgrade mariadb and kernel on db2033 [09:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:40] godog: afaik it hosts pmacct to push netflow data to kafka and kafkatee [09:47:43] yeah definitely, postgres is eating up the root partition [09:47:56] does pmacct use postgres? [09:48:04] yep [09:48:12] (03PS1) 10Filippo Giunchedi: prometheus: default retention to 24 weeks [puppet] - 10https://gerrit.wikimedia.org/r/412660 (https://phabricator.wikimedia.org/T160677) [09:53:09] (03PS2) 10Muehlenhoff: Decomission old video scalers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/412642 (https://phabricator.wikimedia.org/T187466) [09:53:14] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: mw1259.eqiad.wmnet [09:53:24] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: mw1260.eqiad.wmnet [09:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:57] !log reenable gtid replication on db1053 and db2042 [09:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:09] !log Enable GTID on dbstore2002 and dbstore2001 for x1 [09:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:47] (03PS3) 10Muehlenhoff: Decomission old video scalers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/412642 (https://phabricator.wikimedia.org/T187466) [09:59:33] !log Enable GTID on dbstore2002:3313 and dbstore2001:3316 [09:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:08] (03CR) 10Muehlenhoff: [C: 032] Decomission old video scalers in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/412642 (https://phabricator.wikimedia.org/T187466) (owner: 10Muehlenhoff) [10:09:42] 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission mw1259-mw1260 - https://phabricator.wikimedia.org/T187466#3982661 (10MoritzMuehlenhoff) [10:10:04] !log Upgrade mariadb and kernel on db1098 [10:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:23] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: Decommission mw1259-mw1260 - https://phabricator.wikimedia.org/T187466#3976137 (10MoritzMuehlenhoff) [10:11:36] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: Decommission mw1259-mw1260 - https://phabricator.wikimedia.org/T187466#3982664 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff>03None The two hosts have been switched to role::spare, dropped from conftool and marked as downtime until the e... [10:19:42] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412661 [10:23:51] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412661 (owner: 10Marostegui) [10:25:22] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412661 (owner: 10Marostegui) [10:26:36] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412661 (owner: 10Marostegui) [10:27:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1098 s6 and s7 (duration: 00m 56s) [10:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:54] (03PS1) 10Marostegui: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412662 (https://phabricator.wikimedia.org/T187089) [10:32:37] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412662 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [10:34:06] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412662 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [10:35:21] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1093 for alter table (duration: 00m 56s) [10:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:34] (03PS3) 10Jon Harald Søby: Deploy Draft namespace on hiwikiversity. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412081 (https://phabricator.wikimedia.org/T187535) (owner: 10Tulsi Bhagat) [10:35:40] !log Deploy schema change on db1093 - T187089 T185128 T153182 [10:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:53] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [10:35:53] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [10:35:53] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [10:36:18] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic db1098 (s6 and s7) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412663 [10:36:38] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412662 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [10:41:06] (03PS1) 10Ladsgroup: Enable x-kill feature on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412664 (https://phabricator.wikimedia.org/T186714) [10:42:17] (03CR) 10Ema: [C: 031] hieradata: enable SMART on cp* in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/412656 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [10:42:50] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic db1098 (s6 and s7) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412663 (owner: 10Marostegui) [10:44:19] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic db1098 (s6 and s7) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412663 (owner: 10Marostegui) [10:45:10] (03PS4) 10Jon Harald Søby: Deploy Draft namespace on hiwikiversity. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412081 (https://phabricator.wikimedia.org/T187535) (owner: 10Tulsi Bhagat) [10:45:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1098 s6 and s7 (duration: 00m 55s) [10:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:38] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic db1098 (s6 and s7) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412663 (owner: 10Marostegui) [10:47:14] (03CR) 10Muehlenhoff: [C: 04-1] "I'm not really sure that's correct, the require_package is for openjdk-7-jdk (i.e. the full JDK), while Zookeeper itself depends only on J" [puppet] - 10https://gerrit.wikimedia.org/r/410957 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [10:50:57] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic db1098 (s6 and s7) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412667 [10:52:39] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412668 (https://phabricator.wikimedia.org/T128546) [10:53:47] 10Operations, 10Traffic, 10Patch-For-Review: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567#3435393 (10Vgutierrez) Could be interesting for us to roll out openssl 1.1.0? (https://www.openssl.org/blog/blog/2018/02/08/tlsv1.3/). OpenSSL 1.1.1 should be ABI and API compatible with OpenSSL 1.1.0, so... [10:57:27] 10Operations, 10Traffic, 10Patch-For-Review: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567#3982896 (10MoritzMuehlenhoff) Yeah, that's the most plausible option (and we're already using custom OpenSSL 1.1 packages on Debian jessie to support e.g. chacha), but 1.1.1 has only just seen it's first... [11:00:24] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412668 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:01:55] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412668 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:02:08] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412668 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:06:49] !log jdrewniak@tin Synchronized portals/prod/wikipedia.org/assets: Wikimedia portals Update: [[gerrit:412657|Bumping portals to master (T128546)]] (duration: 00m 56s) [11:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:04] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:07:08] (03PS4) 10Filippo Giunchedi: hieradata: enable SMART for misc wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/410413 (https://phabricator.wikimedia.org/T86552) [11:07:46] !log jdrewniak@tin Synchronized portals: Wikimedia portals Update: [[gerrit:412657|Bumping portals to master (T128546)]] (duration: 00m 57s) [11:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:59] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: enable SMART for misc wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/410413 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [11:12:01] (03PS2) 10Filippo Giunchedi: hieradata: enable SMART on cp* in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/412656 (https://phabricator.wikimedia.org/T86552) [11:12:36] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: enable SMART on cp* in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/412656 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [11:15:32] (03CR) 10Filippo Giunchedi: [C: 031] profile::kafka::burrow: add prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/411249 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [11:34:00] (03PS1) 10Urbanecm: New throttle rule, clean expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412672 (https://phabricator.wikimedia.org/T187171) [11:35:27] (03CR) 10jerkins-bot: [V: 04-1] New throttle rule, clean expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412672 (https://phabricator.wikimedia.org/T187171) (owner: 10Urbanecm) [11:35:36] (03PS3) 10Muehlenhoff: Move python3 into standard packages, it's 2018 after all. [puppet] - 10https://gerrit.wikimedia.org/r/411211 [11:36:08] (03PS1) 10Giuseppe Lavagetto: conftool::scripts: update scripts to work with conftool 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/412673 [11:36:57] (03CR) 10Muehlenhoff: [C: 032] Move python3 into standard packages, it's 2018 after all. [puppet] - 10https://gerrit.wikimedia.org/r/411211 (owner: 10Muehlenhoff) [11:47:29] 10Operations, 10Patch-For-Review: puppet should try to mount all mountable swift filesystems - https://phabricator.wikimedia.org/T126574#3983025 (10fgiunchedi) Also swift should attempt to mount filesystems only if they are not commented in fstab [11:50:36] (03PS1) 10Muehlenhoff: Add two exceptions to long-running screen monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/412674 [11:53:50] (03PS1) 10Jon Harald Søby: Add meta namespace localization for sdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412676 (https://phabricator.wikimedia.org/T186943) [12:02:47] (03PS2) 10Urbanecm: New throttle rule, clean expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412672 (https://phabricator.wikimedia.org/T187171) [12:03:33] (03PS3) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412672 (https://phabricator.wikimedia.org/T187171) [12:04:01] (03CR) 10Muehlenhoff: toollabs: add apt pinnings for key packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez) [12:04:37] broken puppet in toolforge instances [12:04:44] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Resource Statement, Duplicate declaration: Package[python3] is already declared in file /etc/puppet/modules/base/manifests/standard_packages.pp:9; cannot redeclare at [12:04:45] /etc/puppet/modules/toollabs/manifests/exec_environ.pp:18 at /etc/puppet/modules/toollabs/manifests/exec_environ.pp:18:5 on node tools-exec-1433.tools.eqiad.wmflabs [12:05:01] (03CR) 10jerkins-bot: [V: 04-1] New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412672 (https://phabricator.wikimedia.org/T187171) (owner: 10Urbanecm) [12:05:14] is there any recent merge that may be the cause of this? [12:05:27] arturo: yes, python3 was added to the standard package list, cc moritzm [12:05:48] arturo: https://github.com/wikimedia/puppet/commit/b276f245d52fb5c4e64d803fcc599fa2be590d57 [12:07:48] ok, we could either rollback or move forward and remove explicit python3 declaration from toolforge instances [12:08:01] whatever is fine for you volans moritz [12:08:07] moritzm* [12:08:26] (03PS2) 10Jon Harald Søby: Add meta namespace localization for sdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412676 (https://phabricator.wikimedia.org/T186943) [12:08:33] I'd say the second one, it was removed from all the other references, apparently this one was missed [12:09:11] (03PS1) 10Jcrespo: MariaDB: Setup db1115 and db2093 as new tendril databases [puppet] - 10https://gerrit.wikimedia.org/r/412678 [12:09:13] (03PS1) 10Jcrespo: Remove python3 installation from tools, it is installed everywhere [puppet] - 10https://gerrit.wikimedia.org/r/412679 [12:09:22] ^volans, arturo my take [12:09:31] (03PS3) 10Jon Harald Søby: Add namespace localization for sdwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412676 (https://phabricator.wikimedia.org/T186943) [12:10:11] jynus: +1 and -1, +1 I agree, -1 I see other references that apparently were missed with a quick git grep [12:10:15] I'm checking them [12:10:17] RECOVERY - Disk space on rhenium is OK: DISK OK [12:10:38] (03PS2) 10Jcrespo: Remove python3 installation from tools, it is installed everywhere [puppet] - 10https://gerrit.wikimedia.org/r/412679 [12:11:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic db1098 (s6 and s7) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412667 (owner: 10Marostegui) [12:11:30] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412680 [12:11:32] (03CR) 10Jon Harald Søby: "Whoever merges this needs to also run the following script:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412676 (https://phabricator.wikimedia.org/T186943) (owner: 10Jon Harald Søby) [12:11:34] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412680 [12:12:07] (03CR) 10Jcrespo: [C: 04-2] "Not yet ready." [puppet] - 10https://gerrit.wikimedia.org/r/412678 (owner: 10Jcrespo) [12:12:52] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic db1098 (s6 and s7) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412667 (owner: 10Marostegui) [12:13:09] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic db1098 (s6 and s7) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412667 (owner: 10Marostegui) [12:13:24] (03PS3) 10Marostegui: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412680 [12:14:09] (03CR) 10Volans: [C: 04-1] "modules/striker/manifests/build.pp and modules/contint/manifests/packages/python.pp seems to refer it too." [puppet] - 10https://gerrit.wikimedia.org/r/412679 (owner: 10Jcrespo) [12:14:10] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1098 s6 and s7 (duration: 00m 56s) [12:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:24] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412680 (owner: 10Marostegui) [12:15:43] (03CR) 10Muehlenhoff: "contint was requested by hashar to remain as-is since it's used without base for CI images." [puppet] - 10https://gerrit.wikimedia.org/r/412679 (owner: 10Jcrespo) [12:16:13] volans: I am not going to work on amending the patch- someone else will have to do it, as it wasn't me who break it [12:16:40] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412680 (owner: 10Marostegui) [12:16:50] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412680 (owner: 10Marostegui) [12:17:34] (03CR) 10Muehlenhoff: [C: 031] "Looks good, seems I missed the declaration in toolforge" [puppet] - 10https://gerrit.wikimedia.org/r/412679 (owner: 10Jcrespo) [12:17:53] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1093 (duration: 00m 55s) [12:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:29] (03CR) 10Arturo Borrero Gonzalez: [C: 032] Remove python3 installation from tools, it is installed everywhere [puppet] - 10https://gerrit.wikimedia.org/r/412679 (owner: 10Jcrespo) [12:22:43] (03PS1) 10Marostegui: db-eqiad.php: Depool db1063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412682 (https://phabricator.wikimedia.org/T187089) [12:24:00] (03CR) 10Arturo Borrero Gonzalez: toollabs: add apt pinnings for key packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez) [12:24:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412682 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [12:25:57] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412682 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [12:26:48] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412682 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [12:27:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1063 for alter table (duration: 00m 55s) [12:27:27] !log Deploy schema change on db1063 - T187089 T185128 T153182 [12:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:42] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [12:27:42] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [12:27:42] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [12:41:28] (03CR) 10Arturo Borrero Gonzalez: toollabs: add apt pinnings for key packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez) [12:41:35] moritzm: ^^^ [12:42:27] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1098 (s6,s7) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412687 [12:43:49] (03PS9) 10Arturo Borrero Gonzalez: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) [12:45:05] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1098 (s6,s7) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412687 (owner: 10Marostegui) [12:46:57] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1098 (s6,s7) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412687 (owner: 10Marostegui) [12:47:07] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1098 (s6,s7) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412687 (owner: 10Marostegui) [12:48:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1098 s6 and s7 (duration: 00m 55s) [12:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:11] (03CR) 10Marostegui: MariaDB: Setup db1115 and db2093 as new tendril databases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412678 (owner: 10Jcrespo) [13:06:00] (03PS1) 10Ema: cache_text: upgrade eqsin to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/412693 (https://phabricator.wikimedia.org/T184448) [13:06:55] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-etherpad-exporter [puppet] - 10https://gerrit.wikimedia.org/r/411276 (https://phabricator.wikimedia.org/T135991) [13:08:20] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1063" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412695 [13:08:23] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1063" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412695 [13:08:32] (03CR) 10Marostegui: [C: 04-2] "Wait for alter table to finish" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412695 (owner: 10Marostegui) [13:09:22] (03CR) 10Ema: [C: 032] cache_text: upgrade eqsin to varnish 5 [puppet] - 10https://gerrit.wikimedia.org/r/412693 (https://phabricator.wikimedia.org/T184448) (owner: 10Ema) [13:10:27] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for prometheus-etherpad-exporter [puppet] - 10https://gerrit.wikimedia.org/r/411276 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:10:32] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-etherpad-exporter [puppet] - 10https://gerrit.wikimedia.org/r/411276 (https://phabricator.wikimedia.org/T135991) [13:13:43] (03PS2) 10Gehel: elasticsearch: collect elasticsearch metrics on per node percentiles [puppet] - 10https://gerrit.wikimedia.org/r/412670 [13:14:20] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: collect elasticsearch metrics on per node percentiles [puppet] - 10https://gerrit.wikimedia.org/r/412670 (owner: 10Gehel) [13:14:58] (03PS3) 10Gehel: elasticsearch: collect elasticsearch metrics on per node percentiles [puppet] - 10https://gerrit.wikimedia.org/r/412670 [13:16:26] !log upgrade cache_text@eqsin to varnish 5 [13:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:16] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1063" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412695 (owner: 10Marostegui) [13:18:46] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1063" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412695 (owner: 10Marostegui) [13:18:56] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1063" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412695 (owner: 10Marostegui) [13:19:16] !log Deploy schema change on dbstore1002 - T187089 T185128 T153182 [13:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:33] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [13:19:33] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [13:19:34] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [13:19:59] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1063 (duration: 00m 55s) [13:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:37] (03CR) 10Gehel: "Puppet compiler seems happy (https://puppet-compiler.wmflabs.org/compiler02/10016/). Not sure how to test the prometheus discovery without" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412670 (owner: 10Gehel) [13:22:35] (03PS2) 10Filippo Giunchedi: prometheus: default retention to 24 weeks [puppet] - 10https://gerrit.wikimedia.org/r/412660 (https://phabricator.wikimedia.org/T160677) [13:30:34] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: default retention to 24 weeks [puppet] - 10https://gerrit.wikimedia.org/r/412660 (https://phabricator.wikimedia.org/T160677) (owner: 10Filippo Giunchedi) [13:31:20] (03CR) 10Jcrespo: [C: 04-2] "I was thinking of disabling puppet until taking it into effect to avoid working twice, but this was just a bit of work preparation in adva" [puppet] - 10https://gerrit.wikimedia.org/r/412678 (owner: 10Jcrespo) [13:33:13] (03PS1) 10Arturo Borrero Gonzalez: toollabs: use require_package() [puppet] - 10https://gerrit.wikimedia.org/r/412699 [13:33:42] (03CR) 10jerkins-bot: [V: 04-1] toollabs: use require_package() [puppet] - 10https://gerrit.wikimedia.org/r/412699 (owner: 10Arturo Borrero Gonzalez) [13:34:18] (03CR) 10Marostegui: "> I was thinking of disabling puppet until taking it into effect to" [puppet] - 10https://gerrit.wikimedia.org/r/412678 (owner: 10Jcrespo) [13:37:24] (03PS2) 10Arturo Borrero Gonzalez: toollabs: use require_package() [puppet] - 10https://gerrit.wikimedia.org/r/412699 [13:37:50] (03CR) 10jerkins-bot: [V: 04-1] toollabs: use require_package() [puppet] - 10https://gerrit.wikimedia.org/r/412699 (owner: 10Arturo Borrero Gonzalez) [13:39:24] 10Operations, 10Cloud-Services, 10hardware-requests, 10cloud-services-team (Kanban): decom silver (was silver has trouble rebooting) - https://phabricator.wikimedia.org/T168559#3983387 (10Andrew) [13:39:40] (03PS3) 10Arturo Borrero Gonzalez: toollabs: use require_package() [puppet] - 10https://gerrit.wikimedia.org/r/412699 [13:44:41] !log roll-restart prometheus after retention period bump [13:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:32] (03PS4) 10Arturo Borrero Gonzalez: toollabs: use require_package() [puppet] - 10https://gerrit.wikimedia.org/r/412699 [13:53:06] PROBLEM - MegaRAID on db2030 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [13:53:08] ACKNOWLEDGEMENT - MegaRAID on db2030 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T187722 [13:53:12] 10Operations, 10ops-codfw: Degraded RAID on db2030 - https://phabricator.wikimedia.org/T187722#3983430 (10ops-monitoring-bot) [13:54:00] 10Operations, 10Ops-Access-Requests: Give 'sudo -u yarn' asccess to joal on analytics-hadoop-workers nodes - https://phabricator.wikimedia.org/T187723#3983433 (10JAllemandou) [13:57:59] 10Operations, 10Ops-Access-Requests: Give 'sudo -u yarn' asccess to joal on analytics-hadoop-workers nodes - https://phabricator.wikimedia.org/T187723#3983455 (10elukey) I support the request, and it might be wise to allow this simple diff for the whole analytics team: ``` diff --git a/modules/admin/data/data... [14:01:59] (03PS1) 10Elukey: admin::data: allow analytics-admins to sudo as yarn [puppet] - 10https://gerrit.wikimedia.org/r/412704 (https://phabricator.wikimedia.org/T187723) [14:04:36] (03PS7) 10Andrew Bogott: labweb horizon: share memcached among labwebs [puppet] - 10https://gerrit.wikimedia.org/r/411546 (https://phabricator.wikimedia.org/T187506) [14:04:38] (03PS1) 10Andrew Bogott: create new (transitional) 'newtoolsadmin.wikimedia.org' host on labweb1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/412705 (https://phabricator.wikimedia.org/T168470) [14:07:23] (03PS7) 10Elukey: profile::kafka::burrow: add prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/411249 (https://phabricator.wikimedia.org/T180442) [14:07:45] is there a way to get puppet facts for a given toolforge instance using the puppet compiler jenkins job (web)? https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/10018/console [14:08:50] (03CR) 10Elukey: [C: 032] profile::kafka::burrow: add prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/411249 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [14:09:13] (03PS1) 10Andrew Bogott: add labweb1002 as a second backend for horizon [puppet] - 10https://gerrit.wikimedia.org/r/412706 [14:15:06] (03CR) 10Andrew Bogott: "I have two concerns:" [puppet] - 10https://gerrit.wikimedia.org/r/412699 (owner: 10Arturo Borrero Gonzalez) [14:20:00] (03PS1) 10Elukey: profile::prometheus::burrow_exporter: set 0.0.0.0 as default listen addr [puppet] - 10https://gerrit.wikimedia.org/r/412709 (https://phabricator.wikimedia.org/T180442) [14:20:42] (03CR) 10Elukey: [C: 032] profile::prometheus::burrow_exporter: set 0.0.0.0 as default listen addr [puppet] - 10https://gerrit.wikimedia.org/r/412709 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [14:21:29] (03Abandoned) 10Andrew Bogott: add labweb1002 as a second backend for horizon [puppet] - 10https://gerrit.wikimedia.org/r/412706 (owner: 10Andrew Bogott) [14:23:42] (03PS2) 10Andrew Bogott: create new (transitional) 'newtoolsadmin.wikimedia.org' host on labweb1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/412705 (https://phabricator.wikimedia.org/T168470) [14:25:11] (03CR) 10Andrew Bogott: [C: 032] create new (transitional) 'newtoolsadmin.wikimedia.org' host on labweb1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/412705 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [14:32:38] (03PS1) 10Andrew Bogott: labweb: add striker role/profile [puppet] - 10https://gerrit.wikimedia.org/r/412711 [14:33:17] (03CR) 10jerkins-bot: [V: 04-1] labweb: add striker role/profile [puppet] - 10https://gerrit.wikimedia.org/r/412711 (owner: 10Andrew Bogott) [14:36:04] (03PS2) 10Andrew Bogott: labweb: add striker role/profile [puppet] - 10https://gerrit.wikimedia.org/r/412711 [14:38:30] (03CR) 10Andrew Bogott: [C: 032] labweb: add striker role/profile [puppet] - 10https://gerrit.wikimedia.org/r/412711 (owner: 10Andrew Bogott) [14:48:37] (03CR) 10Filippo Giunchedi: "See inline" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/412670 (owner: 10Gehel) [14:49:22] (03CR) 10Jon Harald Søby: "Whoever merges this should also run the following script:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412081 (https://phabricator.wikimedia.org/T187535) (owner: 10Tulsi Bhagat) [14:49:24] (03CR) 10Andrew Bogott: [C: 04-2] "Probably we aren't going to do T187506 at all, but keeping this around pending discussion" [puppet] - 10https://gerrit.wikimedia.org/r/411546 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [14:52:55] (03PS1) 10Andrew Bogott: labweb: add integrated hiera settings, mostly for Striker [puppet] - 10https://gerrit.wikimedia.org/r/412714 [14:58:59] !log testing new dbproxy1010 configuration locally to pool labsdb1010 for analytics [14:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:30] (03PS1) 10Filippo Giunchedi: hieradata: enable SMART on bastions [puppet] - 10https://gerrit.wikimedia.org/r/412715 (https://phabricator.wikimedia.org/T86552) [15:00:52] (03CR) 10jerkins-bot: [V: 04-1] hieradata: enable SMART on bastions [puppet] - 10https://gerrit.wikimedia.org/r/412715 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [15:02:33] (03PS1) 10Filippo Giunchedi: hieradata: enable SMART on authdns [puppet] - 10https://gerrit.wikimedia.org/r/412716 (https://phabricator.wikimedia.org/T86552) [15:02:35] (03PS1) 10Filippo Giunchedi: hieradata: enable SMART on recursor [puppet] - 10https://gerrit.wikimedia.org/r/412717 (https://phabricator.wikimedia.org/T86552) [15:02:49] (03CR) 10jerkins-bot: [V: 04-1] hieradata: enable SMART on authdns [puppet] - 10https://gerrit.wikimedia.org/r/412716 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [15:02:58] (03CR) 10jerkins-bot: [V: 04-1] hieradata: enable SMART on recursor [puppet] - 10https://gerrit.wikimedia.org/r/412717 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [15:05:23] (03PS2) 10Filippo Giunchedi: hieradata: enable SMART on bastions [puppet] - 10https://gerrit.wikimedia.org/r/412715 (https://phabricator.wikimedia.org/T86552) [15:05:23] take #2 [15:05:25] (03PS2) 10Filippo Giunchedi: hieradata: enable SMART on authdns [puppet] - 10https://gerrit.wikimedia.org/r/412716 (https://phabricator.wikimedia.org/T86552) [15:05:27] (03PS2) 10Filippo Giunchedi: hieradata: enable SMART on recursor [puppet] - 10https://gerrit.wikimedia.org/r/412717 (https://phabricator.wikimedia.org/T86552) [15:12:25] PROBLEM - High CPU load on API appserver on mw1277 is CRITICAL: CRITICAL - load average: 41.80, 41.15, 40.06 [15:14:20] (03PS1) 10Volans: CHANGELOG: add changelogs for release v3.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/412719 [15:14:25] PROBLEM - High CPU load on API appserver on mw1277 is CRITICAL: CRITICAL - load average: 40.60, 40.78, 40.04 [15:15:50] <_joe_> looking at mw1277 [15:20:25] PROBLEM - High CPU load on API appserver on mw1277 is CRITICAL: CRITICAL - load average: 42.22, 40.68, 40.21 [15:21:11] (03PS2) 10Volans: CHANGELOG: add changelogs for release v3.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/412719 [15:21:30] at least one host has been complaining over the weeked. I didn't act because it was only 30 load, which I didn't belive was emergency level (how many cores does one app server have?) [15:22:34] (03PS1) 10Jcrespo: dbproxy: Fix load balancing syntax and apply it to labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/412721 [15:27:25] RECOVERY - High CPU load on API appserver on mw1277 is OK: OK - load average: 10.44, 18.57, 29.99 [15:33:26] (03CR) 10Volans: [C: 032] CHANGELOG: add changelogs for release v3.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/412719 (owner: 10Volans) [15:34:26] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2030 - https://phabricator.wikimedia.org/T187722#3983773 (10Volans) [15:36:56] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Host deployment-puppetdb01 is DOWN: CRITICAL - Host Unreachable (10.68.23.76) - https://phabricator.wikimedia.org/T187736#3983778 (10MarcoAurelio) [15:37:02] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Host deployment-puppetdb01 is DOWN: CRITICAL - Host Unreachable (10.68.23.76) - https://phabricator.wikimedia.org/T187736#3983788 (10MarcoAurelio) p:05Triage>03High [15:37:04] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v3.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/412719 (owner: 10Volans) [15:38:06] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2030 - https://phabricator.wikimedia.org/T187722#3983793 (10Marostegui) p:05Triage>03Normal a:03Papaul @Papaul if you got spares, could you replace it? Thanks! [15:38:40] (03CR) 10jenkins-bot: CHANGELOG: add changelogs for release v3.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/412719 (owner: 10Volans) [15:38:52] (03PS2) 10Jcrespo: dbproxy: Fix load balancing syntax and apply it to labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/412721 [15:40:19] (03PS2) 10Giuseppe Lavagetto: conftool::scripts: update scripts to work with conftool 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/412673 [15:51:05] (03CR) 10Ema: hieradata: enable SMART on recursor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412717 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [15:52:55] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Give 'sudo -u yarn' asccess to joal on analytics-hadoop-workers nodes - https://phabricator.wikimedia.org/T187723#3983822 (10Nuria) Agreed.Let's do it for the whole team, we want the person with ops duty to be able to look at logs. [15:52:56] (03CR) 10Ema: [C: 031] hieradata: enable SMART on bastions [puppet] - 10https://gerrit.wikimedia.org/r/412715 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [15:54:47] (03PS2) 10Andrew Bogott: labweb: add integrated hiera settings, mostly for Striker [puppet] - 10https://gerrit.wikimedia.org/r/412714 [15:57:06] (03CR) 10Ema: [C: 031] hieradata: enable SMART on authdns [puppet] - 10https://gerrit.wikimedia.org/r/412716 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [15:59:16] (03PS3) 10Andrew Bogott: labweb: add integrated hiera settings, mostly for Striker [puppet] - 10https://gerrit.wikimedia.org/r/412714 [16:02:40] (03CR) 10Andrew Bogott: [C: 032] labweb: add integrated hiera settings, mostly for Striker [puppet] - 10https://gerrit.wikimedia.org/r/412714 (owner: 10Andrew Bogott) [16:03:28] (03PS4) 10Gehel: elasticsearch: collect elasticsearch metrics on per node percentiles [puppet] - 10https://gerrit.wikimedia.org/r/412670 [16:05:20] (03CR) 10Gehel: elasticsearch: collect elasticsearch metrics on per node percentiles (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/412670 (owner: 10Gehel) [16:05:22] (03PS1) 10Andrew Bogott: striker::uwsgi: remove requirement for Trusty [puppet] - 10https://gerrit.wikimedia.org/r/412727 [16:06:04] (03CR) 10Andrew Bogott: [C: 032] striker::uwsgi: remove requirement for Trusty [puppet] - 10https://gerrit.wikimedia.org/r/412727 (owner: 10Andrew Bogott) [16:07:02] (03PS3) 10Jcrespo: dbproxy: Fix load balancing syntax and apply it to labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/412721 [16:08:47] (03CR) 10Filippo Giunchedi: hieradata: enable SMART on recursor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/412717 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [16:09:01] (03PS3) 10Filippo Giunchedi: hieradata: enable SMART on bastions [puppet] - 10https://gerrit.wikimedia.org/r/412715 (https://phabricator.wikimedia.org/T86552) [16:09:53] !log andrew@tin Started deploy [striker/deploy@8a79195]: deploying striker to labweb1001 and 1002 [16:10:06] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: enable SMART on bastions [puppet] - 10https://gerrit.wikimedia.org/r/412715 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [16:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:10] !log andrew@tin Finished deploy [striker/deploy@8a79195]: deploying striker to labweb1001 and 1002 (duration: 00m 17s) [16:10:16] (03PS3) 10Filippo Giunchedi: hieradata: enable SMART on authdns [puppet] - 10https://gerrit.wikimedia.org/r/412716 (https://phabricator.wikimedia.org/T86552) [16:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:49] !log andrew@tin Started deploy [striker/deploy@8a79195]: deploying striker to labweb1001 and 1002 [16:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:11] !log andrew@tin Finished deploy [striker/deploy@8a79195]: deploying striker to labweb1001 and 1002 (duration: 00m 22s) [16:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:22] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: enable SMART on authdns [puppet] - 10https://gerrit.wikimedia.org/r/412716 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [16:13:37] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: enable SMART on recursor [puppet] - 10https://gerrit.wikimedia.org/r/412717 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [16:13:44] (03PS3) 10Filippo Giunchedi: hieradata: enable SMART on recursor [puppet] - 10https://gerrit.wikimedia.org/r/412717 (https://phabricator.wikimedia.org/T86552) [16:15:36] (03PS1) 10Volans: Upstream release v3.0.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/412728 [16:17:03] (03PS4) 10Jcrespo: dbproxy: Fix load balancing syntax and apply it to labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/412721 [16:17:56] (03PS1) 10Andrew Bogott: striker: alternate package requirements for Debian [puppet] - 10https://gerrit.wikimedia.org/r/412729 (https://phabricator.wikimedia.org/T168470) [16:17:59] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for prometheus-openldap-exporter [puppet] - 10https://gerrit.wikimedia.org/r/411219 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:18:41] (03CR) 10Andrew Bogott: [C: 032] striker: alternate package requirements for Debian [puppet] - 10https://gerrit.wikimedia.org/r/412729 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [16:22:02] !log andrew@tin Started deploy [striker/deploy@8a79195]: further attempt to cram striker onto labweb1001 and 1002 [16:22:12] !log andrew@tin Finished deploy [striker/deploy@8a79195]: further attempt to cram striker onto labweb1001 and 1002 (duration: 00m 10s) [16:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:26] (03CR) 10Marostegui: [C: 031] dbproxy: Fix load balancing syntax and apply it to labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/412721 (owner: 10Jcrespo) [16:23:54] (03PS1) 10Andrew Bogott: added temporary newtoolsadmin hostname [dns] - 10https://gerrit.wikimedia.org/r/412730 [16:24:19] (03CR) 10Giuseppe Lavagetto: [C: 031] "makes sense. Could use a specific unit test, but I think it's an important fix anyways." (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/411264 (https://phabricator.wikimedia.org/T169765) (owner: 10Ema) [16:24:26] (03CR) 10Andrew Bogott: [C: 032] added temporary newtoolsadmin hostname [dns] - 10https://gerrit.wikimedia.org/r/412730 (owner: 10Andrew Bogott) [16:25:34] (03PS5) 10Jcrespo: dbproxy: Fix load balancing syntax and apply it to labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/412721 [16:25:52] (03PS6) 10Jcrespo: dbproxy: Fix load balancing syntax and apply it to labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/412721 [16:26:32] (03CR) 10Jcrespo: [C: 032] dbproxy: Fix load balancing syntax and apply it to labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/412721 (owner: 10Jcrespo) [16:28:17] noop on master templates [16:28:30] testing now the change on load-balancing proxies [16:28:44] aka dbproxy1010 [16:29:57] <_joe_> !log uploading conftool 1.0.0beta1 to reprepro for jessie [16:29:59] nope, didn't work [16:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:10] PROBLEM - puppet last run on dbproxy1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:34:45] that is me, working on it [16:40:26] (03PS1) 10Jcrespo: dbproxy: Apply followup to fix load balancing syntax [puppet] - 10https://gerrit.wikimedia.org/r/412735 [16:41:04] (03CR) 10Jcrespo: [C: 032] dbproxy: Apply followup to fix load balancing syntax [puppet] - 10https://gerrit.wikimedia.org/r/412735 (owner: 10Jcrespo) [16:41:57] moritzm: in which APT component for jessie should I include clustershell 1.8? Is jessie-wikimedia ok? (it's a backport from Debian sid package as is, without local modifications) [16:48:48] (03PS1) 10Jcrespo: dbproxy: Transform the yaml array into a proper hash [puppet] - 10https://gerrit.wikimedia.org/r/412736 [16:49:15] (03CR) 10Jcrespo: [C: 032] dbproxy: Transform the yaml array into a proper hash [puppet] - 10https://gerrit.wikimedia.org/r/412736 (owner: 10Jcrespo) [16:49:23] (03PS1) 10Ema: reload-vcl: discard old VCL after switching to the new one [puppet] - 10https://gerrit.wikimedia.org/r/412737 [16:51:12] volans: main, backports is deprecated (but still around for jessie) [16:51:42] moritzm: so directly into main, not jessie-wikimedia? [16:53:00] I mean in the reprepro command: -C main include COMPONENT [16:53:03] volans: jessie-wikimedia/main yes, if you want to be able to upload to Debian/main I'd be happy to walk you through the Debian maintainer process! [16:53:30] clustershell is already into Debian, but version 1.8 only in sid [16:53:45] reprepro -C main include jessie-wikimedia clustershell_1.8_changes [16:54:06] ok then, where I thought it should go, thanks a lot! [16:54:11] RECOVERY - puppet last run on dbproxy1010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:04:24] !log elukey@tin Started deploy [eventlogging/analytics@8bebdf7]: (no justification provided) [17:04:29] !log elukey@tin Finished deploy [eventlogging/analytics@8bebdf7]: (no justification provided) (duration: 00m 05s) [17:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:20] PROBLEM - HHVM rendering on mw2210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:11] RECOVERY - HHVM rendering on mw2210 is OK: HTTP OK: HTTP/1.1 200 OK - 77400 bytes in 0.298 second response time [17:44:33] (03PS2) 10Elukey: profile::zookeeper::server: remove explicit java-7 dependency [puppet] - 10https://gerrit.wikimedia.org/r/410957 (https://phabricator.wikimedia.org/T166081) [17:45:09] (03CR) 10Elukey: "> I'm not really sure that's correct, the require_package is for" [puppet] - 10https://gerrit.wikimedia.org/r/410957 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [17:47:44] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#3983957 (10zeljkofilipin) @EddieGP I would also recommend asking for a... [17:48:10] (03PS1) 10Andrew Bogott: striker: include apache proxy mod [puppet] - 10https://gerrit.wikimedia.org/r/412742 [17:49:12] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Give 'sudo -u yarn' asccess to joal on analytics-hadoop-workers nodes - https://phabricator.wikimedia.org/T187723#3983965 (10RobH) Technically this is modifying a sudo rule, and should be reviewed in the operations weekly team meetings. I'm moving it... [17:50:19] 10Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#3983970 (10RobH) [17:50:33] 10Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#3981063 (10RobH) p:05Triage>03Normal [17:52:25] 10Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#3983977 (10RobH) [17:53:59] (03PS1) 10Elukey: Simplify zookeeper's default template to be systemd friendly [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/412744 (https://phabricator.wikimedia.org/T166081) [17:56:55] (03PS2) 10Andrew Bogott: labweb: include apache proxy mod for striker [puppet] - 10https://gerrit.wikimedia.org/r/412742 [17:58:32] 10Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#3983982 (10RobH) @katielin: In reviewing the checklist, it seems we're going to need a few more items from you. In... [17:58:54] 10Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#3983984 (10RobH) [17:59:14] (03PS3) 10Andrew Bogott: labweb: include apache proxy mod for striker [puppet] - 10https://gerrit.wikimedia.org/r/412742 [17:59:42] (03CR) 10Andrew Bogott: [C: 032] labweb: include apache proxy mod for striker [puppet] - 10https://gerrit.wikimedia.org/r/412742 (owner: 10Andrew Bogott) [17:59:53] (03CR) 10Elukey: "pcc looks good https://puppet-compiler.wmflabs.org/compiler02/10026/" [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/412744 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [18:01:37] (03CR) 10Elukey: "As far as I can see from zookeeper server it should also not restart the daemons when deployed" [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/412744 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [18:03:25] (03PS1) 10RobH: new shell user katie lin [puppet] - 10https://gerrit.wikimedia.org/r/412745 (https://phabricator.wikimedia.org/T187623) [18:03:27] (03PS1) 10Andrew Bogott: labweb: include apache expires mod for striker [puppet] - 10https://gerrit.wikimedia.org/r/412746 [18:04:13] (03CR) 10Andrew Bogott: [C: 032] labweb: include apache expires mod for striker [puppet] - 10https://gerrit.wikimedia.org/r/412746 (owner: 10Andrew Bogott) [18:04:49] !log uploaded clustershell_1.8-1~wmf1_all.deb, python-clustershell_1.8-1~wmf1_all.deb and python3-clustershell_1.8-1~wmf1_all.deb to apt.wikimedia.org jessie-wikimedia [18:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:18] (03PS2) 10Volans: Upstream release v3.0.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/412728 [18:05:29] (03PS1) 10RobH: adding shell user katie to groups [puppet] - 10https://gerrit.wikimedia.org/r/412747 (https://phabricator.wikimedia.org/T187623) [18:05:50] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#3983998 (10RobH) [18:11:18] (03CR) 10Muehlenhoff: profile::zookeeper::server: remove explicit java-7 dependency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410957 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [18:11:37] (03PS3) 10Volans: Upstream release v3.0.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/412728 [18:13:40] (03PS1) 10Andrew Bogott: bweb: include apache proxy_http mod for striker [puppet] - 10https://gerrit.wikimedia.org/r/412748 [18:13:58] (03PS2) 10Andrew Bogott: labweb: include apache proxy_http mod for striker [puppet] - 10https://gerrit.wikimedia.org/r/412748 [18:14:15] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/412744 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [18:14:50] (03CR) 10Andrew Bogott: [C: 032] labweb: include apache proxy_http mod for striker [puppet] - 10https://gerrit.wikimedia.org/r/412748 (owner: 10Andrew Bogott) [18:18:09] (03PS1) 10RobH: adding ldap user cgauthier [puppet] - 10https://gerrit.wikimedia.org/r/412750 (https://phabricator.wikimedia.org/T187720) [18:18:38] (03CR) 10RobH: [C: 032] adding ldap user cgauthier [puppet] - 10https://gerrit.wikimedia.org/r/412750 (https://phabricator.wikimedia.org/T187720) (owner: 10RobH) [18:20:04] (03CR) 10Elukey: profile::zookeeper::server: remove explicit java-7 dependency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410957 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [18:21:15] (03PS3) 10Elukey: profile::zookeeper::server: remove explicit java-7 dependency [puppet] - 10https://gerrit.wikimedia.org/r/410957 (https://phabricator.wikimedia.org/T166081) [18:21:33] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T187442#3984031 (10RobH) a:03MoritzMuehlenhoff This seems to be pending @MoritzMuehlenhoff's confirmation with legal regarding NDA requirements. [18:24:57] (03PS1) 10Andrew Bogott: bweb: include apache proxy_balancer mod for striker [puppet] - 10https://gerrit.wikimedia.org/r/412752 [18:25:32] (03PS2) 10Andrew Bogott: labweb: include apache proxy_balancer mod for striker [puppet] - 10https://gerrit.wikimedia.org/r/412752 [18:26:11] (03CR) 10Andrew Bogott: [C: 032] labweb: include apache proxy_balancer mod for striker [puppet] - 10https://gerrit.wikimedia.org/r/412752 (owner: 10Andrew Bogott) [18:26:45] (03CR) 10Elukey: [C: 04-2] "Wait for the ops meeting's approval before proceeding." [puppet] - 10https://gerrit.wikimedia.org/r/412704 (https://phabricator.wikimedia.org/T187723) (owner: 10Elukey) [18:31:06] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T187442#3984060 (10MoritzMuehlenhoff) Rachel wanted to doublecheck within the WMF Legal department, but that will take until tomorrow at least due to the WMF holiday for U... [18:34:11] (03PS1) 10Andrew Bogott: labweb: include lbmethod_byrequests [puppet] - 10https://gerrit.wikimedia.org/r/412754 [18:35:08] (03CR) 10Andrew Bogott: [C: 032] labweb: include lbmethod_byrequests [puppet] - 10https://gerrit.wikimedia.org/r/412754 (owner: 10Andrew Bogott) [18:44:08] (03PS1) 10Giuseppe Lavagetto: Improve debianization; change source package name. [software/conftool] - 10https://gerrit.wikimedia.org/r/412755 [18:45:39] (03CR) 10jerkins-bot: [V: 04-1] Improve debianization; change source package name. [software/conftool] - 10https://gerrit.wikimedia.org/r/412755 (owner: 10Giuseppe Lavagetto) [18:48:36] PROBLEM - mysqld processes on db2030 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [18:48:37] PROBLEM - Check systemd state on db2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:48:37] PROBLEM - haproxy failover on dbproxy1005 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [18:48:51] PROBLEM - Disk space on db2030 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [18:48:54] * volans looking [18:48:57] PROBLEM - MariaDB disk space on db2030 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [18:49:10] I am here [18:49:12] <_joe_> uh [18:49:14] That host had a broken disk earlier [18:49:16] <_joe_> that seems bad [18:49:18] <_joe_> oh ok [18:49:18] it's gone [18:49:24] Input/output error [18:49:24] I guess controller issue... [18:49:26] Right... [18:49:30] <_joe_> yeah, I would say it has more than one broken disk [18:49:36] Yeah, clearly [18:49:38] <_joe_> can I go back to dinner? :P [18:49:43] _joe_: yep! [18:49:45] <_joe_> it's already depooled I guess [18:49:50] Yeah, it is an m5 codfw slave [18:49:53] so you can go [18:49:55] thanks! [18:49:56] <_joe_> what's up with dbproxy1005 though? [18:50:17] <_joe_> well, ttyl [18:50:18] did it fail? [18:50:28] marostegui: I'm here anyway [18:50:39] volans: can you disable notifications for it? [18:50:46] _joe_: nah, because db2030 failed [18:50:55] jynus: yeah, that host that had a broken disk earlier... [18:50:58] sure [18:51:04] volans: thanks! [18:51:14] marostegui: according to /usr/local/lib/nagios/plugins/get-raid-status-megacli only one disk is failed [18:51:26] Yeah, I saw that earlier [18:51:36] but [18:51:37] RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0 [18:51:37] State: =====> Offline <===== [18:51:41] not degraded [18:51:54] so the RAID is kinda gone? [18:52:10] I am checking logs [18:52:19] Might be that the controller totally failed [18:52:30] don't wast time, disable alerts, create a ticket and we will have a look tomorrow [18:52:34] nothing is broken [18:52:41] I can create the ticket and all that [18:52:42] marostegui: i just replaced the disk on db2030 [18:52:46] Oh [18:52:57] maybe you replaced the wrong one :-) [18:52:59] !log disabled all notifications on Icinga for db2030 [18:53:03] Maybe that triggered a controller failure [18:53:06] I am checking the logs [18:53:09] that, too [18:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:16] in any case, don't worry [18:53:24] I was about to leave, but can take care of this [18:53:28] it is not an emergency [18:53:40] papaul: which disk did you replaced? [18:53:48] volans: slot 5 [18:54:02] the logs says [18:54:11] is the host is still up? [18:54:15] RecordData = *4*Drive 4 is removed. [18:54:20] he he [18:54:22] RecordData = *2*Drive 4 is installed. [18:54:25] what is curious is that from the get-raid-status-megacli it seems that the errors for the slot 4 disappeared [18:54:30] compared to T187722 [18:54:30] T187722: Degraded RAID on db2030 - https://phabricator.wikimedia.org/T187722 [18:54:31] but it could be it counts 0 or not [18:54:40] the ones for slot 5 are the same as in the task [18:55:09] I still think we should figure everthing tomorrow [18:55:32] this is a passive replica as of now [18:55:46] agreed [18:55:57] that host was going to be decommissioned anyways too [18:55:57] we will investigate tomorrow [18:56:16] is the host down, or is it still technically up? [18:56:17] jynus: ok, to reply to your question, OS is still UP for whatever is already in RAM, but not disk I/O is possible [18:56:22] sure [18:56:28] jynus: up but with I/O on any command :) [18:56:30] I was just asking to ack the hos too [18:56:40] I will do it [18:56:48] I've disabled host+all checks [18:56:54] notifications [18:57:44] down also so it doesn't appear on icinga [18:57:51] including the host [18:57:59] ack, will do [18:58:12] I already did [18:58:22] I will not ack haproxy [18:58:38] so we see it if for bad luck the primary server goes down [18:58:43] but I will leave a comment [18:58:55] ok [18:58:57] thx [19:11:38] 10Operations, 10ops-codfw, 10Cloud-VPS: connect eth2 for labneutron2001 and 2002 - https://phabricator.wikimedia.org/T187552#3984130 (10Papaul) @chasemp I have no hosts whit the name labneutron2001 and labneutron2002. Are you referring to labtestneutron2001 and labtestneutron2002? [19:12:15] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Host deployment-puppetdb01 is DOWN: CRITICAL - Host Unreachable (10.68.23.76) - https://phabricator.wikimedia.org/T187736#3984131 (10MarcoAurelio) `PROBLEM - Puppet errors on deployment-secureredirexperiment is CRITICAL: CRITICAL: 100.00% of data above t... [19:15:33] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2030 - https://phabricator.wikimedia.org/T187722#3984136 (10Marostegui) This host has completely failed - looks like (as per my chat with @Papaul that there were two disks blinking badly on the server chassis), the one replaced was not yet detected by mega... [19:19:51] 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission old and unused/spare servers in codfw - https://phabricator.wikimedia.org/T187474#3984145 (10Papaul) stat2001 port information asw-a1-codfw ge-1/0/0 wmf3641 port information asw-b5-codfw ge-5/0/6 [19:22:50] 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission old and unused/spare servers in codfw - https://phabricator.wikimedia.org/T187474#3984146 (10Papaul) [19:29:01] !log uploaded python3-requests-mock, python-requests-mock and python-requests-mock-doc for version 1.3.0-3~wmf1 to apt.wikimedia.org jessie-wikimedia [19:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:21] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2030 - https://phabricator.wikimedia.org/T187722#3984170 (10Marostegui) Also we could remove 2x160GB from s6 with a large server and use one of those 160GB ones for this replacement [19:34:05] (03CR) 10Volans: "recheck" [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/412728 (owner: 10Volans) [19:37:04] 10Operations, 10Performance-Team, 10Traffic, 10Performance: missing H2 coalesce for upload.wm.o for images ref'd in projects' page outputs - https://phabricator.wikimedia.org/T116132#3984187 (10Gilles) Now that HTTP/2 is a thing and Zero won't be anymore, could we put connection coalescing for upload.wikim... [19:41:20] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [19:43:47] (03CR) 10Volans: [C: 032] "Build fine on boron, despite the error in CI, I will open a task to check that." [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/412728 (owner: 10Volans) [19:46:50] (03Merged) 10jenkins-bot: Upstream release v3.0.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/412728 (owner: 10Volans) [20:03:40] !log uploaded cumin_3.0.0-1_amd64.deb to apt.wikimedia.org jessie-wikimedia [20:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:30] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [21:08:25] (03PS1) 10Volans: CLI: fix help message [software/cumin] - 10https://gerrit.wikimedia.org/r/412761 [21:11:53] (03CR) 10jerkins-bot: [V: 04-1] CLI: fix help message [software/cumin] - 10https://gerrit.wikimedia.org/r/412761 (owner: 10Volans) [21:12:52] (03PS2) 10Volans: CLI: fix help message [software/cumin] - 10https://gerrit.wikimedia.org/r/412761 [21:17:39] (03CR) 10Volans: [C: 032] CLI: fix help message [software/cumin] - 10https://gerrit.wikimedia.org/r/412761 (owner: 10Volans) [21:19:51] no_justification: twentyafterfour that BadMethodCallException can probably go back from High to Normal now IMO (see T187731) [21:21:12] (03Merged) 10jenkins-bot: CLI: fix help message [software/cumin] - 10https://gerrit.wikimedia.org/r/412761 (owner: 10Volans) [21:22:32] (03CR) 10jenkins-bot: CLI: fix help message [software/cumin] - 10https://gerrit.wikimedia.org/r/412761 (owner: 10Volans) [21:22:48] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Host deployment-puppetdb01 is DOWN: CRITICAL - Host Unreachable (10.68.23.76) - https://phabricator.wikimedia.org/T187736#3984337 (10MarcoAurelio) `PROBLEM - Puppet errors on deployment-mx02 is CRITICAL: CRITICAL: 100.00% of data above the critical thres... [21:26:06] (03PS1) 10Volans: CHANGELOG: add changelogs for release v3.0.1 [software/cumin] - 10https://gerrit.wikimedia.org/r/412800 [21:28:09] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#3984342 (10EddieGP) >>! In T176754#3983957, @zeljkofilipin wrote: > @Ed... [21:29:16] (03CR) 10Volans: [C: 032] CHANGELOG: add changelogs for release v3.0.1 [software/cumin] - 10https://gerrit.wikimedia.org/r/412800 (owner: 10Volans) [21:31:51] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v3.0.1 [software/cumin] - 10https://gerrit.wikimedia.org/r/412800 (owner: 10Volans) [21:33:04] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Host deployment-puppetdb01 is DOWN: CRITICAL - Host Unreachable (10.68.23.76) - https://phabricator.wikimedia.org/T187736#3983778 (10Paladox) I think this was just deleted but was never removed from shinken. [21:33:07] (03CR) 10jenkins-bot: CHANGELOG: add changelogs for release v3.0.1 [software/cumin] - 10https://gerrit.wikimedia.org/r/412800 (owner: 10Volans) [21:36:05] (03PS1) 10Volans: Upstream release v3.0.1 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/412813 [21:47:00] 10Operations, 10ops-codfw, 10Cloud-VPS: connect eth2 for labtestneutron2001 and 2002 - https://phabricator.wikimedia.org/T187552#3984368 (10chasemp) [21:47:28] 10Operations, 10ops-codfw, 10Cloud-VPS: connect eth2 for labtestneutron2001 and 2002 - https://phabricator.wikimedia.org/T187552#3978705 (10chasemp) >>! In T187552#3984130, @Papaul wrote: > @chasemp I have no hosts whit the name labneutron2001 and labneutron2002. Are you referring to labtestneutron2001 and l... [21:54:25] (03CR) 10Volans: [C: 032] Upstream release v3.0.1 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/412813 (owner: 10Volans) [21:56:54] (03Merged) 10jenkins-bot: Upstream release v3.0.1 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/412813 (owner: 10Volans) [21:57:02] 10Operations, 10Traffic, 10Performance, 10Performance-Team (Radar): missing H2 coalesce for upload.wm.o for images ref'd in projects' page outputs - https://phabricator.wikimedia.org/T116132#3984383 (10Gilles) [22:03:33] !log uploaded cumin_3.0.1-1_amd64.deb to apt.wikimedia.org jessie-wikimedia [22:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:22] any ops avalaible? [23:11:30] PROBLEM - HHVM rendering on mw2222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:12:20] RECOVERY - HHVM rendering on mw2222 is OK: HTTP OK: HTTP/1.1 200 OK - 77414 bytes in 0.300 second response time [23:19:48] (03PS1) 10Krinkle: mediawiki: Enable auto_prepend_file setting for HHVM on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/412827 (https://phabricator.wikimedia.org/T180183) [23:22:50] (03CR) 10Krinkle: "Could use feedback on whether this is the right approach." [puppet] - 10https://gerrit.wikimedia.org/r/412827 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [23:26:59] (03PS5) 10Krinkle: mediawiki/hhvm: Move fatal-error.php to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/379953 (https://phabricator.wikimedia.org/T113114) [23:27:27] (03CR) 10Krinkle: "Rebased on latest head, and also verified that the file is still byte-identical to the current file in mediawiki-config.git." [puppet] - 10https://gerrit.wikimedia.org/r/379953 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [23:29:35] (03PS6) 10Krinkle: mediawiki/hhvm: Move fatal-error.php to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/379953 (https://phabricator.wikimedia.org/T113114) [23:54:28] 10Operations, 10Wikimedia-General-or-Unknown: Figure out why HHVM isn't using error_document404 setting - https://phabricator.wikimedia.org/T187754#3984504 (10Krinkle) [23:58:38] 10Operations, 10Wikimedia-General-or-Unknown: Figure out why HHVM isn't using error_document404 setting - https://phabricator.wikimedia.org/T187754#3984519 (10Krinkle) [23:58:47] 10Operations, 10Wikimedia-General-or-Unknown: Figure out why HHVM isn't using error_document404 setting - https://phabricator.wikimedia.org/T187754#3984504 (10Krinkle) a:05Krinkle>03None