[00:00:04] twentyafterfour: Your horoscope predicts another unfortunate Phabricator update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190711T0000). [00:08:59] (03CR) 10BryanDavis: "PCC output: https://puppet-compiler.wmflabs.org/compiler1002/17310/" [puppet] - 10https://gerrit.wikimedia.org/r/522008 (owner: 10BryanDavis) [00:29:28] 10Operations, 10Analytics, 10Traffic, 10Browser-Support-Apple-Safari, and 3 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921 (10leila) [00:43:34] 10Operations, 10MediaWiki-Containers, 10Release-Engineering-Team, 10Core Platform Team Workboards (Done with CPT), and 4 others: FY2017/18 Program 6 - Outcome 2 - Objective 3: Integrated, container-based development environment - https://phabricator.wikimedia.org/T170456 (10Pchelolo) [00:46:55] (03CR) 10Krinkle: "Even if just for a few days, might be worth doing in mc-labs.php first just to be safe. Or at least to set in both." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521967 (owner: 10Aaron Schulz) [01:16:11] PROBLEM - cassandra-a service on restbase1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:16:25] PROBLEM - cassandra-b SSL 10.64.16.127:7001 on restbase1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [01:16:41] PROBLEM - Restbase root url on restbase1017 is CRITICAL: connect to address 10.64.16.125 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [01:16:45] PROBLEM - cassandra-c SSL 10.64.16.128:7001 on restbase1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [01:17:01] PROBLEM - cassandra-c service on restbase1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:17:15] PROBLEM - cassandra-b service on restbase1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:17:23] PROBLEM - cassandra-b CQL 10.64.16.127:9042 on restbase1017 is CRITICAL: connect to address 10.64.16.127 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [01:17:25] PROBLEM - cassandra-c CQL 10.64.16.128:9042 on restbase1017 is CRITICAL: connect to address 10.64.16.128 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [01:17:29] PROBLEM - cassandra-a CQL 10.64.16.126:9042 on restbase1017 is CRITICAL: connect to address 10.64.16.126 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [01:17:33] PROBLEM - cassandra-a SSL 10.64.16.126:7001 on restbase1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [01:18:39] PROBLEM - puppet last run on restbase1017 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 18 minutes ago with 4 failures. Failed resources (up to 3 shown): Package[cassandra/metrics-collector],Package[restbase/deploy],Package[cassandra/logstash-logback-encoder],Package[cassandra/twcs] [01:36:25] PROBLEM - puppet last run on db1113 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [02:03:41] RECOVERY - puppet last run on db1113 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:22:13] (03PS1) 10Felipe L. Ewald: Removed dead FTP link from C3SL. [puppet] - 10https://gerrit.wikimedia.org/r/522015 [02:23:47] (03PS2) 10Felipe L. Ewald: Removed dead FTP link from C3SL. [puppet] - 10https://gerrit.wikimedia.org/r/522015 [02:41:06] ACKNOWLEDGEMENT - Restbase root url on restbase1017 is CRITICAL: connect to address 10.64.16.125 and port 7231: Connection refused eevans Down until its not (T222960). https://wikitech.wikimedia.org/wiki/RESTBase [02:41:06] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.16.126:9042 on restbase1017 is CRITICAL: connect to address 10.64.16.126 and port 9042: Connection refused eevans Down until its not (T222960). https://phabricator.wikimedia.org/T93886 [02:41:06] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.16.126:7001 on restbase1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Down until its not (T222960). https://phabricator.wikimedia.org/T120662 [02:41:06] ACKNOWLEDGEMENT - cassandra-a service on restbase1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive eevans Down until its not (T222960). https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:41:06] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.16.127:9042 on restbase1017 is CRITICAL: connect to address 10.64.16.127 and port 9042: Connection refused eevans Down until its not (T222960). https://phabricator.wikimedia.org/T93886 [02:41:06] ACKNOWLEDGEMENT - cassandra-b SSL 10.64.16.127:7001 on restbase1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Down until its not (T222960). https://phabricator.wikimedia.org/T120662 [02:41:06] ACKNOWLEDGEMENT - cassandra-b service on restbase1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive eevans Down until its not (T222960). https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:41:07] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.16.128:9042 on restbase1017 is CRITICAL: connect to address 10.64.16.128 and port 9042: Connection refused eevans Down until its not (T222960). https://phabricator.wikimedia.org/T93886 [02:41:07] ACKNOWLEDGEMENT - cassandra-c SSL 10.64.16.128:7001 on restbase1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Down until its not (T222960). https://phabricator.wikimedia.org/T120662 [02:41:08] ACKNOWLEDGEMENT - cassandra-c service on restbase1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive eevans Down until its not (T222960). https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:41:08] ACKNOWLEDGEMENT - puppet last run on restbase1017 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 10 minutes ago with 4 failures. Failed resources (up to 3 shown): Package[cassandra/metrics-collector],Package[restbase/deploy],Package[cassandra/logstash-logback-encoder],Package[cassandra/twcs] eevans Down until its not (T222960). [02:44:57] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 93777912 and 3 seconds [02:45:01] RECOVERY - cassandra-a SSL 10.64.16.126:7001 on restbase1017 is OK: SSL OK - Certificate restbase1017-a valid until 2020-06-24 13:01:17 +0000 (expires in 349 days) https://phabricator.wikimedia.org/T120662 [02:45:15] RECOVERY - cassandra-a service on restbase1017 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:45:50] mutante: are we OK to bootstrap restbase1017? [02:46:25] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 841864 and 0 seconds [02:46:36] hrrm, maybe not quite yet [02:49:21] ACKNOWLEDGEMENT - Check systemd state on restbase1017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. eevans Rebuilding (T222960) [02:49:21] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.16.126:7001 on restbase1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Rebuilding (T222960) https://phabricator.wikimedia.org/T120662 [02:49:21] ACKNOWLEDGEMENT - cassandra-a service on restbase1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed eevans Rebuilding (T222960) https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:43:08] (03CR) 10Jforrester: [C: 03+1] Sort wmgMonologChannels alphabetically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521901 (owner: 10Krinkle) [03:43:13] (03CR) 10Jforrester: [C: 03+1] Remove dead 'wmgMonologChannels' entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521902 (owner: 10Krinkle) [04:44:35] (03PS1) 10ArielGlenn: handle exception when setting up Wiki object for misc dumps [dumps] - 10https://gerrit.wikimedia.org/r/522018 (https://phabricator.wikimedia.org/T227730) [04:45:36] PROBLEM - Check systemd state on cp4029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:54:25] 10Operations, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO, 10Category, and 3 others: FY2017/18 Program 6: Streamlined Service delivery - https://phabricator.wikimedia.org/T170453 (10greg) 05Open→03Resolved Being bold and closing this old annual plan goal tracking task. [04:58:24] (03CR) 10ArielGlenn: [C: 03+2] Removed dead FTP link from C3SL. [puppet] - 10https://gerrit.wikimedia.org/r/522015 (owner: 10Felipe L. Ewald) [04:58:56] (03CR) 10ArielGlenn: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/522015 (owner: 10Felipe L. Ewald) [05:09:44] (03CR) 10ArielGlenn: [C: 03+2] "Thanks for the fix!" [puppet] - 10https://gerrit.wikimedia.org/r/522015 (owner: 10Felipe L. Ewald) [05:53:42] (03PS1) 10Elukey: profile::hadoop::spark2: limit the driver block manager's ports [puppet] - 10https://gerrit.wikimedia.org/r/522024 (https://phabricator.wikimedia.org/T170826) [05:54:38] (03CR) 10jerkins-bot: [V: 04-1] profile::hadoop::spark2: limit the driver block manager's ports [puppet] - 10https://gerrit.wikimedia.org/r/522024 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [05:56:58] (03PS2) 10Elukey: profile::hadoop::spark2: limit the driver block manager's ports [puppet] - 10https://gerrit.wikimedia.org/r/522024 (https://phabricator.wikimedia.org/T170826) [06:03:41] (03PS1) 10MaxSem: Remove mentions of Zero [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522026 [06:09:20] 10Operations, 10Release-Engineering-Team, 10Technical-Debt: Investigate whether GD is still needed on appservers - https://phabricator.wikimedia.org/T227734 (10MaxSem) [06:09:40] 10Operations, 10Release-Engineering-Team, 10Technical-Debt: Investigate whether GD is still needed on appservers - https://phabricator.wikimedia.org/T227734 (10MaxSem) [06:13:17] (03PS3) 10Elukey: profile::hadoop::spark2: limit the driver block manager's ports [puppet] - 10https://gerrit.wikimedia.org/r/522024 (https://phabricator.wikimedia.org/T170826) [06:14:30] (03PS4) 10Elukey: profile::hadoop::spark2: limit the driver block manager's ports [puppet] - 10https://gerrit.wikimedia.org/r/522024 (https://phabricator.wikimedia.org/T170826) [06:17:16] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17311/" [puppet] - 10https://gerrit.wikimedia.org/r/522024 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [06:17:22] (03CR) 10Elukey: [C: 03+2] profile::hadoop::spark2: limit the driver block manager's ports [puppet] - 10https://gerrit.wikimedia.org/r/522024 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [06:28:43] (03PS1) 10Elukey: profile::statistics::gpu: add rccl package [puppet] - 10https://gerrit.wikimedia.org/r/522029 (https://phabricator.wikimedia.org/T224723) [06:30:24] PROBLEM - puppet last run on db2085 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:30:28] (03CR) 10Elukey: [C: 03+2] profile::statistics::gpu: add rccl package [puppet] - 10https://gerrit.wikimedia.org/r/522029 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [06:32:16] PROBLEM - puppet last run on analytics1073 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:33:16] PROBLEM - puppet last run on conf2003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:50:37] (03PS1) 10Elukey: aptrepo: pin a specific version of the AMD ROCm suite [puppet] - 10https://gerrit.wikimedia.org/r/522031 (https://phabricator.wikimedia.org/T224723) [06:52:56] (03CR) 10Elukey: [C: 03+2] aptrepo: pin a specific version of the AMD ROCm suite [puppet] - 10https://gerrit.wikimedia.org/r/522031 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [06:55:25] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/521903 (https://phabricator.wikimedia.org/T198939) (owner: 10Jbond) [06:57:44] RECOVERY - puppet last run on db2085 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:34] RECOVERY - puppet last run on analytics1073 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:00:07] (03CR) 10Muehlenhoff: "Yeah, I wanted to keep it consistent with the other existing systems with that role." [dns] - 10https://gerrit.wikimedia.org/r/521895 (owner: 10Muehlenhoff) [07:00:11] (03PS3) 10Muehlenhoff: Add DNS entries for ldap-codfw-replica* [dns] - 10https://gerrit.wikimedia.org/r/521895 [07:00:36] RECOVERY - puppet last run on conf2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:04:33] (03CR) 10Muehlenhoff: [C: 03+2] Add DNS entries for ldap-codfw-replica* [dns] - 10https://gerrit.wikimedia.org/r/521895 (owner: 10Muehlenhoff) [07:10:40] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [07:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:51] !log jmm@cumin2001 END (ERROR) - Cookbook sre.ganeti.makevm (exit_code=97) [07:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:17] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [07:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:28] (rebooting stat1005 for some tests) [07:21:08] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [07:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:36] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [07:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:15] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [07:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:31] (03PS1) 10Jcrespo: prometheus-mysqld-exporter: move variable to profile [labs/private] - 10https://gerrit.wikimedia.org/r/522032 (https://phabricator.wikimedia.org/T143896) [07:35:16] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] prometheus-mysqld-exporter: move variable to profile [labs/private] - 10https://gerrit.wikimedia.org/r/522032 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [07:43:01] !log installing ldap-codfw-replica* T227669 [07:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:06] T227669: codfw: 2 VMs for LDAP replicas - https://phabricator.wikimedia.org/T227669 [07:57:25] (03CR) 10Jcrespo: [C: 04-1] "Error: Failed to compile catalog for node prometheus1003.eqiad.wmnet: Evaluation Error: Error while evaluating a Function Call, Could not " [puppet] - 10https://gerrit.wikimedia.org/r/521852 (owner: 10Jcrespo) [08:10:17] (03PS2) 10Jcrespo: Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"" [puppet] - 10https://gerrit.wikimedia.org/r/521852 [08:13:22] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [08:14:11] !log cp-ulsfo: downgrade mtail to 3.0.0~rc5-1~bpo9+1 to fix varnishmtail-backend T225604 [08:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:17] T225604: log spam from mtail 3.0.0~rc19 on wezen - https://phabricator.wikimedia.org/T225604 [08:18:12] PROBLEM - Check systemd state on cp4022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:18:21] known ^ [08:19:38] (03PS1) 10Urbanecm: Remove commonswiki from mobilemainpagelegacy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522035 (https://phabricator.wikimedia.org/T227719) [08:19:42] RECOVERY - Check systemd state on cp4022 is OK: OK - running: The system is fully operational [08:20:11] (03PS1) 10Elukey: profile::statistics::gpu: add rocm-libs [puppet] - 10https://gerrit.wikimedia.org/r/522036 (https://phabricator.wikimedia.org/T224723) [08:20:22] RECOVERY - Check systemd state on cp4029 is OK: OK - running: The system is fully operational [08:22:04] (03PS3) 10Jcrespo: Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"" [puppet] - 10https://gerrit.wikimedia.org/r/521852 [08:23:37] (03CR) 10Elukey: [C: 03+2] profile::statistics::gpu: add rocm-libs [puppet] - 10https://gerrit.wikimedia.org/r/522036 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [08:27:42] ACKNOWLEDGEMENT - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel data import in progress [08:36:43] (03PS4) 10Jcrespo: Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"" [puppet] - 10https://gerrit.wikimedia.org/r/521852 [08:37:28] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"" [puppet] - 10https://gerrit.wikimedia.org/r/521852 (owner: 10Jcrespo) [08:42:40] (03PS5) 10Jcrespo: Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"" [puppet] - 10https://gerrit.wikimedia.org/r/521852 [08:43:29] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"" [puppet] - 10https://gerrit.wikimedia.org/r/521852 (owner: 10Jcrespo) [08:46:22] (03PS6) 10Jcrespo: Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"" [puppet] - 10https://gerrit.wikimedia.org/r/521852 [08:47:12] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"" [puppet] - 10https://gerrit.wikimedia.org/r/521852 (owner: 10Jcrespo) [08:48:56] (03PS1) 10Elukey: aptrepo: add thirdparty/amd-rocm25 [puppet] - 10https://gerrit.wikimedia.org/r/522039 (https://phabricator.wikimedia.org/T224723) [08:50:18] (03PS7) 10Jcrespo: Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"" [puppet] - 10https://gerrit.wikimedia.org/r/521852 [08:51:03] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"" [puppet] - 10https://gerrit.wikimedia.org/r/521852 (owner: 10Jcrespo) [08:51:15] !log upload mtail 3.0.0~rc5-1~bpo9+1wmf1 to stretch-wikimedia - T225604 [08:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:21] T225604: log spam from mtail 3.0.0~rc19 on wezen - https://phabricator.wikimedia.org/T225604 [08:52:17] (03PS8) 10Jcrespo: Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"" [puppet] - 10https://gerrit.wikimedia.org/r/521852 [08:59:13] (03PS1) 10Jcrespo: prometheus: move prometheus secrets back to the original role [labs/private] - 10https://gerrit.wikimedia.org/r/522040 (https://phabricator.wikimedia.org/T143896) [09:00:34] (03PS9) 10Jcrespo: Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"" [puppet] - 10https://gerrit.wikimedia.org/r/521852 [09:00:59] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 4: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/521826 (https://phabricator.wikimedia.org/T220784) (owner: 10Elukey) [09:03:10] (03CR) 10Muehlenhoff: [C: 04-1] aptrepo: add thirdparty/amd-rocm25 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522039 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [09:04:21] (03CR) 10Elukey: aptrepo: add thirdparty/amd-rocm25 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522039 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [09:04:58] fixed --^ [09:04:59] (03PS2) 10Elukey: aptrepo: add thirdparty/amd-rocm25 [puppet] - 10https://gerrit.wikimedia.org/r/522039 (https://phabricator.wikimedia.org/T224723) [09:06:56] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/522039 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [09:07:40] (03CR) 10Elukey: [C: 03+2] aptrepo: add thirdparty/amd-rocm25 [puppet] - 10https://gerrit.wikimedia.org/r/522039 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [09:08:15] (03PS10) 10Jcrespo: Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"" [puppet] - 10https://gerrit.wikimedia.org/r/521852 [09:19:37] (03PS11) 10Jcrespo: Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"" [puppet] - 10https://gerrit.wikimedia.org/r/521852 [09:22:44] (03PS1) 10Elukey: profile::statistics::gpu: switch to thirdparty/amd-rocm25 [puppet] - 10https://gerrit.wikimedia.org/r/522042 (https://phabricator.wikimedia.org/T224723) [09:23:53] (03PS12) 10Jcrespo: Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"" [puppet] - 10https://gerrit.wikimedia.org/r/521852 [09:24:11] (03CR) 10Alexandros Kosiaris: [C: 03+1] puppetmaster: remove severmon custom reporter [puppet] - 10https://gerrit.wikimedia.org/r/521903 (https://phabricator.wikimedia.org/T198939) (owner: 10Jbond) [09:24:13] (03CR) 10Elukey: [C: 03+2] profile::statistics::gpu: switch to thirdparty/amd-rocm25 [puppet] - 10https://gerrit.wikimedia.org/r/522042 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [09:24:19] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] prometheus: move prometheus secrets back to the original role [labs/private] - 10https://gerrit.wikimedia.org/r/522040 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [09:27:45] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:27:47] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:16] (03PS1) 10Elukey: profile::statistics::gpu: fix relationship [puppet] - 10https://gerrit.wikimedia.org/r/522043 [09:30:51] (03CR) 10Elukey: [C: 03+2] profile::statistics::gpu: fix relationship [puppet] - 10https://gerrit.wikimedia.org/r/522043 (owner: 10Elukey) [09:31:14] !log disabling puppet temporarily (for puppetdb reboots) [09:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:19] (03PS13) 10Jcrespo: Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"" [puppet] - 10https://gerrit.wikimedia.org/r/521852 [09:35:48] !log rebooting puppetdb2001 to pick up MDS-enabled qemu [09:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:10] 10Operations, 10Wikimedia-Mailing-lists, 10Space (Jan-Mar-2020): Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10Qgil) a:05Qgil→03None Unassigning only to keep my current backlog sane. It will come. [09:39:48] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall!" (034 comments) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/521580 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [09:40:17] (03PS14) 10Jcrespo: Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"" [puppet] - 10https://gerrit.wikimedia.org/r/521852 [09:40:45] jouncebot, next [09:40:45] In 1 hour(s) and 19 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190711T1100) [09:42:50] (03Abandoned) 10DCausse: [cirrus] add cloudelastic service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502832 (https://phabricator.wikimedia.org/T220625) (owner: 10DCausse) [09:45:05] (03CR) 10Jcrespo: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/17320/" [puppet] - 10https://gerrit.wikimedia.org/r/521852 (owner: 10Jcrespo) [09:45:51] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) >>! In T184942#5320841, @Gehel wrote: >>>! In T184942#5306856, @fgiunchedi wrote: >> I don't know offhand, although I'd be interested to k... [09:47:18] !log rebooting puppetdb1001 to pick up MDS-enabled qemu [09:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:16] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/521852 (owner: 10Jcrespo) [09:56:37] !log re-enabling puppet (puppetdb reboots completed) [09:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:01] Is tools down? [10:02:09] meh :p [10:02:33] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10Gehel) >>! In T184942#5323915, @fgiunchedi wrote: > Since the maps performance dashboard with graphite metrics is broken anyways ATM I think it makes... [10:08:45] !log creating swift docker_registry_container_backup T227570 [10:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:51] T227570: docker registry swift replication is not replicating content between DCs - https://phabricator.wikimedia.org/T227570 [10:13:12] (03CR) 10Filippo Giunchedi: [C: 03+1] "Ready to remove varnishstatsd, PTAL" [puppet] - 10https://gerrit.wikimedia.org/r/520187 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi) [10:14:00] (03CR) 10jenkins-bot: Drop the ability to use ZeroBanner and ZeroPortal from production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482102 (https://phabricator.wikimedia.org/T212865) (owner: 10Jforrester) [10:14:04] 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10MoritzMuehlenhoff) [10:14:21] (03CR) 10jenkins-bot: Stop configuring ZeroBanner and ZeroPortal, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482103 (https://phabricator.wikimedia.org/T212865) (owner: 10Jforrester) [10:14:31] (03CR) 10jenkins-bot: Stop loading i18n for ZeroBanner and ZeroPortal, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482104 (https://phabricator.wikimedia.org/T212865) (owner: 10Jforrester) [10:14:52] (03CR) 10jenkins-bot: Remove /w/skin-1.5 symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521562 (https://phabricator.wikimedia.org/T156319) (owner: 10Krinkle) [10:15:52] (03PS8) 10Mvolz: Enable reftabs on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) [10:17:58] (03CR) 10Jbond: [C: 03+2] puppetmaster: remove severmon custom reporter [puppet] - 10https://gerrit.wikimedia.org/r/521903 (https://phabricator.wikimedia.org/T198939) (owner: 10Jbond) [10:18:06] (03PS3) 10Jbond: puppetmaster: remove severmon custom reporter [puppet] - 10https://gerrit.wikimedia.org/r/521903 (https://phabricator.wikimedia.org/T198939) [10:21:42] (03PS5) 10Elukey: Add prometheus node exporter for AMD ROCm's GPU stats [puppet] - 10https://gerrit.wikimedia.org/r/521826 (https://phabricator.wikimedia.org/T220784) [10:23:43] (03PS1) 10Muehlenhoff: Add DHCP entries for ldap-codfw-replica* and add to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/522053 (https://phabricator.wikimedia.org/T227669) [10:24:17] (03CR) 10jerkins-bot: [V: 04-1] Add DHCP entries for ldap-codfw-replica* and add to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/522053 (https://phabricator.wikimedia.org/T227669) (owner: 10Muehlenhoff) [10:25:04] (03CR) 10Elukey: [C: 03+2] Add prometheus node exporter for AMD ROCm's GPU stats [puppet] - 10https://gerrit.wikimedia.org/r/521826 (https://phabricator.wikimedia.org/T220784) (owner: 10Elukey) [10:25:57] (03PS2) 10Muehlenhoff: Add DHCP entries for ldap-codfw-replica* and add to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/522053 (https://phabricator.wikimedia.org/T227669) [10:28:07] (03PS3) 10Muehlenhoff: Add DHCP entries for ldap-codfw-replica* and add to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/522053 (https://phabricator.wikimedia.org/T227669) [10:28:37] !log depooling ms-fe2005 for docker_registry_backups T227570 [10:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:43] T227570: docker registry swift replication is not replicating content between DCs - https://phabricator.wikimedia.org/T227570 [10:29:28] PROBLEM - Check systemd state on ms-fe2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:32:25] (03CR) 10Muehlenhoff: [C: 03+2] Add DHCP entries for ldap-codfw-replica* and add to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/522053 (https://phabricator.wikimedia.org/T227669) (owner: 10Muehlenhoff) [10:35:37] (03PS1) 10Vgutierrez: lvs: Add ncredir service to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/522055 (https://phabricator.wikimedia.org/T133548) [10:41:32] 10Operations: Integrate Stretch 9.6 point update - https://phabricator.wikimedia.org/T209260 (10MoritzMuehlenhoff) These updates have been fully deployed: ` firmware-nonfree ` [10:43:31] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Investigate if a Prometheus exporter for the AMD GPU(s) can be easily created - https://phabricator.wikimedia.org/T220784 (10elukey) https://grafana.wikimedia.org/d/ZAX3zaIWz/amd-rocm-gpu [10:43:54] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Investigate if a Prometheus exporter for the AMD GPU(s) can be easily created - https://phabricator.wikimedia.org/T220784 (10elukey) [10:44:40] (03PS2) 10Vgutierrez: lvs: Add ncredir service to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/522055 (https://phabricator.wikimedia.org/T133548) [10:45:01] !log installing ldap-codfw-replica* [10:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:36] RECOVERY - Check systemd state on ms-fe2005 is OK: OK - running: The system is fully operational [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190711T1100). [11:00:04] dcausse and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] o/ [11:00:21] commands must be followed, so let's SWAT :) [11:00:24] dcausse, want to start? [11:00:28] Urbanecm: sure [11:00:59] I'm C+2'ing my backports to give time for CI [11:01:10] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516750 (https://phabricator.wikimedia.org/T216429) (owner: 10DCausse) [11:01:19] ok [11:02:23] (03Merged) 10jenkins-bot: [cirrus] Use correct factory declaration for EntityFullTextQueryBuilder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516750 (https://phabricator.wikimedia.org/T216429) (owner: 10DCausse) [11:02:25] (03PS2) 10DCausse: [cirrus] Enable UTR30 as a lookup method for ns prefixes on group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520769 [11:02:28] (03CR) 10jenkins-bot: [cirrus] Use correct factory declaration for EntityFullTextQueryBuilder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516750 (https://phabricator.wikimedia.org/T216429) (owner: 10DCausse) [11:05:01] sigh... failure [11:05:43] (03PS1) 10DCausse: Revert "[cirrus] Use correct factory declaration for EntityFullTextQueryBuilder" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522057 [11:05:53] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522057 (owner: 10DCausse) [11:06:48] (03Merged) 10jenkins-bot: Revert "[cirrus] Use correct factory declaration for EntityFullTextQueryBuilder" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522057 (owner: 10DCausse) [11:07:33] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520769 (owner: 10DCausse) [11:08:12] (03PS15) 10Jcrespo: Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"" [puppet] - 10https://gerrit.wikimedia.org/r/521852 (https://phabricator.wikimedia.org/T143896) [11:08:14] (03CR) 10jenkins-bot: Revert "[cirrus] Use correct factory declaration for EntityFullTextQueryBuilder" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522057 (owner: 10DCausse) [11:08:23] (03PS16) 10Jcrespo: Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"" [puppet] - 10https://gerrit.wikimedia.org/r/521852 (https://phabricator.wikimedia.org/T143896) [11:08:31] (03Merged) 10jenkins-bot: [cirrus] Enable UTR30 as a lookup method for ns prefixes on group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520769 (owner: 10DCausse) [11:09:15] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"" [puppet] - 10https://gerrit.wikimedia.org/r/521852 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [11:10:18] (03CR) 10jenkins-bot: [cirrus] Enable UTR30 as a lookup method for ns prefixes on group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520769 (owner: 10DCausse) [11:10:29] (03PS17) 10Jcrespo: Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"" [puppet] - 10https://gerrit.wikimedia.org/r/521852 (https://phabricator.wikimedia.org/T143896) [11:11:21] (03CR) 10Jcrespo: [C: 03+2] Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"" [puppet] - 10https://gerrit.wikimedia.org/r/521852 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [11:12:57] dcausse, do you know, if I'm watching https://logstash.wikimedia.org/goto/7740298eee10bab011b6101af26135c1, is it auto-refreshed, or do I need to F5 when I want fresh data? [11:13:32] there should be an option to enable autorefresh [11:13:34] Urbanecm: it depends on if Auto-refresh is Off or other [11:13:34] Urbanecm: I generally hit refresh but perhaps it's automatic [11:13:46] what jynus said [11:13:57] from what I remember from logstash-beta [11:14:01] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [cirrus] Enable UTR30 as a lookup method for ns prefixes on group2 (duration: 01m 02s) [11:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:21] jynus, how can I check if "autorefresh" is enabled? [11:14:23] (or hauskatze ) [11:14:30] click the range [11:14:33] then autorefresh [11:14:42] the time options above [11:14:45] (03PS1) 10Jcrespo: Revert "prometheus: move prometheus secrets back to the original role" [labs/private] - 10https://gerrit.wikimedia.org/r/522058 [11:14:47] thanks jynus [11:14:49] I don't have access right now [11:14:53] found it [11:15:16] (03PS2) 10Jcrespo: Revert "prometheus: move prometheus secrets back to the original role" [labs/private] - 10https://gerrit.wikimedia.org/r/522058 [11:15:26] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Revert "prometheus: move prometheus secrets back to the original role" [labs/private] - 10https://gerrit.wikimedia.org/r/522058 (owner: 10Jcrespo) [11:15:50] PROBLEM - puppet last run on prometheus2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate-mysqld-exporter-config] [11:16:25] Urbanecm: I'm done [11:16:28] thanks dcausse [11:16:29] (03PS1) 10Jcrespo: Revert "Revert "prometheus: move prometheus secrets back to the original role"" [labs/private] - 10https://gerrit.wikimedia.org/r/522059 [11:16:38] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Revert "Revert "prometheus: move prometheus secrets back to the original role"" [labs/private] - 10https://gerrit.wikimedia.org/r/522059 (owner: 10Jcrespo) [11:16:45] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522035 (https://phabricator.wikimedia.org/T227719) (owner: 10Urbanecm) [11:16:50] (03PS2) 10Urbanecm: Remove usergroup communityapps from officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521933 (https://phabricator.wikimedia.org/T227680) [11:16:55] (03PS1) 10Jcrespo: Revert "Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db""" [puppet] - 10https://gerrit.wikimedia.org/r/522060 [11:16:57] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521933 (https://phabricator.wikimedia.org/T227680) (owner: 10Urbanecm) [11:17:03] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Revert "Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db""" [puppet] - 10https://gerrit.wikimedia.org/r/522060 (owner: 10Jcrespo) [11:17:43] (03Merged) 10jenkins-bot: Remove commonswiki from mobilemainpagelegacy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522035 (https://phabricator.wikimedia.org/T227719) (owner: 10Urbanecm) [11:17:58] (03Merged) 10jenkins-bot: Remove usergroup communityapps from officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521933 (https://phabricator.wikimedia.org/T227680) (owner: 10Urbanecm) [11:19:37] (03CR) 10jenkins-bot: Remove commonswiki from mobilemainpagelegacy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522035 (https://phabricator.wikimedia.org/T227719) (owner: 10Urbanecm) [11:19:54] !log urbanecm@deploy1001 Synchronized dblists/mobilemainpagelegacy.dblist: SWAT: [[:gerrit:522035|Remove commonswiki from mobilemainpagelegacy]] (T227719) (duration: 00m 58s) [11:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:11] T227719: Turn off mobile main page special casing on Commons - https://phabricator.wikimedia.org/T227719 [11:20:23] (03PS1) 10Jcrespo: Revert "Revert "Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"""" [puppet] - 10https://gerrit.wikimedia.org/r/522061 [11:21:34] I am stupid [11:22:19] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:521933|Remove usergroup communityapps from officewiki]] (T227680) (duration: 01m 02s) [11:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:24] T227680: Remove usergroup communityapps from officewiki - https://phabricator.wikimedia.org/T227680 [11:23:26] (03PS2) 10Jcrespo: Revert "Revert "Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"""" [puppet] - 10https://gerrit.wikimedia.org/r/522061 (https://phabricator.wikimedia.org/T143896) [11:24:00] !log urbanecm@deploy1001 Started scap: Namespace translation for Punjabi (T226959) [11:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:05] T226959: Adding new Namespaces and renaming some in Punjabi language at Punjabi Wikisource. - https://phabricator.wikimedia.org/T226959 [11:24:19] ^^full scap is required, because this is namespace translations^^ [11:24:51] (03PS3) 10Jcrespo: Revert "Revert "Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"""" [puppet] - 10https://gerrit.wikimedia.org/r/522061 (https://phabricator.wikimedia.org/T143896) [11:25:44] (03CR) 10Jcrespo: [C: 03+2] Revert "Revert "Revert "Revert "prometheus-mysqld-exporter: Automate targets based on zarcillo db"""" [puppet] - 10https://gerrit.wikimedia.org/r/522061 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [11:27:05] how's scap going Urbanecm ? [11:27:38] hauskatze, fine so far, change should be synced by now, updating localisation cache for wmf.11 now [11:28:18] I'm keeping an eye on pa.wikisource recent changes to see when the namespace names change [11:28:35] but will leave in some minutes [11:28:40] lunch with colleagues [11:28:44] I'm keeping an eye on fatalmonitor for sanity's sake :D [11:28:52] even better heh [11:29:21] no change so far (and i hope no change will be there) [11:30:25] namespacedupes required afterwards? [11:30:45] I think we should doc all this process if it was not clear [11:30:50] methinks [11:30:56] depends if there are any pages with "translated namespace" by now [11:30:59] will run it anyway, to be sure [11:31:11] RECOVERY - puppet last run on prometheus2003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [11:32:56] 10Operations, 10DBA, 10observability, 10Patch-For-Review: MySQL metrics monitoring - https://phabricator.wikimedia.org/T143896 (10jcrespo) `lang=bash root@prometheus2003:/srv/prometheus/ops/targets$ ls -la mysql-* -r--r--r-- 1 root root 2592 Jul 11 11:27 mysql-core_codfw.yaml -r--r--r-- 1 root... [11:33:07] 10Operations, 10DBA, 10observability, 10Patch-For-Review: MySQL metrics monitoring - https://phabricator.wikimedia.org/T143896 (10jcrespo) [11:38:25] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [11:46:43] PROBLEM - very high load average likely xfs on ms-be2031 is CRITICAL: CRITICAL - load average: 182.56, 101.18, 44.31 https://wikitech.wikimedia.org/wiki/Swift [11:50:39] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 57.99, 22.76, 13.20 https://wikitech.wikimedia.org/wiki/Application_servers [11:50:41] PROBLEM - High CPU load on API appserver on mw1285 is CRITICAL: CRITICAL - load average: 62.16, 28.12, 18.43 https://wikitech.wikimedia.org/wiki/Application_servers [11:50:55] PROBLEM - High CPU load on API appserver on mw1276 is CRITICAL: CRITICAL - load average: 69.51, 27.48, 17.53 https://wikitech.wikimedia.org/wiki/Application_servers [11:51:33] PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 69.34, 30.26, 18.79 https://wikitech.wikimedia.org/wiki/Application_servers [11:51:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [11:51:43] RECOVERY - High CPU load on API appserver on mw1285 is OK: OK - load average: 28.54, 24.83, 17.95 https://wikitech.wikimedia.org/wiki/Application_servers [11:52:41] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 16.16, 19.43, 13.22 https://wikitech.wikimedia.org/wiki/Application_servers [11:53:01] RECOVERY - High CPU load on API appserver on mw1276 is OK: OK - load average: 24.87, 27.03, 18.84 https://wikitech.wikimedia.org/wiki/Application_servers [11:53:37] RECOVERY - High CPU load on API appserver on mw1290 is OK: OK - load average: 22.78, 27.39, 19.34 https://wikitech.wikimedia.org/wiki/Application_servers [11:54:13] !log urbanecm@deploy1001 Finished scap: Namespace translation for Punjabi (T226959) (duration: 30m 13s) [11:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:19] T226959: Adding new Namespaces and renaming some in Punjabi language at Punjabi Wikisource. - https://phabricator.wikimedia.org/T226959 [11:56:05] !log EU SWAT done [11:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:15] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [11:58:31] PROBLEM - Host ms-be2031 is DOWN: PING CRITICAL - Packet loss = 100% [11:58:55] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10elukey) All right so ROCm 2.5 and tensorflow-rocm 1.13.3 seems to work. Other versions of TF (1.13.4 and 1.14.0) lead to the follow... [12:01:22] godog: ms-be --^ [12:01:44] Urbanecm: o/ - are the MW exceptions related to the deploy? (just seen the alarms) [12:01:59] elukey: thanks, I'll take a look [12:02:26] elukey, don't know, which exceptions? The deploy was just supposed to change few *.namespace.php files [12:02:56] Urbanecm: you can see it above --^ [12:03:24] do you mean "High CPU load on API appserver on mw1222 is OK"? [12:03:35] Urbanecm: there was also "MediaWiki exceptions and fatals per minute" [12:03:42] yes exactly [12:03:43] !log power reset ms-be2031, stuck and nothing on console [12:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:32] Urbanecm: elukey: looking at logstash the error was predominantly "ErrorException from line 322 of /srv/mediawiki/php-1.34.0-wmf.11/includes/context/RequestContext.php: PHP Warning: Recursion detected in RequestContext::getLanguage" [12:04:59] cdanis, which whould be known as T180050 [12:05:00] T180050: PHP Warning: Recursion detected in RequestContext - https://phabricator.wikimedia.org/T180050 [12:05:04] (aside from a number of timeouts, which AIUI are ~normal around release windows :| ) [12:05:25] *should [12:05:48] 10Operations, 10ops-codfw, 10media-storage: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 (10fgiunchedi) Today ms-be2031 locked up as well, I'll upgrade the firmware once it is back ` Slot 3 Port 1 : Smart Array P840 Controller - (4096 MB, V4.52) 14 Logical... [12:06:04] cdanis: is the warning counted among the fatals reported? [12:06:12] (never checked it, I assumed not) [12:06:29] RECOVERY - Host ms-be2031 is UP: PING OK - Packet loss = 0%, RTA = 36.64 ms [12:06:31] RECOVERY - very high load average likely xfs on ms-be2031 is OK: OK - load average: 15.58, 3.80, 1.27 https://wikitech.wikimedia.org/wiki/Swift [12:06:45] elukey: sorry, which warning? [12:06:48] elukey, regarding high cpu load, I THINK I saw someone complaining about this happening while deployment, however, can't remember details and can't find a ticket, so it cam be a false memory definitely [12:07:06] Urbanecm: it's been happening routinely around rollout time whenever scap's refreshCdbJsonFiles step happens [12:07:06] *while deploying [12:07:15] for the past... about three months? at least? [12:07:33] cdanis, in that case, it's definitely related, since I was running a full scap [12:07:56] !log ms-be2031 raid controller firmware upgrade 4.52 -> 6.88 - T141756 [12:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:01] T141756: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 [12:08:47] elukey: the error spike around the release window looks to just be the usual release-related timeouts https://logstash.wikimedia.org/goto/3230df27de79f8796675fd76175b6ba9 [12:09:52] cdanis: the Recursion detected etc.. IIUC is listed as "warning" in logstash [12:10:10] this is what I was asking, I thought only ERROR+ were counted for the alarm [12:10:13] but never checked :) [12:10:16] I think that is true [12:10:39] elukey, fyi, that error is known as T180050 [12:10:39] T180050: PHP Warning: Recursion detected in RequestContext - https://phabricator.wikimedia.org/T180050 [12:10:41] it was't obvious that the fatals were timeouts because they each timed-out at a different file/line # [12:11:02] Urbanecm: didn't mean to blame your deployment etc.., just to know if it was expected and/or if SRE needed to check, thanks! :) [12:11:28] I totally understand, it's good to know that someone watches what's happening and errs on the safe side :) [12:12:27] (03CR) 10Elukey: [C: 03+1] Allow analytics-privatedata-users group to access swift auth env file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/521954 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata) [12:26:11] (03CR) 10Elukey: "Thanks! Created the missing wiki page, hope it is ok!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520959 (owner: 10Dzahn) [12:26:56] (03PS2) 10Filippo Giunchedi: varnish: remove varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/520187 (https://phabricator.wikimedia.org/T184942) [12:28:17] (03CR) 10jerkins-bot: [V: 04-1] varnish: remove varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/520187 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi) [12:33:22] (03PS3) 10Filippo Giunchedi: varnish: remove varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/520187 (https://phabricator.wikimedia.org/T184942) [12:35:24] (03CR) 10Ema: [C: 03+1] varnish: remove varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/520187 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi) [12:36:26] (03PS2) 10Ottomata: Allow analytics-privatedata-users group to access swift auth env file [puppet] - 10https://gerrit.wikimedia.org/r/521954 (https://phabricator.wikimedia.org/T219544) [12:37:28] (03CR) 10Filippo Giunchedi: [C: 03+2] varnish: remove varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/520187 (https://phabricator.wikimedia.org/T184942) (owner: 10Filippo Giunchedi) [12:37:30] (03PS4) 10Filippo Giunchedi: varnish: remove varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/520187 (https://phabricator.wikimedia.org/T184942) [12:39:37] !log Disable puppet on mw1222, server will be depooled and pooled a few times for tests - T224538 [12:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:42] T224538: Socket Errors on PHP7 - https://phabricator.wikimedia.org/T224538 [12:40:27] (03CR) 10Ottomata: [C: 03+2] Allow analytics-privatedata-users group to access swift auth env file [puppet] - 10https://gerrit.wikimedia.org/r/521954 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata) [12:40:29] (03PS3) 10Ottomata: Allow analytics-privatedata-users group to access swift auth env file [puppet] - 10https://gerrit.wikimedia.org/r/521954 (https://phabricator.wikimedia.org/T219544) [12:40:31] (03PS1) 10MSantos: Disable replicate and admin cron in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/522072 (https://phabricator.wikimedia.org/T215641) [12:44:12] !log Running purgePage.php on pages in Page: NS on pawikisource (T226959) [12:44:15] !log cp-ulsfo: upgrade mtail to 3.0.0~rc5-1~bpo9+1wmf1 [12:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:18] T226959: Adding new Namespaces and renaming some in Punjabi language at Punjabi Wikisource. - https://phabricator.wikimedia.org/T226959 [12:44:21] PROBLEM - Varnish traffic logger - varnishstatsd on cp1077 is CRITICAL: NRPE: Command check_varnishstatsd not defined https://wikitech.wikimedia.org/wiki/Varnish [12:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:49] PROBLEM - Varnish traffic logger - varnishstatsd on cp2006 is CRITICAL: NRPE: Command check_varnishstatsd not defined https://wikitech.wikimedia.org/wiki/Varnish [12:44:59] godog: ah, puppet hasn't run on icinga.wm.org yet [12:45:03] ack'ing the alerts [12:45:31] PROBLEM - Varnish traffic logger - varnishstatsd on cp5007 is CRITICAL: NRPE: Command check_varnishstatsd not defined https://wikitech.wikimedia.org/wiki/Varnish [12:45:41] ACKNOWLEDGEMENT - Varnish traffic logger - varnishstatsd on cp1077 is CRITICAL: NRPE: Command check_varnishstatsd not defined Ema varnishstatsd removed https://wikitech.wikimedia.org/wiki/Varnish [12:45:41] ACKNOWLEDGEMENT - Varnish traffic logger - varnishstatsd on cp1081 is CRITICAL: NRPE: Command check_varnishstatsd not defined Ema varnishstatsd removed https://wikitech.wikimedia.org/wiki/Varnish [12:45:41] ACKNOWLEDGEMENT - Varnish traffic logger - varnishstatsd on cp2006 is CRITICAL: NRPE: Command check_varnishstatsd not defined Ema varnishstatsd removed https://wikitech.wikimedia.org/wiki/Varnish [12:45:41] ACKNOWLEDGEMENT - Varnish traffic logger - varnishstatsd on cp3032 is CRITICAL: NRPE: Command check_varnishstatsd not defined Ema varnishstatsd removed https://wikitech.wikimedia.org/wiki/Varnish [12:45:41] ACKNOWLEDGEMENT - Varnish traffic logger - varnishstatsd on cp4029 is CRITICAL: NRPE: Command check_varnishstatsd not defined Ema varnishstatsd removed https://wikitech.wikimedia.org/wiki/Varnish [12:45:41] ACKNOWLEDGEMENT - Varnish traffic logger - varnishstatsd on cp4032 is CRITICAL: NRPE: Command check_varnishstatsd not defined Ema varnishstatsd removed https://wikitech.wikimedia.org/wiki/Varnish [12:45:41] ACKNOWLEDGEMENT - Varnish traffic logger - varnishstatsd on cp5007 is CRITICAL: NRPE: Command check_varnishstatsd not defined Ema varnishstatsd removed https://wikitech.wikimedia.org/wiki/Varnish [12:45:50] (03CR) 10MSantos: [C: 04-1] "Wait for monday to not affect production in the weekend." [puppet] - 10https://gerrit.wikimedia.org/r/522072 (https://phabricator.wikimedia.org/T215641) (owner: 10MSantos) [12:45:54] (03PS1) 10Ottomata: Use 640 mode for swift auth env file [puppet] - 10https://gerrit.wikimedia.org/r/522074 (https://phabricator.wikimedia.org/T219544) [12:45:56] ema: thanks! [12:46:09] sorry for the spam :( [12:46:37] we were clearly just testing if icinga-wm still works [12:47:51] (03PS1) 10Muehlenhoff: Remove Diamond from production hosts [puppet] - 10https://gerrit.wikimedia.org/r/522075 (https://phabricator.wikimedia.org/T212231) [12:47:53] (03CR) 10Ottomata: [C: 03+2] Use 640 mode for swift auth env file [puppet] - 10https://gerrit.wikimedia.org/r/522074 (https://phabricator.wikimedia.org/T219544) (owner: 10Ottomata) [12:48:40] !log fleet-wide: remove obsolete file /etc/debdeploy-autorestarts.conf [12:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:22] PROBLEM - Varnish traffic logger - varnishstatsd on cp5009 is CRITICAL: NRPE: Command check_varnishstatsd not defined https://wikitech.wikimedia.org/wiki/Varnish [12:51:30] ACKNOWLEDGEMENT - Varnish traffic logger - varnishstatsd on cp4030 is CRITICAL: NRPE: Command check_varnishstatsd not defined Ema varnishstatsd removed https://wikitech.wikimedia.org/wiki/Varnish [12:51:30] ACKNOWLEDGEMENT - Varnish traffic logger - varnishstatsd on cp5009 is CRITICAL: NRPE: Command check_varnishstatsd not defined Ema varnishstatsd removed https://wikitech.wikimedia.org/wiki/Varnish [12:51:35] 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, and 2 others: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10Ottomata) @EBernhardson analytics-search user should now be able to access the auth file [12:54:06] PROBLEM - Varnish traffic logger - varnishstatsd on cp3040 is CRITICAL: NRPE: Command check_varnishstatsd not defined https://wikitech.wikimedia.org/wiki/Varnish [12:54:55] I'm acking the remaining ones too [12:55:40] err, downtiming [12:55:53] godog: I'm running puppet on icinga1001 [12:56:41] done, the alerts should stop now [12:58:28] unless we're forcing a puppet run on all cp hosts running puppet on icinga now will remove only checks on hosts where puppet has already ran, if I'm not mistaken [12:58:49] fine either way, I've silenced the remaining alerts [13:00:00] ah good point [13:00:52] !log cp-ulsfo: varnish frontend rolling restarts for 5.1.3-1wm11 upgrades T227672 [13:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:58] T227672: Upgrade Varnish to 5.1.3-1wm11 - https://phabricator.wikimedia.org/T227672 [13:05:10] (03CR) 10jenkins-bot: Remove usergroup communityapps from officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521933 (https://phabricator.wikimedia.org/T227680) (owner: 10Urbanecm) [13:06:48] (03CR) 10Muehlenhoff: Introduce openldap_config in hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522073 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [13:15:56] (03CR) 10Andrew Bogott: [C: 03+2] keyholder: hiera setting for require_encrypted_keys [puppet] - 10https://gerrit.wikimedia.org/r/522008 (owner: 10BryanDavis) [13:28:27] (03CR) 10Filippo Giunchedi: [C: 03+1] varnish::logging: mask mtail.service [puppet] - 10https://gerrit.wikimedia.org/r/522081 (owner: 10Ema) [13:28:29] (03CR) 10Hashar: [C: 03+1] "Cwhite: you can most probably merge this change right now, then another change that would provide the missing extended description?" [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/519363 (owner: 10Hashar) [13:34:34] (03CR) 10Ema: [C: 03+2] varnish::logging: mask mtail.service [puppet] - 10https://gerrit.wikimedia.org/r/522081 (owner: 10Ema) [13:35:38] (03PS3) 10Andrew Bogott: cloud: Add default LVS hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/521947 (owner: 10BryanDavis) [13:38:40] (03PS2) 10Andrew Bogott: cloud: set scap::version: present in hiera [puppet] - 10https://gerrit.wikimedia.org/r/521959 (owner: 10BryanDavis) [13:40:49] !log roll restart ms-be2016 ms-be2017 ms-be2018 ms-be2019 ms-be2020 ms-be2021 ms-be2028 ms-be2029 ms-be2030 ms-be2031 ms-be2032 ms-be2033 ms-be2034 ms-be2035 ms-be2036 - T225713 [13:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:55] T225713: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 [13:52:10] 08Warning Alert for device cr2-esams.wikimedia.org - Memory over 85% [13:54:08] (03PS1) 10DCausse: Revert "Revert "[cirrus] Use correct factory declaration for EntityFullTextQueryBuilder"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522092 [13:56:59] (03PS3) 10Vgutierrez: lvs: Add ncredir service to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/522055 (https://phabricator.wikimedia.org/T133548) [14:03:10] (03Abandoned) 10Andrew Bogott: no-op patch for catalog testing [puppet] - 10https://gerrit.wikimedia.org/r/517651 (owner: 10Andrew Bogott) [14:11:37] RECOVERY - Device not healthy -SMART- on ms-be2018 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2018&var-datasource=codfw+prometheus/ops [14:11:56] 10Operations, 10vm-requests: codfw: 2 VMs for LDAP replicas - https://phabricator.wikimedia.org/T227669 (10MoritzMuehlenhoff) 05Open→03Resolved VMs have been created. [14:16:27] RECOVERY - puppet last run on ms-be2018 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:17:14] !log restart wikibugs [14:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:37] 10Operations, 10Puppet, 10Packaging: Hiera incompatible with newer versions of puppet - https://phabricator.wikimedia.org/T227779 (10jbond) [14:21:55] (03PS2) 10Jbond: hiera backends: [puppet] - 10https://gerrit.wikimedia.org/r/522101 (https://phabricator.wikimedia.org/T227779) [14:21:57] (03CR) 10Jbond: "PCC running https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/17325/" [puppet] - 10https://gerrit.wikimedia.org/r/522101 (https://phabricator.wikimedia.org/T227779) (owner: 10Jbond) [14:21:59] (03CR) 10Ema: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/522094 (owner: 10Ema) [14:22:16] (03PS2) 10Vgutierrez: Add ncredir-lb records [dns] - 10https://gerrit.wikimedia.org/r/521414 (https://phabricator.wikimedia.org/T133548) [14:22:42] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/522075 (https://phabricator.wikimedia.org/T212231) (owner: 10Muehlenhoff) [14:24:53] I'll deploy https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/522013/ in a few minutes to unblock the train [14:26:10] !log cdanis@install1002.wikimedia.org ~ % sudo -E reprepro -C main include buster-wikimedia /home/volans/conftool/buster/conftool_1.1.0-1+deb10u1_amd64.changes [14:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:20] !log cdanis@install1002.wikimedia.org ~ % sudo -E reprepro -C main include jessie-wikimedia /home/volans/conftool/jessie/conftool_1.1.0-1+deb8u1_amd64.changes [14:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:28] !log cdanis@install1002.wikimedia.org ~ % sudo -E reprepro -C main include stretch-wikimedia /home/volans/conftool/stretch/conftool_1.1.0-1_amd64.changes [14:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:48] (03PS3) 10Vgutierrez: Add ncredir-lb records [dns] - 10https://gerrit.wikimedia.org/r/521414 (https://phabricator.wikimedia.org/T133548) [14:28:08] (03CR) 10Ottomata: [C: 03+1] geoip::data::archive: move to kerberos::systemd_timer [puppet] - 10https://gerrit.wikimedia.org/r/520775 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [14:30:15] (03CR) 10Filippo Giunchedi: [C: 03+1] mtail: pass either 'present' or 'absent' to ensure [puppet] - 10https://gerrit.wikimedia.org/r/522094 (owner: 10Ema) [14:30:18] (03CR) 10Jbond: "Looks good one small nit. I did something similar to this for the standard module which also pulls in the labs/private repo. if that als" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480957 (owner: 10Hashar) [14:31:20] (03CR) 10Ema: [C: 03+2] mtail: pass either 'present' or 'absent' to ensure [puppet] - 10https://gerrit.wikimedia.org/r/522094 (owner: 10Ema) [14:33:58] elukey: o/ [14:34:04] i'm trying to run a spark job in yarn client mode [14:34:11] both stat1004 and stat1007 seem to be hanging at [14:34:32] e.g. [14:34:33] Registering block manager analytics1076.eqiad.wmnet:39211 [14:34:35] (03PS4) 10Vgutierrez: Add ncredir-lb records [dns] - 10https://gerrit.wikimedia.org/r/521414 (https://phabricator.wikimedia.org/T133548) [14:34:37] oops wrong chat room! [14:34:43] (03CR) 10Jbond: [C: 03+1] "LGTM: upstream bug in case its useful to others https://github.com/rubocop-hq/rubocop/issues/893" [puppet] - 10https://gerrit.wikimedia.org/r/484410 (owner: 10Hashar) [14:36:08] (03CR) 10BBlack: [C: 03+1] Add ncredir-lb records [dns] - 10https://gerrit.wikimedia.org/r/521414 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [14:36:36] (03CR) 10Jbond: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/517091 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [14:38:41] (03CR) 10Jbond: "puppet stuff looks fine to me but dont know swift" [puppet] - 10https://gerrit.wikimedia.org/r/513053 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [14:40:49] (03PS2) 10Jbond: Create two LDAP replicas in codfw [puppet] - 10https://gerrit.wikimedia.org/r/522102 (https://phabricator.wikimedia.org/T227669) (owner: 10Muehlenhoff) [14:41:53] (03CR) 10Jbond: [C: 03+1] "LGTM: i also updated the bug ref" [puppet] - 10https://gerrit.wikimedia.org/r/522102 (https://phabricator.wikimedia.org/T227669) (owner: 10Muehlenhoff) [14:42:40] (03PS1) 10Elukey: profile::hadoop::spark2: fix port range [puppet] - 10https://gerrit.wikimedia.org/r/522105 (https://phabricator.wikimedia.org/T170826) [14:42:51] * Krinkle staging on mwdebug1002 [14:42:58] (03CR) 10Ottomata: [C: 03+1] profile::hadoop::spark2: fix port range [puppet] - 10https://gerrit.wikimedia.org/r/522105 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [14:43:36] (03CR) 10Elukey: [C: 03+2] profile::hadoop::spark2: fix port range [puppet] - 10https://gerrit.wikimedia.org/r/522105 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [14:45:04] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.13/includes/libs/rdbms/database/Database.php: 903f3f94f5d2e3 / T227708 (duration: 00m 59s) [14:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:11] T227708: DatabaseMysqlBase: domain schemas are not supported. - https://phabricator.wikimedia.org/T227708 [14:45:21] (03PS5) 10Vgutierrez: Add ncredir-lb records [dns] - 10https://gerrit.wikimedia.org/r/521414 (https://phabricator.wikimedia.org/T133548) [14:45:48] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10fgiunchedi) [14:45:57] (03CR) 10BBlack: [C: 03+1] Add ncredir-lb records [dns] - 10https://gerrit.wikimedia.org/r/521414 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [14:47:41] (03CR) 10Gehel: "LGTM, I'll synchronize with Mateus for the deployment." [puppet] - 10https://gerrit.wikimedia.org/r/522072 (https://phabricator.wikimedia.org/T215641) (owner: 10MSantos) [14:48:24] (03CR) 10Vgutierrez: [C: 03+2] Add ncredir-lb records [dns] - 10https://gerrit.wikimedia.org/r/521414 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [14:48:35] (03PS6) 10Vgutierrez: Add ncredir-lb records [dns] - 10https://gerrit.wikimedia.org/r/521414 (https://phabricator.wikimedia.org/T133548) [14:51:23] !log upgrade to python3-conftool 1.1.0-1 on mwdebug2001 [14:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:50] ! setting CPU governor to performance for elastic1052 - T225713 [14:52:51] T225713: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 [14:54:17] (03PS4) 10Vgutierrez: lvs: Add ncredir service to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/522055 (https://phabricator.wikimedia.org/T133548) [14:54:39] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10fgiunchedi) [14:55:38] gehel: thanks! also there's a missing !log there [14:55:45] !log setting CPU governor to performance for elastic1052 - T225713 [14:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:51] godog: thanks for spotting it! [14:56:01] 10Operations, 10Traffic, 10Goal, 10HTTPS, 10Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548 (10Vgutierrez) [14:56:16] (03PS1) 10BBlack: discovery-map remove [1/4]: remove refs [puppet] - 10https://gerrit.wikimedia.org/r/522110 [14:56:17] (03PS1) 10BBlack: discovery-map remove [3/4]: stop deploying [puppet] - 10https://gerrit.wikimedia.org/r/522111 [14:56:19] godog: I'll keep an eye on that one for a few days and depending on the result apply the the whole cluster [14:56:19] (03PS1) 10BBlack: discovery-map remove [4/4]: Remove completely [puppet] - 10https://gerrit.wikimedia.org/r/522112 [14:56:22] (03PS1) 10BBlack: discovery-map remove [2/4]: remove ops/dns refs [dns] - 10https://gerrit.wikimedia.org/r/522113 [14:57:11] (03CR) 10jerkins-bot: [V: 04-1] discovery-map remove [2/4]: remove ops/dns refs [dns] - 10https://gerrit.wikimedia.org/r/522113 (owner: 10BBlack) [14:57:52] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo debdeploy deploy -u T197126-2019-07-11-conftool.yaml -s mw-canary [14:57:54] gehel: sounds good, just to make sure we're on the same page, apply the whole fix as in the bios settings and reboot ? I've edited the description of https://phabricator.wikimedia.org/T225713 just now to make that clearer [14:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:58] T197126: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 [14:58:22] godog: nope, atm just via /sys/... [14:59:12] * godog nods [14:59:38] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10Gehel) >>! In T225713#5324717, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://tools.wmflabs.org/sal/log/AWvhiHZWOwpQ-3Pk9Mvn} [2019-07-11T14:55:45Z] ... [15:00:27] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10tramm) Another two weeks passed... How can we help to bring some action? [15:00:32] !log restarted Jenkins for plugins upgrades [15:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:47] (03PS2) 10BBlack: discovery-map remove [2/4]: remove ops/dns refs [dns] - 10https://gerrit.wikimedia.org/r/522113 [15:01:13] (03CR) 10jerkins-bot: [V: 04-1] discovery-map remove [2/4]: remove ops/dns refs [dns] - 10https://gerrit.wikimedia.org/r/522113 (owner: 10BBlack) [15:01:18] ^ will keep failing CI until 1/4 is deployed, that's ok :) [15:03:42] bblack: I'm wondering if adding in 2/4 a Depends-On: Change_id_of_1/4 would solve the issue [15:04:18] also, isn't 2/4 depending on 1/4 strange? [15:04:35] oh, I see, it is an upload only, not a merge [15:05:26] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo debdeploy deploy -u T197126-2019-07-11-conftool.yaml -s cp-canary [15:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:31] T197126: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 [15:06:36] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10fgiunchedi) [15:08:24] (03CR) 10Andrew Bogott: [C: 03+2] openstack mwopenstackclients: Add designateclient (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/513752 (https://phabricator.wikimedia.org/T224708) (owner: 10Alex Monk) [15:08:57] cdanis@cumin1001.eqiad.wmnet ~ % sudo debdeploy deploy -u T197126-2019-07-11-conftool.yaml -s mw-api-canary [15:12:50] vgutierrez: I don't think depends-on can help here with both cross-repo and CI tooling concerns in play [15:13:10] can try and see just in case [15:13:44] (03PS3) 10BBlack: discovery-map remove [2/4]: remove ops/dns refs [dns] - 10https://gerrit.wikimedia.org/r/522113 [15:14:17] (03CR) 10jerkins-bot: [V: 04-1] discovery-map remove [2/4]: remove ops/dns refs [dns] - 10https://gerrit.wikimedia.org/r/522113 (owner: 10BBlack) [15:14:27] (the issue is the ops/dns CI checks in 2/4 fail because they're pulling ops/puppet stuff that doesn't have the 1/4 change) [15:15:59] oh wait. that may still be a problem, but I also forgot all the mock stuff heh [15:19:33] (03CR) 10Cwhite: [C: 03+1] Remove Diamond from production hosts [puppet] - 10https://gerrit.wikimedia.org/r/522075 (https://phabricator.wikimedia.org/T212231) (owner: 10Muehlenhoff) [15:21:28] (03PS4) 10BBlack: discovery-map remove [2/4]: remove ops/dns refs [dns] - 10https://gerrit.wikimedia.org/r/522113 [15:21:59] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10Gehel) I observe a pretty significant drop in CPU usage on elastic1052 (>50% to ~25%), so that looks good. I'll wait until Monday to apply to the whole cluster. [15:22:27] apparently junkins is happy now! [15:23:01] lol @ junkins [15:24:21] (03PS2) 10Elukey: geoip::data::archive: move to kerberos::systemd_timer [puppet] - 10https://gerrit.wikimedia.org/r/520775 (https://phabricator.wikimedia.org/T226698) [15:24:58] that's a nice new nick it acquired [15:27:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] helmfile.d: adding eqiad,codfw admin helmfiles (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/522098 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [15:28:19] !log upgrade to python3-conftool 1.1.0-1 on cp4022 [15:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:25] (03CR) 10Fsero: helmfile.d: adding eqiad,codfw admin helmfiles (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/522098 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [15:28:35] !log setting CPU governor to performance for wdqs1004 - T225713 [15:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:41] T225713: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 [15:29:36] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mbsantos - https://phabricator.wikimedia.org/T227695 (10MSantos) @dr0ptp4kt and @JoeWalsh what do you think? [15:29:58] (03CR) 10Elukey: [C: 03+2] geoip::data::archive: move to kerberos::systemd_timer [puppet] - 10https://gerrit.wikimedia.org/r/520775 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [15:30:29] (03PS2) 10Fsero: helmfile.d: adding eqiad,codfw admin helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/522098 (https://phabricator.wikimedia.org/T212130) [15:30:31] (03Restored) 10Ori.livneh: Configure forensic logging of Apache requests; enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [15:30:44] (03PS5) 10Ori.livneh: Configure forensic logging of Apache requests; enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/511751 [15:33:50] (03PS1) 10Andrew Bogott: Removed checks for Trusty in os_version [puppet] - 10https://gerrit.wikimedia.org/r/522120 [15:34:56] (03CR) 10Andrew Bogott: [C: 03+2] Removed checks for Trusty in os_version [puppet] - 10https://gerrit.wikimedia.org/r/522120 (owner: 10Andrew Bogott) [15:35:36] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for mbsantos - https://phabricator.wikimedia.org/T227695 (10dr0ptp4kt) Approved as Engineering Director. [15:36:06] (03CR) 10Elukey: "Saw the code change passing by, left a comment :) Hello Ori!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [15:37:39] (03CR) 10Thcipriani: [V: 03+2] its-phabricator: new build with updated its-base [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/521944 (owner: 10Thcipriani) [15:38:29] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [15:38:31] (03PS1) 10Elukey: geoip::data::archive: remove parameter [puppet] - 10https://gerrit.wikimedia.org/r/522123 (https://phabricator.wikimedia.org/T226698) [15:38:57] PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [15:39:13] this is me --^ [15:40:39] (03CR) 10Elukey: [C: 03+2] geoip::data::archive: remove parameter [puppet] - 10https://gerrit.wikimedia.org/r/522123 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [15:41:53] I'm going to push up a quick gerrit plugin update, FYI. No restart needed. [15:42:46] (03PS1) 10Lucas Werkmeister (WMDE): Specify $wmgWBRepoConceptBaseUri again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522125 (https://phabricator.wikimedia.org/T225212) [15:42:48] (03PS1) 10Lucas Werkmeister (WMDE): Specify $wgWBRepoSettings['conceptBaseUri'] again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522126 (https://phabricator.wikimedia.org/T225212) [15:42:58] (03Abandoned) 10Lucas Werkmeister (WMDE): Specify $wgWBRepoSettings['conceptBaseUri'] again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518239 (https://phabricator.wikimedia.org/T225212) (owner: 10Lucas Werkmeister (WMDE)) [15:44:05] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@4daa16c]: it-phabricator plugin update (gerrit2001 only) [15:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:17] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@4daa16c]: it-phabricator plugin update (gerrit2001 only) (duration: 00m 11s) [15:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:23] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [15:45:25] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@4daa16c]: it-phabricator plugin update (cobalt) [15:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:36] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@4daa16c]: it-phabricator plugin update (cobalt) (duration: 00m 11s) [15:45:37] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove Diamond from production hosts [puppet] - 10https://gerrit.wikimedia.org/r/522075 (https://phabricator.wikimedia.org/T212231) (owner: 10Muehlenhoff) [15:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:57] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [15:50:43] !log deactivate ping-offload in codfw for server reboot [15:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:15] 10Operations, 10Analytics, 10Research-Backlog, 10serviceops-radar, and 4 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10leila) [15:53:20] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:53:21] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:56] !log rebooting ping2001 to pick up MDS-enabled qemu [15:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:20] !log installing dnspython update from stretch point release [15:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:08] !log revert deactivate ping-offload in codfw for server reboot [15:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:24] !log depool cp4022 for testing conftool change [15:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:44] (03PS1) 10Paladox: Gerrit v2.15.14 [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/522133 [15:58:50] (03PS2) 10Paladox: Gerrit v2.15.14 [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/522133 [15:59:23] (03CR) 10Alexandros Kosiaris: [C: 04-1] Create two LDAP replicas in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522102 (https://phabricator.wikimedia.org/T227669) (owner: 10Muehlenhoff) [15:59:41] !log deactivate ping-offload in eqiad for server reboot [15:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] godog and _joe_: It is that lovely time of the day again! You are hereby commanded to deploy Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190711T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:01:45] (03CR) 10Muehlenhoff: Create two LDAP replicas in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522102 (https://phabricator.wikimedia.org/T227669) (owner: 10Muehlenhoff) [16:02:22] !log repool cp4022 after testing conftool change [16:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:40] urandom: no, the reimage script failed :( [16:03:26] !log rebooting ping1001 to pick up MDS-enabled qemu [16:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:24] (03CR) 10Alexandros Kosiaris: [C: 04-1] Create two LDAP replicas in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522102 (https://phabricator.wikimedia.org/T227669) (owner: 10Muehlenhoff) [16:05:22] 10Operations, 10Analytics, 10Discovery, 10Research-Backlog: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10leila) [16:06:55] (03CR) 10Muehlenhoff: Create two LDAP replicas in codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522102 (https://phabricator.wikimedia.org/T227669) (owner: 10Muehlenhoff) [16:12:00] !log revert deactivate ping-offload in eqiad for server reboot [16:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:12] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Missed a couple on the previous one, added them now. Otherwise, LGTM" (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/522098 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [16:19:09] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo debdeploy deploy -u T197126-2019-07-11-conftool.yaml -s ulsfo [16:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:26] T197126: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 [16:19:46] (03CR) 10Alexandros Kosiaris: Introduce openldap_config in hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522073 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [16:27:05] (03CR) 10Elukey: Introduce openldap_config in hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522073 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [16:28:16] (03CR) 10Elukey: "I left a comment in https://phabricator.wikimedia.org/T225642#5306522 last week, I didn't progress since I didn't get any answer and was w" [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [16:28:50] (03CR) 10Muehlenhoff: Introduce openldap_config in hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522073 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [16:42:06] 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10MoritzMuehlenhoff) [16:42:24] 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All complete [16:44:00] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10Traffic, 10Release-Engineering-Team (Development services): Set up a subdomain for Phame to enable caching - https://phabricator.wikimedia.org/T226044 (10greg) [16:44:06] 10Operations, 10ops-codfw: Request for hard drives - https://phabricator.wikimedia.org/T227800 (10ayounsi) [16:47:57] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo debdeploy deploy -u T197126-2019-07-11-conftool.yaml -s eqsin [16:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:07] T197126: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 [16:54:40] 10Operations, 10DC-Ops: Request for hard drives - https://phabricator.wikimedia.org/T227800 (10akosiaris) p:05Triage→03Normal [16:57:42] 10Operations, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services), 10Technical-Debt: Investigate whether GD is still needed on appservers - https://phabricator.wikimedia.org/T227734 (10greg) [17:00:04] cscott, arlolra, subbu, and halfak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190711T1700). [17:00:28] 10Operations, 10DC-Ops: Request for hard drives - https://phabricator.wikimedia.org/T227800 (10Papaul) @HMarcus unfortunately codfw has not 8TB+ disks [17:02:22] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo debdeploy deploy -u T197126-2019-07-11-conftool.yaml -s esams [17:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:26] T197126: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 [17:07:53] (03CR) 10Alexandros Kosiaris: Introduce openldap_config in hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522073 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [17:12:11] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-esams.wikimedia.org recovered from Memory over 85% [17:36:16] PROBLEM - Memory correctable errors -EDAC- on mw1239 is CRITICAL: 4.001 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1239&var-datasource=eqiad+prometheus/ops [17:37:35] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo debdeploy deploy -u T197126-2019-07-11-conftool.yaml -s codfw [17:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:44] T197126: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 [17:38:54] 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Cmjohnson) [17:44:27] 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Cmjohnson) I have removed the ops-eqiad tag, if you have an issue that required DC ops pleas... [17:46:02] (03CR) 10Ppchelko: [C: 03+1] RESTRouter: Add initial Helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/512923 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [17:47:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Cmjohnson) 05Open→03Resolved I am resolving this task [17:48:20] (03CR) 10Alexandros Kosiaris: [V: 03+2] RESTRouter: Add initial Helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/512923 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [17:48:25] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] RESTRouter: Add initial Helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/512923 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [17:49:14] 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Dzahn) > the server can be installed whenever you need it. Yea, actually this still needs a... [17:49:40] 10Operations, 10ops-eqiad: (OoW) Broken memory on mw1239 - https://phabricator.wikimedia.org/T209139 (10Cmjohnson) 05Open→03Declined The server is out of warranty and I do not have any spare DIMM [17:49:52] (03PS1) 10Alexandros Kosiaris: Publish restrouter 0.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/522151 (https://phabricator.wikimedia.org/T223953) [17:51:43] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Publish restrouter 0.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/522151 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [17:51:55] It looks like there are no SWAT deploys scheduled for 17:00 UTC, so I plan to start the train early since we have two groups to do today. If I am mistaken about the SWAT please let me know [17:55:09] 10Operations, 10Traffic: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Cmjohnson) I am removing the ops-eqiad tag on this task, if you need additional dc ops work please add the tag back. [17:55:30] 10Operations, 10ops-eqiad, 10serviceops: Heating alerts / memory errors on mw1254 - https://phabricator.wikimedia.org/T204491 (10Cmjohnson) 05Open→03Resolved [17:56:30] 10Operations: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10Cmjohnson) I am removing the ops-eqiad tag, if you onsite work is still required please add the ops-eqiad tag. [18:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190711T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:37] jouncebot: next [18:00:38] In 0 hour(s) and 59 minute(s): MediaWiki train - American version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190711T1900) [18:01:53] longma: I'm about to do a deploy of something else (conftool) in eqiad, but should only take a minute or so [18:02:04] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo debdeploy deploy -u T197126-2019-07-11-conftool.yaml -s eqiad [18:02:07] okay, I'll stand by [18:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:09] T197126: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 [18:02:48] 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Cmjohnson) @dzahn, I need to know I don't know what that means? What does DC-ops need to tr... [18:03:11] longma: all done! [18:03:19] Thanks! [18:04:19] (03PS1) 10Jeena Huneidi: group1 wikis to 1.34.0-wmf.13 refs T220738 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522156 [18:04:21] (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.34.0-wmf.13 refs T220738 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522156 (owner: 10Jeena Huneidi) [18:06:16] (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.13 refs T220738 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522156 (owner: 10Jeena Huneidi) [18:06:43] (03CR) 10jenkins-bot: group1 wikis to 1.34.0-wmf.13 refs T220738 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522156 (owner: 10Jeena Huneidi) [18:08:31] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.34.0-wmf.13 refs T220738 [18:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:58] T220738: 1.34.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T220738 [18:09:29] !log jhuneidi@deploy1001 Synchronized php: group1 wikis to 1.34.0-wmf.13 refs T220738 (duration: 00m 57s) [18:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:58] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [18:12:19] that's a new one [18:18:24] 10Operations, 10Operations-Software-Development: debmonitor: Race condition between package updated triggered by apt hook and daily cron run - https://phabricator.wikimedia.org/T198850 (10CDanis) Just a note that it happened again today ;) [18:19:16] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [18:27:51] 10Operations, 10netops: Standardize cross confederation BGP policies - https://phabricator.wikimedia.org/T227808 (10ayounsi) p:05Triage→03Low [18:29:08] sbassett: hola, ahem that query you are doing on cluster might be a bit too heavy [18:29:12] 10Operations, 10netops: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841 (10ayounsi) [18:29:16] 10Operations, 10netops: Standardize cross confederation BGP policies - https://phabricator.wikimedia.org/T227808 (10ayounsi) [18:29:27] nuria: yep, just killed it, sorry [18:29:30] sbassett: webrequest is petabytes of data and you might be looking at it all [18:30:14] sbassett: we can help you reformat query, normally looking at a day timespam is good if you want to get an overall idea of data [18:30:26] nuria: thought I had decent enough where clauses but I guess not. [18:30:29] sbassett: like year=2019 and month=01 and day=09 [18:31:04] nuria: yes, that's a little cumbersome for the window I want. Was hoping I could select across a larger range of days. [18:32:42] sbassett: you can with day in (10, 11, 12,13) [18:32:49] sbassett: and is_pageview=1 [18:32:53] nuria: ok, thanks! [18:32:56] sbassett: if you are looking for pageviews [18:35:34] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [18:36:50] akosiaris: we just enabled termbox, right? [18:37:20] sbassett: ping us on wikimedia-analytics if you need help [18:38:18] hm no, that's wrong [18:38:28] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [18:39:24] * hauskatze always reads 'termomix' instead of 'termox' [18:40:40] (03PS1) 10Jhedden: toolforge: install fish shell [puppet] - 10https://gerrit.wikimedia.org/r/522161 (https://phabricator.wikimedia.org/T219054) [18:59:25] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10DBA: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10Cmjohnson) I still need to move the DIMM around ...I need the server taken down. If this needs to be scheduled, please let me kno... [19:00:04] longma: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - American version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190711T1900). [19:01:29] (03PS1) 10Jeena Huneidi: group2 wikis to 1.34.0-wmf.13 refs T220738 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522168 [19:01:31] (03CR) 10Jeena Huneidi: [C: 03+2] group2 wikis to 1.34.0-wmf.13 refs T220738 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522168 (owner: 10Jeena Huneidi) [19:03:00] (03CR) 10Jeena Huneidi: group2 wikis to 1.34.0-wmf.13 refs T220738 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522168 (owner: 10Jeena Huneidi) [19:10:21] 10Operations, 10ops-eqiad, 10Analytics: Broken disk on analytics1072 - https://phabricator.wikimedia.org/T226467 (10Cmjohnson) @elukey I am not sure which disk this? I think it's a smaller ssd? Can you confirm the disk type and size please ? [19:17:08] (03PS1) 10Andrew Bogott: labsdb: remove alias for zerowiki [puppet] - 10https://gerrit.wikimedia.org/r/522170 (https://phabricator.wikimedia.org/T227716) [19:23:37] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10Cmjohnson) Confirming that I've seen this task and the server is under warranty. As long as the failed disk is on helium and not the array this should be a warranty issue. [19:25:06] (03Abandoned) 10Jeena Huneidi: group2 wikis to 1.34.0-wmf.13 refs T220738 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522168 (owner: 10Jeena Huneidi) [19:25:53] (03PS2) 10Andrew Bogott: labsdb: remove alias for zerowiki [puppet] - 10https://gerrit.wikimedia.org/r/522170 (https://phabricator.wikimedia.org/T227716) [19:25:57] (03PS1) 10Jeena Huneidi: all wikis to 1.34.0-wmf.13 refs T220738 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522171 [19:25:59] (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.34.0-wmf.13 refs T220738 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522171 (owner: 10Jeena Huneidi) [19:27:09] (03CR) 10Andrew Bogott: [C: 03+2] labsdb: remove alias for zerowiki [puppet] - 10https://gerrit.wikimedia.org/r/522170 (https://phabricator.wikimedia.org/T227716) (owner: 10Andrew Bogott) [19:27:12] (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.13 refs T220738 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522171 (owner: 10Jeena Huneidi) [19:27:29] (03CR) 10jenkins-bot: all wikis to 1.34.0-wmf.13 refs T220738 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522171 (owner: 10Jeena Huneidi) [19:29:33] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.13 refs T220738 [19:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:39] T220738: 1.34.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T220738 [19:35:58] (03CR) 10Jhedden: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/518748 (https://phabricator.wikimedia.org/T226415) (owner: 10Andrew Bogott) [19:44:25] !log milimetric@deploy1001 Started deploy [analytics/refinery@3296aab]: Fix to reimport cu_changes [19:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:29] (03PS5) 10CDanis: swift: hiera-ize object-replicator interval [puppet] - 10https://gerrit.wikimedia.org/r/513053 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [19:48:41] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513053 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [19:50:10] (03CR) 10CDanis: [C: 03+2] swift: hiera-ize object-replicator interval [puppet] - 10https://gerrit.wikimedia.org/r/513053 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [19:50:43] (03PS5) 10CDanis: beta: tweak swift replicator [puppet] - 10https://gerrit.wikimedia.org/r/513054 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [19:51:14] (03CR) 10CDanis: [C: 03+2] beta: tweak swift replicator [puppet] - 10https://gerrit.wikimedia.org/r/513054 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [19:51:37] (03PS5) 10CDanis: swift: hiera-ize object server number of workers [puppet] - 10https://gerrit.wikimedia.org/r/513058 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [19:51:44] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513058 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [19:53:35] (03CR) 10CDanis: [C: 03+2] swift: hiera-ize object server number of workers [puppet] - 10https://gerrit.wikimedia.org/r/513058 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [19:53:46] So, T227814 / T227815. [19:53:46] T227815: Message wikibase-sitelinks-wikipedia missing from wikidata.org - https://phabricator.wikimedia.org/T227815 [19:53:46] T227814: Wikidata localization is broken - https://phabricator.wikimedia.org/T227814 [19:53:52] (03PS5) 10CDanis: beta: lower swift server workers [puppet] - 10https://gerrit.wikimedia.org/r/513059 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [19:53:59] (03CR) 10CDanis: [C: 03+2] beta: lower swift server workers [puppet] - 10https://gerrit.wikimedia.org/r/513059 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [19:54:10] James_F: right grep -c wikibase-sitelinks-wikipedia php-1.34.0-wmf.13/cache/l10n/upstream/l10n_cache-en.cdb.json -> 0 [19:54:11] longma, thcipriani: We could try a full scap? [19:54:14] so that message isn't there. [19:54:27] But the user report suggests other things are broken, at least on Wikidata. [19:54:42] We could revert Wikidata to wmf.11, but I worry that it's actually more widespread. [19:54:56] (03PS5) 10CDanis: swift: hierarize container_replicator settings [puppet] - 10https://gerrit.wikimedia.org/r/513062 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [19:55:04] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513062 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [19:55:05] Without an explanation as to what broke, maybe other wikis are broken too and we just haven't seen it yet? [19:55:08] (03PS2) 10Jhedden: wmcs-cold-migrate: use 'virsh undefine' to cleanup old VMs [puppet] - 10https://gerrit.wikimedia.org/r/518748 (https://phabricator.wikimedia.org/T226415) (owner: 10Andrew Bogott) [19:55:36] that's possible, it's also possible something went weird branching the extension that has that message? [19:55:46] what is a full scap? [19:55:59] I'd say roll back wikidata and lets try to figure out what went wrong [19:56:01] longma: `scap sync`. Syncs everything, including a full i18n rebuild. [19:56:08] (Takes ~40 minutes.) [19:56:13] ah [19:56:41] (03CR) 10CDanis: [C: 03+2] swift: hierarize container_replicator settings [puppet] - 10https://gerrit.wikimedia.org/r/513062 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [19:57:07] (03PS4) 10CDanis: beta: slow down swift container replication [puppet] - 10https://gerrit.wikimedia.org/r/513063 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [19:57:15] so to roll back wikidata, do I update wikiversions.json. in mediawiki-staging and make a CR? [19:57:44] (03CR) 10CDanis: [C: 03+2] beta: slow down swift container replication [puppet] - 10https://gerrit.wikimedia.org/r/513063 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [19:57:48] right, change wikidatawiki back to the previous version in wikiversions.json and then run a sync-wikiversions [19:58:37] the only thing that changed with scap recently is clearing resourceloader blobs IIRC [19:59:43] Which if anything should have fixed things like this. [19:59:52] i.e., calling refreshMessageBlobs.php at the end of scap; although, yeah, don't think it should have caused this [20:00:02] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={LIST,PATCH} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:00:07] other recent changes haven't touched l10n stuff [20:00:10] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: /srv 1035 MB (1% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [20:00:20] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:01:12] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,get,list} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:01:48] running sync-wikiversions [20:01:52] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:03:16] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: Revert wikidata to 1.34.0-wmf.11 [20:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:38] at a quick glance, the revert seems to have fixed the history page issue mentioned in T227814 [20:04:39] T227814: [Regression wmf.13] Wikidata localisation is broken - https://phabricator.wikimedia.org/T227814 [20:06:16] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:07:04] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:07:20] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:07:40] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:11:59] !log milimetric@deploy1001 deploy aborted: Fix to reimport cu_changes (duration: 27m 34s) [20:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:42] (03PS1) 10Andrew Bogott: shinken: monitor the cloudinfra project [puppet] - 10https://gerrit.wikimedia.org/r/522178 [20:12:52] (03PS2) 10Andrew Bogott: shinken: monitor the cloudinfra project [puppet] - 10https://gerrit.wikimedia.org/r/522178 [20:13:20] RECOVERY - Disk space on an-coord1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [20:13:49] (03CR) 10Andrew Bogott: [C: 03+2] shinken: monitor the cloudinfra project [puppet] - 10https://gerrit.wikimedia.org/r/522178 (owner: 10Andrew Bogott) [20:18:33] !log otto@deploy1001 Started deploy [analytics/refinery@3296aab]: (no justification provided) [20:18:36] !log otto@deploy1001 Finished deploy [analytics/refinery@3296aab]: (no justification provided) (duration: 00m 02s) [20:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:33] 10Operations, 10DC-Ops, 10Office-IT: Request for hard drives - https://phabricator.wikimedia.org/T227800 (10Peachey88) [20:19:33] !log otto@deploy1001 Started deploy [analytics/refinery@3296aab]: (no justification provided) [20:19:35] !log otto@deploy1001 Finished deploy [analytics/refinery@3296aab]: (no justification provided) (duration: 00m 02s) [20:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:23] !log otto@deploy1001 Started deploy [analytics/refinery@3296aab]: (no justification provided) [20:20:26] !log otto@deploy1001 Finished deploy [analytics/refinery@3296aab]: (no justification provided) (duration: 00m 03s) [20:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:41] !log otto@deploy1001 Started deploy [analytics/refinery@3296aab]: (no justification provided) [20:22:44] !log otto@deploy1001 Finished deploy [analytics/refinery@3296aab]: (no justification provided) (duration: 00m 02s) [20:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:34] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:35:50] (03PS1) 10Jeena Huneidi: Revert wikidata to 1.34.0-wmf.11 refs T227814 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522181 [20:40:51] (03CR) 10Thcipriani: [C: 03+1] Revert wikidata to 1.34.0-wmf.11 refs T227814 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522181 (owner: 10Jeena Huneidi) [20:41:03] (03CR) 10Jeena Huneidi: [C: 03+2] Revert wikidata to 1.34.0-wmf.11 refs T227814 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522181 (owner: 10Jeena Huneidi) [20:41:55] (03Merged) 10jenkins-bot: Revert wikidata to 1.34.0-wmf.11 refs T227814 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522181 (owner: 10Jeena Huneidi) [20:42:10] (03CR) 10jenkins-bot: Revert wikidata to 1.34.0-wmf.11 refs T227814 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522181 (owner: 10Jeena Huneidi) [21:02:41] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Krinkle) See also T137291. [21:21:30] !log otto@deploy1001 Started deploy [analytics/refinery@3296aab]: (no justification provided) [21:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:32] !log otto@deploy1001 Finished deploy [analytics/refinery@3296aab]: (no justification provided) (duration: 02m 02s) [21:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:53] 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet f... [21:30:18] (03CR) 10MaxSem: "906 requests this month. Also, after the DNS record is gone, the site needs to be removed from Apache config." [dns] - 10https://gerrit.wikimedia.org/r/521966 (owner: 10Jforrester) [21:30:59] (03PS3) 10Jhedden: wmcs-cold-migrate: use 'virsh undefine' to cleanup old VMs [puppet] - 10https://gerrit.wikimedia.org/r/518748 (https://phabricator.wikimedia.org/T226415) (owner: 10Andrew Bogott) [21:31:37] (03CR) 10jerkins-bot: [V: 04-1] wmcs-cold-migrate: use 'virsh undefine' to cleanup old VMs [puppet] - 10https://gerrit.wikimedia.org/r/518748 (https://phabricator.wikimedia.org/T226415) (owner: 10Andrew Bogott) [21:33:06] (03PS4) 10Jhedden: wmcs-cold-migrate: use 'virsh undefine' to cleanup old VMs [puppet] - 10https://gerrit.wikimedia.org/r/518748 (https://phabricator.wikimedia.org/T226415) (owner: 10Andrew Bogott) [21:36:51] (03PS2) 10Dzahn: install_server: decom netmon1003 [puppet] - 10https://gerrit.wikimedia.org/r/502178 (https://phabricator.wikimedia.org/T220355) [21:38:08] (03CR) 10Dzahn: [C: 03+2] install_server: decom netmon1003 [puppet] - 10https://gerrit.wikimedia.org/r/502178 (https://phabricator.wikimedia.org/T220355) (owner: 10Dzahn) [21:40:08] ACKNOWLEDGEMENT - HP RAID on db2044 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:10, 1I:1:11, 1I:1:12, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9 - Failed: 1I:1:2 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T227829 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:40:12] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T227829 (10ops-monitoring-bot) [21:40:29] (03CR) 10Jhedden: [C: 03+2] wmcs-cold-migrate: use 'virsh undefine' to cleanup old VMs [puppet] - 10https://gerrit.wikimedia.org/r/518748 (https://phabricator.wikimedia.org/T226415) (owner: 10Andrew Bogott) [21:41:06] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [21:41:53] restbase1017 again? [21:41:54] ^ that is due to an ongoing server reinstall. just like yesterday [21:41:59] ah [21:42:01] hah okay [21:42:03] yea, because that install failed :( [21:42:14] with something vague about not being able to get puppet state [21:42:16] on the first run [21:42:41] but first i want to see if i can reproduce [21:42:50] (03PS6) 10Ori.livneh: Configure forensic logging of Apache requests; enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/511751 [21:44:22] (03CR) 10Ori.livneh: Configure forensic logging of Apache requests; enable on beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [21:51:35] ACKNOWLEDGEMENT - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. andrew bogott T227785 [21:52:07] ACKNOWLEDGEMENT - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. andrew bogott T227785 [21:54:42] (03PS1) 10Andrew Bogott: Revert "openstack mwopenstackclients: Use designateclient in ensure functions" [puppet] - 10https://gerrit.wikimedia.org/r/522192 (https://phabricator.wikimedia.org/T227785) [21:55:17] (03CR) 10jerkins-bot: [V: 04-1] Revert "openstack mwopenstackclients: Use designateclient in ensure functions" [puppet] - 10https://gerrit.wikimedia.org/r/522192 (https://phabricator.wikimedia.org/T227785) (owner: 10Andrew Bogott) [21:57:01] 10Operations, 10Traffic: Wikipedia is unavailable on Symbian phone's btowsers - https://phabricator.wikimedia.org/T227828 (10Krinkle) [21:57:07] 10Operations, 10Traffic: Wikipedia is unavailable on Symbian phone's browsers - https://phabricator.wikimedia.org/T227828 (10Krinkle) [21:58:17] (03PS2) 10Andrew Bogott: Revert "openstack mwopenstackclients: Use designateclient in ensure functions" [puppet] - 10https://gerrit.wikimedia.org/r/522192 (https://phabricator.wikimedia.org/T227785) [22:01:19] 10Operations, 10Traffic: Wikipedia is unavailable on Symbian phone's browsers - https://phabricator.wikimedia.org/T227828 (10Aklapper) Thanks! Do you have any way to find out which exact Symbian version you are using? And/or would you share in public which exact phone this is about? [22:05:07] (03PS2) 10Dzahn: turn netmon1003 into a spare, delete servermon role [puppet] - 10https://gerrit.wikimedia.org/r/502171 (https://phabricator.wikimedia.org/T198939) [22:07:11] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [22:07:16] !log thcipriani@deploy1001 Started scap: no op scap sync to rebuild l10n-cache (T227814) [22:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:21] T227814: [Regression wmf.13] Wikidata localisation is broken - https://phabricator.wikimedia.org/T227814 [22:08:42] (03CR) 10Andrew Bogott: [C: 03+2] Revert "openstack mwopenstackclients: Use designateclient in ensure functions" [puppet] - 10https://gerrit.wikimedia.org/r/522192 (https://phabricator.wikimedia.org/T227785) (owner: 10Andrew Bogott) [22:09:53] PROBLEM - Device not healthy -SMART- on db2044 is CRITICAL: cluster=mysql device=cciss,11 instance=db2044:9100 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2044&var-datasource=codfw+prometheus/ops [22:11:10] (03PS3) 10Dzahn: turn netmon1003 into a spare [puppet] - 10https://gerrit.wikimedia.org/r/502171 (https://phabricator.wikimedia.org/T198939) [22:12:46] (03PS4) 10Dzahn: turn netmon1003 into a spare [puppet] - 10https://gerrit.wikimedia.org/r/502171 (https://phabricator.wikimedia.org/T198939) [22:12:58] looks like ExtensionMessages got WikibaseLib for wmf.13 this time around. [22:13:04] no idea why it didn't the first time around. [22:18:15] (03PS5) 10Dzahn: turn netmon1003 into a spare [puppet] - 10https://gerrit.wikimedia.org/r/502171 (https://phabricator.wikimedia.org/T198939) [22:21:07] (03PS1) 10Andrew Bogott: Use designateclient in ensure functions [puppet] - 10https://gerrit.wikimedia.org/r/522196 (https://phabricator.wikimedia.org/T227785) [22:21:33] (03CR) 10Andrew Bogott: [C: 04-1] "blocked until we've upgraded to Openstack Newton" [puppet] - 10https://gerrit.wikimedia.org/r/522196 (https://phabricator.wikimedia.org/T227785) (owner: 10Andrew Bogott) [22:22:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudvirt1007 predicted raid failure - https://phabricator.wikimedia.org/T209861 (10Cmjohnson) 05Open→03Resolved This was completed awhile ago...never updated task [22:23:56] (03CR) 10Dzahn: [C: 03+2] turn netmon1003 into a spare [puppet] - 10https://gerrit.wikimedia.org/r/502171 (https://phabricator.wikimedia.org/T198939) (owner: 10Dzahn) [22:24:26] (03PS3) 10Cwhite: set up debian packaging [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/521580 (https://phabricator.wikimedia.org/T196066) [22:25:31] (03CR) 10Cwhite: set up debian packaging (034 comments) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/521580 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [22:26:50] !log thcipriani@deploy1001 Finished scap: no op scap sync to rebuild l10n-cache (T227814) (duration: 19m 34s) [22:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:58] T227814: [Regression wmf.13] Wikidata localisation is broken - https://phabricator.wikimedia.org/T227814 [22:27:17] (03PS6) 10Dzahn: remove netmon1003 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/502171 (https://phabricator.wikimedia.org/T198939) [22:27:52] longma: James_F this seems like a positive sign: https://en.wikipedia.org/wiki/MediaWiki:Wikibase-sitelinks-wikipedia [22:28:44] I think I will roll forward wikidata. [22:29:18] okay, let me know if there's anything I should do [22:29:56] (03PS7) 10Dzahn: remove netmon1003 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/502171 (https://phabricator.wikimedia.org/T198939) [22:30:31] Very odd. [22:31:29] but only 20 minutes, which is nice [22:32:52] kind of terrifying. Doing the same thing twice shouldn't give us different results :\ [22:33:17] yeah :( [22:34:14] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: wikidatawiki back to 1.34.0-wmf.13 [22:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:28] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [22:36:05] (03CR) 10Dzahn: [C: 03+2] remove netmon1003 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/502171 (https://phabricator.wikimedia.org/T198939) (owner: 10Dzahn) [22:37:44] !log Deployed fix for T224240, accidentally rode along with Tyler's no-op scap [22:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:50] (03PS1) 10Thcipriani: Revert "Revert wikidata to 1.34.0-wmf.11 refs T227814" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522201 [22:39:14] RoanKattouw: you're welcome :) [22:40:02] hah [22:40:13] when a no-op isn't a no-op [22:41:18] (03CR) 10Thcipriani: [C: 03+2] Revert "Revert wikidata to 1.34.0-wmf.11 refs T227814" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522201 (owner: 10Thcipriani) [22:41:49] :-( [22:42:27] (03Merged) 10jenkins-bot: Revert "Revert wikidata to 1.34.0-wmf.11 refs T227814" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522201 (owner: 10Thcipriani) [22:42:45] (03PS1) 10Andrew Bogott: mwopenstackclients: 30 second timeouts for all http actions [puppet] - 10https://gerrit.wikimedia.org/r/522204 (https://phabricator.wikimedia.org/T227785) [22:43:03] now there's just https://phabricator.wikimedia.org/T227822 which I'm not 100% sure about [22:43:10] (as blockers to this week's train) [22:43:28] (03CR) 10jenkins-bot: Revert "Revert wikidata to 1.34.0-wmf.11 refs T227814" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522201 (owner: 10Thcipriani) [22:44:03] I wonder if that's now resolved as a result of all wikis being on the same version again? [22:44:54] seems to be gone from the fatalmonitor/mediawiki-errors dashboard [22:46:31] huh, so it'll come back on tuesday? [22:47:15] (03PS2) 10Andrew Bogott: mwopenstackclients: 300 second timeouts for all http actions [puppet] - 10https://gerrit.wikimedia.org/r/522204 (https://phabricator.wikimedia.org/T227785) [22:47:15] 10Operations, 10ops-eqiad, 10ops-ulsfo, 10DC-Ops: Audit down ports - https://phabricator.wikimedia.org/T218751 (10Cmjohnson) All items on rows A, B an C have been updated. Row D will need some on-site verification [22:49:02] or there was some incompatibility introduced between wmf.11 and wmf.13 [22:49:14] meaning that it was a one-time thing [22:51:30] (03CR) 10BryanDavis: [C: 03+1] mwopenstackclients: 300 second timeouts for all http actions [puppet] - 10https://gerrit.wikimedia.org/r/522204 (https://phabricator.wikimedia.org/T227785) (owner: 10Andrew Bogott) [22:55:35] (03CR) 10Krinkle: [C: 03+1] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521004 (owner: 10Aaron Schulz) [22:56:58] greg-g: thcipriani: likely one-off. It looks like the problem happened because the code changed from an unversioned cache key to a versioned cache key, and to them make a breaking change in the value, in the same branch. [22:57:27] Krinkle: ah, well, that's "good" [22:57:38] So what happened is probably something like wmf.13 populating the versioned key with the new value format and wmf.11 reading it regardless and then breaking. [22:57:47] not sure how/why but it wouldn't happen again [22:58:24] greg-g: haven't finished crawling through all of https://logstash.wikimedia.org/app/kibana#/dashboard/0a9ecdc0-b6dc-11e8-9d8f-dbc23b470465, but nothing frequent remaining at least. [22:59:23] jouncebot: next [22:59:23] In 0 hour(s) and 0 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190711T2300) [22:59:36] !log netmon1003 - removing servermon - servermon.wikimedia.org is being decom'ed (T198939) [22:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:42] T198939: Decommission servermon - https://phabricator.wikimedia.org/T198939 [23:00:04] MaxSem, RoanKattouw, and Niharika: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190711T2300). [23:00:04] MatmaRex and RoanKattouw: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:52] hi [23:04:04] I suppose I can SWAT since I'm already setup [23:04:35] (03PS3) 10Thcipriani: Oversample all EditAttemptStep events on VE-as-mobile-default wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521338 (https://phabricator.wikimedia.org/T227317) (owner: 10DLynch) [23:05:07] (03PS1) 10BryanDavis: systemd::timer::job: Add optional $max_runtime_seconds [puppet] - 10https://gerrit.wikimedia.org/r/522208 (https://phabricator.wikimedia.org/T227830) [23:05:21] RoanKattouw: still around for SWAT? [23:05:51] thanks thcipriani [23:06:41] MatmaRex: is there any particular order for deploying these files? Will one way or the other cause a log explosion? [23:07:42] urandom: restbase1017 should be usable now [23:07:46] 10Operations, 10ops-eqiad, 10Performance-Team (Radar): (OoW) tungsten disk 1 and 8 SMART failure - https://phabricator.wikimedia.org/T193628 (10Cmjohnson) 05Open→03Declined Since there is no need to replace these disks...declining the task [23:08:04] urandom: i see your shell user has been created and it's on stretch now [23:08:08] thcipriani: it shouldn't matter [23:08:32] MatmaRex: okie doke, I'll just do a sync-dir then [23:08:50] (03CR) 10Thcipriani: [C: 03+2] Oversample all EditAttemptStep events on VE-as-mobile-default wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521338 (https://phabricator.wikimedia.org/T227317) (owner: 10DLynch) [23:09:03] urandom: puppet currently fails with "Error: Execution of '/usr/bin/scap deploy-local ..etc" so i guess it needs a deployment step next [23:09:05] (03PS2) 10BryanDavis: systemd::timer::job: Add optional $max_runtime_seconds [puppet] - 10https://gerrit.wikimedia.org/r/522208 (https://phabricator.wikimedia.org/T227830) [23:09:50] (03Merged) 10jenkins-bot: Oversample all EditAttemptStep events on VE-as-mobile-default wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521338 (https://phabricator.wikimedia.org/T227317) (owner: 10DLynch) [23:10:00] mutante: yeah, that could be, I'll have a look [23:10:09] (03CR) 10jenkins-bot: Oversample all EditAttemptStep events on VE-as-mobile-default wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521338 (https://phabricator.wikimedia.org/T227317) (owner: 10DLynch) [23:10:17] mutante: thanks btw! [23:11:33] urandom: yw. i repeated the reimage command and it kind of failed but after puppet was already done running so this time i could ssh to it and just run puppet again and that should be it [23:13:00] MatmaRex: live on mwdebug1002, check please (if possible) [23:13:36] thcipriani: seems good [23:13:46] cool, going live [23:15:42] !log thcipriani@deploy1001 Synchronized wmf-config: SWAT: [[gerrit:521338|Oversample all EditAttemptStep events on VE-as-mobile-default wikis]] T227317 (duration: 00m 50s) [23:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:48] ^ MatmaRex should be live now [23:15:49] T227317: Oversample mobile wikitext EditAttemptStep events on default editor A/B test wikis - https://phabricator.wikimedia.org/T227317 [23:17:00] thcipriani: yup, seem to be now. thank you! [23:17:26] yw! glad all's working! [23:18:36] (03PS3) 10BryanDavis: systemd::timer::job: Add optional $max_runtime_seconds [puppet] - 10https://gerrit.wikimedia.org/r/522208 (https://phabricator.wikimedia.org/T227830) [23:26:02] (03PS4) 10BryanDavis: systemd::timer::job: Add optional $max_runtime_seconds [puppet] - 10https://gerrit.wikimedia.org/r/522208 (https://phabricator.wikimedia.org/T227830) [23:31:20] (03CR) 10BryanDavis: "PCC output: https://puppet-compiler.wmflabs.org/compiler1001/17333/" [puppet] - 10https://gerrit.wikimedia.org/r/522208 (https://phabricator.wikimedia.org/T227830) (owner: 10BryanDavis) [23:31:37] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10stjn) >>! In T211881#4878324, @akosiaris wrote: > On an unrelated note, I find funny that the top wiki in... [23:32:05] 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Dzahn) >>! In T222960#5325727, @Cmjohnson wrote: > @dzahn, I need to know I don't know what... [23:32:07] (03CR) 10BryanDavis: "Worth discussing if this is the "right" fix for this problem or not." [puppet] - 10https://gerrit.wikimedia.org/r/522208 (https://phabricator.wikimedia.org/T227830) (owner: 10BryanDavis) [23:32:53] 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Dzahn) a:03Eevans [23:33:41] 10Operations, 10Wikimedia-production-error (Shared Build Failure): Everything fails with unable to load the docker file - https://phabricator.wikimedia.org/T227833 (10Ladsgroup) [23:33:51] 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Dzahn) This should be good to use now so you can take it back into service. Let us know if y... [23:33:53] 10Operations, 10Wikimedia-production-error (Shared Build Failure): Everything fails with unable to load the docker file - https://phabricator.wikimedia.org/T227833 (10Ladsgroup) p:05Triage→03Unbreak! [23:35:30] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Cmjohnson) 05Open→03Declined Declining the task since the server is out of warranty. [23:36:42] !log eevans@deploy1001 Started deploy [cassandra/logstash-logback-encoder@d085ffa]: (no justification provided) [23:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:42] !log eevans@deploy1001 deploy aborted: (no justification provided) (duration: 02m 00s) [23:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:40] thcipriani: Oh crap, here now sorryt [23:40:10] I can SWAT my own patch if that makes things easier for you, sorry for missing your ping earlier [23:40:11] RoanKattouw: no worries :) I'm still around if you want me to SWAT [23:40:17] Yes please [23:40:24] 10Operations, 10User-DannyS712, 10Wikimedia-production-error (Shared Build Failure): Everything fails with unable to load the docker file - https://phabricator.wikimedia.org/T227833 (10DannyS712) 05Open→03Resolved a:03Jfoster81747 Per comment on https://gerrit.wikimedia.org/r/517775: `Sorry, my fault,... [23:40:33] * thcipriani does [23:40:40] (03PS2) 10Dzahn: delete servermon role and module [puppet] - 10https://gerrit.wikimedia.org/r/502174 (https://phabricator.wikimedia.org/T198939) [23:45:31] thcipriani: I don't think there's any testing I can do because I can't reproduce the error using any of the URLs from previous occurences in logstash, so we'll just have to monitor the logs to see if it goes away [23:45:32] (03PS1) 10CDanis: conftool::client: convert hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/522217 [23:45:44] !log eevans@deploy1001 Started deploy [cassandra/logstash-logback-encoder@d085ffa]: (no justification provided) [23:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:08] RoanKattouw: k, sounds good. [23:47:09] (03PS2) 10Dzahn: mariadb: revoke servermon grants [puppet] - 10https://gerrit.wikimedia.org/r/502172 (https://phabricator.wikimedia.org/T198939) [23:47:32] (03CR) 10jerkins-bot: [V: 04-1] mariadb: revoke servermon grants [puppet] - 10https://gerrit.wikimedia.org/r/502172 (https://phabricator.wikimedia.org/T198939) (owner: 10Dzahn) [23:47:38] 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase1017.eqiad.wmnet'] ` Of wh... [23:47:41] !log eevans@deploy1001 Finished deploy [cassandra/logstash-logback-encoder@d085ffa]: (no justification provided) (duration: 01m 56s) [23:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:21] !log eevans@deploy1001 Started deploy [cassandra/logstash-logback-encoder@d085ffa]: deploy logback to restbase1017 (T222960) [23:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:27] T222960: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 [23:48:27] (03PS3) 10Dzahn: mariadb: revoke servermon grants [puppet] - 10https://gerrit.wikimedia.org/r/502172 (https://phabricator.wikimedia.org/T198939) [23:49:08] !log eevans@deploy1001 Finished deploy [cassandra/logstash-logback-encoder@d085ffa]: deploy logback to restbase1017 (T222960) (duration: 00m 47s) [23:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:31] (03PS2) 10Dzahn: mariadb::ferm_misc: remove firewall rule for servermon [puppet] - 10https://gerrit.wikimedia.org/r/502176 (https://phabricator.wikimedia.org/T198939) [23:52:46] (03PS2) 10Dzahn: traffic servers: remove netmon1003 director and backend [puppet] - 10https://gerrit.wikimedia.org/r/502177 (https://phabricator.wikimedia.org/T220355) [23:55:14] (03PS1) 10Eevans: Updated list of RESTBase hosts [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/522218 (https://phabricator.wikimedia.org/T222960) [23:56:01] RoanKattouw: okie doke, going live [23:57:02] (03PS2) 10Dzahn: proxysql: add icinga notes_url [puppet] - 10https://gerrit.wikimedia.org/r/521382 [23:58:19] !log thcipriani@deploy1001 Synchronized php-1.34.0-wmf.13/includes/watcheditem/WatchedItemStore.php: SWAT: [[gerrit:522155|WatchedItemStore: Fix fatal when revision is deleted]] T226741 (duration: 00m 51s) [23:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:26] T226741: WatchedItemStore fatals on some logged-in pageviews: Argument to RevisionStore::getNextRevision must be RevisionRecord, null given - https://phabricator.wikimedia.org/T226741 [23:58:28] ^ RoanKattouw live!