[03:07:05] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:19:27] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1081 is OK: HTTP OK: HTTP/1.0 200 OK - 22312 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [05:00:05] PROBLEM - Debian mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [05:59:19] (03PS1) 10Privacybatm: Add favicon to Tendril [software/tendril] - 10https://gerrit.wikimedia.org/r/583203 (https://phabricator.wikimedia.org/T204110) [05:59:21] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [software/tendril] - 10https://gerrit.wikimedia.org/r/583203 (https://phabricator.wikimedia.org/T204110) (owner: 10Privacybatm) [06:10:55] (03CR) 10Marostegui: [C: 03+2] production-m5.sql: Remove nova DB grants [puppet] - 10https://gerrit.wikimedia.org/r/583052 (https://phabricator.wikimedia.org/T248313) (owner: 10Marostegui) [06:13:07] !log Remove grants 'nova'@'208.80.154.23' on nova.* - T248313 [06:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:14] T248313: Drop nova and nova_api databases from m5 - https://phabricator.wikimedia.org/T248313 [06:15:34] !log Rename tables on db1133 (m5 master) nova_api database - T248313 [06:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:13] 10Operations, 10Patch-For-Review: Add favicon to icinga and tendril - https://phabricator.wikimedia.org/T204110 (10Privacybatm) {F31701001} This patch is tested for 404-page only! [06:24:18] (03CR) 10Privacybatm: "> Patch Set 1:" [software/tendril] - 10https://gerrit.wikimedia.org/r/583203 (https://phabricator.wikimedia.org/T204110) (owner: 10Privacybatm) [06:30:10] (03CR) 10Privacybatm: "This patch is tested for 404-page only!" [software/tendril] - 10https://gerrit.wikimedia.org/r/583203 (https://phabricator.wikimedia.org/T204110) (owner: 10Privacybatm) [06:33:54] (03PS1) 10Marostegui: install_server: Do not reimage db1111 [puppet] - 10https://gerrit.wikimedia.org/r/583205 [06:33:58] (03PS2) 10Marostegui: wmnet: Replace dbproxy1010 with dbproxy1018 [dns] - 10https://gerrit.wikimedia.org/r/534573 (https://phabricator.wikimedia.org/T231520) [06:35:38] 10Operations, 10DBA, 10Data-Services, 10Patch-For-Review: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) This can now go after merging: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/582961/ https://gerrit.wikimedia.o... [06:35:53] (03PS2) 10Marostegui: wikireplica_dns: Replace dbproxy1010 with dbproxy1018 [puppet] - 10https://gerrit.wikimedia.org/r/534577 (https://phabricator.wikimedia.org/T231520) [06:38:24] 10Operations, 10LDAP-Access-Requests: Add user to ops ldap group - https://phabricator.wikimedia.org/T248445 (10DubOSv10) [06:39:09] 10Operations, 10DBA, 10Data-Services, 10Patch-For-Review: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) Everything seems to be working fine on dbproxy1019 and dbproxy1018 after merging the above changes. Everything is reachab... [06:39:36] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1111 [puppet] - 10https://gerrit.wikimedia.org/r/583205 (owner: 10Marostegui) [06:45:13] (03PS2) 10Muehlenhoff: Add deneb to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/583104 [06:52:12] (03CR) 10Muehlenhoff: [C: 03+2] Add deneb to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/583104 (owner: 10Muehlenhoff) [06:57:33] !log Deploy schema change on db2129 (s6 codfw master) [06:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1137 for upgrade', diff saved to https://phabricator.wikimedia.org/P10757 and previous config saved to /var/cache/conftool/dbconfig/20200325-070946-marostegui.json [07:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:15] (03PS1) 10Marostegui: install_server: Allow reimage db1137 [puppet] - 10https://gerrit.wikimedia.org/r/583211 (https://phabricator.wikimedia.org/T246604) [07:20:53] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [07:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:44] (03PS4) 10Giuseppe Lavagetto: ProductionServices: switch eventgate-main to use envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/576009 (https://phabricator.wikimedia.org/T244843) [07:38:13] (03CR) 10Elukey: "The change makes sense, I am wondering if it is wise to wait a bit before deploying since it will cause changes to the way we use memcache" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575098 (owner: 10Aaron Schulz) [07:42:02] !log reboot scs-eqsin for CPU usage [07:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:48] <_joe_> jouncebot: next [07:42:48] In 3 hour(s) and 17 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200325T1100) [07:44:35] (03CR) 10Giuseppe Lavagetto: "Ditto for me. I'm not 100% sure I want to activate this in the current uncertain situation. I'll have to discuss this with the rest of SRE" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575098 (owner: 10Aaron Schulz) [07:46:36] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] TwoColConflict: Limited default deployment CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583084 (https://phabricator.wikimedia.org/T244863) (owner: 10WMDE-Fisch) [07:49:00] (03CR) 10Muehlenhoff: base::firewall: add a new global abuse_nets (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/583090 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [07:50:25] !log oblivian@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'production' . [07:50:25] !log oblivian@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' . [07:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:44] (03CR) 10Jcrespo: [C: 04-1] "See comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/583211 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [08:02:02] (03CR) 10Marostegui: "thank you! fixing!" [puppet] - 10https://gerrit.wikimedia.org/r/583211 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [08:02:15] PROBLEM - MariaDB read only m5 on db1117 is CRITICAL: Could not connect to localhost:3325 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:02:29] PROBLEM - MariaDB Slave SQL: m3 on db1117 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:03:01] PROBLEM - MariaDB Slave Lag: m5 on db1117 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:03:03] PROBLEM - MariaDB Slave IO: m3 on db1117 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:03:03] PROBLEM - MariaDB Slave SQL: m5 on db1117 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:03:07] what? [08:03:10] checking [08:03:10] <_joe_> uh [08:03:17] no user impact though [08:03:18] <_joe_> I'll stop deploying [08:03:19] it is a sby host [08:03:29] PROBLEM - MariaDB read only m3 on db1117 is CRITICAL: Could not connect to localhost:3323 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:03:29] PROBLEM - MariaDB Slave Lag: m3 on db1117 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:03:29] PROBLEM - MariaDB Slave IO: m5 on db1117 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:04:59] mmmm [08:05:18] mysql is up but the socket is gone :-/ [08:05:50] so mysql is up and running, it is a monitoring issue, the socket is gone and hence the monitoring cannot happen [08:06:30] going to downtime it so I can troubleshoot [08:07:57] I wonder why only m3 and m5 sockets got deleted and not m1 and m2 [08:08:51] wow, it has been gone for days :-/ [08:09:09] <_joe_> marostegui: so maybe an expired downtime? [08:09:14] maybe [08:09:18] but m1 and m2 sockets are there [08:09:24] <_joe_> can I proceed with my deployments? [08:09:31] so weird [08:09:33] _joe_: yep [08:10:27] (03PS2) 10Marostegui: install_server: Allow reimage db1137 [puppet] - 10https://gerrit.wikimedia.org/r/583211 (https://phabricator.wikimedia.org/T246604) [08:12:05] they are gone since the 23rd [08:12:21] m3 and m5 that is [08:12:36] I think I know what might have happened [08:12:44] !log Stop all mysql daemons on db1117 [08:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:14] <_joe_> !log upgrading all eventgate-main to envoy 1.13.1 T246868 [08:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:19] T246868: envoyproxy: CVE-2020-8664 CVE-2020-8661 CVE-2020-8660 CVE-2020-8659 - https://phabricator.wikimedia.org/T246868 [08:15:27] !log oblivian@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'production' . [08:15:27] !log oblivian@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' . [08:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:24] Yeah, it has to do with the bug we had on multiinstance hosts and the old package, as I restarted m1 and m2 past week, it deleted all the other sockets (that's the bug). I have upgraded the package now to the one that has it fixed [08:17:45] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:17:52] ^ me [08:18:05] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:18:19] PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:18:23] !log Reboot db1117 for full-upgrade [08:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:35] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:18:35] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:19:21] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:19:29] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:19:37] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:19:43] ^ all that is expected [08:19:57] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:22:33] RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:22:49] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:22:49] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:23:07] RECOVERY - MariaDB read only m5 on db1117 is OK: Version 10.1.44-MariaDB, Uptime 38s, read_only: True, 35.08 QPS, connection latency: 0.002527s, query latency: 0.000476s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:23:17] RECOVERY - MariaDB Slave SQL: m3 on db1117 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:23:20] (03CR) 10Jcrespo: "Hi, Privacybatm," [software/tendril] - 10https://gerrit.wikimedia.org/r/583203 (https://phabricator.wikimedia.org/T204110) (owner: 10Privacybatm) [08:23:35] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:23:43] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:23:49] RECOVERY - MariaDB Slave Lag: m5 on db1117 is OK: OK slave_sql_lag Replication lag: 0.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:23:51] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:23:51] RECOVERY - MariaDB Slave IO: m3 on db1117 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:23:53] RECOVERY - MariaDB Slave SQL: m5 on db1117 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:24:05] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:24:09] RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:24:25] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:26:26] 10Operations, 10Patch-For-Review: Add favicon to icinga and tendril - https://phabricator.wikimedia.org/T204110 (10jcrespo) Thank you very much for your contribution. I have left a comment on the change. In order for you to not lose much time, I have left a transparent .ico file here: (you can use it or create... [08:26:45] 10Operations, 10DBA, 10Patch-For-Review: Add favicon to icinga and tendril - https://phabricator.wikimedia.org/T204110 (10jcrespo) [08:27:41] (03PS1) 10ArielGlenn: fix name of option in a fixup script help message [dumps] - 10https://gerrit.wikimedia.org/r/583289 [08:29:34] (03CR) 10Jcrespo: [C: 03+1] install_server: Allow reimage db1137 [puppet] - 10https://gerrit.wikimedia.org/r/583211 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [08:29:43] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage db1137 [puppet] - 10https://gerrit.wikimedia.org/r/583211 (https://phabricator.wikimedia.org/T246604) (owner: 10Marostegui) [08:34:52] (03PS1) 10Marostegui: install_server: Reimage db1137 with buster [puppet] - 10https://gerrit.wikimedia.org/r/583290 [08:36:34] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1137 with buster [puppet] - 10https://gerrit.wikimedia.org/r/583290 (owner: 10Marostegui) [08:37:50] 10Operations, 10LDAP-Access-Requests: Add user "dubosv10" to ops ldap group - https://phabricator.wikimedia.org/T248445 (10Aklapper) [08:38:32] !log Reimage db1137 [08:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:42] 10Operations, 10LDAP-Access-Requests: Add user "dubosv10" to ops ldap group - https://phabricator.wikimedia.org/T248445 (10Aklapper) 05Open→03Stalled Hi @DubOSv10, thanks for taking the time to report this and welcome to Wikimedia Phabricator! Please see and follow https://phabricator.wikimedia.org/tag/ld... [08:51:33] (03PS1) 10Elukey: profile::presto::server: add nagios process check [puppet] - 10https://gerrit.wikimedia.org/r/583291 [08:53:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [08:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:41] (03PS2) 10Jcrespo: mariadb-backups: Update rowinfo format to include index name [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579365 (https://phabricator.wikimedia.org/T244884) [08:54:06] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Update rowinfo format to include index name [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579365 (https://phabricator.wikimedia.org/T244884) (owner: 10Jcrespo) [08:55:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:52] (03CR) 10Elukey: [C: 03+2] profile::presto::server: add nagios process check [puppet] - 10https://gerrit.wikimedia.org/r/583291 (owner: 10Elukey) [08:56:04] RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [09:02:08] (03PS3) 10Jcrespo: CuminExecution: Capture Exception cumin.transports.WorkerError [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [09:02:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1137', diff saved to https://phabricator.wikimedia.org/P10758 and previous config saved to /var/cache/conftool/dbconfig/20200325-090227-marostegui.json [09:02:30] (03CR) 10jerkins-bot: [V: 04-1] CuminExecution: Capture Exception cumin.transports.WorkerError [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [09:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:04] (03PS1) 10Vgutierrez: ATS: Enable inbound TLSv1.3 on upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/583292 (https://phabricator.wikimedia.org/T170567) [09:07:28] (03PS2) 10Vgutierrez: ATS: Enable inbound TLSv1.3 on upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/583292 (https://phabricator.wikimedia.org/T170567) [09:12:10] (03CR) 10Vgutierrez: "pcc shows a NOOP on text and the expected change on upload: https://puppet-compiler.wmflabs.org/compiler1003/21557/" [puppet] - 10https://gerrit.wikimedia.org/r/583292 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [09:13:15] (03CR) 10Jcrespo: "So Volans is our expert on Cumin, added here, but I think the best way to add a test here for our execution method is to mock the run comm" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [09:14:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1137', diff saved to https://phabricator.wikimedia.org/P10759 and previous config saved to /var/cache/conftool/dbconfig/20200325-091421-marostegui.json [09:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:55] (03PS1) 10Ema: ATS: disable transaction_active_timeout_in for EventStreams [puppet] - 10https://gerrit.wikimedia.org/r/583295 (https://phabricator.wikimedia.org/T242767) [09:16:17] (03CR) 10Ema: [C: 03+1] ATS: Enable inbound TLSv1.3 on upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/583292 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [09:19:45] (03CR) 10Vgutierrez: [C: 03+1] ATS: disable transaction_active_timeout_in for EventStreams [puppet] - 10https://gerrit.wikimedia.org/r/583295 (https://phabricator.wikimedia.org/T242767) (owner: 10Ema) [09:23:34] !log upgrade ATS to 8.0.6-1wm3 on upload@eqsin - T170567 [09:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:39] T170567: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 [09:23:49] (03CR) 10Ema: "pcc here for reference: https://puppet-compiler.wmflabs.org/compiler1003/21558/" [puppet] - 10https://gerrit.wikimedia.org/r/583295 (https://phabricator.wikimedia.org/T242767) (owner: 10Ema) [09:27:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [09:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:58] (03CR) 10Vgutierrez: [C: 03+2] ATS: Enable inbound TLSv1.3 on upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/583292 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [09:54:36] !log Enable inbound TLSv1.3 on upload@eqsin - T170567 [09:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:42] T170567: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 [09:55:39] 10Operations, 10netops: Configure conditional advertising in eqdfw and knams - https://phabricator.wikimedia.org/T236785 (10ayounsi) `lang=diff,name=cr2-eqdfw [edit routing-options rib inet6.0 aggregate route 2620:0:860::/46] - policy BGP_aggregate_contributors_eqiad; + policy BGP_from_LVS; [edit routi... [09:56:35] !log change aggregate policy for 2620:0:860::/46 on cr2-eqdfw - T236785 [09:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:40] T236785: Configure conditional advertising in eqdfw and knams - https://phabricator.wikimedia.org/T236785 [10:04:35] !log oblivian@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'production' . [10:04:35] !log oblivian@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' . [10:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:15] (03PS8) 10Giuseppe Lavagetto: mw1261: switch to envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/580343 [10:15:31] 10Operations, 10Traffic, 10netops: BGP: Investigate isolating codfw and eqiad - https://phabricator.wikimedia.org/T246721 (10ayounsi) digging more into it, codfw already advertise its own /23 (and v6 /48). [10:19:46] !log change aggregate policy for v4 prefixes on cr2-eqdfw - T236785 [10:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:51] T236785: Configure conditional advertising in eqdfw and knams - https://phabricator.wikimedia.org/T236785 [10:20:56] (03PS1) 10Elukey: Set maximum failover retry attempts for HDFS in Hadoop Test [puppet] - 10https://gerrit.wikimedia.org/r/583303 (https://phabricator.wikimedia.org/T244499) [10:22:53] (03CR) 10Arturo Borrero Gonzalez: "In jessie this package still exists. I wonder if this will break jessie servers (the few that are left anyway)." [puppet] - 10https://gerrit.wikimedia.org/r/583108 (owner: 10Muehlenhoff) [10:22:55] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:27:27] (03CR) 10Elukey: [C: 03+2] Set maximum failover retry attempts for HDFS in Hadoop Test [puppet] - 10https://gerrit.wikimedia.org/r/583303 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [10:28:49] (03CR) 10Muehlenhoff: "puppet-common also exists on buster-wikimedia, but we don't already install it for a while. It's also a transition package on jessie, see " [puppet] - 10https://gerrit.wikimedia.org/r/583108 (owner: 10Muehlenhoff) [10:31:05] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3050 is OK: HTTP OK: HTTP/1.0 200 OK - 22366 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:36:10] 10Operations, 10netops: Configure conditional advertising in eqdfw and knams - https://phabricator.wikimedia.org/T236785 (10ayounsi) `lang=diff,name=cr3-knams [edit routing-options rib inet6.0 aggregate route 2620:0:862::/48] - policy BGP_aggregate_contributors; + policy BGP_from_LVS; [edit routing-opt... [10:37:52] !log change aggregate policy for 2620:0:862::/48 on cr3-knams - T236785 [10:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:57] T236785: Configure conditional advertising in eqdfw and knams - https://phabricator.wikimedia.org/T236785 [10:39:02] (03CR) 10Giuseppe Lavagetto: "seems to DTRT https://puppet-compiler.wmflabs.org/compiler1001/21559/" [puppet] - 10https://gerrit.wikimedia.org/r/580343 (owner: 10Giuseppe Lavagetto) [10:39:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1137', diff saved to https://phabricator.wikimedia.org/P10760 and previous config saved to /var/cache/conftool/dbconfig/20200325-103938-marostegui.json [10:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I did a quick cumin tests and couldn't find a single VM with this package installed anyway." [puppet] - 10https://gerrit.wikimedia.org/r/583108 (owner: 10Muehlenhoff) [10:44:46] (03CR) 10Jbond: [C: 03+1] Remove puppet-common [puppet] - 10https://gerrit.wikimedia.org/r/583108 (owner: 10Muehlenhoff) [10:54:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/583108 (owner: 10Muehlenhoff) [10:55:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1137', diff saved to https://phabricator.wikimedia.org/P10761 and previous config saved to /var/cache/conftool/dbconfig/20200325-105503-marostegui.json [10:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:43] (03PS2) 10Privacybatm: Add favicon to Tendril [software/tendril] - 10https://gerrit.wikimedia.org/r/583203 (https://phabricator.wikimedia.org/T204110) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200325T1100). [11:00:04] CFisch_WMDE: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:51] * Urbanecm around, but leaves to CFisch_WMDE [11:01:43] (03CR) 10Privacybatm: "> Patch Set 1:" [software/tendril] - 10https://gerrit.wikimedia.org/r/583203 (https://phabricator.wikimedia.org/T204110) (owner: 10Privacybatm) [11:02:12] * CFisch_WMDE currently in a call so happy if Urbanecm can do it. [11:02:41] CFisch_WMDE: I am, will you be able to test it? [11:02:54] Yepp [11:02:55] (03PS4) 10Urbanecm: TwoColConflict: Limited default deployment InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581991 (https://phabricator.wikimedia.org/T244863) (owner: 10Andrew-WMDE) [11:02:57] nice [11:03:02] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581991 (https://phabricator.wikimedia.org/T244863) (owner: 10Andrew-WMDE) [11:03:34] (03PS4) 10Urbanecm: TwoColConflict: Limited default deployment CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583084 (https://phabricator.wikimedia.org/T244863) (owner: 10WMDE-Fisch) [11:03:39] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583084 (https://phabricator.wikimedia.org/T244863) (owner: 10WMDE-Fisch) [11:04:00] (03Merged) 10jenkins-bot: TwoColConflict: Limited default deployment InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581991 (https://phabricator.wikimedia.org/T244863) (owner: 10Andrew-WMDE) [11:04:41] (03Merged) 10jenkins-bot: TwoColConflict: Limited default deployment CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583084 (https://phabricator.wikimedia.org/T244863) (owner: 10WMDE-Fisch) [11:05:43] CFisch_WMDE: pulled onto mwdebug1001, lmk [11:06:53] Works like a charm. Thanks Urbanecm ! [11:06:58] okay, syncing [11:08:22] !log jynus@cumin1001 dbctl commit (dc=all): 'Reduce db1091 load, increase main traffic on all other s4 instances', diff saved to https://phabricator.wikimedia.org/P10762 and previous config saved to /var/cache/conftool/dbconfig/20200325-110821-jynus.json [11:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:51] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 81cda0f: TwoColConflict: Limited default deployment InitialiseSettings.php (T244863) (duration: 01m 17s) [11:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:57] T244863: Deploy Two-Column Edit Conflict as the default workflow for a small set of wikis - https://phabricator.wikimedia.org/T244863 [11:10:06] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 81cda0f: TwoColConflict: Limited default deployment InitialiseSettings.php (T244863; take II) (duration: 01m 06s) [11:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:25] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: 7b8d7c5: TwoColConflict: Limited default deployment CommonSettings.php (T244863) (duration: 01m 06s) [11:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:30] CFisch_WMDE: should be all! [11:12:49] Cool, thanks again! [11:13:11] happy to help! [11:13:26] (03PS1) 10Giuseppe Lavagetto: services_proxy: add keepalive, retries to eventgate-main too [puppet] - 10https://gerrit.wikimedia.org/r/583305 [11:13:33] (03PS3) 10Urbanecm: Add gwtoolset to available rights to allow granting to global groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582876 [11:14:03] (03CR) 10Urbanecm: [C: 03+2] Add gwtoolset to available rights to allow granting to global groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582876 (owner: 10Urbanecm) [11:16:29] (03Merged) 10jenkins-bot: Add gwtoolset to available rights to allow granting to global groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582876 (owner: 10Urbanecm) [11:19:21] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: 59412db: Add gwtoolset to available rights to allow granting to global groups (duration: 01m 07s) [11:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:11] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw125[0-3].eqiad.wmnet [11:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:48] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw123[2-5].eqiad.wmnet [11:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:54] (03CR) 10Jcrespo: [C: 03+1] "Great job- you correctly were able to submit and amendment, and your comment here and a heads up on Phabricator was very useful." [software/tendril] - 10https://gerrit.wikimedia.org/r/583203 (https://phabricator.wikimedia.org/T204110) (owner: 10Privacybatm) [11:21:20] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [11:21:22] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:28] 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 4 host(s) and their services with reason: decom ` mw[1232-1235].eqiad.wmnet ` [11:21:29] !log EU SWAT done [11:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:13] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [11:22:14] (03CR) 10Marostegui: [C: 03+1] "Thank you!!" [software/tendril] - 10https://gerrit.wikimedia.org/r/583203 (https://phabricator.wikimedia.org/T204110) (owner: 10Privacybatm) [11:22:15] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:20] 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 4 host(s) and their services with reason: decom ` mw[1250-1253].eqiad.wmnet ` [11:22:44] (03CR) 10Jcrespo: [C: 03+2] Add favicon to Tendril [software/tendril] - 10https://gerrit.wikimedia.org/r/583203 (https://phabricator.wikimedia.org/T204110) (owner: 10Privacybatm) [11:23:00] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Add favicon to Tendril [software/tendril] - 10https://gerrit.wikimedia.org/r/583203 (https://phabricator.wikimedia.org/T204110) (owner: 10Privacybatm) [11:23:35] (03CR) 10Dzahn: "more old servers bought in 2014 in https://rt.wikimedia.org/Ticket/Display.html?id=8786" [puppet] - 10https://gerrit.wikimedia.org/r/583114 (https://phabricator.wikimedia.org/T247780) (owner: 10Dzahn) [11:24:22] (03CR) 10MSantos: [C: 03+2] Update config template for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/583166 (owner: 10Mholloway) [11:24:32] (03PS1) 10Marostegui: Revert "install_server: Allow reimage db1137" [puppet] - 10https://gerrit.wikimedia.org/r/583307 [11:24:39] (03Merged) 10jenkins-bot: Update config template for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/583166 (owner: 10Mholloway) [11:25:21] jouncebot: now [11:25:21] For the next 0 hour(s) and 34 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200325T1100) [11:26:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/21560/mw1261.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/583305 (owner: 10Giuseppe Lavagetto) [11:26:16] (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Allow reimage db1137" [puppet] - 10https://gerrit.wikimedia.org/r/583307 (owner: 10Marostegui) [11:26:45] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw123[2-5].eqiad.wmnet [11:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:03] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw125[0-3].eqiad.wmnet [11:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:17] (03PS1) 10KartikMistry: apertium-mk-en: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-mk-en] - 10https://gerrit.wikimedia.org/r/583308 (https://phabricator.wikimedia.org/T247585) [11:31:02] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [11:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:15] !log decom mw1232 - mw1235 [11:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:14] (03PS1) 10CDanis: depool codfw for router maintenance [dns] - 10https://gerrit.wikimedia.org/r/583309 (https://phabricator.wikimedia.org/T248394) [11:33:42] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [11:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:49] 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[1232-1235].eqiad.wmnet` - mw1232.eqiad.wmnet (**PASS**) - Downtimed host on Icinga... [11:34:17] (03CR) 10CDanis: [C: 03+2] depool codfw for router maintenance [dns] - 10https://gerrit.wikimedia.org/r/583309 (https://phabricator.wikimedia.org/T248394) (owner: 10CDanis) [11:35:03] !log depool codfw for router maintenance T248394 [11:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:08] T248394: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 [11:37:31] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [11:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:56] !log decom mw1250 - mw1253 [11:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:33] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [11:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:38] 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[1250-1253].eqiad.wmnet` - mw1250.eqiad.wmnet (**PASS**) - Downtimed host on Icinga... [11:40:54] (03PS2) 10Dzahn: site: decom mw125[0-3] and mw123[2-5] [puppet] - 10https://gerrit.wikimedia.org/r/583114 (https://phabricator.wikimedia.org/T247780) [11:41:49] (03CR) 10Dzahn: [C: 03+2] "decom'ed with cookbook" [puppet] - 10https://gerrit.wikimedia.org/r/583114 (https://phabricator.wikimedia.org/T247780) (owner: 10Dzahn) [11:43:03] 10Operations, 10DBA, 10Patch-For-Review: Add favicon to icinga and tendril - https://phabricator.wikimedia.org/T204110 (10jcrespo) 05Open→03Resolved The work is completed after we deployed the change to production: {F31701300} Thank you very much @Privacybatm for your contribution. [11:46:10] (03CR) 10Volans: "> Patch Set 3:" (033 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [11:46:53] (03PS1) 10Dzahn: DHCP: remove decom'ed appservers from rack D5 [puppet] - 10https://gerrit.wikimedia.org/r/583313 (https://phabricator.wikimedia.org/T247780) [11:46:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2115 for upgrade', diff saved to https://phabricator.wikimedia.org/P10763 and previous config saved to /var/cache/conftool/dbconfig/20200325-114655-marostegui.json [11:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:52] (03CR) 10Dzahn: [C: 03+2] "the entire "mw122", "mw123", "mw124" ranges are already state decom in netbox. plus half of mw125" [puppet] - 10https://gerrit.wikimedia.org/r/583313 (https://phabricator.wikimedia.org/T247780) (owner: 10Dzahn) [11:49:10] (03PS2) 10Dzahn: DHCP: remove decom'ed appservers from rack D5 [puppet] - 10https://gerrit.wikimedia.org/r/583313 (https://phabricator.wikimedia.org/T247780) [11:50:22] !log cr1-codfw: `set chassis fpc 5 inline-services flex-flow-sizing` and `request chassis fpc restart slot 5` T248394 [11:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:27] T248394: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 [11:52:16] <_joe_> mutante: uhhhh i asked to stop decomming those servers on monday :/ [11:52:27] PROBLEM - PHP7 jobrunner on mw2265 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [11:52:27] PROBLEM - PHP7 jobrunner on mw2158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [11:52:29] PROBLEM - PHP7 jobrunner on mw2281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [11:52:48] _joe_: oh :/ wasn't aware. stopping now. i did only 4 per group to be careful [11:52:54] PROBLEM - PHP7 jobrunner on mw2152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [11:52:54] PROBLEM - PHP7 jobrunner on mw2154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [11:52:54] PROBLEM - PHP7 jobrunner on mw2156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [11:52:54] PROBLEM - PHP7 jobrunner on mw2160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [11:52:58] PROBLEM - PHP7 jobrunner on mw2260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [11:53:01] <_joe_> what's going on? [11:53:02] PROBLEM - PHP7 jobrunner on mw2161 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [11:53:06] PROBLEM - PHP7 jobrunner on mw2282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [11:53:20] PROBLEM - PHP7 jobrunner on mw2267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [11:53:24] PROBLEM - PHP7 jobrunner on mw2278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [11:53:24] PROBLEM - PHP7 jobrunner on mw2279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [11:53:28] PROBLEM - Nginx local proxy to videoscaler on mw2158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [11:53:28] PROBLEM - Nginx local proxy to apache on mw2172 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:53:28] PROBLEM - PHP7 rendering on mw2180 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:53:28] PROBLEM - PHP7 rendering on mw2189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:53:28] PROBLEM - Apache HTTP on mw2175 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:53:28] PROBLEM - PHP7 rendering on mw2292 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:53:28] PROBLEM - PHP7 rendering on mw2226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:53:29] PROBLEM - PHP7 rendering on mw2294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:53:29] PROBLEM - PHP7 rendering on mw2287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:53:30] PROBLEM - PHP7 rendering on mw2361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:53:30] PROBLEM - PHP7 rendering on mw2328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:53:31] PROBLEM - PHP7 jobrunner on mw2157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [11:53:31] PROBLEM - Nginx local proxy to apache on mw2192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:53:32] PROBLEM - PHP7 rendering on mw2195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:53:32] PROBLEM - Nginx local proxy to apache on mw2217 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [11:54:01] <_joe_> I can't reach those servers [11:54:15] VRRP should have failed over [11:54:19] <_joe_> codfw Is unreachable [11:54:25] same for OSPF [11:54:26] XioNoX: anyting WIP? [11:54:27] _joe_: router maintenance and codfw depooled, afAict [11:54:31] yes [11:54:34] (03PS1) 10Marostegui: install_server: Allow reimage db2115 [puppet] - 10https://gerrit.wikimedia.org/r/583315 [11:54:36] <_joe_> ok sorry [11:54:42] Request from 2a00:23c6:9200:9700:41e:44a7:39c8:d465 via cp3062.esams.wmnet, ATS/8.0.6 [11:54:42] Error: 502, internal error - server connection terminated at 2020-03-25 11:54:22 GMT [11:54:44] <_joe_> so not an issue for now? [11:54:48] <_joe_> heh [11:54:51] <_joe_> see? [11:55:02] if I set mwdebug too codfw, works fine otherwise. [11:55:13] RhinosF1: okay that's expected [11:55:19] <_joe_> RhinosF1: that's expected yes [11:55:25] waiting for a restart, right? [11:55:31] yes, it should be just a few more minutes [11:55:35] paged... [11:55:49] RECOVERY - PHP7 jobrunner on mw2282 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [11:55:52] ugh.. laptop about to shutdown and need to find an adapter for euro socket [11:55:52] RECOVERY - PHP7 jobrunner on mw2267 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [11:55:57] and there we go.. that seems to come back [11:55:58] RECOVERY - PHP7 jobrunner on mw2278 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [11:55:58] RECOVERY - PHP7 jobrunner on mw2279 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [11:56:04] RECOVERY - Apache HTTP on mw2175 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 649 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:56:04] RECOVERY - PHP7 jobrunner on mw2157 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [11:56:04] RECOVERY - Nginx local proxy to videoscaler on mw2158 is OK: HTTP OK: HTTP/1.1 200 OK - 339 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [11:56:04] RECOVERY - Nginx local proxy to apache on mw2172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 650 bytes in 0.242 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:56:04] RECOVERY - PHP7 rendering on mw2294 is OK: HTTP OK: HTTP/1.1 200 OK - 77143 bytes in 0.272 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:56:05] RECOVERY - PHP7 rendering on mw2292 is OK: HTTP OK: HTTP/1.1 200 OK - 77143 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:56:05] RECOVERY - PHP7 rendering on mw2328 is OK: HTTP OK: HTTP/1.1 200 OK - 77143 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:56:05] RECOVERY - PHP7 rendering on mw2361 is OK: HTTP OK: HTTP/1.1 200 OK - 77143 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:56:06] RECOVERY - PHP7 rendering on mw2180 is OK: HTTP OK: HTTP/1.1 200 OK - 77143 bytes in 0.311 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:56:06] RECOVERY - PHP7 rendering on mw2287 is OK: HTTP OK: HTTP/1.1 200 OK - 77143 bytes in 0.304 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:56:07] RECOVERY - PHP7 rendering on mw2189 is OK: HTTP OK: HTTP/1.1 200 OK - 77143 bytes in 0.320 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:56:07] RECOVERY - PHP7 rendering on mw2226 is OK: HTTP OK: HTTP/1.1 200 OK - 77143 bytes in 0.328 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:56:08] RECOVERY - Apache HTTP on mw2237 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 649 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:56:08] RECOVERY - Apache HTTP on mw2240 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 649 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:56:09] RECOVERY - Nginx local proxy to apache on mw2223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 650 bytes in 0.201 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:56:09] RECOVERY - Nginx local proxy to apache on mw2217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 650 bytes in 0.225 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:56:10] RECOVERY - Nginx local proxy to apache on mw2192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 650 bytes in 0.233 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:56:10] RECOVERY - PHP7 rendering on mw2195 is OK: HTTP OK: HTTP/1.1 200 OK - 77143 bytes in 0.300 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:56:10] apergos: router maintenance / reboot [11:56:11] RECOVERY - PHP7 rendering on mw2228 is OK: HTTP OK: HTTP/1.1 200 OK - 77143 bytes in 0.307 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:56:11] RECOVERY - PHP7 rendering on mw2155 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [11:56:12] RECOVERY - Apache HTTP on mw2295 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 649 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:56:12] RECOVERY - Apache HTTP on mw2292 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 649 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:56:13] RECOVERY - Apache HTTP on mw2301 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 649 bytes in 0.120 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:56:13] RECOVERY - Nginx local proxy to jobrunner on mw2278 is OK: HTTP OK: HTTP/1.1 200 OK - 338 bytes in 0.152 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [11:56:48] RECOVERY - PHP7 jobrunner on mw2159 is OK: HTTP OK: HTTP/1.1 200 OK - 321 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [11:56:48] RECOVERY - Apache HTTP on mw2215 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 649 bytes in 0.122 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:56:48] RECOVERY - Apache HTTP on mw2169 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 649 bytes in 0.127 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:56:49] RECOVERY - Apache HTTP on mw2316 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 649 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:56:49] RECOVERY - Apache HTTP on mw2321 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 649 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:56:49] RECOVERY - Apache HTTP on mw2196 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 649 bytes in 0.127 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:56:49] RECOVERY - Apache HTTP on mw2220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 649 bytes in 0.123 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:57:45] looks good now if we didn't send icinga that excited it got killed [11:58:03] and there's the recovery pages [11:58:43] RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp2022 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Varnish [11:58:59] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp2005 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Varnish [11:58:59] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp2001 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Varnish [11:58:59] RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp2005 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Varnish [11:58:59] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp2007 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Varnish [11:58:59] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp2006 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Varnish [11:58:59] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp2010 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Varnish [11:58:59] RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp2018 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Varnish [11:59:01] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp2004 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Varnish [11:59:01] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp2025 is OK: HTTP OK: HTTP/1.1 200 OK - 543 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Varnish [12:00:19] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 266 probes of 543 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:03:37] (03PS1) 10Arturo Borrero Gonzalez: cloud: refactor novaproxy role [puppet] - 10https://gerrit.wikimedia.org/r/583316 (https://phabricator.wikimedia.org/T135046) [12:05:45] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 35 probes of 543 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:08:57] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:12:54] (03PS3) 10Arturo Borrero Gonzalez: dynamicproxy: add support for dynamic XFF per FQDN [puppet] - 10https://gerrit.wikimedia.org/r/583098 (https://phabricator.wikimedia.org/T135046) [12:13:19] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:14:18] <_joe_> uh [12:17:34] 10Operations, 10netops: Configure conditional advertising in eqdfw and knams - https://phabricator.wikimedia.org/T236785 (10ayounsi) Previous change rolled back as: `* 2620:0:862::/48 Self I` was not being advertised to the world anymore While there are contributing... [12:19:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: refactor novaproxy role [puppet] - 10https://gerrit.wikimedia.org/r/583316 (https://phabricator.wikimedia.org/T135046) (owner: 10Arturo Borrero Gonzalez) [12:23:45] 10Operations, 10LDAP-Access-Requests: Add user "dubosv10" to ops ldap group - https://phabricator.wikimedia.org/T248445 (10Volans) @DubOSv10 thanks for your interest. Unfortunately Netbox contains information that may be sensitive in nature, so access to it is restricted on a need-to-know basis. [12:25:09] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1077 is OK: HTTP OK: HTTP/1.0 200 OK - 22324 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:25:47] 10Operations, 10SRE-Access-Requests: Request Netbox access for user "dubosv10" - https://phabricator.wikimedia.org/T248445 (10Volans) p:05Triage→03Medium [12:27:07] (03PS1) 10Arturo Borrero Gonzalez: cloud: novaproxy: fix default value for String typed hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/583318 (https://phabricator.wikimedia.org/T135046) [12:30:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: novaproxy: fix default value for String typed hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/583318 (https://phabricator.wikimedia.org/T135046) (owner: 10Arturo Borrero Gonzalez) [12:32:36] (03CR) 10Jcrespo: [C: 03+1] install_server: Allow reimage db2115 [puppet] - 10https://gerrit.wikimedia.org/r/583315 (owner: 10Marostegui) [12:37:56] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage db2115 [puppet] - 10https://gerrit.wikimedia.org/r/583315 (owner: 10Marostegui) [12:40:23] (03PS1) 10Marostegui: install_server: Reimage db2115 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/583320 [12:40:55] 10Operations, 10MediaWiki-extensions-Translate, 10Schema-change, 10Wikimedia-Incident, 10Wikimedia-production-error: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293 (10Nikerabbit) [12:41:44] (03PS2) 10Marostegui: install_server: Reimage db2115 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/583320 [12:43:52] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2115 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/583320 (owner: 10Marostegui) [12:45:49] !log Stop MySQL on db2115 for reimage to buster [12:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:09] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:47:40] ACKNOWLEDGEMENT - HTTPS-planet on en.planet.wikimedia.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2020-04-21 07:03:51 +0000 (expires in 26 days) daniel_zahn new cert is already staged and will be updated in 4 days per Valentin https://wikitech.wikimedia.org/wiki/Planet.wikimedia.org [12:47:40] ACKNOWLEDGEMENT - HTTPS-wmfusercontent on phab.wmfusercontent.org is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org valid until 2020-04-21 07:03:51 +0000 (expires in 26 days) daniel_zahn new cert is already staged and will be updated in 4 days per Valentin https://phabricator.wikimedia.org/tag/phabricator/ [12:51:46] (03CR) 10Arturo Borrero Gonzalez: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/583098 (https://phabricator.wikimedia.org/T135046) (owner: 10Arturo Borrero Gonzalez) [12:56:06] !log Deploy schema change on db1139:3316 [12:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:47] (03PS4) 10Arturo Borrero Gonzalez: dynamicproxy: add support for dynamic XFF per FQDN [puppet] - 10https://gerrit.wikimedia.org/r/583098 (https://phabricator.wikimedia.org/T135046) [12:57:15] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3050 is OK: HTTP OK: HTTP/1.0 200 OK - 22367 bytes in 0.259 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:01:53] (03PS5) 10Arturo Borrero Gonzalez: dynamicproxy: add support for dynamic XFF per FQDN [puppet] - 10https://gerrit.wikimedia.org/r/583098 (https://phabricator.wikimedia.org/T135046) [13:02:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [13:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:53] (03CR) 10Dzahn: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/583144 (owner: 10Papaul) [13:03:03] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [13:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:08] (03CR) 10Ottomata: [C: 03+1] Convert test2wiki from EventLogging repo to client-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583149 (https://phabricator.wikimedia.org/T196309) (owner: 10Krinkle) [13:05:07] (03PS6) 10Arturo Borrero Gonzalez: dynamicproxy: add support for dynamic XFF per FQDN [puppet] - 10https://gerrit.wikimedia.org/r/583098 (https://phabricator.wikimedia.org/T135046) [13:07:24] (03PS2) 10Muehlenhoff: Remove puppet-common [puppet] - 10https://gerrit.wikimedia.org/r/583108 [13:07:51] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.08165 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:08:39] (03PS4) 10Dzahn: microsites::httpd: close port 80 for caching servers [puppet] - 10https://gerrit.wikimedia.org/r/572352 [13:08:56] hmm.. re: puppet failures, looks like a spike already over [13:09:25] (03CR) 10jerkins-bot: [V: 04-1] microsites::httpd: close port 80 for caching servers [puppet] - 10https://gerrit.wikimedia.org/r/572352 (owner: 10Dzahn) [13:09:53] (03PS7) 10Arturo Borrero Gonzalez: dynamicproxy: add support for dynamic XFF per FQDN [puppet] - 10https://gerrit.wikimedia.org/r/583098 (https://phabricator.wikimedia.org/T135046) [13:10:01] was esams cp .. it seems [13:11:48] oh.. or not [13:11:59] 250 warnings of puppet dependency cycles all over the place [13:12:18] (03CR) 10Muehlenhoff: [C: 03+2] Remove puppet-common [puppet] - 10https://gerrit.wikimedia.org/r/583108 (owner: 10Muehlenhoff) [13:13:23] mutante: it is not over [13:13:26] puppet is broken [13:13:27] moritzm: any relation to: [13:13:29] Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, (): did not find expected key while parsing a block mapping at line 164 column 5 (file: /etc/puppet/manifests/realm.pp [13:13:33] ? [13:13:34] yeah ^ that [13:15:12] looking [13:15:44] reverting, but I don't see how it could have been the root cause [13:15:57] hmm.. it started before you merged [13:16:11] but the other recent merges seem harmless too [13:16:39] (03PS1) 10RhinosF1: Removed expired throttle.php entries. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583325 [13:17:26] "zero resources tracked by Puppet" [13:17:51] (03PS2) 10RhinosF1: Removed expired throttle.php entries. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583325 [13:17:58] there are puppet failures in Icinga dating back to 15 mins, that can't be my patch for puppet-common [13:18:10] the line it talks about: [13:18:11] $app_routes = hiera('discovery::app_routes') [13:18:18] "2020-03-25 13:14:12 0d 0h 13m 22s 3/3 WARNING: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle." [13:18:26] well that is when i check on the puppetmaster itself [13:19:18] nothing really got merged 15 min ago ... [13:19:22] puppetdb ? [13:20:03] <_joe_> cdanis: did you touch that ^^ [13:20:13] (03CR) 10Ottomata: [C: 03+1] ATS: disable transaction_active_timeout_in for EventStreams [puppet] - 10https://gerrit.wikimedia.org/r/583295 (https://phabricator.wikimedia.org/T242767) (owner: 10Ema) [13:20:28] <_joe_> mutante: gimme a couple minutes and I'll take a look [13:20:34] thanks _joe_ [13:21:04] _joe_: I don't think so [13:21:07] yea, right before that line there is the "$other_site = $site ?" logic [13:21:12] I've made no puppet changes this morning [13:21:21] <_joe_> ack thanks [13:21:57] (03PS1) 10Muehlenhoff: Revert "Remove puppet-common" [puppet] - 10https://gerrit.wikimedia.org/r/583326 [13:23:39] <_joe_> oh yes I think what moritzm did could be the problem [13:23:56] (03PS8) 10Arturo Borrero Gonzalez: dynamicproxy: add support for dynamic XFF per FQDN [puppet] - 10https://gerrit.wikimedia.org/r/583098 (https://phabricator.wikimedia.org/T135046) [13:24:03] <_joe_> that's the first call to hiera() [13:24:09] <_joe_> in the whole catalog [13:24:11] there will be icinga storm coming in once these upgrade from WARN to CRIT. let's get ready to stop icinga-wm [13:24:30] <_joe_> mutante: there is no more an icinga storm [13:24:35] <_joe_> we changed that months ago [13:24:42] ok [13:24:52] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Remove puppet-common" [puppet] - 10https://gerrit.wikimedia.org/r/583326 (owner: 10Muehlenhoff) [13:25:29] <_joe_> moritzm: I'll re-install puppet-common on all the puppetmasters via cumin [13:26:01] package was removed via cumin before puppet merge? [13:26:21] _joe_: ahhh, gotcha, meh [13:26:25] <_joe_> !log cumin A:puppetmaster 'apt-get -y install puppet-common' [13:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:59] but it's just an empty transition on the puppet masters as well? [13:27:02] <_joe_> I still get the same error [13:27:06] <_joe_> so it wasn't it [13:27:17] <_joe_> there is something wrong in some hiera yaml file I guess [13:27:28] and it wasn't installed on the puppet masters per my debmonitor tab [13:27:30] <_joe_> can someone check the hieradata tree ? [13:28:01] <_joe_> there must be a yaml file that doesn't parse correctly [13:28:36] (03PS1) 10Jbond: add abuse_networks [labs/private] - 10https://gerrit.wikimedia.org/r/583327 [13:28:57] _joe_: when it started? [13:29:01] looking at recent changes for anything that touched .yaml [13:29:14] 15 minutes ago and does not seem to match a merge in the repo [13:29:17] <_joe_> volans: no idea [13:29:19] well now 20 or so i think [13:29:23] there was one in the private 25 minte ago [13:29:24] oldest errors dating back in Icinga to 24m [13:29:25] checking [13:29:30] <_joe_> volans: hah [13:29:35] sans delay [13:29:52] <_joe_> yeah I wasn't finding anything in the puppet repo [13:29:58] <_joe_> it might be private indeed [13:30:08] about 24 min is the oldest one i see [13:30:13] could be the 012d1733)(jbond) abuse_networks: add new global to be used in ACL's in puppet.private [13:30:18] matches timing-wise [13:30:29] jbond42: yeah [13:30:32] it's invalid [13:30:38] i made a vchange to the global private hiera let me revert [13:30:48] sorry was just reading back log [13:30:50] a hash that contains keys and arry like [13:31:12] jbond42: line 165 and following [13:31:27] looks a mix between a hash and an array to me [13:31:47] ah yeah it' missing the networks keyword [13:31:50] and related indentation [13:31:53] jbond42: ^^^ [13:32:01] want me to fix it? [13:32:14] volans: i thin i just pushed a fix [13:32:20] testing now [13:32:28] puppet works again [13:32:31] indentation is still messy [13:32:33] but yeah [13:32:56] wanna run a "failed only" cumin or just wait? [13:33:19] i guess it will be equal to running on * though [13:33:47] probably just waiting is the same result and we don't cause a stampede on the master [13:33:51] if running use the suggestedbatch size on wikitech [13:33:56] to not kill the masters [13:34:25] yea, and then it's not faster than the random agent run and we can just let it recover [13:35:01] (03PS9) 10Arturo Borrero Gonzalez: dynamicproxy: add support for dynamic XFF per FQDN [puppet] - 10https://gerrit.wikimedia.org/r/583098 (https://phabricator.wikimedia.org/T135046) [13:37:19] moritzm: you can re-revert i guess [13:37:38] ack, will do in a bit when Icinga is cleared a bit [13:37:55] <_joe_> I would just let nature run its course [13:38:00] volans: indentation should be fixed now [13:38:07] _joe_: ack [13:39:24] jbond42: <3 [13:40:57] (03PS1) 10Muehlenhoff: Remove puppet-common [puppet] - 10https://gerrit.wikimedia.org/r/583335 [13:43:42] (03CR) 10Holger Knust: [V: 03+1 C: 03+1] changeprop: Move prometheus query to use regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/583111 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [13:44:43] (03PS3) 10Hnowlan: changeprop: Move prometheus query to use regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/583111 (https://phabricator.wikimedia.org/T213193) [13:45:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "PCC https://puppet-compiler.wmflabs.org/compiler1002/21567/" [puppet] - 10https://gerrit.wikimedia.org/r/583098 (https://phabricator.wikimedia.org/T135046) (owner: 10Arturo Borrero Gonzalez) [13:46:16] (03CR) 10Hnowlan: [C: 03+2] changeprop: Move prometheus query to use regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/583111 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [13:46:34] (03Merged) 10jenkins-bot: changeprop: Move prometheus query to use regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/583111 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [13:48:11] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [13:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, two nits inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/583078 (https://phabricator.wikimedia.org/T245743) (owner: 10Jbond) [13:53:48] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [13:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:46] (03PS1) 10Jbond: network: add new function to return ip lists used in ACL's [puppet] - 10https://gerrit.wikimedia.org/r/583340 (https://phabricator.wikimedia.org/T233945) [13:56:48] (03PS1) 10Jbond: profile::base::firewall: add support for abuse_networks [puppet] - 10https://gerrit.wikimedia.org/r/583341 (https://phabricator.wikimedia.org/T233945) [13:56:50] (03PS1) 10Jbond: varnish: update varnish config to use the abuse_networks global [puppet] - 10https://gerrit.wikimedia.org/r/583342 [13:57:25] (03CR) 10jerkins-bot: [V: 04-1] network: add new function to return ip lists used in ACL's [puppet] - 10https://gerrit.wikimedia.org/r/583340 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [13:59:48] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/583340 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [14:01:21] (03PS2) 10Jbond: network: add new function to return ip lists used in ACL's [puppet] - 10https://gerrit.wikimedia.org/r/583340 (https://phabricator.wikimedia.org/T233945) [14:05:26] (03PS2) 10Jbond: profile::base::firewall: add support for abuse_networks [puppet] - 10https://gerrit.wikimedia.org/r/583341 (https://phabricator.wikimedia.org/T233945) [14:06:06] (03PS2) 10Jbond: varnish: update varnish config to use the abuse_networks global [puppet] - 10https://gerrit.wikimedia.org/r/583342 [14:06:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [14:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:54] (03PS5) 10Dzahn: microsites::httpd: close port 80 for caching servers [puppet] - 10https://gerrit.wikimedia.org/r/572352 [14:12:24] (03PS3) 10Jbond: profile::idp::client::httpd: add check for sso redirect [puppet] - 10https://gerrit.wikimedia.org/r/583078 (https://phabricator.wikimedia.org/T245743) [14:12:40] (03CR) 10Jbond: "updated thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/583078 (https://phabricator.wikimedia.org/T245743) (owner: 10Jbond) [14:15:03] (03Abandoned) 10Jbond: base::firewall: add a new global abuse_nets [puppet] - 10https://gerrit.wikimedia.org/r/583090 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [14:16:07] (03PS3) 10Thcipriani: Integration Cluster: update gitcache nightly [puppet] - 10https://gerrit.wikimedia.org/r/579602 [14:16:26] (03CR) 10Thcipriani: Integration Cluster: update gitcache nightly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579602 (owner: 10Thcipriani) [14:20:09] 10Operations, 10MediaWiki-Debug-Logger, 10Traffic, 10Developer Productivity: noc.wikimedia.org doesn't route to the docroot when WikimediaDebug browser extension is live - https://phabricator.wikimedia.org/T245552 (10Krinkle) [14:20:37] 10Operations, 10MediaWiki-Debug-Logger, 10Traffic, 10Developer Productivity: noc.wikimedia.org doesn't route to the docroot when WikimediaDebug browser extension is live - https://phabricator.wikimedia.org/T245552 (10Krinkle) [14:20:49] 10Operations, 10MediaWiki-Debug-Logger, 10Traffic, 10Developer Productivity: noc.wikimedia.org doesn't route to the docroot when WikimediaDebug browser extension is live - https://phabricator.wikimedia.org/T245552 (10Krinkle) [14:26:39] (03CR) 10Ema: [C: 03+2] ATS: disable transaction_active_timeout_in for EventStreams [puppet] - 10https://gerrit.wikimedia.org/r/583295 (https://phabricator.wikimedia.org/T242767) (owner: 10Ema) [14:28:52] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:32:15] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/21568/" [puppet] - 10https://gerrit.wikimedia.org/r/572352 (owner: 10Dzahn) [14:33:14] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:33:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, one final nit (feel free to ignore)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/583078 (https://phabricator.wikimedia.org/T245743) (owner: 10Jbond) [14:35:26] (03CR) 10Dzahn: "on vega.codfw.wmnet: -&R_SERVICE(tcp, 80, $CACHES);" [puppet] - 10https://gerrit.wikimedia.org/r/572352 (owner: 10Dzahn) [14:38:32] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3050 is OK: HTTP OK: HTTP/1.0 200 OK - 22383 bytes in 0.275 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:39:18] !log static microsites (annual.wikimedia.org, research.wikimedia.org, static-bugzilla etc). closed port 80 for caching servers, finalizing switch to https behind caching servers [14:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:23] (03PS4) 10Jbond: profile::idp::client::httpd: add check for sso redirect [puppet] - 10https://gerrit.wikimedia.org/r/583078 (https://phabricator.wikimedia.org/T245743) [14:41:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/583078 (https://phabricator.wikimedia.org/T245743) (owner: 10Jbond) [14:46:06] !log closed port 80 for caching servers on misc backends https://gerrit.wikimedia.org/r/q/topic:%22applayer-tls%22+(status:open%20OR%20status:merged) as final step per service on T210411 [14:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:12] T210411: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 [14:48:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10Dzahn) @bblack https://gerrit.wikimedia.org/r/q/topic:%2522decom-eqiad%2522+(status:open) [14:48:55] (03CR) 10Thcipriani: [C: 03+1] Do not update the globals cache file while opcache needs regeneration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575469 (https://phabricator.wikimedia.org/T236104) (owner: 10Giuseppe Lavagetto) [14:49:57] (03PS1) 10CDanis: Revert "depool codfw for router maintenance" [dns] - 10https://gerrit.wikimedia.org/r/583354 (https://phabricator.wikimedia.org/T248394) [14:50:07] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Dzahn) @akosiaris fix for partman? https://gerrit.wikimedia.org/r/c/operations/puppet/+/576887 [14:51:08] (03CR) 10CDanis: [C: 03+2] Revert "depool codfw for router maintenance" [dns] - 10https://gerrit.wikimedia.org/r/583354 (https://phabricator.wikimedia.org/T248394) (owner: 10CDanis) [14:51:37] !log repool codfw T248394 [14:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:44] T248394: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 [14:54:04] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10ItamarWMDE) [14:56:34] (03PS2) 10Dzahn: noc::site: close port 80 for caching servers [puppet] - 10https://gerrit.wikimedia.org/r/572337 [14:57:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Revert "Neutron l3_agent: refresh on config changes" [puppet] - 10https://gerrit.wikimedia.org/r/583093 (owner: 10Andrew Bogott) [14:58:38] (03PS3) 10Dzahn: noc::site: close port 80 for caching servers [puppet] - 10https://gerrit.wikimedia.org/r/572337 [14:59:00] (03PS1) 10Muehlenhoff: Add DHCP/partman config for deneb.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/583357 [14:59:23] (03CR) 10jerkins-bot: [V: 04-1] noc::site: close port 80 for caching servers [puppet] - 10https://gerrit.wikimedia.org/r/572337 (owner: 10Dzahn) [15:00:54] (03PS4) 10Dzahn: noc::site: close port 80 for caching servers [puppet] - 10https://gerrit.wikimedia.org/r/572337 [15:01:20] (03CR) 10Dzahn: [C: 03+1] Add DHCP/partman config for deneb.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/583357 (owner: 10Muehlenhoff) [15:01:42] (03CR) 10Muehlenhoff: [C: 03+2] Add DHCP/partman config for deneb.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/583357 (owner: 10Muehlenhoff) [15:03:38] (03CR) 10jerkins-bot: [V: 04-1] noc::site: close port 80 for caching servers [puppet] - 10https://gerrit.wikimedia.org/r/572337 (owner: 10Dzahn) [15:04:32] (03CR) 10Dzahn: [V: 04-1 C: 04-1] "Illegal variable name, The given name 'CUMIN_MASTERS' does not conform to the naming rule" [puppet] - 10https://gerrit.wikimedia.org/r/572337 (owner: 10Dzahn) [15:05:34] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Neutron l3_agent: refresh on config changes" [puppet] - 10https://gerrit.wikimedia.org/r/583093 (owner: 10Andrew Bogott) [15:06:14] (03PS5) 10Dzahn: noc::site: close port 80 for caching servers [puppet] - 10https://gerrit.wikimedia.org/r/572337 [15:06:55] (03CR) 10jerkins-bot: [V: 04-1] noc::site: close port 80 for caching servers [puppet] - 10https://gerrit.wikimedia.org/r/572337 (owner: 10Dzahn) [15:11:31] 10Operations, 10Cloud-VPS (Project-requests): Request creation of SRE VPS project - https://phabricator.wikimedia.org/T247517 (10bd808) >>! In T247517#5995228, @jbond wrote: > Perhaps we could be even more aggressive. I'm not sure how much we could script things or what the capabilities of openstack are but i... [15:13:23] 10Operations, 10Cloud-VPS (Project-requests): Request creation of 'sre-sandbox' VPS project - https://phabricator.wikimedia.org/T247517 (10bd808) [15:13:31] (03PS3) 10Dzahn: phabricator: remove firewall holes for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/569100 [15:14:57] !log installing deneb.codfw.wmnet T248165 [15:15:01] (03PS4) 10Dzahn: phabricator: close port 80 for caching servers [puppet] - 10https://gerrit.wikimedia.org/r/569100 [15:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:03] T248165: codfw: 1 VM for builder - https://phabricator.wikimedia.org/T248165 [15:16:58] (03PS6) 10Dzahn: noc::site: close port 80 for caching servers [puppet] - 10https://gerrit.wikimedia.org/r/572337 [15:21:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2115 after reimage to Buster', diff saved to https://phabricator.wikimedia.org/P10767 and previous config saved to /var/cache/conftool/dbconfig/20200325-152148-marostegui.json [15:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:20] (03CR) 10Dzahn: [C: 04-2] "This is to help with migrating contint to buster" [puppet] - 10https://gerrit.wikimedia.org/r/566383 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [15:27:02] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:27:02] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) 05Open→03Stalled @hashar Per your c... [15:27:04] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [15:28:17] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10MoritzMuehlenhoff) Nothing can be blocked on 5... [15:29:42] 10Operations: webproxy: 501 Not Implemented - https://phabricator.wikimedia.org/T248485 (10hashar) [15:30:17] 10Operations: webproxy: 501 Not Implemented - https://phabricator.wikimedia.org/T248485 (10hashar) 05Open→03Resolved a:03hashar Trying locally without a proxy: 2020-03-25 16:29:51 ERROR 501: HTTPS Required. I will switch the URL to https [15:31:07] 10Operations, 10serviceops, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10Dzahn) >>! In T226236#5505636, @MoritzMuehlenhoff wrote: >>>! In T226236#5505608, @hashar wrote: >> Anywa,y I am declini... [15:31:35] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) also see T226236#5998777 [15:31:45] 10Operations, 10netops: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 (10CDanis) [15:37:51] (03PS12) 10Herron: logstash: add new SSD hosts to ELK7 cluster with disktype attr "ssd" [puppet] - 10https://gerrit.wikimedia.org/r/579340 (https://phabricator.wikimedia.org/T247376) [15:38:22] 10Operations, 10netops: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 (10CDanis) Deployed to codfw but caused an outage. Incident report in progress [15:38:24] (03CR) 10Papaul: [C: 03+2] DNS: Add mgmt and production DNS for cp2027 to cp2042 [dns] - 10https://gerrit.wikimedia.org/r/583144 (owner: 10Papaul) [15:38:29] (03PS2) 10Papaul: DNS: Add mgmt and production DNS for cp2027 to cp2042 [dns] - 10https://gerrit.wikimedia.org/r/583144 [15:46:31] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10Volans) p:05Triage→03Medium [15:48:50] (03PS1) 10Jbond: envoy: introduce use_remote_address parameter [puppet] - 10https://gerrit.wikimedia.org/r/583366 [15:48:52] (03PS1) 10Jbond: profile::tlsproxy::envoy: allow users to override the cluster addr [puppet] - 10https://gerrit.wikimedia.org/r/583367 [15:48:54] (03PS1) 10Jbond: idp: update the idp proxy config to use localhost and use_remote_address [puppet] - 10https://gerrit.wikimedia.org/r/583368 [15:49:15] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583366 (owner: 10Jbond) [15:49:23] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583367 (owner: 10Jbond) [15:49:30] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583368 (owner: 10Jbond) [15:51:13] (03CR) 10jerkins-bot: [V: 04-1] envoy: introduce use_remote_address parameter [puppet] - 10https://gerrit.wikimedia.org/r/583366 (owner: 10Jbond) [15:51:38] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10Volans) [15:51:39] (03CR) 10jerkins-bot: [V: 04-1] profile::tlsproxy::envoy: allow users to override the cluster addr [puppet] - 10https://gerrit.wikimedia.org/r/583367 (owner: 10Jbond) [15:51:55] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic: Elevated 503 responses between 2020-03-15 and 2020-03-19 - https://phabricator.wikimedia.org/T248132 (10Mholloway) [15:52:00] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10Volans) Looping @KFrancis to verify that we have a valid NDA on file. I can see the line in the related spreadsheet but there are no dates, so asking for confirmation. @ItamarWMDE... [15:53:51] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10ItamarWMDE) [15:54:37] (03CR) 10Jgreen: [C: 03+2] Add frnetmon1001 as a host to check [puppet] - 10https://gerrit.wikimedia.org/r/582912 (https://phabricator.wikimedia.org/T232137) (owner: 10Dwisehaupt) [15:54:47] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10ItamarWMDE) @Volans Done. Thank you for the quick processing. [15:55:16] (03PS2) 10Jbond: envoy: introduce use_remote_address parameter [puppet] - 10https://gerrit.wikimedia.org/r/583366 [15:56:58] 10Operations, 10Cloud-VPS (Project-requests): Request creation of 'sre-sandbox' VPS project - https://phabricator.wikimedia.org/T247517 (10jbond) ack thanks ill pass this on in the sre-foundations meeting in 5 mins [15:57:54] (03PS1) 10Muehlenhoff: Use builder role for deneb [puppet] - 10https://gerrit.wikimedia.org/r/583373 [15:58:44] (03PS2) 10Jbond: profile::tlsproxy::envoy: allow users to override the cluster addr [puppet] - 10https://gerrit.wikimedia.org/r/583367 [15:59:46] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583366 (owner: 10Jbond) [16:01:43] (03PS7) 10Dzahn: noc::site: close port 80 for caching servers [puppet] - 10https://gerrit.wikimedia.org/r/572337 [16:02:23] (03CR) 10jerkins-bot: [V: 04-1] profile::tlsproxy::envoy: allow users to override the cluster addr [puppet] - 10https://gerrit.wikimedia.org/r/583367 (owner: 10Jbond) [16:03:27] (03CR) 10Dzahn: "works now https://puppet-compiler.wmflabs.org/compiler1001/21572/mwmaint1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/572337 (owner: 10Dzahn) [16:04:20] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10ema) >>! In T242767#5998492, @gerritbot wrote: > Change 583295 **merged** by Ema: > [operations/puppet@pr... [16:06:22] (03PS3) 10Jbond: profile::tlsproxy::envoy: allow users to override the cluster addr [puppet] - 10https://gerrit.wikimedia.org/r/583367 [16:07:18] !log updating blubberoid to envoy 1.13.1 T246868 [16:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:23] T246868: envoyproxy: CVE-2020-8664 CVE-2020-8661 CVE-2020-8660 CVE-2020-8659 - https://phabricator.wikimedia.org/T246868 [16:12:03] (03PS1) 10Dzahn: remove IPs of decom'ed appservers [dns] - 10https://gerrit.wikimedia.org/r/583377 [16:13:22] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10WMDE-leszek) [16:13:50] 10Operations, 10ops-codfw, 10Traffic: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10Papaul) [16:14:40] (03PS1) 10Papaul: DHCP Partman: Add MAC address and partman for cp2027 to cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/583381 (https://phabricator.wikimedia.org/T247340) [16:14:53] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10WMDE-leszek) [16:15:07] !log rzl@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' . [16:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:31] (03PS2) 10Dzahn: remove IPs of recently decom'ed appservers in eqiad D5 [dns] - 10https://gerrit.wikimedia.org/r/583377 (https://phabricator.wikimedia.org/T247780) [16:16:06] (03PS1) 10KartikMistry: apertium-oc-ca: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-oc-ca] - 10https://gerrit.wikimedia.org/r/583383 (https://phabricator.wikimedia.org/T247585) [16:17:18] !log rzl@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [16:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:09] (03CR) 10Dzahn: "looks good to me. assuming that is the right partman recipe. except we should probably rename this "partman/custom/cp2018.cfg" if it's not" [puppet] - 10https://gerrit.wikimedia.org/r/583381 (https://phabricator.wikimedia.org/T247340) (owner: 10Papaul) [16:18:14] (03CR) 10Dzahn: [C: 03+1] DHCP Partman: Add MAC address and partman for cp2027 to cp2042 [puppet] - 10https://gerrit.wikimedia.org/r/583381 (https://phabricator.wikimedia.org/T247340) (owner: 10Papaul) [16:31:33] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583367 (owner: 10Jbond) [16:31:35] (03PS2) 10Jbond: idp: update the idp proxy config to use localhost and use_remote_address [puppet] - 10https://gerrit.wikimedia.org/r/583368 [16:32:57] !log upload cescout 0.1.0-1 to apt.wm.o (buster) - T247273 [16:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:03] T247273: Deploy the cescout package (censorship monitoring) - https://phabricator.wikimedia.org/T247273 [16:33:08] !log rzl@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [16:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:16] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583368 (owner: 10Jbond) [16:34:34] PROBLEM - BGP status on cr1-eqsin is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Telia, AS1299/IPv4: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:40:28] (03PS2) 10Giuseppe Lavagetto: profile::tlsproxy::envoy: allow defining timeouts, disable retries [puppet] - 10https://gerrit.wikimedia.org/r/583086 [16:46:06] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:47:05] PROBLEM - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: host 103.102.166.129, interfaces up: 81, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:47:27] !log updated jenkins packages on apt.wikimedia.org to 2.222.1 [16:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:01] (03PS3) 10Giuseppe Lavagetto: profile::tlsproxy::envoy: allow defining timeouts, disable retries [puppet] - 10https://gerrit.wikimedia.org/r/583086 [16:50:03] !log installing python-bleach security updates [16:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:02] RECOVERY - BGP status on cr1-eqsin is OK: BGP OK - up: 266, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:51:06] RECOVERY - Router interfaces on cr1-eqsin is OK: OK: host 103.102.166.129, interfaces up: 85, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:51:44] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 56 probes of 543 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:52:10] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:52:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/21576/restbase1016.eqiad.wmnet/ the change is a noop, in practice if not in theory. I wil" [puppet] - 10https://gerrit.wikimedia.org/r/583086 (owner: 10Giuseppe Lavagetto) [16:53:08] 10Operations, 10Cloud-VPS (Project-requests), 10cloud-services-team (Kanban): Request creation of 'sre-sandbox' VPS project - https://phabricator.wikimedia.org/T247517 (10bd808) Discussed and approved in the 2020-03-25 WMCS team meeting. We want to build the instance reaper script //before// we turn the proj... [17:00:50] (03PS1) 10Muehlenhoff: Extend Cumin alias for logstash with ELK7 roles [puppet] - 10https://gerrit.wikimedia.org/r/583386 [17:06:42] 10Operations, 10ops-eqiad, 10Analytics: analytics1044 hardware failure - https://phabricator.wikimedia.org/T248413 (10Volans) 05Open→03Resolved a:03Volans So far no more errors in racadm. Let's keep an eye on it, but I'm resolving it for now. Feel free to re-open on re-occurrence. [17:08:17] 10Operations, 10ops-codfw, 10ops-eqiad, 10ops-eqsin, and 2 others: Audit & update spares part tracking for all sites - https://phabricator.wikimedia.org/T243450 (10RobH) [17:09:48] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 37 probes of 543 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:11:27] (03PS3) 10Mvolz: Update citoid to use buster/node10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582861 [17:12:10] (03CR) 10Mvolz: [C: 03+2] Update citoid to use buster/node10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582861 (owner: 10Mvolz) [17:12:27] (03Merged) 10jenkins-bot: Update citoid to use buster/node10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582861 (owner: 10Mvolz) [17:23:01] (03PS1) 10Hashar: contint: add acl package for file permissions tweak [puppet] - 10https://gerrit.wikimedia.org/r/583392 (https://phabricator.wikimedia.org/T210271) [17:29:57] (03CR) 10Bstorm: [C: 03+2] toolforge cleanup: remove the ferm_handlers profile [puppet] - 10https://gerrit.wikimedia.org/r/583170 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [17:33:26] !log mvolz@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'citoid' for release 'staging' . [17:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:07] (03CR) 1020after4: [C: 03+1] ATS: directly talk wss:// to aphlict [puppet] - 10https://gerrit.wikimedia.org/r/569104 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [17:35:43] (03PS1) 10C. Scott Ananian: The official name of Parsoid is 'Parsoid' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583394 [17:35:45] (03PS1) 10C. Scott Ananian: Link to the phab task for VE/Parsoid being disabled on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583395 [17:37:30] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:38:34] !log mvolz@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'citoid' for release 'production' . [17:38:36] PROBLEM - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: host 103.102.166.129, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:52] RECOVERY - Router interfaces on cr1-eqsin is OK: OK: host 103.102.166.129, interfaces up: 85, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:43:16] !log mvolz@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'citoid' for release 'production' . [17:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:48] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:44:46] (03PS2) 10Ottomata: eventstreams: Remove from scb role [puppet] - 10https://gerrit.wikimedia.org/r/583076 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [17:56:45] (03PS13) 10Herron: logstash: add new SSD hosts to ELK7 cluster with disktype attr "ssd" [puppet] - 10https://gerrit.wikimedia.org/r/579340 (https://phabricator.wikimedia.org/T247376) [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Morning SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200325T1800). [18:00:04] tgr: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:33] I'll self-SWAT [18:03:05] tgr: Please ping when you're done. [18:05:14] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@4bdf55b]: Stop rerendering experimental PCS endpoints [18:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:54] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@4bdf55b]: Stop rerendering experimental PCS endpoints (duration: 01m 40s) [18:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:14] 10Operations, 10Security-Team, 10Patch-For-Review, 10User-jbond: Banning IPs / subnets from accessing login/validation endpoint - https://phabricator.wikimedia.org/T233945 (10chasemp) > >>We might need functionality for banning networks or specific IPs in case of abuse, either via built-in CAS functionali... [18:13:14] 10Operations, 10Security-Team, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10chasemp) >>! In T244792#5945742, @HMarcus wrote: > Thanks @chasemp > > @MoritzMuehlenhoff please confirm you can access the admin dashboard,... [18:13:49] !log tgr@deploy1001 Synchronized php-1.35.0-wmf.25/extensions/GrowthExperiments/modules/homepage/: SWAT: [[gerrit:583393|Mentorship module: Update for root screen refactor (T248422)]] (duration: 03m 23s) [18:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:56] T248422: [regression wmf.25] Mentorship module shows guidance root screen - https://phabricator.wikimedia.org/T248422 [18:15:07] 18:13:31 1 proxies had sync errors [18:15:20] ...on mw1251.eqiad.wmnet returned [255]: ssh: connect to host mw1251.eqiad.wmnet port 22: Connection timed out [18:15:30] Fun. [18:16:01] sync-apaches has succeeded though [18:16:16] proxies are used to fan out to the apaches, right? [18:16:20] Yes. [18:16:37] not sure where that leaves us [18:16:42] should I re-sync? [18:16:57] So does that mean the boxes didn't get the new code? [18:17:06] Yeah, re-sync just in case. [18:17:38] But 1251 does seem to be down. [18:17:39] I don't know; wouldn't sync-apaches also have errors then? [18:18:45] Maybe. [18:19:24] please ping me as well when done deploying [18:21:40] !log tgr@deploy1001 Synchronized php-1.35.0-wmf.25/extensions/GrowthExperiments/modules/homepage/: re-sync, mw1251 failed (duration: 03m 18s) [18:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:51] James_F: Pchelolo: I'm done (as in, got the same error again) [18:21:58] OK. [18:22:15] it's a JS-only bugfix, so not the end of the world if the old code is stuck on a few boxes [18:22:22] Pchelolo: What did you need to do? [18:22:26] James_F: I'm going to deploy restbase, but it will trigger some icinga alerts [18:22:40] OK. [18:22:47] so, not related to MW deployment, but might create some noize for you [18:23:14] Yeah, I'm not doing an MW deploy, working on infrastructure stuff instead. [18:23:21] and alerts are expected of course and will self-resolve :) [18:23:42] should I file a task about mw1251 or is that a transient error? [18:24:16] (03PS1) 10DannyS712: Stop using $wgContentHandlerUseDB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583406 [18:24:38] (03PS1) 10Lucas Werkmeister (WMDE): Don’t check constraints on “category contains” qualifiers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583407 (https://phabricator.wikimedia.org/T235292) [18:24:39] !log ppchelko@deploy1001 Started deploy [restbase/deploy@777b881]: Remove experimental PCS endpoints [18:24:40] (03PS2) 10DannyS712: Stop using $wgContentHandlerUseDB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583406 [18:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:59] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10Tarrow) [18:27:22] (03CR) 10DannyS712: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583407 (https://phabricator.wikimedia.org/T235292) (owner: 10Lucas Werkmeister (WMDE)) [18:31:30] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [18:31:45] tgr: Please file a task, looks like it's actually down. [18:31:55] Eh, I'll do it. [18:32:04] James_F: o/ [18:32:12] ^^ expected. [18:32:50] tgr: T248501 [18:32:51] T248501: mw1251 down (no ssh) but still in dsh group? - https://phabricator.wikimedia.org/T248501 [18:32:53] James_F: it was decommissioned (mw1251) https://phabricator.wikimedia.org/T247780#5997935 [18:33:00] thx [18:33:06] mutante: ^^^ [18:33:22] volans: Ah. That'd do it, yes. [18:33:26] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [18:35:35] James_F: lets wait for the mediawiki/core patch in gate and submit [18:35:37] then upgrade [18:35:41] unless something is blocked [18:35:59] Sure. [18:37:06] hashar: Go now? [18:37:21] yes [18:37:57] (03CR) 10Jforrester: [C: 04-1] "Let's wait for wmf.26." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583406 (owner: 10DannyS712) [18:38:29] (03PS1) 10Jdlrobson: Preparation for removal of $wgMobileFrontendLogo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583408 (https://phabricator.wikimedia.org/T248500) [18:39:07] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@777b881]: Remove experimental PCS endpoints (duration: 14m 28s) [18:39:10] James_F: i am in a meeting with tyler [18:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:33] hashar: OK, will wait. [18:39:37] !log ppchelko@deploy1001 Started deploy [restbase/deploy@a1c3be4]: Add restbase202[123] T244178 [18:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:42] T244178: Deploy restbase to restbase202[123] - https://phabricator.wikimedia.org/T244178 [18:40:25] (03CR) 10jerkins-bot: [V: 04-1] Preparation for removal of $wgMobileFrontendLogo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583408 (https://phabricator.wikimedia.org/T248500) (owner: 10Jdlrobson) [18:43:14] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [18:44:32] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [18:46:29] (03CR) 10DannyS712: "> Let's wait for wmf.26." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583406 (owner: 10DannyS712) [18:47:03] (03CR) 10Jdlrobson: "Hey James - more logo consolidation (yay!!)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583408 (https://phabricator.wikimedia.org/T248500) (owner: 10Jdlrobson) [18:48:09] (03CR) 10Herron: [C: 03+2] ELK7: require disktype "hdd" for new indices [puppet] - 10https://gerrit.wikimedia.org/r/579338 (https://phabricator.wikimedia.org/T247376) (owner: 10Herron) [18:53:36] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@a1c3be4]: Add restbase202[123] T244178 (duration: 14m 00s) [18:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:42] T244178: Deploy restbase to restbase202[123] - https://phabricator.wikimedia.org/T244178 [18:54:29] (03CR) 10Herron: [C: 03+2] logstash: add new SSD hosts to ELK7 cluster with disktype attr "ssd" [puppet] - 10https://gerrit.wikimedia.org/r/579340 (https://phabricator.wikimedia.org/T247376) (owner: 10Herron) [18:55:11] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/572337 (owner: 10Dzahn) [18:59:06] Just been a report of timeouts from India in #wikipedia-en [18:59:42] https://www.dotcom-tools.com/ping-test.aspx shows Shanghai and Beijing timeouts as well [19:00:04] twentyafterfound and dduvall: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200325T1900). [19:00:55] A curl to text-lb.eqsin.wm.o gets a 400 Varnish error [19:01:12] Request from via cp5007 frontend, Varnish XID 333331812
Upstream caches: cp5007 int
Error: 400, at Wed, 25 Mar 2020 18:59:57 GMT [19:01:27] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10Volans) p:05Triage→03Medium [19:01:31] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10Volans) Looping @KFrancis to verify that we have a valid NDA on file. I can see the line in the related spreadsheet but there are no dates, so asking for confirmation. [19:02:23] (03PS1) 10Mstyles: kibana: refactor kibana profile into two profiles [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961) [19:02:56] AntiComposite: looks like we might be inaccessible in Asia [19:03:04] (03Abandoned) 10Mstyles: kibana: add kibana to relforge [puppet] - 10https://gerrit.wikimedia.org/r/581111 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [19:03:37] (03CR) 10jerkins-bot: [V: 04-1] kibana: refactor kibana profile into two profiles [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [19:03:39] Should I create a phab task? [19:04:03] RhinosF1: https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue [19:05:18] bd808: passed it onto the user in India that reported [19:07:33] bd808: see AntiComposite’s curl above for esquin but they can’t even get on wikitech. Screenshot it for them [19:08:25] RhinosF1: https://wikitech-static.wikimedia.org/wiki/Reporting_a_connectivity_issue may work for them, but the Phab part would still be a problem if the routes from their POP to us are busted. [19:08:26] Esqin* [19:08:34] Either 400 is the expected response or everything is broken. [19:08:56] bd808: they might be able to provide some idea [19:09:09] If they can’t then I’ll get them to paste and copy it [19:09:22] an email to noc@ might also get through, depending on what and where the fault is [19:12:14] bd808: apparently it’s back [19:12:18] AntiComposite: that 400 response is actually expected. The test curl commands on that page are about just testing the first hop. not all the way to a MediaWiki response [19:12:36] * bd808 should make a note on the page about that [19:12:39] * AntiComposite has forgotten that multiple times [19:12:41] RhinosF1: AntiComposite: mtr or traceroutes would be helpful [19:12:57] * AntiComposite is not having issues, someone in -en connecting from india is [19:12:59] any pastebin site, they can omit the first hop if they're worried about exposing their own IP [19:13:16] and they're back up now [19:13:19] cdanis: maybe pop into #wikipedia-en so you can talk to them directly [19:14:40] (03PS1) 1020after4: group1 wikis to 1.35.0-wmf.25 refs T233873 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583418 [19:14:42] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.35.0-wmf.25 refs T233873 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583418 (owner: 1020after4) [19:16:08] (03PS2) 10Mstyles: kibana: refactor kibana profile into two profiles [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961) [19:16:10] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.25 refs T233873 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583418 (owner: 1020after4) [19:16:26] I'm on a call, but monitoring it [19:16:49] (03CR) 10Jforrester: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583408 (https://phabricator.wikimedia.org/T248500) (owner: 10Jdlrobson) [19:18:08] (03CR) 10Jdlrobson: "How can we make it work? I need to do this for new Vector skin..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583408 (https://phabricator.wikimedia.org/T248500) (owner: 10Jdlrobson) [19:19:48] (03PS3) 10Mstyles: kibana: refactor kibana profile into two profiles [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961) [19:23:16] (03CR) 10Jforrester: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583408 (https://phabricator.wikimedia.org/T248500) (owner: 10Jdlrobson) [19:24:28] James_F: meeting ended but I gotta get dinner / take care of kids a bit :/ [19:24:36] + there is another meeting in half hour bah [19:24:44] Yeah. [19:24:49] I can JFDI? [19:24:52] sure! [19:25:06] OK everyone, I'm about to restart jenkins. [19:25:22] it should be fine [19:25:23] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [19:26:05] for the jenkins logs: journalctl -u jenkins -f [19:26:19] !log scap sync-proxies failed on mw1251 [19:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:24] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.25 refs T233873 [19:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:30] T233873: 1.35.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T233873 [19:26:39] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [19:27:48] !log upgrading Jenkins # T248122 [19:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:52] !log group1 looks good after deploying wmf.25 refs T233873 [19:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:28] !log updating eventstreams to envoy 1.13.1 T246868 [19:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:33] T246868: envoyproxy: CVE-2020-8664 CVE-2020-8661 CVE-2020-8660 CVE-2020-8659 - https://phabricator.wikimedia.org/T246868 [19:30:08] !log rzl@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [19:30:08] !log rzl@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [19:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:46] 10Operations, 10fundraising-tech-ops: rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Dwisehaupt) Monitoring enabled. Just need to update netbox and we can close this out. [19:36:05] !log Jenkins restarted on all machines [19:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:36] (03PS4) 10Mstyles: kibana: refactor kibana profile into two profiles [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961) [19:39:15] cdanis: looks like https://mobile.twitter.com/kums3570/status/1242883310612664320 could correlate - they’re India as well. Not much useful info though [19:39:47] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [19:39:51] RhinosF1: thanks. that's close to the start of the interval as Sohom_Datta described it [19:40:28] cdanis: yeah, I’ll reply with info on reporting issues if it happens again [19:40:44] we should simplify that page a bit [19:41:08] RhinosF1: make sure to use the wikitech-static.wikimedia.org link, that one is on a server outside our infrastructure, so it will work even if we're down from the user's POV [19:41:33] Cool [19:42:03] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [19:43:23] cdanis: https://mobile.twitter.com/FormulaRhino/status/1242899828683804674 [19:43:27] (well, in most cases) [19:43:40] RhinosF1: thanks! [19:43:50] Np! Happy to help! [19:48:36] 10Operations, 10netops, 10Wikimedia-Incident: Add linecard diversity to the router-to-router interconnect in codfw - https://phabricator.wikimedia.org/T248506 (10CDanis) [19:49:20] (03CR) 10Jdlrobson: "If you have time that you would be super helpful. I'll be adding icon for now but in about a week or so I'll also need to add a tagline pr" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583408 (https://phabricator.wikimedia.org/T248500) (owner: 10Jdlrobson) [19:56:07] (03CR) 10Mstyles: "There are changes detected, which is expected, but I don't think parameters are quite in the right place. Puppet compilter -> https://pupp" [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [20:00:05] halfak and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200325T2000). [20:01:09] !log rzl@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [20:01:10] !log rzl@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [20:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:57] (03CR) 10Jhedden: [C: 03+2] techblog.wikimedia.org: Point at upstream service provider [dns] - 10https://gerrit.wikimedia.org/r/577371 (https://phabricator.wikimedia.org/T246507) (owner: 10BryanDavis) [20:07:02] (03PS4) 10Jhedden: techblog.wikimedia.org: Point at upstream service provider [dns] - 10https://gerrit.wikimedia.org/r/577371 (https://phabricator.wikimedia.org/T246507) (owner: 10BryanDavis) [20:08:10] (03PS3) 10Jhedden: redirects: Remove redirect handling for techblog.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/577373 (https://phabricator.wikimedia.org/T246507) (owner: 10BryanDavis) [20:16:40] !log rzl@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [20:16:40] !log rzl@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [20:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:48] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Logstash: add SSD tier to ELK7 cluster - https://phabricator.wikimedia.org/T247376 (10herron) [20:18:31] (03CR) 10Jhedden: [C: 03+2] redirects: Remove redirect handling for techblog.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/577373 (https://phabricator.wikimedia.org/T246507) (owner: 10BryanDavis) [20:19:48] !log updating citoid to envoy 1.13.1 T246868 [20:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:53] T246868: envoyproxy: CVE-2020-8664 CVE-2020-8661 CVE-2020-8660 CVE-2020-8659 - https://phabricator.wikimedia.org/T246868 [20:19:55] !log rzl@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'citoid' for release 'staging' . [20:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:55] oh never mind, thanks mvolz for updating it earlier :) [20:22:38] !log updating cxserver to envoy 1.13.1 T246868 [20:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:47] !log rzl@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [20:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:22] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10Nuria) @ItamarWMDE hello, can you explain what is the work you need data access for? [20:29:03] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is OK: HTTP OK: HTTP/1.0 200 OK - 22375 bytes in 3.302 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:30:35] (03PS1) 10Herron: ELK7: require disktype "ssd" for new indices [puppet] - 10https://gerrit.wikimedia.org/r/583438 (https://phabricator.wikimedia.org/T247376) [20:32:07] !log rzl@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [20:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:31] !log rzl@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [20:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:06] (03CR) 10Herron: [C: 03+2] ELK7: require disktype "ssd" for new indices [puppet] - 10https://gerrit.wikimedia.org/r/583438 (https://phabricator.wikimedia.org/T247376) (owner: 10Herron) [20:40:47] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Logstash: add SSD tier to ELK7 cluster - https://phabricator.wikimedia.org/T247376 (10herron) [20:52:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Hardware): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Andrew) 05Resolved→03Open I just stress-tested this and it crashed again. Stress test was: ` sudo cumin --force -... [21:00:24] (03PS1) 10Herron: ELK7: set "index.query.default_field":"all" in logstash ES template [puppet] - 10https://gerrit.wikimedia.org/r/583449 (https://phabricator.wikimedia.org/T247014) [21:01:50] 10Operations, 10DC-Ops, 10decommission, 10fundraising-tech-ops: decommission bismuth.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T248516 (10Dwisehaupt) [21:04:14] (03CR) 10Herron: [C: 03+2] "moving forward with this to have a fix in place before indices roll over tonight" [puppet] - 10https://gerrit.wikimedia.org/r/583449 (https://phabricator.wikimedia.org/T247014) (owner: 10Herron) [21:06:54] (03CR) 10Herron: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/583386 (owner: 10Muehlenhoff) [21:07:00] !log updating eventgate-analytics to envoy 1.13.1 T246868 [21:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:06] T246868: envoyproxy: CVE-2020-8664 CVE-2020-8661 CVE-2020-8660 CVE-2020-8659 - https://phabricator.wikimedia.org/T246868 [21:07:07] !log rzl@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' . [21:07:07] !log rzl@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' . [21:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:13] (03CR) 10Herron: [C: 03+1] icinga: relax check interval for selected checks [puppet] - 10https://gerrit.wikimedia.org/r/580985 (https://phabricator.wikimedia.org/T247538) (owner: 10Filippo Giunchedi) [21:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:55] !log rzl@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' . [21:10:55] !log rzl@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' . [21:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:17] !log rzl@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' . [21:11:17] !log rzl@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' . [21:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:13] (03PS4) 10Guozr.im: CuminExecution: Capture Exception cumin.transports.WorkerError [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) [21:16:31] !log holding off on updating eventgate-analytics until EU time, to check on unexpected helmfile diffs T246868 [21:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:37] T246868: envoyproxy: CVE-2020-8664 CVE-2020-8661 CVE-2020-8660 CVE-2020-8659 - https://phabricator.wikimedia.org/T246868 [21:17:00] rlazarus: am here [21:17:23] where is the unexpected diff? [21:17:35] ottomata: didn't even think to ask you, sorry! pastebinning, one sec [21:19:18] (03CR) 10Guozr.im: "> Patch Set 3:" (033 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [21:21:16] ottomata: bleh, now I can't repro it -- I've been doing a bunch of these and I must have missed a "source .hfenv" when I switched directories [21:21:30] nothing to see here, thanks for your help :) [21:23:15] ah ok! great! [21:27:11] !log rzl@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' . [21:27:11] !log rzl@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' . [21:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:21] (03PS2) 10Jforrester: Preparation for removal of $wgMobileFrontendLogo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583408 (https://phabricator.wikimedia.org/T248500) (owner: 10Jdlrobson) [21:38:23] (03PS1) 10Jforrester: Construct wgLogos in CommonSettings so that projects can inherit values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583459 [21:39:19] !log rzl@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' . [21:39:19] !log rzl@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' . [21:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:28] Jdlrobson: There, that should work. [21:42:38] (03CR) 10Jdlrobson: "neat." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583408 (https://phabricator.wikimedia.org/T248500) (owner: 10Jdlrobson) [21:44:36] !log updating eventgate-analytics-external to envoy 1.13.1 T246868 [21:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:41] T246868: envoyproxy: CVE-2020-8664 CVE-2020-8661 CVE-2020-8660 CVE-2020-8659 - https://phabricator.wikimedia.org/T246868 [21:44:56] !log rzl@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' . [21:44:56] !log rzl@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' . [21:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:30] !log dropping unused Cassandra keyspaces -- T248018 [21:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:35] T248018: Drop Cassandra keyspaces for /page/references - https://phabricator.wikimedia.org/T248018 [21:47:49] 10Operations, 10SRE-Access-Requests: Add Scardenasmolinar to WMF LDAP group - https://phabricator.wikimedia.org/T248521 (10MusikAnimal) [21:53:58] !log rzl@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' . [21:53:58] !log rzl@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' . [21:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:35] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/references/{title} (Get references from storage) is CRITICAL: Test Get references from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:55:39] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/references/{title} (Get references from storage) is CRITICAL: Test Get references from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:56:05] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/references/{title} (Get references from storage) is CRITICAL: Test Get references from storage returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:56:56] !log ppchelko@deploy1001 Started deploy [restbase/deploy@a1c3be4] (dev-cluster): Remove experimental PCS endpoints [21:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:07] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:59:39] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:59:45] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:59:52] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@a1c3be4] (dev-cluster): Remove experimental PCS endpoints (duration: 02m 57s) [21:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:27] !log rzl@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' . [22:00:27] !log rzl@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' . [22:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:13] !log updating eventgate-logging-external to envoy 1.13.1 T246868 [22:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:18] T246868: envoyproxy: CVE-2020-8664 CVE-2020-8661 CVE-2020-8660 CVE-2020-8659 - https://phabricator.wikimedia.org/T246868 [22:05:20] !log rzl@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' . [22:05:21] !log rzl@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' . [22:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:29] !log rzl@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' . [22:10:29] !log rzl@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' . [22:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:28] !log rzl@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' . [22:16:28] !log rzl@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' . [22:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:32] !log rzl@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' . [22:21:32] !log rzl@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' . [22:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:18] (03PS1) 10Papaul: Add new cp nodes cp2027 to cp2042 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/583469 (https://phabricator.wikimedia.org/T247340) [22:35:28] 10Operations, 10Performance-Team, 10Traffic: Review socket balancing in ATS/Varnish traffic layers - https://phabricator.wikimedia.org/T248522 (10Krinkle) [22:35:45] 10Operations, 10Prod-Kubernetes, 10serviceops: `helmfile --interactive apply` logs to SAL even if cancelled - https://phabricator.wikimedia.org/T248523 (10RLazarus) p:05Triage→03Low [22:39:21] 10Operations, 10Research, 10The-Wikipedia-Library, 10Traffic, and 4 others: Set an explicit "Origin When Cross-Origin" referer policy via the meta referrer tag - https://phabricator.wikimedia.org/T87276 (10Krinkle) [22:39:31] 10Operations, 10Patch-For-Review, 10Security: envoyproxy: CVE-2020-8664 CVE-2020-8661 CVE-2020-8660 CVE-2020-8659 - https://phabricator.wikimedia.org/T246868 (10RLazarus) All kubernetes services are updated in all clusters. (T246868#6000068 turned out to be operator error, there were no unexpected diffs.) [22:51:07] (03PS2) 10Jforrester: Construct wgLogos in CommonSettings so that projects can inherit values [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583459 [22:51:09] (03PS3) 10Jforrester: Preparation for removal of $wgMobileFrontendLogo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583408 (https://phabricator.wikimedia.org/T248500) (owner: 10Jdlrobson) [22:52:02] 10Operations, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Rename multimedia-team to structured-data-team - https://phabricator.wikimedia.org/T235550 (10Aklapper) >>! In T235550#5992779, @MBinder_WMF wrote: > I have updated that Herald. Would appreciate a quick review of it, just in case. :) LGTM > The mul... [23:00:05] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200325T2300). [23:00:05] davidwbarratt: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:02:06] 10Operations, 10LDAP-Access-Requests: Add Scardenasmolinar to WMF LDAP group - https://phabricator.wikimedia.org/T248521 (10Volans) p:05Triage→03Medium I've added the user `uid=suecarmol, cn=Scardenasmolinar` to the `wmf` LDAP group after verifying their account on the corp LDAP. Please verify that everyth... [23:08:57] I'm here! sorry I'm late! [23:09:32] ping RoanKattouw Niharika Urbanecm [23:09:39] Hey [23:09:55] I just got here, I'll do the SWAT [23:12:02] sweet! [23:14:46] Do we normally do full scap in swat? [23:14:55] Cause one of those patches adds new messages :( [23:16:42] (03PS3) 10DannyS712: Enable Special:Investigate on testwiki, and add `investigate` right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583105 (https://phabricator.wikimedia.org/T247645) [23:17:00] We don't normally, and well spotted [23:17:04] Did not notice that [23:17:11] It's only a right/action message [23:17:16] Yeah we could skip it [23:17:28] So it'll be a broken message on Special:UserGroupRights and if anyone tries to use the page without the right [23:17:32] Not great, but not horrible [23:17:34] Also let's see if it passes Jenkins, it's V-1ed because of an unrelated/friviolous failure [23:18:04] Yeah, we've upset github by using it too much [23:18:11] No, it failed phan again [23:18:12] PHP Fatal error: Cannot use the final modifier on an abstract class in vendor/microsoft/tolerant-php-parser/tests/cases/parser/abstractMethodDeclaration7.php on line 3 [23:18:42] gah! [23:18:43] WHYYYY [23:19:25] it's a bloody test case [23:19:40] Are we using some old release of it? [23:19:40] https://github.com/microsoft/tolerant-php-parser/blob/master/.gitattributes [23:19:58] .gitattributes Update .gitattributes 3 years ago [23:20:02] Oh you're right, it's a *test case* in *vendor* [23:20:02] * Reedy squints [23:20:23] davidwbarratt: Do you know what's including that? [23:20:30] Why the hell is phan looking at that [23:20:46] RoanKattouw: Bad phan config [23:20:46] Reedy it shouldn't be anything in that patch.... [23:20:46] Moment [23:22:33] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:25:40] Or maybe not [23:26:16] https://github.com/wikimedia/mediawiki-tools-phan/blob/master/src/config.php#L125 [23:26:18] :/ [23:26:40] Uhh what [23:26:51] I've become a non-phan of phan [23:26:59] lol [23:27:00] If it's already in the exclusion list, why is it not being excluded [23:27:10] Yeah [23:27:10] And why isn't this breaking in master, only in the wmf branch [23:27:13] It's in for the last two releases [23:28:13] Thankfully this is a pretty simple patch so I'll just force-merge it [23:28:16] And I'll file a task [23:28:35] RoanKattouw is the hero we needed [23:28:41] PROBLEM - Disk space on netflow2001 is CRITICAL: DISK CRITICAL - free space: / 302 MB (3% inode=90%): /tmp 302 MB (3% inode=90%): /var/tmp 302 MB (3% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=netflow2001&var-datasource=codfw+prometheus/ops [23:30:01] OK, CheckUser patch is now on mwdebug1001, but we won't be able to test it because it's not enabled anywhre [23:30:43] (03CR) 10Catrope: [C: 03+2] Enable Special:Investigate on testwiki, and add `investigate` right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583105 (https://phabricator.wikimedia.org/T247645) (owner: 10DannyS712) [23:30:51] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [23:31:44] (03Merged) 10jenkins-bot: Enable Special:Investigate on testwiki, and add `investigate` right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583105 (https://phabricator.wikimedia.org/T247645) (owner: 10DannyS712) [23:32:01] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [23:32:12] And now the config patch is also in mwdebug1001 [23:32:16] davidwbarratt: Please test there [23:32:20] on it! [23:33:59] it's perfect. I get no special page on English, and a Permission Error (expected) on test.wikipedia.org [23:34:31] And I see investigate listed at https://meta.wikimedia.org/wiki/Special:GlobalGroupPermissions/staff as well [23:34:55] awesome! yeah that will be given to "staff" [23:34:57] I won't grant it myself, but whoever should do it can [23:35:03] OK great, I'll deploy [23:35:22] you see it? How? its not deployed yet... / I can't see it [23:35:32] DannyS712: special extension [23:35:38] I already posted at the steward noticeboard to grant the right following deployment [23:35:41] What extension? [23:35:49] The WikimediaDebug browser extension [23:35:51] https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Browser_extensions [23:36:29] 10Operations, 10LDAP-Access-Requests: Add Scardenasmolinar to WMF LDAP group - https://phabricator.wikimedia.org/T248521 (10Scardenasmolinar) I can now see the Code-Review+2 button. Thank you very much! [23:37:12] cool; me too [23:37:34] 10Operations, 10LDAP-Access-Requests: Add Scardenasmolinar to WMF LDAP group - https://phabricator.wikimedia.org/T248521 (10Volans) 05Open→03Resolved a:03Volans [23:38:50] https://meta.wikimedia.org/wiki/Special:GlobalGroupPermissions/staff returns [XnvrfwpAMFsAAEVStQoAAACI] 2020-03-25 23:38:40: Fatal exception of type "MWException" [23:38:55] !log catrope@deploy1001 Synchronized php-1.35.0-wmf.25/extensions/CheckUser/: Add new investigate right (T247645) (duration: 03m 17s) [23:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:00] T247645: CU 2.0: Enable Special:Investigate on testwiki [small] - https://phabricator.wikimedia.org/T247645 [23:39:42] DannyS712: WFM [23:39:51] * Reedy looks [23:39:55] also returned `[XnvrjwpAIDEAAEHNLFsAAAAJ] 2020-03-25 23:38:55: Fatal exception of type "MWException"` when I reproduced in incognito [23:39:58] but now it works fine [23:40:11] 2020-03-25 23:38:40 [XnvrfwpAMFsAAEVStQoAAACI] mw1256 metawiki 1.35.0-wmf.25 exception ERROR: [XnvrfwpAMFsAAEVStQoAAACI] /wiki/Special:GlobalGroupPermissions/staff MWException from line 164 of /srv/mediawiki/php-1.35.0-wmf.25/includes/Hooks.php: Invalid callback CheckUserHooks::onUserGetAllRights in hooks for UserGetAllRights [23:40:11] {"exception_id":"XnvrfwpAMFsAAEVStQoAAACI","exception_url":"/wiki/Special:GlobalGroupPermissions/staff","caught_by":"mwe_handler"} [23:40:20] I think you hit it before things had fully synced [23:41:54] so extension.json added the hook, but CheckUserHooks didn't have the method yet? [23:42:11] Probably [23:42:25] You posted the error 5 seconds before Roan's sync finished (which took over 3 minutes) [23:42:37] So yeah, you were lucky and hit a server half synced [23:42:53] well, it needs to be synced again (I think - T236104) so I'll try and reproduce [23:42:54] T236104: Cache of wmf-config/InitialiseSettings often 1 step behind - https://phabricator.wikimedia.org/T236104 [23:42:56] !log catrope@deploy1001 Synchronized php-1.35.0-wmf.25/extensions/CheckUser/: Retry because mw1251 timed out, and it is a proxy (duration: 03m 15s) [23:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:03] Ughhhh mw1251 timed out again this time [23:43:07] And it's a proxy, so that's bad [23:44:10] I think that's known [23:44:11] T248501 [23:44:12] T248501: mw1251 down (no ssh) but still in dsh group? - https://phabricator.wikimedia.org/T248501 [23:45:11] RoanKattouw: I'd presume if servers can't sync from a proxy, they'll look at another one [23:45:15] So shouldn't be an issue [23:45:17] Yeah checking now [23:45:40] You're right, they're all in sync [23:48:35] Reedy why did you want rollback/ [23:48:48] T248501 [23:48:54] Oh you found it already [23:49:21] !log catrope@deploy1001 Synchronized wmf-config/CommonSettings.php: Add investigate to $wgAvailableRights (T247645) (duration: 03m 16s) [23:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:26] T247645: CU 2.0: Enable Special:Investigate on testwiki [small] - https://phabricator.wikimedia.org/T247645 [23:49:35] DannyS712: Because clicking undo is too many clicks when there's spam on Tech posted on #wikimedia-tech [23:49:48] Also, T248306 [23:49:49] T248306: CI error on WMF branches: Cannot use the final modifier on an abstract class in vendor/microsoft/tolerant-php-parser/tests/cases/parser/abstractMethodDeclaration7.php on line 3 - https://phabricator.wikimedia.org/T248306 [23:51:16] Investigate granted to staff [23:56:16] RoanKattouw should investigate be up on testwiki?