[00:08:03] Hey guys, trying to view a diff from a deleted page is returning [XdMxlApAAEYAAJddOUQAAADU] 2019-11-19 00:04:36: Fatal exception of type "InvalidArgumentException" [00:08:35] any known problems? [00:15:09] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 883.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:17:44] (03PS4) 10Ayounsi: Add vlan support for asw [homer/public] - 10https://gerrit.wikimedia.org/r/550376 [00:22:53] (03CR) 10Ayounsi: "> Patch Set 3: Code-Review-1" (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/550376 (owner: 10Ayounsi) [00:24:05] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10CGlenn) @Nuria Hi Nuria! Yes, I read the instructions. My update was to reply to the previous user that I updated the SSH ke... [00:30:26] !log rebooting phab2001 [00:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:25] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10Nuria) We need approval from your manager in phab, per my comment above. [00:39:41] !log phab2001 - rsyncing /srv/repos data from phab1003 (T190568) [00:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:46] T190568: Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 [00:52:40] (03PS1) 10Dzahn: add xhgui1001 and xhgui2001 [dns] - 10https://gerrit.wikimedia.org/r/551691 (https://phabricator.wikimedia.org/T238098) [00:58:46] 10Operations, 10vm-requests, 10Patch-For-Review, 10Performance-Team (Radar): vm request for xhgui - https://phabricator.wikimedia.org/T238098 (10Dzahn) Added "xhgui" as a new type of server on the [[ https://wikitech.wikimedia.org/w/index.php?title=Infrastructure_naming_conventions&type=revision&diff=18450... [01:03:45] (03PS1) 10Dzahn: install_server: add xhgui to flat partman for VMs [puppet] - 10https://gerrit.wikimedia.org/r/551692 (https://phabricator.wikimedia.org/T238098) [01:05:09] (03CR) 10Dzahn: [C: 03+2] install_server: add xhgui to flat partman for VMs [puppet] - 10https://gerrit.wikimedia.org/r/551692 (https://phabricator.wikimedia.org/T238098) (owner: 10Dzahn) [01:06:33] PROBLEM - Check systemd state on phab2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:10:11] (03PS1) 10Dzahn: site/webperf: remove superfluous lint-ignore comments [puppet] - 10https://gerrit.wikimedia.org/r/551693 [01:12:18] (03PS1) 10Dzahn: site/centrallog: unify nodes into a single stanza [puppet] - 10https://gerrit.wikimedia.org/r/551694 [01:14:21] (03PS1) 10Dzahn: site/xhgui: add xhgui eqiad and codfw nodes as spares [puppet] - 10https://gerrit.wikimedia.org/r/551695 (https://phabricator.wikimedia.org/T238098) [01:19:41] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [01:42:25] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [01:53:27] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:59:29] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [02:16:31] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [02:23:31] (03PS5) 10DannyS712: Partial cleanup of InitializeSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546369 (https://phabricator.wikimedia.org/T231178) [02:27:55] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [02:53:02] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Legacy (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Milimetric) >>! In T211881#5615716, @kaldari wrote: >>I guess its not clear to me what exactly the decisio... [03:08:57] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp2013 [puppet] - 10https://gerrit.wikimedia.org/r/551703 (https://phabricator.wikimedia.org/T231627) [03:08:59] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp2013 [puppet] - 10https://gerrit.wikimedia.org/r/551704 (https://phabricator.wikimedia.org/T231627) [03:13:23] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [03:37:29] !log Move cp2013 from nginx to ats-tls - T231627 [03:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:37:36] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [03:37:46] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp2013 [puppet] - 10https://gerrit.wikimedia.org/r/551703 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [03:39:25] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp2013 [puppet] - 10https://gerrit.wikimedia.org/r/551704 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [03:41:51] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [03:45:57] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [03:49:31] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Papaul) @Andrew this task has been resolved but both cloudbackup2001 and 2002 are showing "staged" in Netbox for status and both array are... [03:51:00] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp2016 [puppet] - 10https://gerrit.wikimedia.org/r/551708 (https://phabricator.wikimedia.org/T231627) [03:51:01] !log T208369 ran mwscript extensions/GrowthExperiments/maintenance/deleteOldSurveys.php cswiki --cutoff 350 [03:51:02] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp2016 [puppet] - 10https://gerrit.wikimedia.org/r/551709 (https://phabricator.wikimedia.org/T231627) [03:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:06] T208369: Welcome survey: anonymize data after one year - https://phabricator.wikimedia.org/T208369 [03:53:53] !log Move cp2016 from nginx to ats-tls - T231627 [03:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:53:58] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [03:54:01] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp2016 [puppet] - 10https://gerrit.wikimedia.org/r/551708 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [03:55:31] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp2016 [puppet] - 10https://gerrit.wikimedia.org/r/551709 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [04:01:57] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [04:07:30] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [04:14:31] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp2019 [puppet] - 10https://gerrit.wikimedia.org/r/551711 (https://phabricator.wikimedia.org/T231627) [04:14:33] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp2019 [puppet] - 10https://gerrit.wikimedia.org/r/551712 (https://phabricator.wikimedia.org/T231627) [04:17:07] !log Move cp2019 from nginx to ats-tls - T231627 [04:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:12] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [04:17:21] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp2019 [puppet] - 10https://gerrit.wikimedia.org/r/551711 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [04:19:10] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp2019 [puppet] - 10https://gerrit.wikimedia.org/r/551712 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [04:25:26] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [04:33:23] (03PS1) 10Vgutierrez: hiera: Set nginx on port 4443 for text cluster on codfw [puppet] - 10https://gerrit.wikimedia.org/r/551713 (https://phabricator.wikimedia.org/T231627) [04:33:25] (03PS1) 10Vgutierrez: hiera: Set ats-tls on port 443 for text cluster on codfw [puppet] - 10https://gerrit.wikimedia.org/r/551714 (https://phabricator.wikimedia.org/T231627) [04:42:39] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/19468/" [puppet] - 10https://gerrit.wikimedia.org/r/551713 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [04:46:39] (03CR) 10Vgutierrez: "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1002/19469/" [puppet] - 10https://gerrit.wikimedia.org/r/551714 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [04:47:09] !log Move cp2023 from nginx to ats-tls - T231627 [04:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:47:17] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [04:47:32] (03CR) 10Vgutierrez: [C: 03+2] hiera: Set nginx on port 4443 for text cluster on codfw [puppet] - 10https://gerrit.wikimedia.org/r/551713 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [04:49:30] (03CR) 10Vgutierrez: [C: 03+2] hiera: Set ats-tls on port 443 for text cluster on codfw [puppet] - 10https://gerrit.wikimedia.org/r/551714 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [04:57:30] PROBLEM - HTTPS Unified ECDSA on cp2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [04:57:32] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp2010 is CRITICAL: connect to address 10.192.16.136 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:57:44] PROBLEM - HTTPS Unified RSA on cp2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [05:00:42] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [05:02:13] !log Start pre-switchover steps T235469 [05:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:18] T235469: Switchover s6 primary database master db1061 -> db1131 - 19th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T235469 [05:05:07] (03PS3) 10Marostegui: mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/551040 (https://phabricator.wikimedia.org/T235469) [05:07:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1131 with weight 0 T235469 ', diff saved to https://phabricator.wikimedia.org/P9661 and previous config saved to /var/cache/conftool/dbconfig/20191119-050748-marostegui.json [05:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:54] T235469: Switchover s6 primary database master db1061 -> db1131 - 19th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T235469 [05:13:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1105:3312 after compression', diff saved to https://phabricator.wikimedia.org/P9662 and previous config saved to /var/cache/conftool/dbconfig/20191119-051259-marostegui.json [05:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105:3311 for compression', diff saved to https://phabricator.wikimedia.org/P9663 and previous config saved to /var/cache/conftool/dbconfig/20191119-051344-marostegui.json [05:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:35] !log Compress tables on db1105:3311 [05:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:16] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp1085 [puppet] - 10https://gerrit.wikimedia.org/r/551716 (https://phabricator.wikimedia.org/T231627) [05:16:18] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp1085 [puppet] - 10https://gerrit.wikimedia.org/r/551717 (https://phabricator.wikimedia.org/T231627) [05:17:19] !log Move cp1085 from nginx to ats-tls - T231627 [05:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:24] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [05:18:24] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp1085 [puppet] - 10https://gerrit.wikimedia.org/r/551716 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [05:19:30] PROBLEM - Check systemd state on cp2010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:19:42] PROBLEM - ats-tls HTTPS en.wikipedia.org ECDSA on cp2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [05:19:48] PROBLEM - ats-tls HTTPS en.wikipedia.org RSA on cp2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [05:20:25] that's not expected [05:20:30] * vgutierrez checking [05:21:58] RECOVERY - ats-tls HTTPS en.wikipedia.org ECDSA on cp2010 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345573 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-11-22 07:59:59 +0000 (expires in 369 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:22:01] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/551040 (https://phabricator.wikimedia.org/T235469) (owner: 10Marostegui) [05:22:04] RECOVERY - ats-tls HTTPS en.wikipedia.org RSA on cp2010 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345567 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-11-22 07:59:59 +0000 (expires in 369 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:22:04] sigh... fixed [05:22:08] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp2010 is OK: HTTP OK: HTTP/1.0 200 OK - 19597 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [05:22:08] RECOVERY - HTTPS Unified ECDSA on cp2010 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345564 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-11-22 07:59:59 +0000 (expires in 369 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:22:27] I've missed cp2010 in T231627 [05:22:27] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [05:22:40] PROBLEM - HTTPS Unified RSA on cp1085 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [05:22:42] (03PS2) 10Marostegui: wmnet: Update s6-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/551041 (https://phabricator.wikimedia.org/T235469) [05:22:53] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [05:22:56] PROBLEM - HTTPS Unified ECDSA on cp1085 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [05:22:58] RECOVERY - Check systemd state on cp2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:24:01] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp1085 [puppet] - 10https://gerrit.wikimedia.org/r/551717 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [05:24:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1097:3314 for compression', diff saved to https://phabricator.wikimedia.org/P9664 and previous config saved to /var/cache/conftool/dbconfig/20191119-052412-marostegui.json [05:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:18] !log Compress db1097:3314 [05:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3318 for schema change', diff saved to https://phabricator.wikimedia.org/P9665 and previous config saved to /var/cache/conftool/dbconfig/20191119-052632-marostegui.json [05:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:41] !log Deploy schema change on db1099:3318 [05:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:52] RECOVERY - HTTPS Unified RSA on cp1085 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345560 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-11-22 07:59:59 +0000 (expires in 369 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:28:10] RECOVERY - HTTPS Unified ECDSA on cp1085 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345541 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-11-22 07:59:59 +0000 (expires in 369 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:29:53] RECOVERY - HTTPS Unified RSA on cp2010 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345098 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-11-22 07:59:59 +0000 (expires in 369 days) https://wikitech.wikimedia.org/wiki/HTTPS [05:32:41] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [05:35:18] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp1087 [puppet] - 10https://gerrit.wikimedia.org/r/551719 (https://phabricator.wikimedia.org/T231627) [05:35:20] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp1087 [puppet] - 10https://gerrit.wikimedia.org/r/551720 (https://phabricator.wikimedia.org/T231627) [05:36:45] !log Move cp1087 from nginx to ats-tls - T231627 [05:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:50] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [05:38:23] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp1087 [puppet] - 10https://gerrit.wikimedia.org/r/551719 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [05:39:41] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp1087 [puppet] - 10https://gerrit.wikimedia.org/r/551720 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [05:40:59] (03CR) 10Marostegui: [C: 03+1] "These DBs are read only, which is expected. Just noting it here." [puppet] - 10https://gerrit.wikimedia.org/r/551285 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [05:45:48] (03CR) 10Jcrespo: [C: 03+1] mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/551040 (https://phabricator.wikimedia.org/T235469) (owner: 10Marostegui) [05:46:02] (03CR) 10Jcrespo: [C: 03+1] wmnet: Update s6-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/551041 (https://phabricator.wikimedia.org/T235469) (owner: 10Marostegui) [05:48:19] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [05:49:40] (03PS1) 10Marostegui: dbproxy1011: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/551721 [06:00:04] marostegui and jynus: #bothumor I � Unicode. All rise for s6 database master failover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191119T0600). [06:00:07] jynus: ready? [06:00:58] yes [06:01:03] !log Starting s6 failover from db1061 to db1131 - T235469 [06:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:09] T235469: Switchover s6 primary database master db1061 -> db1131 - 19th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T235469 [06:01:23] !log marostegui@cumin2001 dbctl commit (dc=all): 'Set s6 as read-only for maintenance T235469', diff saved to https://phabricator.wikimedia.org/P9666 and previous config saved to /var/cache/conftool/dbconfig/20191119-060122-marostegui.json [06:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:40] ro confirmed [06:01:44] Warning: The database has been locked for maintenance [06:02:04] !log marostegui@cumin2001 dbctl commit (dc=all): 'Promote db1131 to s6 master and remove read-only from s6 T235469', diff saved to https://phabricator.wikimedia.org/P9667 and previous config saved to /var/cache/conftool/dbconfig/20191119-060203-marostegui.json [06:02:08] switchover done [06:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:11] ro off [06:02:24] I can edit [06:02:25] I can edit [06:02:33] checking errors [06:02:56] (03PS1) 10Vgutierrez: hiera: Set nginx on port 4443 globally [puppet] - 10https://gerrit.wikimedia.org/r/551722 (https://phabricator.wikimedia.org/T231627) [06:02:58] (03PS1) 10Vgutierrez: hiera: Set ats-tls on port 443 globally [puppet] - 10https://gerrit.wikimedia.org/r/551723 (https://phabricator.wikimedia.org/T231627) [06:03:48] I see some errors from the jobqueue [06:07:34] I think we are good [06:07:42] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s6-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/551041 (https://phabricator.wikimedia.org/T235469) (owner: 10Marostegui) [06:12:13] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [06:15:15] (03PS1) 10Marostegui: db1061: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/551725 [06:18:25] (03CR) 10Marostegui: [C: 03+2] db1061: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/551725 (owner: 10Marostegui) [06:20:05] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [06:20:13] (03CR) 10Marostegui: [C: 03+2] dbproxy1011: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/551721 (owner: 10Marostegui) [06:20:35] !log Depool labsdb1010 for upgrade [06:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:09] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1003/19472/" [puppet] - 10https://gerrit.wikimedia.org/r/551723 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [06:25:12] (03PS2) 10Vgutierrez: hiera: Set nginx on port 4443 for the caching clusters [puppet] - 10https://gerrit.wikimedia.org/r/551722 (https://phabricator.wikimedia.org/T231627) [06:25:14] (03PS2) 10Vgutierrez: hiera: Set ats-tls on port 443 globally [puppet] - 10https://gerrit.wikimedia.org/r/551723 (https://phabricator.wikimedia.org/T231627) [06:26:45] !log Move cp1089 from nginx to ats-tls - T231627 [06:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:50] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [06:27:55] (03CR) 10Vgutierrez: [C: 03+2] hiera: Set nginx on port 4443 for the caching clusters [puppet] - 10https://gerrit.wikimedia.org/r/551722 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [06:30:10] (03CR) 10Vgutierrez: [C: 03+2] hiera: Set ats-tls on port 443 globally [puppet] - 10https://gerrit.wikimedia.org/r/551723 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [06:30:29] (03PS3) 10Vgutierrez: hiera: Set ats-tls on port 443 globally [puppet] - 10https://gerrit.wikimedia.org/r/551723 (https://phabricator.wikimedia.org/T231627) [06:33:03] PROBLEM - HTTPS Unified ECDSA on cp1089 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [06:33:35] (03PS1) 10Marostegui: Revert "dbproxy1011: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/551727 [06:33:37] PROBLEM - HTTPS Unified RSA on cp1089 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [06:33:58] ^^ expected [06:34:45] RECOVERY - HTTPS Unified ECDSA on cp1089 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345596 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-11-22 07:59:59 +0000 (expires in 369 days) https://wikitech.wikimedia.org/wiki/HTTPS [06:35:19] RECOVERY - HTTPS Unified RSA on cp1089 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345562 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-11-22 07:59:59 +0000 (expires in 369 days) https://wikitech.wikimedia.org/wiki/HTTPS [06:35:32] (03PS1) 10Marostegui: db-eqiad,db-codfw.php. Remove db2061 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551728 (https://phabricator.wikimedia.org/T238526) [06:36:19] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php. Remove db2061 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551728 (https://phabricator.wikimedia.org/T238526) (owner: 10Marostegui) [06:37:02] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php. Remove db2061 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551728 (https://phabricator.wikimedia.org/T238526) (owner: 10Marostegui) [06:38:12] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2061 from config T238526 (duration: 00m 53s) [06:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:17] T238526: Decommission db2061.codfw.wmnet - https://phabricator.wikimedia.org/T238526 [06:39:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2061 from config T238526 (duration: 00m 52s) [06:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:33] 10Operations, 10Traffic: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) 05Open→03Resolved [06:40:38] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10Performance-Team (Radar): Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Vgutierrez) [06:41:20] (03PS1) 10Marostegui: mariadb: Set db2061 to spare [puppet] - 10https://gerrit.wikimedia.org/r/551729 (https://phabricator.wikimedia.org/T238526) [06:41:24] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [06:41:31] 10Operations, 10Traffic: Proxy-connection HTTP response header being sent to some users in some cases causing HTTP/2 protocol errors - https://phabricator.wikimedia.org/T238509 (10Vgutierrez) 05Open→03Resolved Marking this as resolved as we don't use nginx anymore to terminate TLS in the caching cluster no... [06:43:20] (03CR) 10Marostegui: [C: 03+2] mariadb: Set db2061 to spare [puppet] - 10https://gerrit.wikimedia.org/r/551729 (https://phabricator.wikimedia.org/T238526) (owner: 10Marostegui) [06:44:34] !log Remove db2061 from tendril and zarcillo T238526 [06:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:38] T238526: Decommission db2061.codfw.wmnet - https://phabricator.wikimedia.org/T238526 [06:45:06] !log Stop MySQL on db2061 T238526 [06:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:40] 10Operations, 10Traffic: Remove nginx puppetization for cache text/text_ats - https://phabricator.wikimedia.org/T238625 (10Vgutierrez) [06:47:50] 10Operations, 10Traffic: Remove nginx puppetization for cache text/text_ats - https://phabricator.wikimedia.org/T238625 (10Vgutierrez) p:05Triage→03Normal [07:24:45] (03PS1) 10Vgutierrez: ATS: Disable websocket remap rules for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/551731 (https://phabricator.wikimedia.org/T238593) [07:26:46] marostegui: i'll steal deploy1001 for 5 mins, if that's ok [07:27:40] (03PS4) 10Mobrovac: Enable links from math formulae on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551180 (https://phabricator.wikimedia.org/T208758) (owner: 10Physikerwelt) [07:30:46] (03CR) 10Mobrovac: [C: 03+2] Enable links from math formulae on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551180 (https://phabricator.wikimedia.org/T208758) (owner: 10Physikerwelt) [07:31:28] (03Merged) 10jenkins-bot: Enable links from math formulae on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551180 (https://phabricator.wikimedia.org/T208758) (owner: 10Physikerwelt) [07:33:31] !log mobrovac@deploy1001 Synchronized wmf-config/CommonSettings-labs.php: Enable math links in Beta - T208758 (duration: 00m 53s) [07:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:36] T208758: Extend Popups-Extension to render popups for annotated - https://phabricator.wikimedia.org/T208758 [07:34:15] marostegui: ok, i'm done [07:37:01] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [07:47:52] ^ expected [07:47:57] I will get to it later [07:49:29] mobrovac: sorry for not replying, I'm dealing with some paperwork [07:50:25] no worries marostegui, good luck with the bureaucracy [07:50:55] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:51:03] mobrovac: I will need it, gracias! [08:00:00] (03CR) 10Effie Mouzeli: [C: 03+2] admin,mediawiki: Remove hhvm related sudo privileges [puppet] - 10https://gerrit.wikimedia.org/r/550483 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [08:00:35] (03PS3) 10Effie Mouzeli: admin,mediawiki: Remove hhvm related sudo privileges [puppet] - 10https://gerrit.wikimedia.org/r/550483 (https://phabricator.wikimedia.org/T229792) [08:13:54] (03PS4) 10Effie Mouzeli: mediawiki: Update decommission_appserver.sh [puppet] - 10https://gerrit.wikimedia.org/r/551162 (https://phabricator.wikimedia.org/T229792) [08:19:17] (03CR) 10Muehlenhoff: (WIP) mediawiki: remove all hhvm related files and hieradata (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/551527 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [08:29:35] 10Operations, 10WMF-Legal, 10serviceops, 10Patch-For-Review: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10Prtksxna) After a discussion with @LMixter, @JbuattiWMF and @Varnent we'd like to propose: 1. `transparency.wikimedia.org` (an... [08:39:36] (03PS2) 10Gehel: [wdqs] Fix event service configuration [puppet] - 10https://gerrit.wikimedia.org/r/550754 (https://phabricator.wikimedia.org/T101013) (owner: 10DCausse) [08:41:29] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: Update decommission_appserver.sh [puppet] - 10https://gerrit.wikimedia.org/r/551162 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [08:43:09] (03CR) 10Gehel: [C: 03+2] [wdqs] Fix event service configuration [puppet] - 10https://gerrit.wikimedia.org/r/550754 (https://phabricator.wikimedia.org/T101013) (owner: 10DCausse) [08:43:33] dcausse: ^ [08:44:43] gehel:thanks! [08:48:04] (03PS4) 10Gehel: [wdqs] add logging config for exporting updated entities [puppet] - 10https://gerrit.wikimedia.org/r/551549 (https://phabricator.wikimedia.org/T238557) (owner: 10DCausse) [08:50:21] (03CR) 10Gehel: [C: 03+2] [wdqs] add logging config for exporting updated entities [puppet] - 10https://gerrit.wikimedia.org/r/551549 (https://phabricator.wikimedia.org/T238557) (owner: 10DCausse) [08:51:20] 10Operations, 10Phabricator, 10Traffic, 10serviceops, 10Patch-For-Review: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10ema) [09:05:01] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1011: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/551727 (owner: 10Marostegui) [09:05:27] !log Repool labsbdb1010 [09:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1099:3318 after schema change', diff saved to https://phabricator.wikimedia.org/P9668 and previous config saved to /var/cache/conftool/dbconfig/20191119-090745-marostegui.json [09:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:53] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:08:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1101:3318 for schema change', diff saved to https://phabricator.wikimedia.org/P9669 and previous config saved to /var/cache/conftool/dbconfig/20191119-090823-marostegui.json [09:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:39] !log Deploy schema change on db1101:3318 [09:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:52] !log installing libxslt security updates [09:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:48] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Reimage wezen to Stretch or Buster (and rename to centrallog2001) - https://phabricator.wikimedia.org/T224564 (10fgiunchedi) [09:10:12] (03PS1) 10Elukey: profile::analytics_cluster::coordinator: add geoip database [puppet] - 10https://gerrit.wikimedia.org/r/551776 (https://phabricator.wikimedia.org/T238432) [09:12:31] (03PS1) 10Vgutierrez: prometheus,ATS: track current active http/http2/websocket connections [puppet] - 10https://gerrit.wikimedia.org/r/551777 (https://phabricator.wikimedia.org/T238593) [09:12:33] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: pool centrallog2001 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/547248 (https://phabricator.wikimedia.org/T224564) (owner: 10Filippo Giunchedi) [09:12:40] (03PS5) 10Filippo Giunchedi: hieradata: pool centrallog2001 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/547248 (https://phabricator.wikimedia.org/T224564) [09:16:17] (03CR) 10Ema: prometheus,ATS: track current active http/http2/websocket connections (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551777 (https://phabricator.wikimedia.org/T238593) (owner: 10Vgutierrez) [09:20:04] !log rolling restart of nginx on acmechief/puppetdb to pick up libxslt security updates [09:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:35] !log installing ncurses security updates [09:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:09] (03PS2) 10Vgutierrez: prometheus,ATS: track current active http/http2/websocket connections [puppet] - 10https://gerrit.wikimedia.org/r/551777 (https://phabricator.wikimedia.org/T238593) [09:24:17] (03CR) 10Vgutierrez: prometheus,ATS: track current active http/http2/websocket connections (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551777 (https://phabricator.wikimedia.org/T238593) (owner: 10Vgutierrez) [09:24:56] (03PS23) 10ArielGlenn: allow storage generated misc cron dump output on secondary nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [09:26:22] (03PS2) 10Effie Mouzeli: prometheus: Remove dead HHVM code [puppet] - 10https://gerrit.wikimedia.org/r/551161 (https://phabricator.wikimedia.org/T229792) [09:26:59] (03PS2) 10Effie Mouzeli: mediawiki: Remove HHVM references and includes [puppet] - 10https://gerrit.wikimedia.org/r/550818 (https://phabricator.wikimedia.org/T229792) [09:27:47] 10Operations: Integrate Buster 10.2 point update - https://phabricator.wikimedia.org/T238519 (10MoritzMuehlenhoff) [09:28:41] (03CR) 10Ema: [C: 03+1] prometheus,ATS: track current active http/http2/websocket connections [puppet] - 10https://gerrit.wikimedia.org/r/551777 (https://phabricator.wikimedia.org/T238593) (owner: 10Vgutierrez) [09:38:51] (03PS1) 10TerraCodes: Complete wmfRealm to wmgRealm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551778 (https://phabricator.wikimedia.org/T45956) [09:40:12] (03PS4) 10Filippo Giunchedi: rsyslog: setup temporary secure rsync for logs transfer [puppet] - 10https://gerrit.wikimedia.org/r/547245 (https://phabricator.wikimedia.org/T224564) [09:40:12] !log disable puppet on P:mediawiki::php - T229792 [09:40:14] (03PS1) 10Filippo Giunchedi: hieradata: pool centrallog2001 [puppet] - 10https://gerrit.wikimedia.org/r/551779 (https://phabricator.wikimedia.org/T224564) [09:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:18] T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 [09:44:51] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:49:23] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Vgutierrez) so, I've been doing some tests, and ATS doesn't drop the url-encoded version of the semicolon, so `... [09:49:46] (03PS3) 10Vgutierrez: prometheus,ATS: track current active http/http2/websocket connections [puppet] - 10https://gerrit.wikimedia.org/r/551777 (https://phabricator.wikimedia.org/T238593) [10:00:06] 10Operations, 10Phabricator, 10Traffic, 10serviceops, 10Patch-For-Review: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10ema) By going through SAL and the irc logs on #wikimedia-operations I've reconstructed the events as follo... [10:01:30] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Vgutierrez) So: `vgutierrez@cp1075:~$ curl -H 'Host: ban.wikipedia.org' "http://127.0.0.1:3120/wiki/Mal:%3B" -v... [10:02:31] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: pool centrallog2001 [puppet] - 10https://gerrit.wikimedia.org/r/551779 (https://phabricator.wikimedia.org/T224564) (owner: 10Filippo Giunchedi) [10:03:12] (03CR) 10Vgutierrez: [C: 03+2] prometheus,ATS: track current active http/http2/websocket connections [puppet] - 10https://gerrit.wikimedia.org/r/551777 (https://phabricator.wikimedia.org/T238593) (owner: 10Vgutierrez) [10:06:02] !log repool centrallog2001 [10:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:09] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:06:41] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:08:45] that's the telia maint ^ [10:10:32] !log Upgrade MySQL on db2086 db2087 db2100 [10:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:35] there might be some rsyslog failures alerts firing, expected [10:13:23] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:15:48] (03PS24) 10ArielGlenn: allow storage generated misc cron dump output on secondary nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [10:16:28] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Reimage wezen to Stretch or Buster (and rename to centrallog2001) - https://phabricator.wikimedia.org/T224564 (10fgiunchedi) [10:17:12] 10Operations, 10Wikimedia-Mailing-lists: Create mailing list for project GLOW - https://phabricator.wikimedia.org/T238607 (10Peachey88) [10:17:13] (03PS4) 10Gehel: query_service: add updater mode option [puppet] - 10https://gerrit.wikimedia.org/r/551167 (https://phabricator.wikimedia.org/T231411) (owner: 10Mathew.onipe) [10:17:57] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, are there equivalent metrics for php-fpm (or needed?)" [puppet] - 10https://gerrit.wikimedia.org/r/551524 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [10:20:06] (03CR) 10ArielGlenn: [C: 03+2] allow storage generated misc cron dump output on secondary nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) (owner: 10ArielGlenn) [10:20:38] (03CR) 10Gehel: [C: 03+2] query_service: add updater mode option [puppet] - 10https://gerrit.wikimedia.org/r/551167 (https://phabricator.wikimedia.org/T231411) (owner: 10Mathew.onipe) [10:21:27] apergos: should I merge your change as well? [10:21:41] ah it's you who has the lock! [10:21:45] yeah go ahead [10:22:25] done! [10:22:28] ty [10:22:52] 10Operations, 10ops-codfw, 10observability, 10User-fgiunchedi: Update label and switch port for wezen -> centrallog2001 - https://phabricator.wikimedia.org/T238642 (10fgiunchedi) [10:23:34] (03PS4) 10Gehel: Switch wdqs1004 to merging updater mode [puppet] - 10https://gerrit.wikimedia.org/r/551169 (https://phabricator.wikimedia.org/T231411) (owner: 10Mathew.onipe) [10:23:50] apergos: np! [10:24:16] !log Upgrade db2120 db2121 db2122 [10:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:45] (03PS1) 10Jbond: puppet_compiler: ensure we dont reset the permissions of the repo [puppet] - 10https://gerrit.wikimedia.org/r/551783 [10:24:49] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:25:07] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: Remove HHVM references and includes [puppet] - 10https://gerrit.wikimedia.org/r/550818 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [10:26:15] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10Volans) @jbond just to be on the safe side and to verify the theory, if possible make a quick test that the new cert in the CR is able to verify exiting puppet node... [10:26:36] 10Operations, 10Analytics, 10Analytics-Kanban, 10SRE-Access-Requests, 10Patch-For-Review: Add system user analytics-privatedata to the anaytics-privatedata-users group - https://phabricator.wikimedia.org/T238306 (10elukey) During the SRE meeting no strong opposition to this task was raised, but it was su... [10:29:11] !log Upgrade db2077 [10:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:30] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10jbond) @volans cheers also re keeping the old cert around, for what its worth it will remain in the git history [10:29:52] (03CR) 10Jbond: [C: 03+2] puppet_compiler: ensure we dont reset the permissions of the repo [puppet] - 10https://gerrit.wikimedia.org/r/551783 (owner: 10Jbond) [10:30:31] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:31:12] (03PS1) 10Ema: cache: reimage cp2001 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/551785 (https://phabricator.wikimedia.org/T227432) [10:31:14] (03PS1) 10Ema: cache_text codfw: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/551786 (https://phabricator.wikimedia.org/T227432) [10:32:01] (03PS1) 10Gehel: wdqs: fix logging configuration for updated entities [puppet] - 10https://gerrit.wikimedia.org/r/551787 (https://phabricator.wikimedia.org/T238557) [10:32:23] (03PS9) 10Arturo Borrero Gonzalez: toolforge: proxy: adjust setup for the new k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/543135 (https://phabricator.wikimedia.org/T234037) [10:33:04] (03PS1) 10Filippo Giunchedi: monitoring: remove nginx availability alerts [puppet] - 10https://gerrit.wikimedia.org/r/551789 (https://phabricator.wikimedia.org/T231627) [10:33:27] (03CR) 10Ema: [C: 03+2] vcl: move XWD pass logic to wm_common [puppet] - 10https://gerrit.wikimedia.org/r/551531 (https://phabricator.wikimedia.org/T233768) (owner: 10Ema) [10:33:37] (03CR) 10Arturo Borrero Gonzalez: toolforge: proxy: adjust setup for the new k8s cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543135 (https://phabricator.wikimedia.org/T234037) (owner: 10Arturo Borrero Gonzalez) [10:34:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098:3317 for upgrade and compression', diff saved to https://phabricator.wikimedia.org/P9670 and previous config saved to /var/cache/conftool/dbconfig/20191119-103426-marostegui.json [10:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:37] !log Compress and upgrade db1098:3317 [10:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:44] (03CR) 10DCausse: [C: 03+1] wdqs: fix logging configuration for updated entities [puppet] - 10https://gerrit.wikimedia.org/r/551787 (https://phabricator.wikimedia.org/T238557) (owner: 10Gehel) [10:35:23] (03CR) 10Gehel: [C: 03+2] wdqs: fix logging configuration for updated entities [puppet] - 10https://gerrit.wikimedia.org/r/551787 (https://phabricator.wikimedia.org/T238557) (owner: 10Gehel) [10:35:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3316 for upgrade and compression', diff saved to https://phabricator.wikimedia.org/P9671 and previous config saved to /var/cache/conftool/dbconfig/20191119-103540-marostegui.json [10:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:03] !log Compress and upgrade db1098:3316 [10:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:32] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Enable mwdebug routes for noc.wikimedia.org - https://phabricator.wikimedia.org/T233768 (10ema) 05Open→03Resolved a:03ema This is now done: ` $ curl -v https://noc.wikimedia.org/Potato -H "X-Wikimedia-Debug: mwdebug1001.eqiad.wm... [10:48:12] 10Operations, 10serviceops: envoyproxy does not automatically reload certificates - https://phabricator.wikimedia.org/T238597 (10ema) [10:48:14] 10Operations, 10Traffic: Trigger envoy reload upon TLS certificate update - https://phabricator.wikimedia.org/T236125 (10ema) [10:50:43] (03CR) 10Vgutierrez: [C: 03+1] monitoring: remove nginx availability alerts [puppet] - 10https://gerrit.wikimedia.org/r/551789 (https://phabricator.wikimedia.org/T231627) (owner: 10Filippo Giunchedi) [10:52:39] 10Operations, 10serviceops: envoyproxy does not automatically reload certificates - https://phabricator.wikimedia.org/T238597 (10ema) @Joe: is there any potential risk in making `profile::tlsproxy::envoy::use_hot_restarter` default to `true` as @cdanis suggested? [10:56:11] 10Operations, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Mailman cannot correctly decode GB2312-superset mails labelled as GB2312 (non-standard behavior) - https://phabricator.wikimedia.org/T173894 (10MarcoAurelio) [10:56:36] 10Operations, 10Wikimedia-Mailing-lists, 10Upstream: https://lists.wikimedia.org/mailman/options/ doesn't set charset header - https://phabricator.wikimedia.org/T172929 (10MarcoAurelio) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191119T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:05:57] (03PS5) 10Gehel: Switch wdqs1004 to merging updater mode [puppet] - 10https://gerrit.wikimedia.org/r/551169 (https://phabricator.wikimedia.org/T231411) (owner: 10Mathew.onipe) [11:08:13] (03CR) 10Gehel: [C: 03+2] Switch wdqs1004 to merging updater mode [puppet] - 10https://gerrit.wikimedia.org/r/551169 (https://phabricator.wikimedia.org/T231411) (owner: 10Mathew.onipe) [11:08:20] (03CR) 10Effie Mouzeli: "Looks ok https://puppet-compiler.wmflabs.org/compiler1001/19477/" [puppet] - 10https://gerrit.wikimedia.org/r/551161 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [11:08:22] (03CR) 10Alexandros Kosiaris: [C: 03+1] logstash: remove HHVM references [puppet] - 10https://gerrit.wikimedia.org/r/551524 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [11:08:27] umh, I do have a patch for SWAT, but if there's no deployer avalaible... [11:08:45] Urbanecm: do we have SWAT for today? [11:09:00] (03CR) 10Alexandros Kosiaris: [C: 03+1] prometheus: Remove dead HHVM code [puppet] - 10https://gerrit.wikimedia.org/r/551161 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [11:09:21] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/551526 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [11:11:03] (03PS1) 10MarcoAurelio: [WIP] Configure default search namespaces for eo.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551793 [11:13:18] (03PS1) 10Elukey: kerberos: add syslog logging to kerberos-run-command.py [puppet] - 10https://gerrit.wikimedia.org/r/551794 (https://phabricator.wikimedia.org/T238306) [11:14:10] (03PS2) 10MarcoAurelio: Configure default search namespaces for eo.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551793 (https://phabricator.wikimedia.org/T237792) [11:14:33] (03PS2) 10Elukey: kerberos: add syslog logging to kerberos-run-command.py [puppet] - 10https://gerrit.wikimedia.org/r/551794 (https://phabricator.wikimedia.org/T238306) [11:16:48] (03PS1) 10Jbond: memcache: comment out memcache [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/551795 (https://phabricator.wikimedia.org/T233933) [11:16:57] !log restarting wdqs updater on wdqs1004 - T231411 [11:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:03] T231411: Test new Updater service - https://phabricator.wikimedia.org/T231411 [11:18:01] (03CR) 10Jbond: [V: 03+2 C: 03+2] memcache: comment out memcache [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/551795 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [11:20:27] (03PS2) 10Alexandros Kosiaris: hhvm: Remove hhvm module from puppet [puppet] - 10https://gerrit.wikimedia.org/r/551526 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [11:20:51] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/551526 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [11:29:17] !log Upgrade dbstore1003 (3311,3315,3317) [11:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:28] (03PS1) 10Gehel: Revert "Switch wdqs1004 to merging updater mode" [puppet] - 10https://gerrit.wikimedia.org/r/551796 [11:30:51] (03PS2) 10Gehel: Revert "Switch wdqs1004 to merging updater mode" [puppet] - 10https://gerrit.wikimedia.org/r/551796 (https://phabricator.wikimedia.org/T231411) [11:32:06] (03CR) 10DCausse: [C: 03+1] Revert "Switch wdqs1004 to merging updater mode" [puppet] - 10https://gerrit.wikimedia.org/r/551796 (https://phabricator.wikimedia.org/T231411) (owner: 10Gehel) [11:32:53] (03CR) 10Gehel: [C: 03+2] Revert "Switch wdqs1004 to merging updater mode" [puppet] - 10https://gerrit.wikimedia.org/r/551796 (https://phabricator.wikimedia.org/T231411) (owner: 10Gehel) [11:32:54] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM, just a pedantic minor comment which could be ignored." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539150 (owner: 10Muehlenhoff) [11:33:43] (03PS3) 10Effie Mouzeli: (WIP) mediawiki: remove all hhvm related files and hieradata [puppet] - 10https://gerrit.wikimedia.org/r/551527 (https://phabricator.wikimedia.org/T229792) [11:34:17] (03CR) 10jerkins-bot: [V: 04-1] (WIP) mediawiki: remove all hhvm related files and hieradata [puppet] - 10https://gerrit.wikimedia.org/r/551527 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [11:34:34] (03CR) 10Phamhi: [C: 03+1] toolforge: prometheus: enable scraping for the new k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/551191 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [11:35:54] (03PS4) 10Effie Mouzeli: (WIP) mediawiki: remove all hhvm related files and hieradata [puppet] - 10https://gerrit.wikimedia.org/r/551527 (https://phabricator.wikimedia.org/T229792) [11:37:57] !log restarting wdqs blazegraph on wdqs1004 - T231411 [11:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:02] T231411: Test new Updater service - https://phabricator.wikimedia.org/T231411 [11:39:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: new k8s: add wmcs-k8s-get-cert.sh script [puppet] - 10https://gerrit.wikimedia.org/r/550673 (https://phabricator.wikimedia.org/T215553) (owner: 10Arturo Borrero Gonzalez) [11:39:23] (03CR) 10Effie Mouzeli: (WIP) mediawiki: remove all hhvm related files and hieradata (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/551527 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [11:41:13] !log depooling wdqs1004 - T231411 [11:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:20] the page on iginga is me sorry will be fixed shortly [11:43:28] ok, I just got it [11:44:03] ok should be back [11:45:35] (03CR) 10Faidon Liambotis: Automatically cast network strings to ipaddress objects (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/551273 (owner: 10Ayounsi) [11:47:00] (03CR) 10Faidon Liambotis: [C: 03+1] mr: add DHCP server support + replace all system {} [homer/public] - 10https://gerrit.wikimedia.org/r/551274 (owner: 10Ayounsi) [11:48:16] (03CR) 10Faidon Liambotis: Automatically cast network strings to ipaddress objects (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/551273 (owner: 10Ayounsi) [11:49:07] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10jbond) I have tested a puppet run and made sure the debmonitor client works with the new CA. Are there any other service worth validating. specificity dose this c... [11:49:40] (03CR) 10Faidon Liambotis: [C: 03+1] "LGTM, with the exception of the sort. It sounds like this doesn't work, so perhaps we need to figure this out first?" [homer/public] - 10https://gerrit.wikimedia.org/r/550370 (owner: 10Ayounsi) [11:52:31] 10Operations, 10Wikimedia-Mailing-lists: Create mailing list for project GLOW - https://phabricator.wikimedia.org/T238607 (10Aklapper) Hi @Moushira. Please see https://meta.wikimedia.org/wiki/Mailing_lists#Create_a_new_list for required information. Thanks! [11:56:31] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10Volans) The debmonitor test didn't test much as the debmonitor client sends the puppet client cert (not the CA) and it's the server that validates it with the CA.... [11:57:32] (03CR) 10Faidon Liambotis: [C: 04-1] "> > Minor comment inline. I'm also not sure if we should have special" (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/550376 (owner: 10Ayounsi) [12:02:33] (03PS1) 10Arturo Borrero Gonzalez: ssl: add dummy private key for toolforge-k8s-prometheus [labs/private] - 10https://gerrit.wikimedia.org/r/551797 (https://phabricator.wikimedia.org/T237643) [12:03:50] 10Operations, 10WMF-Legal, 10serviceops, 10Patch-For-Review: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10Aklapper) I wonder why #1 should be performed before #2 is performed, and not the other way round (copy/move historical data o... [12:04:21] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] ssl: add dummy private key for toolforge-k8s-prometheus [labs/private] - 10https://gerrit.wikimedia.org/r/551797 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [12:08:28] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db1067 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551799 (https://phabricator.wikimedia.org/T238297) [12:09:22] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db1067 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551799 (https://phabricator.wikimedia.org/T238297) (owner: 10Marostegui) [12:10:18] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1067 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551799 (https://phabricator.wikimedia.org/T238297) (owner: 10Marostegui) [12:11:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db1067 from config T238297 (duration: 00m 52s) [12:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:30] T238297: decommission db1067.eqiad.wmnet - https://phabricator.wikimedia.org/T238297 [12:12:23] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db1067 from config T238297 (duration: 00m 52s) [12:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:57] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/551527 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:17:17] (03CR) 10Alexandros Kosiaris: [C: 03+2] logging: add logspam utilities [puppet] - 10https://gerrit.wikimedia.org/r/547777 (owner: 10Brennen Bearnes) [12:21:11] (03PS3) 10Arturo Borrero Gonzalez: toolforge: prometheus: enable scraping for the new k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/551191 (https://phabricator.wikimedia.org/T237643) [12:22:54] !log enable puppet on P:mediawiki::php and *.codfw.wmnet [12:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:22] 10Operations, 10hardware-requests: Hardware request for Postgres database for censorship monitoring scripts - https://phabricator.wikimedia.org/T238652 (10ssingh) [12:24:18] (03PS4) 10Arturo Borrero Gonzalez: toolforge: prometheus: enable scraping for the new k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/551191 (https://phabricator.wikimedia.org/T237643) [12:28:29] !log enable puppet on P:mediawiki::php and *.eqiad.wmnet [12:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:33] (03PS1) 10ArielGlenn: move misc crons to dumpsdata1002 nfs server [puppet] - 10https://gerrit.wikimedia.org/r/551804 (https://phabricator.wikimedia.org/T224563) [12:31:18] 10Operations, 10hardware-requests: Hardware request for Postgres database for censorship monitoring scripts - https://phabricator.wikimedia.org/T238652 (10akosiaris) p:05Triage→03Normal Adding some extra info as we 've discussed with @ssingh before this task was filed. After the discussion it became obviou... [12:33:16] (03PS1) 10Arturo Borrero Gonzalez: ssl: move toolforge-k8s-prometheus priv key to a proper location [labs/private] - 10https://gerrit.wikimedia.org/r/551805 (https://phabricator.wikimedia.org/T237643) [12:33:42] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] ssl: move toolforge-k8s-prometheus priv key to a proper location [labs/private] - 10https://gerrit.wikimedia.org/r/551805 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [12:35:18] (03CR) 10ArielGlenn: [C: 04-2] "DO NOT MERGE until Friday late or Saturday when no misc cron dumps are running." [puppet] - 10https://gerrit.wikimedia.org/r/551804 (https://phabricator.wikimedia.org/T224563) (owner: 10ArielGlenn) [12:36:47] (03PS1) 10Volans: Revert "coherence: Check device names for correct formatting" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/551806 [12:38:26] (03CR) 10Volans: [C: 03+2] Revert "coherence: Check device names for correct formatting" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/551806 (owner: 10Volans) [12:41:26] 10Operations, 10Dumps-Generation, 10Patch-For-Review: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) The schedule is now: - move misc crons to dumpsdata1002 when misc crons are not running (https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/551804/... [12:42:13] !log add libapache2-mod-auth-cas 1.2-1 to stretch-wikimedia repo [12:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: prometheus: enable scraping for the new k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/551191 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [12:48:00] 10Operations, 10Phabricator, 10Traffic, 10serviceops, 10Patch-For-Review: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Joe) The problem is most apache workers ended up being stuck talking to aphlict via `proxy_wstunnel` which... [12:55:47] 10Operations, 10SRE-tools, 10netbox, 10observability, 10User-crusnov: Netbox Alert Cleanups - https://phabricator.wikimedia.org/T224946 (10faidon) What's the latest here? Please keep the task updated :) [12:59:15] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Remove dead HHVM code [puppet] - 10https://gerrit.wikimedia.org/r/551161 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:59:27] (03CR) 10Filippo Giunchedi: [C: 03+2] monitoring: remove nginx availability alerts [puppet] - 10https://gerrit.wikimedia.org/r/551789 (https://phabricator.wikimedia.org/T231627) (owner: 10Filippo Giunchedi) [12:59:47] (03PS1) 10Jbond: apereo_cas: use CAS protocol instead of SAML [puppet] - 10https://gerrit.wikimedia.org/r/551811 [13:10:41] (03CR) 10Muehlenhoff: logstash: introduce logstash 7 and openjdk-11 support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [13:10:54] (03PS1) 10Arturo Borrero Gonzalez: toolforge: prometheus: fix syntax in the inlined config yaml [puppet] - 10https://gerrit.wikimedia.org/r/551816 (https://phabricator.wikimedia.org/T237643) [13:13:11] addshore: I know it sound strange but what happens if you try to delete https://grafana.wikimedia.org/d/000000599/wikibase-wb_terms-newitemidformatter again after getting 403? another 403 ? [13:14:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: prometheus: fix syntax in the inlined config yaml [puppet] - 10https://gerrit.wikimedia.org/r/551816 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [13:14:15] !log depool cp2001 and reimage as text_ats T227432 [13:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:21] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [13:15:47] (03CR) 10Ema: [C: 03+2] cache: reimage cp2001 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/551785 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [13:18:30] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2001.codfw.wmnet'] ` The log can be found in `/var/log/wm... [13:22:28] (03PS1) 10Arturo Borrero Gonzalez: toolforge: prometheus: fix syntax for label in the new-k8s-nodes job [puppet] - 10https://gerrit.wikimedia.org/r/551817 (https://phabricator.wikimedia.org/T237643) [13:25:27] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1079:9536,cp1081:9536,cp1087:9536,cp1089:9536} site=eqiad tunnel={cp2001_v4,cp2001_v6} Ema reimaging cp2001 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [13:25:46] (03CR) 10jerkins-bot: [V: 04-1] toolforge: prometheus: fix syntax for label in the new-k8s-nodes job [puppet] - 10https://gerrit.wikimedia.org/r/551817 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [13:26:02] (03PS2) 10Arturo Borrero Gonzalez: toolforge: prometheus: fix syntax for label in the new-k8s-nodes job [puppet] - 10https://gerrit.wikimedia.org/r/551817 (https://phabricator.wikimedia.org/T237643) [13:28:09] (03PS3) 10Arturo Borrero Gonzalez: toolforge: prometheus: fix syntax for label in the new-k8s-nodes job [puppet] - 10https://gerrit.wikimedia.org/r/551817 (https://phabricator.wikimedia.org/T237643) [13:31:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: prometheus: fix syntax for label in the new-k8s-nodes job [puppet] - 10https://gerrit.wikimedia.org/r/551817 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [13:33:56] !log ema@cumin2001 START - Cookbook sre.hosts.downtime [13:33:58] (03PS5) 10Muehlenhoff: Enable ldap-corp1001/2001 as additional LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/539150 [13:33:59] !log ema@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [13:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:25] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:35:47] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:37:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1101:3318 after schema change', diff saved to https://phabricator.wikimedia.org/P9672 and previous config saved to /var/cache/conftool/dbconfig/20191119-133704-marostegui.json [13:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1092 for schema change', diff saved to https://phabricator.wikimedia.org/P9673 and previous config saved to /var/cache/conftool/dbconfig/20191119-133850-marostegui.json [13:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:24] !log Deploy schema change on db1092 [13:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:44] (03CR) 10Herron: "FWIW PS16 and onward include offline logstash plugins built by bin/logstash-plugin prepare-offline-pack which handles fetching logstash ge" [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [13:43:13] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/551527 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [13:46:57] !log Deploy schema change on labswiki (wikitech) - T238370 [13:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:02] T238370: Apply schema changes for OAuth 2.0 - https://phabricator.wikimedia.org/T238370 [13:49:57] !log Deploy schema change on foundationwiki directly on s3 master - T238370 [13:50:02] !log Deploy schema change on foundationwiki directly on s3 master - T238370 [13:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:53] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2001.codfw.wmnet'] ` and were **ALL** successful. [13:55:12] !log Deploy schema change on mediawikiwiki directly on s3 master T238370 [13:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:28] T238370: Apply schema changes for OAuth 2.0 - https://phabricator.wikimedia.org/T238370 [13:57:08] !log Deploy schema change on mediawikiwiki directly on s7 master T238370 [13:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:22] !log Deploy schema change on metawiki directly on s7 master T238370 [13:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:33] 10Operations, 10WMF-Legal, 10serviceops, 10Patch-For-Review: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10Prtksxna) @Aklapper, the legal team doesn't have the resources to do #2 at the moment, and not having #1 is already causing so... [13:58:51] (03CR) 10Ema: [C: 03+2] cache_text codfw: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/551786 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:10:18] 10Operations, 10DBA, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Scheduled renaming of two Wikimedia user accounts - https://phabricator.wikimedia.org/T238512 (10Sotiale) The main account went well. (looks fine) I will handle bot account at 14:20 by UTC. [14:13:15] !log pool cp2001 with ATS backend T227432 [14:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:20] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [14:14:41] 10Operations, 10DBA, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Scheduled renaming of two Wikimedia user accounts - https://phabricator.wikimedia.org/T238512 (10Marostegui) Thanks for the heads up. [14:16:13] (03PS1) 10Ema: cache: reimage cp2004 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/551827 (https://phabricator.wikimedia.org/T227432) [14:18:07] !log depool cp2004 and reimage as text_ats T227432 [14:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:20] (03CR) 10Ema: [C: 03+2] cache: reimage cp2004 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/551827 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:20:56] 10Operations, 10DBA, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Scheduled renaming of two Wikimedia user accounts - https://phabricator.wikimedia.org/T238512 (10Sotiale) >>! In T238512#5675217, @Marostegui wrote: > Thanks for the heads up. Thanks for caring. Can we proceed now? [14:21:03] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2004.codfw.wmnet'] ` The log can be found in `/var/log/wm... [14:21:14] 10Operations, 10DBA, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Scheduled renaming of two Wikimedia user accounts - https://phabricator.wikimedia.org/T238512 (10Marostegui) >>! In T238512#5675233, @Sotiale wrote: >>>! In T238512#5675217, @Marostegui wrote: >> Thanks for the heads up. > > Thanks f... [14:28:14] 10Operations, 10DBA, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Scheduled renaming of two Wikimedia user accounts - https://phabricator.wikimedia.org/T238512 (101997kB) https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress?username=Lamchuhan-bot already finished, in case you guys still need... [14:28:42] (03CR) 10Ottomata: [C: 03+1] profile::analytics_cluster::coordinator: add geoip database [puppet] - 10https://gerrit.wikimedia.org/r/551776 (https://phabricator.wikimedia.org/T238432) (owner: 10Elukey) [14:28:57] 10Operations, 10DBA, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Scheduled renaming of two Wikimedia user accounts - https://phabricator.wikimedia.org/T238512 (10Marostegui) Excellent - thank you. All good from our side too. [14:29:21] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10CGlenn) @Nuria Hi Nuria. Yes, I am aware that needs to be accomplished. I have the checklist in my ticket. [14:29:28] (03PS2) 10Elukey: profile::analytics_cluster::coordinator: add geoip database [puppet] - 10https://gerrit.wikimedia.org/r/551776 (https://phabricator.wikimedia.org/T238432) [14:30:03] 10Operations, 10DBA, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Scheduled renaming of two Wikimedia user accounts - https://phabricator.wikimedia.org/T238512 (10Sotiale) I was so surprised that it was taken care of so quickly. Thank you very much for your attention!! [14:31:22] 10Operations, 10DBA, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Scheduled renaming of two Wikimedia user accounts - https://phabricator.wikimedia.org/T238512 (101997kB) 05Open→03Resolved a:03Sotiale [14:32:32] (03CR) 10Elukey: [C: 03+2] profile::analytics_cluster::coordinator: add geoip database [puppet] - 10https://gerrit.wikimedia.org/r/551776 (https://phabricator.wikimedia.org/T238432) (owner: 10Elukey) [14:34:26] (03PS1) 10Elukey: Revert "profile::analytics_cluster::coordinator: add geoip database" [puppet] - 10https://gerrit.wikimedia.org/r/551833 [14:34:58] !log restarting blazegraph with additional logging on wdqs1004 - T231411 [14:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:03] T231411: Test new Updater service - https://phabricator.wikimedia.org/T231411 [14:35:37] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10jbond) Thanks Ricardo, I have now validated all the certificates in the private repo ` lines=7 puppetmaster1001 ~ % openssl x509 -in new_ca.pem -noout -subject -d... [14:36:29] !log ema@cumin2001 START - Cookbook sre.hosts.downtime [14:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:32] (03CR) 10Elukey: [C: 03+2] Revert "profile::analytics_cluster::coordinator: add geoip database" [puppet] - 10https://gerrit.wikimedia.org/r/551833 (owner: 10Elukey) [14:38:31] !log ema@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:17] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10Ottomata) Hm, uh oh, @elukey we might want to recreate all those certs with a really long expiry :) > i noticed that all of the folders under /srv/private/modules/... [14:44:37] 10Operations, 10Phabricator, 10Traffic, 10serviceops, 10Patch-For-Review: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10akosiaris) >>! In T238593#5674571, @ema wrote: > By going through SAL and the irc logs on #wikimedia-opera... [14:49:04] (03PS1) 10Jbond: admin: add ezachte ssh key [puppet] - 10https://gerrit.wikimedia.org/r/551836 (https://phabricator.wikimedia.org/T215790) [14:49:20] godog: i tried yesterday and I got another 403, let me try again today [14:49:55] yup, 2x 403s in a row [14:50:22] Request from - via cp3058.esams.wmnet, ATS/8.0.5 [14:50:22] Error: 403, Access Denied at 2019-11-19 14:49:40 GMT [14:50:48] ah yeah same here, thanks addshore [14:52:16] (03CR) 10Ema: [C: 03+1] ATS: Disable websocket remap rules for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/551731 (https://phabricator.wikimedia.org/T238593) (owner: 10Vgutierrez) [14:52:23] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2004.codfw.wmnet'] ` and were **ALL** successful. [14:52:25] I was asking before deleting myself because the other day it did work after I tried again [14:52:42] I believe it has sth to do with ats migration, I don't know yet what tho [14:53:50] getting 403s instead of something more useful? :) [14:54:27] (03CR) 10Jbond: logstash: introduce logstash 7 and openjdk-11 support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [14:54:29] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Core Platform Team Legacy (Watching / External), and 3 others: Migrate EventStreams to k8s deployment pipeline - https://phabricator.wikimedia.org/T238658 (10Ottomata) [14:55:00] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Core Platform Team Legacy (Watching / External), and 3 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10Ottomata) [14:55:36] ema: hehe indeed, context is T238540 [14:55:36] T238540: Delete grafana dashboard, https://grafana.wikimedia.org/d/000000599/wikibase-wb_terms-newitemidformatter - https://phabricator.wikimedia.org/T238540 [14:55:38] 10Operations, 10Traffic, 10Wikidata, 10observability: Delete grafana dashboard, https://grafana.wikimedia.org/d/000000599/wikibase-wb_terms-newitemidformatter - https://phabricator.wikimedia.org/T238540 (10fgiunchedi) [14:56:31] 10Operations, 10Traffic, 10Wikidata, 10observability: Delete grafana dashboard, https://grafana.wikimedia.org/d/000000599/wikibase-wb_terms-newitemidformatter - https://phabricator.wikimedia.org/T238540 (10ema) p:05Triage→03Normal [14:56:59] godog: thanks, looking [14:58:59] (03PS1) 10BBlack: TLS Analytics: make parsing more robust [puppet] - 10https://gerrit.wikimedia.org/r/551840 [15:04:01] (03PS1) 10Ottomata: [WIP] eventstreams chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551843 (https://phabricator.wikimedia.org/T238658) [15:04:16] (03CR) 10jerkins-bot: [V: 04-1] [WIP] eventstreams chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551843 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [15:05:42] (03CR) 10Herron: logstash: introduce logstash 7 and openjdk-11 support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [15:05:45] (03CR) 10Jbond: puppetmaster,icinga: naggen2 cleanup and update to python3 (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/549222 (owner: 10Cwhite) [15:06:17] !log mobrovac@deploy1001 Started deploy [restbase/deploy@5e7f759] (dev-cluster): Switch test.wp and test2.wp to Parsoid/PHP [15:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:37] (03CR) 10Alexandros Kosiaris: [C: 03+1] Enable ldap-corp1001/2001 as additional LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/539150 (owner: 10Muehlenhoff) [15:07:19] 10Operations, 10Traffic, 10Wikidata, 10observability: Delete grafana dashboard, https://grafana.wikimedia.org/d/000000599/wikibase-wb_terms-newitemidformatter - https://phabricator.wikimedia.org/T238540 (10ema) Interesting, I've observed the request failing as described in this task by using the Chromium d... [15:07:54] !log pool cp2004 with ATS backend T227432 [15:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:58] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [15:08:19] (03PS2) 10Ottomata: [WIP] eventstreams chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551843 (https://phabricator.wikimedia.org/T238658) [15:08:32] (03CR) 10jerkins-bot: [V: 04-1] [WIP] eventstreams chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551843 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [15:09:06] (03PS3) 10Ottomata: [WIP] eventstreams chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551843 (https://phabricator.wikimedia.org/T238658) [15:09:15] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@5e7f759] (dev-cluster): Switch test.wp and test2.wp to Parsoid/PHP (duration: 02m 58s) [15:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:20] (03CR) 10jerkins-bot: [V: 04-1] [WIP] eventstreams chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551843 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [15:09:23] (03PS4) 10Ottomata: [WIP] eventstreams chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551843 (https://phabricator.wikimedia.org/T238658) [15:09:35] (03CR) 10jerkins-bot: [V: 04-1] [WIP] eventstreams chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/551843 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [15:13:09] !log mobrovac@deploy1001 Started deploy [restbase/deploy@5e7f759]: Switch test.wp and test2.wp to Parsoid/PHP - T229015 [15:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:14] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [15:14:27] (03PS1) 10Ema: cache: reimage cp2006 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/551846 (https://phabricator.wikimedia.org/T227432) [15:15:55] !log depool cp2006 and reimage as text_ats T227432 [15:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:59] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [15:18:15] (03CR) 10Ema: [C: 03+2] cache: reimage cp2006 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/551846 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [15:19:33] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2006.codfw.wmnet'] ` The log can be found in `/var/log/wm... [15:24:16] 10Operations, 10ops-codfw, 10observability, 10User-fgiunchedi: Update label and switch port for wezen -> centrallog2001 - https://phabricator.wikimedia.org/T238642 (10Papaul) p:05Triage→03Normal [15:26:24] 10Operations, 10ops-codfw, 10observability, 10User-fgiunchedi: Update label and switch port for wezen -> centrallog2001 - https://phabricator.wikimedia.org/T238642 (10Papaul) ` papaul@asw-d-codfw# show | compare [edit interfaces ge-5/0/13] - description wezen; + description centrallog2001; [15:27:31] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@5e7f759]: Switch test.wp and test2.wp to Parsoid/PHP - T229015 (duration: 14m 22s) [15:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:36] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [15:27:42] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1079:9536,cp1081:9536,cp1087:9536,cp1089:9536} site=eqiad tunnel={cp2006_v4,cp2006_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:28:23] 10Operations, 10Traffic, 10Wikidata, 10observability: Delete grafana dashboard, https://grafana.wikimedia.org/d/000000599/wikibase-wb_terms-newitemidformatter - https://phabricator.wikimedia.org/T238540 (10ema) >>! In T238540#5675342, @ema wrote: > I've observed the request failing as described in this tas... [15:28:26] (03CR) 10Dzahn: [C: 03+1] ATS: Disable websocket remap rules for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/551731 (https://phabricator.wikimedia.org/T238593) (owner: 10Vgutierrez) [15:33:32] (03PS1) 10Ema: ATS: allow DELETE requests [puppet] - 10https://gerrit.wikimedia.org/r/551849 (https://phabricator.wikimedia.org/T238540) [15:34:57] !log ema@cumin2001 START - Cookbook sre.hosts.downtime [15:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:35] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1079:9536,cp1081:9536,cp1087:9536} site=eqiad tunnel={cp2006_v4,cp2006_v6} Ema reimaging cp2006 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:37:04] !log ema@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:11] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [15:38:23] (03PS1) 10Ema: cache: reimage cp2007 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/551850 (https://phabricator.wikimedia.org/T227432) [15:43:03] RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:44:54] (03PS2) 10Ema: ATS: Disable websocket remap rules for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/551731 (https://phabricator.wikimedia.org/T238593) (owner: 10Vgutierrez) [15:46:37] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [15:53:08] jouncebot now [15:53:08] No deployments scheduled for the next 0 hour(s) and 6 minute(s) [15:53:20] jouncebot next [15:53:20] In 0 hour(s) and 6 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191119T1600) [15:57:43] !log mobrovac@deploy1001 Started deploy [restbase/deploy@564b2c6]: New Parsoid/PHP config structure [15:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:13] (03CR) 10Herron: logstash: introduce logstash 7 and openjdk-11 support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [15:58:48] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/references/{title} (Get references of a test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received https://wikitech.wikimedia.org/wi [15:58:48] oring/mobileapps [15:59:54] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@564b2c6]: New Parsoid/PHP config structure (duration: 02m 11s) [15:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] godog and _joe_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191119T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:01:14] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:04:19] (03PS2) 10Dzahn: phabricator: use codfw db servers for codfw server [puppet] - 10https://gerrit.wikimedia.org/r/551285 (https://phabricator.wikimedia.org/T137928) [16:05:03] (03CR) 10Ema: [C: 03+2] ATS: Disable websocket remap rules for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/551731 (https://phabricator.wikimedia.org/T238593) (owner: 10Vgutierrez) [16:05:05] (03CR) 10Dzahn: [C: 03+2] "readonly db in codfw for codfw warm standby-server" [puppet] - 10https://gerrit.wikimedia.org/r/551285 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [16:06:08] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2006.codfw.wmnet'] ` and were **ALL** successful. [16:09:48] !log pool cp2006 with ATS backend T227432 [16:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:53] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [16:10:30] !log depool cp2007 and reimage as text_ats T227432 [16:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:16] (03CR) 10Ema: [C: 03+2] cache: reimage cp2007 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/551850 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [16:12:54] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2007.codfw.wmnet'] ` The log can be found in `/var/log/wm... [16:14:06] (03PS4) 10Ayounsi: Add virtual-chassis support [software/homer] - 10https://gerrit.wikimedia.org/r/550367 [16:14:12] 10Operations, 10Phabricator, 10Traffic, 10serviceops: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) >>! In T238593#5674571, @ema wrote: > - 2019-11-15 17:30 SAL: `mutante: phabricator - -started phd service`. @Dzahn It's... [16:14:26] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [16:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:16] !log reloading data from wdqs1007 to wdqs1004 - after failed test of merging updater - T212826 [16:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:21] T212826: Create dedicated Updater service in Blazegraph - https://phabricator.wikimedia.org/T212826 [16:15:50] (03CR) 10Ayounsi: "> Patch Set 2: Code-Review+1" [homer/public] - 10https://gerrit.wikimedia.org/r/550370 (owner: 10Ayounsi) [16:16:24] (03CR) 10jerkins-bot: [V: 04-1] Add virtual-chassis support [software/homer] - 10https://gerrit.wikimedia.org/r/550367 (owner: 10Ayounsi) [16:17:38] !log phab1003 - systemctl stop aphlict (proxy config in apache is disabled as well as disabled in ATS) (T238593) [16:17:42] (03CR) 10Ayounsi: [C: 03+1] Add virtual-chassis support [homer/public] - 10https://gerrit.wikimedia.org/r/550370 (owner: 10Ayounsi) [16:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:44] T238593: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 [16:19:53] (03PS1) 10Elukey: hive: enable QOP by default when SASL is set [puppet] - 10https://gerrit.wikimedia.org/r/551860 [16:20:07] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/references/{title} (Get references of a test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/media-list/{ti [16:20:07] ist from test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:20:29] 10Operations, 10Phabricator, 10Traffic, 10serviceops: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10ema) >>! In T238593#5675662, @Dzahn wrote: > > You can entirely disregard that, i was on phab1001 and not phab1003 by accident.... [16:21:21] !log phab1003 - puppet restarts aphlict service even with "phabricator_aphlict_enabled: false" in Hiera. But it does properly remove the proxy config lines from apache. so service is running but not used. (T238593) [16:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:19] !log installing glib2.0 security updates [16:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:34] 10Operations, 10Wikimedia-Mailing-lists: Create mailing list for project GLOW - https://phabricator.wikimedia.org/T238607 (10crusnov) p:05Triage→03Normal [16:25:47] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [16:25:55] 10Operations, 10Phabricator, 10Traffic, 10serviceops: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) >>! In T238593#5675686, @ema wrote: > Was there any other admin action between the page and when @joe disabled proxy_wstu... [16:26:08] re: mobileapps alerts, looks like the internal request rates for a many endpoints tripled beginning at 15:18 UTC [16:26:18] 10Operations, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Mailman cannot correctly decode GB2312-superset mails labelled as GB2312 (non-standard behavior) - https://phabricator.wikimedia.org/T173894 (10crusnov) p:05Triage→03Normal [16:26:41] 10Operations, 10Wikimedia-Mailing-lists, 10Upstream: https://lists.wikimedia.org/mailman/options/ doesn't set charset header - https://phabricator.wikimedia.org/T172929 (10crusnov) p:05Triage→03Normal [16:27:15] https://grafana.wikimedia.org/d/000000183/mobileapps?orgId=1&from=1574170024596&to=1574180824596&var-percentile=p50&panelId=1&fullscreen [16:28:11] mobrovac: Pchelolo: Is this increase in internal mobileapps request rates expected? ^ [16:28:21] !log ema@cumin2001 START - Cookbook sre.hosts.downtime [16:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:30] !log ema@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:34] (03PS1) 10Elukey: Add fake kerberos keytabs for an1030 and an-coord1001 [labs/private] - 10https://gerrit.wikimedia.org/r/551862 [16:31:49] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake kerberos keytabs for an1030 and an-coord1001 [labs/private] - 10https://gerrit.wikimedia.org/r/551862 (owner: 10Elukey) [16:33:12] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/19484/" [puppet] - 10https://gerrit.wikimedia.org/r/551860 (owner: 10Elukey) [16:36:11] 10Operations: Integrate Buster 10.2 point update - https://phabricator.wikimedia.org/T238519 (10MoritzMuehlenhoff) [16:37:09] (03CR) 10Ayounsi: "> Patch Set 3:" (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/550376 (owner: 10Ayounsi) [16:38:09] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1079:9536,cp1081:9536,cp1087:9536} site=eqiad tunnel={cp2007_v4,cp2007_v6} Ema reimaging 2007 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [16:39:24] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/551849 (https://phabricator.wikimedia.org/T238540) (owner: 10Ema) [16:42:21] (03CR) 10Jbond: logstash: introduce logstash 7 and openjdk-11 support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [16:42:34] 10Operations, 10Phabricator, 10Traffic, 10serviceops: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) Regarding the puppetization: There is `hiera('phabricator_aphlict_enabled'.` which is now set to false. What this does:... [16:43:13] 10Operations, 10Traffic, 10Performance-Team (Radar): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10ema) As per irc conversation with @gilles, we do have frontend servers tagged in navtiming hadoop data. It would be very useful if we could have the info... [16:45:10] 10Operations, 10observability: Make contact group for Netbox report alerts - https://phabricator.wikimedia.org/T230725 (10crusnov) p:05Triage→03Normal [16:45:14] (03PS6) 10Mforns: analytics::refinery::job::druid_load: Add sanitization for netflow [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) [16:45:35] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2007.codfw.wmnet'] ` and were **ALL** successful. [16:48:40] (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::druid_load: Add sanitization for netflow [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [16:48:50] (03PS7) 10Elukey: analytics::refinery::job::druid_load: Add sanitization for netflow [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [16:51:08] 10Operations, 10Performance-Team, 10Traffic: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) @Gilles To see if and to which extent ats-tls is also responsible for some of the performance degradation, you can query hadoop and check the ssl t... [16:52:21] (03PS5) 10Ayounsi: Automatically cast network strings to ipaddress objects [software/homer] - 10https://gerrit.wikimedia.org/r/551273 [16:52:35] (03PS1) 10CRusnov: nagios_common: Add dcops irc notification commands [puppet] - 10https://gerrit.wikimedia.org/r/551864 (https://phabricator.wikimedia.org/T230725) [16:53:15] (03CR) 10Ayounsi: Automatically cast network strings to ipaddress objects (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/551273 (owner: 10Ayounsi) [16:54:16] (03CR) 10Ayounsi: [C: 03+1] "Waiting for I5d4881c63e4c40c74f0397b4ff75bb79139d4ae7" [homer/public] - 10https://gerrit.wikimedia.org/r/551274 (owner: 10Ayounsi) [16:56:40] (03PS8) 10Elukey: analytics::refinery::job::druid_load: Add sanitization for netflow [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [16:56:59] 10Operations, 10Traffic, 10Inuka-Team (Kanban), 10Patch-For-Review, 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10SBisson) My summary of the gerrit discussion: //The patch was written in a way that completely ignored the reality of the... [16:57:37] (03CR) 10Elukey: analytics::refinery::job::druid_load: Add sanitization for netflow [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [16:57:42] (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::druid_load: Add sanitization for netflow [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [16:58:32] !log pool cp2007 with ATS backend T227432 [16:58:36] (03CR) 10Elukey: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [16:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:37] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [16:58:54] jouncebot: now [16:58:55] For the next 0 hour(s) and 1 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191119T1600) [16:58:58] jouncebot: next [16:58:58] In 0 hour(s) and 1 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191119T1700) [16:59:46] * addshore waits to see if that is being used [17:00:04] cscott, arlolra, subbu, halfak, and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191119T1700). [17:01:22] subbu: is that deployment slot being used? :) [17:02:11] !log volker-e@deploy1001 Started deploy [design/style-guide@d73818a]: Deploy design/style-guide: [17:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:19] !log volker-e@deploy1001 Finished deploy [design/style-guide@d73818a]: Deploy design/style-guide: (duration: 00m 07s) [17:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:00] we may do a deployment, but we normally do it at noon CT. [17:03:36] ack! I'm going to push some of my changes out in this hour then! [17:04:08] ok [17:09:29] (03CR) 10Filippo Giunchedi: logstash: introduce logstash 7 and openjdk-11 support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [17:11:30] !log addshore@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/Wikidata.org: T221774 - Wikidata.org extension (queryservice maxlag) [[gerrit:551855]] [[gerrit:551856]] (duration: 00m 54s) [17:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:34] T221774: Add Wikidata query service lag to Wikidata maxlag - https://phabricator.wikimedia.org/T221774 [17:12:34] * addshore now waits for the next one to merge (won't be doing anything for the next 15 mins) [17:15:51] jouncebot: now [17:15:51] For the next 0 hour(s) and 44 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191119T1700) [17:15:55] jouncebot: next [17:15:55] In 0 hour(s) and 44 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191119T1800) [17:17:38] (03CR) 10Dzahn: [C: 03+1] nagios_common: Add dcops irc notification commands [puppet] - 10https://gerrit.wikimedia.org/r/551864 (https://phabricator.wikimedia.org/T230725) (owner: 10CRusnov) [17:17:49] (03CR) 10CRusnov: [C: 03+2] nagios_common: Add dcops irc notification commands [puppet] - 10https://gerrit.wikimedia.org/r/551864 (https://phabricator.wikimedia.org/T230725) (owner: 10CRusnov) [17:18:44] (03CR) 10Elukey: analytics::refinery::job::druid_load: Add sanitization for netflow [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [17:18:50] (03PS9) 10Elukey: analytics::refinery::job::druid_load: Add sanitization for netflow [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [17:20:32] (03CR) 10Herron: logstash: introduce logstash 7 and openjdk-11 support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [17:21:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10RobH) Please note this should have included https://netbox.wikimedia.org/dcim/devices/1405/ in the decom as well. @Jclark-ctr: I'm adding... [17:22:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10RobH) [17:23:12] (03CR) 10Dzahn: [C: 03+1] "now that DB is switched to readonly-codfw, we can do this, right" [puppet] - 10https://gerrit.wikimedia.org/r/549906 (https://phabricator.wikimedia.org/T232883) (owner: 10Dzahn) [17:23:21] (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::druid_load: Add sanitization for netflow [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [17:23:39] mmm I cannot submit --^ [17:23:47] merge conflict? [17:24:04] it seems not, I can rebase on my local repo and on gerrit [17:24:24] elukey: it's marked as WIP [17:24:34] so you can't submit unless you un-WIP it [17:24:45] There should be a "Start Review" button [17:24:55] hauskater: I owe you one, TIL [17:24:57] (03PS19) 10Herron: logstash: introduce logstash 7 and openjdk-11 support [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) [17:25:22] no idea about WIP [17:25:24] thanks a lot! [17:25:33] elukey: heh, I'll make sure to send you my fee minute :P [17:26:06] hauskater: ahhahh sure! [17:26:18] i think the WIP feature might be WIP [17:26:47] (03CR) 10jerkins-bot: [V: 04-1] logstash: introduce logstash 7 and openjdk-11 support [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [17:29:06] (03PS1) 10Elukey: role::analytics_test_cluster_coordinator: enable Kerberos for Presto [puppet] - 10https://gerrit.wikimedia.org/r/551870 (https://phabricator.wikimedia.org/T237269) [17:29:35] (03PS20) 10Herron: logstash: introduce logstash 7 and openjdk-11 support [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) [17:30:12] (03PS2) 10Elukey: role::analytics_test_cluster_coordinator: enable Kerberos for Presto [puppet] - 10https://gerrit.wikimedia.org/r/551870 (https://phabricator.wikimedia.org/T237269) [17:31:46] (03CR) 10CDanis: CI - python3: first attempt at adding python3 CI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [17:32:35] 10Operations, 10netops: "unknown session id" from bird on centrallog hosts - https://phabricator.wikimedia.org/T238677 (10fgiunchedi) [17:33:01] 10Operations, 10netops: "unknown session id" from bird on centrallog hosts - https://phabricator.wikimedia.org/T238677 (10ayounsi) a:03ayounsi [17:33:04] (03PS1) 10CRusnov: nagios_common: Add contactgroup for dcops [puppet] - 10https://gerrit.wikimedia.org/r/551871 (https://phabricator.wikimedia.org/T230725) [17:35:42] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster_coordinator: enable Kerberos for Presto [puppet] - 10https://gerrit.wikimedia.org/r/551870 (https://phabricator.wikimedia.org/T237269) (owner: 10Elukey) [17:35:59] (03PS2) 10CRusnov: nagios_common: Add contactgroup for dcops [puppet] - 10https://gerrit.wikimedia.org/r/551871 (https://phabricator.wikimedia.org/T230725) [17:39:09] !log gehel@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [17:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:34] gehel: ^ [17:41:45] (03PS2) 10Addshore: DNM: mediawiki/wikidata maint cron for updateQueryServiceLag [puppet] - 10https://gerrit.wikimedia.org/r/551582 (https://phabricator.wikimedia.org/T221774) [17:42:14] (03CR) 10Addshore: [C: 03+1] "Just verified this script line while testing the deployment on mwmaint1002 while backporting the patch for this maint script." [puppet] - 10https://gerrit.wikimedia.org/r/551582 (https://phabricator.wikimedia.org/T221774) (owner: 10Addshore) [17:43:52] !log addshore@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/Wikidata.org: T221774 - Wikidata.org extension (queryservice maxlag, maint script) [[gerrit:551857]] (duration: 00m 52s) [17:43:53] (03PS3) 10CRusnov: nagios_common: Add contactgroup for dcops [puppet] - 10https://gerrit.wikimedia.org/r/551871 (https://phabricator.wikimedia.org/T230725) [17:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:57] T221774: Add Wikidata query service lag to Wikidata maxlag - https://phabricator.wikimedia.org/T221774 [17:44:31] (03CR) 10Dzahn: [C: 03+1] nagios_common: Add contactgroup for dcops [puppet] - 10https://gerrit.wikimedia.org/r/551871 (https://phabricator.wikimedia.org/T230725) (owner: 10CRusnov) [17:44:44] (03PS4) 10CRusnov: nagios_common: Add contactgroup for dcops [puppet] - 10https://gerrit.wikimedia.org/r/551871 (https://phabricator.wikimedia.org/T230725) [17:48:42] (03PS1) 10ArielGlenn: add partman recipe that leaves /data on dump servers alone [puppet] - 10https://gerrit.wikimedia.org/r/551879 (https://phabricator.wikimedia.org/T224563) [17:49:03] (03CR) 10Volans: cables: detect duplicate cable names, and blank cable names (034 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) (owner: 10CRusnov) [17:53:38] (03PS1) 10Elukey: role::analytics_test_cluster::coordinator: fix Presto kerberos config [puppet] - 10https://gerrit.wikimedia.org/r/551881 (https://phabricator.wikimedia.org/T237269) [17:54:15] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::coordinator: fix Presto kerberos config [puppet] - 10https://gerrit.wikimedia.org/r/551881 (https://phabricator.wikimedia.org/T237269) (owner: 10Elukey) [17:55:40] jouncebot next [17:55:40] In 0 hour(s) and 4 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191119T1800) [17:56:27] (03PS1) 10Dzahn: monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 [17:56:31] (03CR) 10Ayounsi: cables: detect duplicate cable names, and blank cable names (033 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) (owner: 10CRusnov) [17:58:05] (03PS2) 10Dzahn: monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 [17:59:27] (03PS2) 10Dzahn: site/webperf: remove superfluous lint-ignore comments [puppet] - 10https://gerrit.wikimedia.org/r/551693 [17:59:48] (03PS1) 10CRusnov: netbox report alerts: Notify dcops group on failures [puppet] - 10https://gerrit.wikimedia.org/r/551884 (https://phabricator.wikimedia.org/T230725) [17:59:59] (03CR) 10Dzahn: [C: 03+2] "only comments" [puppet] - 10https://gerrit.wikimedia.org/r/551693 (owner: 10Dzahn) [18:00:05] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191119T1800). [18:00:05] RoanKattouw and hauskater: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:22] * Urbanecm waves [18:00:29] * addshore waves [18:00:40] * addshore has 1 patch currently merging still to backport pre swat [18:00:58] feel free to do anything that will be faster than that, it still have 5-10 mins on gate [18:01:19] I'm waiting for the elevator, I'll start the SWAT when I'm at my desk [18:01:21] (03CR) 10Dzahn: [C: 03+1] netbox report alerts: Notify dcops group on failures [puppet] - 10https://gerrit.wikimedia.org/r/551884 (https://phabricator.wikimedia.org/T230725) (owner: 10CRusnov) [18:01:24] :D [18:01:33] (03CR) 10jerkins-bot: [V: 04-1] monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [18:02:14] (03PS2) 10Dzahn: site/centrallog: unify nodes into a single stanza [puppet] - 10https://gerrit.wikimedia.org/r/551694 [18:03:51] (03PS3) 10Catrope: Configure default search namespaces for eo.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551793 (https://phabricator.wikimedia.org/T237792) (owner: 10MarcoAurelio) [18:03:58] (03CR) 10Catrope: [C: 03+2] Configure default search namespaces for eo.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551793 (https://phabricator.wikimedia.org/T237792) (owner: 10MarcoAurelio) [18:04:22] Oh wait, that patch is hauskater's, not Urbanecm's [18:04:32] hauskater: Are you here for your SWAT (eowikisource default search namespace)? [18:04:37] RoanKattouw: sure [18:04:39] OK great [18:04:42] (03CR) 10CRusnov: [C: 03+2] netbox report alerts: Notify dcops group on failures [puppet] - 10https://gerrit.wikimedia.org/r/551884 (https://phabricator.wikimedia.org/T230725) (owner: 10CRusnov) [18:04:46] (03CR) 10Catrope: [C: 03+2] Configure default search namespaces for eo.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551793 (https://phabricator.wikimedia.org/T237792) (owner: 10MarcoAurelio) [18:04:55] (03Merged) 10jenkins-bot: Configure default search namespaces for eo.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551793 (https://phabricator.wikimedia.org/T237792) (owner: 10MarcoAurelio) [18:05:02] damn ubuntu, the default ping sound is inaudible [18:05:44] (03CR) 10Dzahn: [C: 03+2] site/centrallog: unify nodes into a single stanza [puppet] - 10https://gerrit.wikimedia.org/r/551694 (owner: 10Dzahn) [18:06:32] hauskater: OK it's ready for testing on mwdebug1001 [18:06:35] ffff... commit message wrong, it's eo.wiktionary, not wikisource [18:06:40] code is okay though [18:06:41] checking [18:07:42] RoanKattouw: works as expected [18:08:07] 10Operations, 10netops: "unknown session id" from bird on centrallog hosts - https://phabricator.wikimedia.org/T238677 (10ayounsi) 05Open→03Resolved Clearing the BFD session on the router and restarting bird solved the issue. If it happen again please reopen and I'll investigate it more. [18:08:39] (03CR) 10MarcoAurelio: Configure default search namespaces for eo.wikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551793 (https://phabricator.wikimedia.org/T237792) (owner: 10MarcoAurelio) [18:09:02] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Configure default search namespaces for eowikisource (T237792) (duration: 00m 52s) [18:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:10] T237792: Editing "$wgNamespacesToBeSearchedDefault" at EO wiktionary - https://phabricator.wikimedia.org/T237792 [18:10:53] thanks RoanKattouw :) [18:10:59] CUSTOM - Check the Netbox report coherence for fail status. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:11:44] Oh whoops sorry [18:11:52] I just slavishly repeated what the commit message said [18:12:07] !log That was eowiktionary, not eowikisource [18:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:26] (03PS14) 10Jbond: CI - python3: first attempt at adding python3 CI [puppet] - 10https://gerrit.wikimedia.org/r/510613 [18:14:37] (03PS3) 10Elukey: Enable Kerberos in Hadoop Analytics and Druid Analytics/Public [puppet] - 10https://gerrit.wikimedia.org/r/549566 (https://phabricator.wikimedia.org/T237269) [18:14:52] CUSTOM - Check the Netbox report coherence for fail status. on netbox1001 is UNKNOWN: CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds. https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:15:12] RoanKattouw: hows swat going? :) [18:16:08] Waiting for Jenkins on my patch [18:16:15] My bad RoanKattouw - I'm not sure on what I was thinking when I wrote the commit message. Luckily I made the change in the right wiki. [18:16:30] Oh it's done, I'll do my patch now [18:16:41] RoanKattouw: I believe i just fetched them both [18:16:59] (mine too) [18:17:21] Yes yo udid [18:17:28] * addshore will let you continue with yours :) [18:17:30] I can do mine after! [18:17:34] (03CR) 10Jbond: CI - python3: first attempt at adding python3 CI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [18:17:51] I've pulled both to mwdebug1001 [18:17:53] So let's test in parllel [18:18:10] RoanKattouw: okay! :D [18:18:32] (03CR) 10jerkins-bot: [V: 04-1] CI - python3: first attempt at adding python3 CI [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [18:18:57] RoanKattouw: mine appears to be good [18:19:11] addshore: Go ahead and sync it, I'm still testing [18:19:12] PROBLEM - Check the Netbox report coherence for fail status. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:19:13] ack! [18:19:40] syncing [18:20:28] !log addshore@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/Wikidata.org: T221774 - Wikidata.org extension (queryservice maxlag, hook) [[gerrit:551858]] (duration: 00m 53s) [18:20:30] =] done [18:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:33] T221774: Add Wikidata query service lag to Wikidata maxlag - https://phabricator.wikimedia.org/T221774 [18:21:27] (03PS3) 10Addshore: mediawiki/wikidata maint cron for updateQueryServiceLag [puppet] - 10https://gerrit.wikimedia.org/r/551582 (https://phabricator.wikimedia.org/T221774) [18:21:33] (03PS4) 10Addshore: mediawiki/wikidata maint cron for updateQueryServiceLag [puppet] - 10https://gerrit.wikimedia.org/r/551582 (https://phabricator.wikimedia.org/T221774) [18:21:40] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:21:46] (03PS1) 10DCausse: [wdqs] attempt to fix updated entity id logs [puppet] - 10https://gerrit.wikimedia.org/r/551890 [18:22:09] not sure if any ops are around to merge a cron change for me, https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/551582/ [18:22:30] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:22:40] (03PS15) 10Jbond: CI - python3: first attempt at adding python3 CI [puppet] - 10https://gerrit.wikimedia.org/r/510613 [18:22:51] (03PS2) 10Dzahn: add xhgui1001 and xhgui2001 [dns] - 10https://gerrit.wikimedia.org/r/551691 (https://phabricator.wikimedia.org/T238098) [18:22:57] CUSTOM - Check the Netbox report coherence for fail status. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:24:58] mutante: do you have any time to look at https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/551582/ ? :) Would be great to get it in today [18:25:11] If not I'll leave shortly and try to get it scheduled in tommorrow [18:25:41] icinga is temporarily broken, fixing [18:27:15] deploying parsoid shortly. [18:27:15] that must be the sound my phoe jsut made [18:27:22] (03CR) 10Jbond: [C: 04-1] "im gonna -1 this i don't think its a good idea as `/usr/bin/env python` still references python2 on buster and lower" [puppet] - 10https://gerrit.wikimedia.org/r/550437 (owner: 10Jbond) [18:27:36] chaomodus: need a hand with anything? [18:27:50] just some puppet weirdness adding a contact group [18:28:15] ok, I'm around if you need another set of eyes [18:28:24] it's just a missing contact group, it's under control. either puppet adds it now or we just add it ourselves [18:29:05] sudo puppet agent -tv [18:29:11] * jbond42 also here if needed [18:29:24] same here [18:29:31] Still trying to test my change and mwdebug1001 keeps 500ing for some resaon [18:29:58] (03PS1) 10CRusnov: icinga: fix contactgroup for dcops [puppet] - 10https://gerrit.wikimedia.org/r/551892 [18:30:09] RoanKattouw: oh? [18:30:12] (03CR) 10CRusnov: [C: 03+2] nagios_common: Add contactgroup for dcops [puppet] - 10https://gerrit.wikimedia.org/r/551871 (https://phabricator.wikimedia.org/T230725) (owner: 10CRusnov) [18:30:29] And I'm trying to test a JS change... [18:30:41] !log icinga config - manually added team-dcops, started icinga [18:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:02] it's up [18:31:13] (03PS16) 10Jbond: CI - python3: first attempt at adding python3 CI [puppet] - 10https://gerrit.wikimedia.org/r/510613 [18:31:26] I wonder if this is more ATS breakage [18:32:31] !log ssastry@deploy1001 Started deploy [parsoid/deploy@6e7cffd]: Updating Parsoid to 1a1105a7 [18:32:32] RoanKattouw: is it a 500 page served by ATS? you should be able to tell [18:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:42] I think it was served by Varnish [18:32:46] What I meant was mwdebug routing issues [18:32:59] icinga should be fixed [18:33:00] hah well [18:33:14] It used to be that mwdebug routing didn't work in ulsfo and I had to manually point my DNS to eqiad to fix it [18:33:31] Now that broke, but commenting out my /etc/hosts rule fixes it, and ulsfo is routing me to mwdebug correctly but eqiad 500s when I try [18:33:59] Specifically, in /etc/hosts I had 208.80.154.224 dyna.wikimedia.org [18:34:02] Uncommenting that fixed it [18:34:34] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10eprodromou) [18:34:36] !log ssastry@deploy1001 Finished deploy [parsoid/deploy@6e7cffd]: Updating Parsoid to 1a1105a7 (duration: 02m 04s) [18:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:24] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10eprodromou) >>! In T236963#5662843, @jijiki wrote: > Currently we have 1.9.0 on releases.w.o and on the servers. Please upload the new... [18:35:43] !log catrope@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/VisualEditor/: Unbreak instrumentation of init events (duration: 00m 53s) [18:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:35] (03CR) 10Volans: [C: 03+1] "LGTM, some comment/caveat inline" (033 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/551273 (owner: 10Ayounsi) [18:41:35] (03PS1) 10CRusnov: profile::icinga::ircbot: Add dcops log for dcops channel notifications [puppet] - 10https://gerrit.wikimedia.org/r/551898 [18:42:40] (03Abandoned) 10CRusnov: icinga: fix contactgroup for dcops [puppet] - 10https://gerrit.wikimedia.org/r/551892 (owner: 10CRusnov) [18:42:45] mutante, can we get a php-fmp restart on the canaries? [18:43:05] i aborted deploy and that probably doesn't restart php-fpm [18:43:44] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10RobH) I'll be sending the following into ESAMS remote hands: I'll be sending the following to Iron Mountain remote hands request via the portal: Iron Mountain, We are experiencing transi... [18:44:31] (03CR) 10jerkins-bot: [V: 04-1] profile::icinga::ircbot: Add dcops log for dcops channel notifications [puppet] - 10https://gerrit.wikimedia.org/r/551898 (owner: 10CRusnov) [18:44:58] _joe_, you around? [18:45:11] subbu: he's out today [18:45:20] I need a php-fm restart on the canaries ... looks like an aborted deploy doesn't restart the services. [18:45:31] well, any ops that can do it :) [18:46:18] (03PS2) 10CRusnov: profile::icinga::ircbot: Add log for dcops channel notifications [puppet] - 10https://gerrit.wikimedia.org/r/551898 (https://phabricator.wikimedia.org/T230725) [18:46:33] php-fpm restart on wtp1025, wtp1026, wtp2001, wtp2002 [18:46:49] i tried myself and i don't have sudo perms to do that. [18:46:57] (03PS3) 10Catrope: [Beta] Flow: Use Parsoid/PHP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549875 (https://phabricator.wikimedia.org/T229078) (owner: 10Mobrovac) [18:48:19] akosiaris, ^^ [18:48:21] I do have permissions, and I think I see how, but I'm still learning my way around and don't have sufficient confidence to reach in and try that without backup, sorry :) if anyone else knowledgeable is lurking I'm happy to help [18:48:27] (03PS17) 10Jbond: CI - python3: Add support for python3 syntax checks [puppet] - 10https://gerrit.wikimedia.org/r/510613 [18:48:57] "sudo service php-fpm restart" on wtp1025 and wtp1026 for now is sufficient. [18:49:10] (03PS4) 10Catrope: [Beta] Use Parsoid/PHP for Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549875 (https://phabricator.wikimedia.org/T229078) (owner: 10Mobrovac) [18:50:25] (03PS1) 10Elukey: Update TLS crt for yarn.w.o after regeneration [puppet] - 10https://gerrit.wikimedia.org/r/551899 [18:50:31] (03CR) 10Muehlenhoff: "The Elasticsearch plugins are packaged as a deb since a while now, it's fairly elegant and we could adapt that scheme, see https://phabric" [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [18:51:25] (03CR) 10Elukey: [C: 03+2] Update TLS crt for yarn.w.o after regeneration [puppet] - 10https://gerrit.wikimedia.org/r/551899 (owner: 10Elukey) [18:52:06] subbu: I can help [18:52:15] thanks. :) [18:52:15] \o/ [18:52:21] (03CR) 10Dzahn: [C: 03+1] "DNS:yarn.wikimedia.org, DNS:hue.wikimedia.org, DNS:superset.wikimedia.org, DNS:pivot.wikimedia.org, DNS:turnilo.wikimedia.org, DNS:stats.w" [puppet] - 10https://gerrit.wikimedia.org/r/551899 (owner: 10Elukey) [18:52:34] akosiaris: I'm happy to handle it, I just wanted to make sure you were around in case I set everything on fire :D [18:52:46] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10RobH) @bblack: Do we care when the work is done, other than in EU or US business hours? [18:52:57] rlazarus: cool. You can't really, it's just test wikis for now that have parsoid-php [18:53:08] so feel free to restart php-fpm on those hosts [18:53:19] ack, going ahead [18:53:28] subbu: aborted deploy? Ctrl-c ? [18:53:28] but, we found a bug in the deployment recipe today :) [18:53:33] no. rollback. [18:53:37] ouch [18:53:43] yeah, let's fix that [18:53:49] on rollback, php-fpm doesn't restart it appears. [18:53:52] (03PS5) 10Cwhite: puppetmaster,icinga: naggen2 cleanup and update to python3 [puppet] - 10https://gerrit.wikimedia.org/r/549222 [18:54:22] subbu: ok, wanna file a task and tag serviceops and release engineering. This is scap related I guess [18:54:36] will do. it may also be a marko thing, but yes, will do. [18:54:48] a marko thing? [18:54:57] Failed to restart php-fpm.service: Unit php-fpm.service not found. [18:54:58] (03CR) 10Cwhite: puppetmaster,icinga: naggen2 cleanup and update to python3 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/549222 (owner: 10Cwhite) [18:54:59] typo check? [18:55:14] php7.2-fpm.service [18:55:14] I did also find `sudo -i /usr/local/sbin/restart-php7.2-fpm` on https://wikitech.wikimedia.org/wiki/Application_servers/Runbook, should I be running that instead? [18:55:20] ack [18:55:50] akosiaris, marko set up the checks.yaml file for deployment which does the restarts, etc. [18:56:04] i wonder if there is something we need for the rollback path and in that sense was wondering if it was a marko thing. [18:56:27] or maybe joe / daniel set it up :) anyway, i'll file the task and the relevant parties can figure it out. [18:56:48] !log restarted php7.2-fpm on wtp1025, wtp1026 [18:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:53] 10Operations, 10serviceops, 10User-Joe: SRE FY2019 Q3 goal: Ramp-up serving traffic to PHP 7 - https://phabricator.wikimedia.org/T212828 (10akosiaris) Should we resolve this? [18:56:59] subbu: done, thanks for your patience :) [18:57:15] subbu: ok, thanks [18:57:23] rlazarus, ty. [18:57:54] subbu: oh, did you want 2001 and 2002 also? [18:58:15] yes please. they don't get any php reqs. but good to be consistent. [18:58:58] done and done [18:59:05] !log restarted php7.2-fpm on wtp2001, wtp2002 [18:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:10] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [19:00:07] (03CR) 10CDanis: [C: 03+1] "thanks for doing this!" [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [19:00:30] !log regenerate TLS cert for yarn.wikimedia.org (containing SANs for all analytics UIs) to add datasets.w.o SAN (site was failing due to ATS not being able to contact thorium) [19:00:34] (03CR) 10Dzahn: [C: 03+1] profile::icinga::ircbot: Add log for dcops channel notifications [puppet] - 10https://gerrit.wikimedia.org/r/551898 (https://phabricator.wikimedia.org/T230725) (owner: 10CRusnov) [19:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:40] (03PS18) 10Jbond: CI - python3: first attempt at adding python3 CI [puppet] - 10https://gerrit.wikimedia.org/r/510613 [19:01:07] (03CR) 10CRusnov: [C: 03+2] profile::icinga::ircbot: Add log for dcops channel notifications [puppet] - 10https://gerrit.wikimedia.org/r/551898 (https://phabricator.wikimedia.org/T230725) (owner: 10CRusnov) [19:01:18] cdanis: shdubsh: thanks for the review i just did a minor update could i get you to +1 again [19:01:37] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@6e6bd42]: Prevent expensive content transforms from blocking the event loop (T229286) [19:01:40] (03PS19) 10Jbond: CI - python3: Add support for python3 syntax checks [puppet] - 10https://gerrit.wikimedia.org/r/510613 [19:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:42] T229286: Resolve service instability due to excessive event loop blockage since starting PCS response pregeneration - https://phabricator.wikimedia.org/T229286 [19:01:59] cscott, sbailey "!$env->$pageWithOldid" [19:02:08] oops. wrong channel :-) [19:02:26] (03CR) 10CDanis: [C: 03+1] CI - python3: Add support for python3 syntax checks [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [19:02:26] jbond42: hah, cute hack, +1 [19:02:46] :) thanks [19:03:11] (03CR) 10Dzahn: [C: 03+2] add xhgui1001 and xhgui2001 [dns] - 10https://gerrit.wikimedia.org/r/551691 (https://phabricator.wikimedia.org/T238098) (owner: 10Dzahn) [19:05:22] ACKNOWLEDGEMENT - Check systemd state on phab2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn test https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:06:52] (03CR) 10Faidon Liambotis: [C: 03+2] netbox: Set URL in report alert to the URL of the report [puppet] - 10https://gerrit.wikimedia.org/r/549959 (owner: 10CRusnov) [19:07:41] (03CR) 10Jbond: puppetmaster,icinga: naggen2 cleanup and update to python3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/549222 (owner: 10Cwhite) [19:08:27] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@6e6bd42]: Prevent expensive content transforms from blocking the event loop (T229286) (duration: 06m 49s) [19:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:32] T229286: Resolve service instability due to excessive event loop blockage since starting PCS response pregeneration - https://phabricator.wikimedia.org/T229286 [19:09:20] (03PS3) 10CRusnov: netbox: Set URL in report alert to the URL of the report [puppet] - 10https://gerrit.wikimedia.org/r/549959 [19:12:33] (03CR) 10jerkins-bot: [V: 04-1] netbox: Set URL in report alert to the URL of the report [puppet] - 10https://gerrit.wikimedia.org/r/549959 (owner: 10CRusnov) [19:12:59] (03CR) 10Cwhite: [C: 03+1] CI - python3: Add support for python3 syntax checks [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [19:14:16] (03PS4) 10CRusnov: netbox: Set URL in report alert to the URL of the report [puppet] - 10https://gerrit.wikimedia.org/r/549959 [19:14:22] (03PS2) 10Dzahn: site/xhgui: add xhgui eqiad and codfw nodes as spares [puppet] - 10https://gerrit.wikimedia.org/r/551695 (https://phabricator.wikimedia.org/T238098) [19:14:57] (03CR) 10Dzahn: [C: 03+2] site/xhgui: add xhgui eqiad and codfw nodes as spares [puppet] - 10https://gerrit.wikimedia.org/r/551695 (https://phabricator.wikimedia.org/T238098) (owner: 10Dzahn) [19:16:12] (03PS21) 10CRusnov: backends: add Netbox backend [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) [19:18:34] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [19:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:40] !log dzahn@cumin1001 END (ERROR) - Cookbook sre.ganeti.makevm (exit_code=97) [19:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:53] (03CR) 10CRusnov: [C: 03+2] netbox: Set URL in report alert to the URL of the report [puppet] - 10https://gerrit.wikimedia.org/r/549959 (owner: 10CRusnov) [19:20:20] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [19:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:17] (03CR) 10Jbond: logstash: introduce logstash 7 and openjdk-11 support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [19:26:47] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/551524 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [19:33:18] (03PS1) 10RLazarus: interface: Add a config option to avoid CPU 0 in RPS. [puppet] - 10https://gerrit.wikimedia.org/r/551907 (https://phabricator.wikimedia.org/T236208) [19:35:58] (03CR) 10RLazarus: "Please enjoy this drive-by patch! Chris pointed me at this Phab ticket when I was looking for a smallish chunk of code to contribute somew" [puppet] - 10https://gerrit.wikimedia.org/r/551907 (https://phabricator.wikimedia.org/T236208) (owner: 10RLazarus) [19:42:59] (03PS1) 10Dzahn: phabricator: stop aphlict service if disabled in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/551910 [19:46:22] (03CR) 10jerkins-bot: [V: 04-1] phabricator: stop aphlict service if disabled in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/551910 (owner: 10Dzahn) [19:49:35] (03PS6) 10Ayounsi: Automatically cast network strings to ipaddress objects [software/homer] - 10https://gerrit.wikimedia.org/r/551273 [19:51:14] (03CR) 10Ayounsi: "Thanks!" (034 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/551273 (owner: 10Ayounsi) [19:51:23] (03PS2) 10Dzahn: phabricator: stop aphlict service if disabled in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/551910 [19:55:46] (03PS7) 10Ayounsi: Automatically cast network strings to ipaddress objects [software/homer] - 10https://gerrit.wikimedia.org/r/551273 [19:56:39] (03CR) 10Ayounsi: [C: 03+2] Automatically cast network strings to ipaddress objects (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/551273 (owner: 10Ayounsi) [19:57:43] (03PS3) 10Dzahn: phabricator: stop aphlict service if disabled in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/551910 [19:59:58] (03Merged) 10jenkins-bot: Automatically cast network strings to ipaddress objects [software/homer] - 10https://gerrit.wikimedia.org/r/551273 (owner: 10Ayounsi) [20:00:49] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/19487/phab1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/551910 (owner: 10Dzahn) [20:02:04] (03PS4) 10Dzahn: phabricator: stop aphlict service if disabled in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/551910 (https://phabricator.wikimedia.org/T238593) [20:03:32] (03CR) 10Dzahn: [C: 03+2] phabricator: stop aphlict service if disabled in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/551910 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [20:05:48] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/19487/" [puppet] - 10https://gerrit.wikimedia.org/r/551910 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [20:07:27] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [20:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:51] !log phab1003 after merging gerrit:551910 puppet now also stopped the actual aphlict service and removed the systemd unit file. had to manually run 'systemctl reset-failed' though to clean systemd status and avoid icinga alert (T238593) [20:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:56] T238593: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 [20:10:15] !log homer push on mgmt routers [20:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:24] (03CR) 10BBlack: [C: 03+1] "Awesome, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/551907 (https://phabricator.wikimedia.org/T236208) (owner: 10RLazarus) [20:13:46] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] mr: add DHCP server support + replace all system {} [homer/public] - 10https://gerrit.wikimedia.org/r/551274 (owner: 10Ayounsi) [20:14:14] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [20:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:46] !log completed reloading data from wdqs1007 to wdqs1004 - after failed test of merging updater - T212826 [20:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:51] T212826: Create dedicated Updater service in Blazegraph - https://phabricator.wikimedia.org/T212826 [20:18:06] (03PS1) 10Dzahn: phabricator: if aphlict is disabled also unload wstunnel httpd module [puppet] - 10https://gerrit.wikimedia.org/r/551915 (https://phabricator.wikimedia.org/T238593) [20:22:15] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10RobH) Ok, synced up with @bblack via irc and he doesn't have a preference for time. My above directions have been submitted for remote hands via the portal, case RITM0115394. [20:24:40] (03PS4) 10Ottomata: eventgate-analytics - stream config for new sparql-query streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/548764 (https://phabricator.wikimedia.org/T101013) [20:25:42] (03PS1) 10Dzahn: planet: clarify feeds are to be Wikimedia related [puppet] - 10https://gerrit.wikimedia.org/r/551916 [20:30:43] 10Operations, 10Traffic, 10Patch-For-Review: interface-rps.py should have a flag to avoid CPU0 - https://phabricator.wikimedia.org/T236208 (10BBlack) So the patch above adds it to the queue distribution logic in interface-rps, but there's another piece of the puzzle here, which is setting the hardware's queu... [20:33:55] 10Operations, 10Traffic, 10Patch-For-Review: interface-rps.py should have a flag to avoid CPU0 - https://phabricator.wikimedia.org/T236208 (10BBlack) Adding @RLazarus in hopes of nerd-sniping him further on this topic... [20:35:10] (03CR) 10Ayounsi: Add PIM stanza for CR devices (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/549689 (owner: 10Ayounsi) [20:44:11] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10RobH) SCTASK0128980 is new case number, confirmed and opened. (I suppose one is the request, and now we have a confirmed remote hands case?) [20:46:50] (03CR) 10RLazarus: [C: 03+2] interface: Add a config option to avoid CPU 0 in RPS. [puppet] - 10https://gerrit.wikimedia.org/r/551907 (https://phabricator.wikimedia.org/T236208) (owner: 10RLazarus) [20:55:58] (03PS1) 10Dzahn: planet: upgrade a bunch of en blogs from http to https [puppet] - 10https://gerrit.wikimedia.org/r/551919 [20:58:40] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/19488/" [puppet] - 10https://gerrit.wikimedia.org/r/551915 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [20:59:52] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [20:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:21] (03CR) 10Dzahn: [C: 03+2] planet: clarify feeds are to be Wikimedia related [puppet] - 10https://gerrit.wikimedia.org/r/551916 (owner: 10Dzahn) [21:04:30] (03PS1) 10CRusnov: netbox: fix alert notes_url [puppet] - 10https://gerrit.wikimedia.org/r/551922 [21:04:41] (03CR) 10Dzahn: [C: 03+2] planet: upgrade a bunch of en blogs from http to https [puppet] - 10https://gerrit.wikimedia.org/r/551919 (owner: 10Dzahn) [21:04:49] (03PS2) 10Dzahn: planet: upgrade a bunch of en blogs from http to https [puppet] - 10https://gerrit.wikimedia.org/r/551919 [21:07:57] (03CR) 10Herron: "> The Elasticsearch plugins are packaged as a deb since a while now," [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [21:08:23] (03CR) 10CRusnov: [C: 03+2] netbox: fix alert notes_url [puppet] - 10https://gerrit.wikimedia.org/r/551922 (owner: 10CRusnov) [21:12:28] 10Operations, 10netops, 10observability: Determine & implement near-term method for escalating network alerts - https://phabricator.wikimedia.org/T237587 (10herron) Friendly ping to @Volans about @fgiunchedi question above [21:14:40] !log rebooting pfw3-codfw for upgrade - T235150 [21:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:48] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:23:00] 10Operations, 10Phabricator, 10Traffic, 10serviceops: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10Dzahn) The Hiera key now does all the things, also stopped the service and unloaded the httpd module wstunnel. After that Phabr... [21:30:14] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:31:21] (03CR) 10Muehlenhoff: "That plan sounds good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) (owner: 10Herron) [21:33:04] (03PS1) 10Dzahn: add xhgui1001/2001 to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/551932 (https://phabricator.wikimedia.org/T238098) [21:35:06] (03CR) 10Dzahn: [C: 03+2] add xhgui1001/2001 to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/551932 (https://phabricator.wikimedia.org/T238098) (owner: 10Dzahn) [21:42:46] (03CR) 10RLazarus: "Volans: Not sure why reviewer-bot pinged on this patch but it doesn't need review right now." [puppet] - 10https://gerrit.wikimedia.org/r/549668 (owner: 10RLazarus) [21:43:14] (03Abandoned) 10RLazarus: [poolcounter] (WIP) Add a poolcounter prometheus exporter. [puppet] - 10https://gerrit.wikimedia.org/r/549668 (owner: 10RLazarus) [21:45:11] !log rebooting pfw3-codfw:node1 for upgrade - T235150 [21:45:14] rlazarus: the bot added him because on https://www.mediawiki.org/wiki/Git/Reviewers he is subscribed to anything ".py" [21:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:26] (\.py(\.erb)?$|cumin|failoid|debmonitor|spicerack|cookbook|netbox)/) [21:46:02] you can also add yourself to stuff on that special wiki page if you like [21:46:27] huh, even though the patch was marked WIP? that's a surprise [21:46:31] should update my own stuff (bugzilla, lol) [21:46:50] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10eprodromou) So, I don't know if there's anyone on our team except Tim who has access to the releases server. @tstarling would you mind... [21:46:54] rlazarus: i think the WIP feature itself is a bit WIP [21:47:03] also why suddenly now? it'd been sitting idle for a week [21:47:06] haha fair [21:47:25] paladox: ^ reviewer-bot adds reviewers to things that are not "ready for review" but WIP [21:47:26] it'll be because the bot didn't check that the change had WIP [21:47:37] yea. that bot was made before gerrit had WIP [21:47:40] yup [21:47:56] why it happened now after sitting there for a week is interesting [21:48:07] maybe something was fixed where the bot runs [21:48:18] rlazarus: no prob :) [21:48:28] Yeh the bot was fix [21:48:31] see -releng [21:48:34] *fixed [21:48:39] ahh mystery solved [21:48:41] thanks paladox :) [21:48:45] i like it when things are explainable :p [21:48:47] your welcome :) [21:48:52] https://phabricator.wikimedia.org/T222983 [21:49:57] 10Operations, 10netops, 10observability: Determine & implement near-term method for escalating network alerts - https://phabricator.wikimedia.org/T237587 (10Volans) @herron @fgiunchedi I don't think that much, I guess you have to do the triggering part, I'm not super clear what you have in mind, a script to... [21:52:49] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10Dzahn) There are different releasers group for different software. The people who can upload releases of wikidiff2 are: members: [dem... [21:54:23] any objection if I send a PR to gerrit-reviewer-bot disabling it for changes marked WIP? does anybody prefer the current behavior? [21:55:26] (completely scientific survey) [21:55:29] rlazarus: i think if it's not marked "ready for review" it should not get reviewers. other behaviour is kind of a bug. that difference did not exist when the bot was made [21:55:39] yeah that's my impression too [21:58:00] I like to be aware of WIP things, sometimes I get pinged to do an early review of a WIP CR, but YMMV [21:58:21] it could totally be left to the committer without automation [21:58:27] I'm not against, just to be clear :) [21:58:57] yeah, there are definitely cases where I'd CC you for an early look (as I did cdanis in that case, actually) but I still wouldn't expect you to be added automatically [21:59:41] I agree that Reviewer-bot adding people automatically to a WIP patch seems like a bug [22:02:50] filed https://github.com/valhallasw/gerrit-reviewer-bot/issues/21 [22:03:36] (github is the canonical repo for this project afaict, which is surprising? let me know if there's a gerrit repo I missed) [22:04:45] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:05:10] (03PS1) 10CRusnov: netbox: Make the report url in report alerts the canonical url [puppet] - 10https://gerrit.wikimedia.org/r/551935 [22:05:58] ^^ wdqs1010 is doing a data import taking longer than expected, extending downtime [22:13:34] (03PS1) 10Dzahn: create analytics-web.discovery.wmnet, point to thorium [dns] - 10https://gerrit.wikimedia.org/r/551938 [22:16:18] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T236331 (10Jclark-ctr) Replaced Failed Drive [22:18:35] (03PS1) 10Dzahn: ATS/varnish: rename thorium director to analytic-web [puppet] - 10https://gerrit.wikimedia.org/r/551939 [22:19:42] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T230575 (10Jclark-ctr) Replaced Failed drive [22:22:01] the meaning of WIP for Gerrit in this context is "when it's not ready for review yet", hence the "ready for review" button. so pinging people to review while it's WIP would be unexpected. but there is another kind of "WIP" when people just use that term in the commit message but not the actual Gerrit feature [22:23:18] rlazarus: it's correct that it's on github (though maybe unfortunate) but yea.. https://phabricator.wikimedia.org/T153719#2888548 et al [22:24:09] nod [22:24:50] actually the Gerrit term is "draft" [22:25:07] 10Operations, 10Commons, 10Multimedia, 10SRE-swift-storage: File not found - https://phabricator.wikimedia.org/T238695 (10Josve05a) [22:25:27] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T236331 (10JHedden) 05Open→03Resolved @Jclark-ctr replaced this today with a new 1.9TB drive. No host errors were seen and the megaraid card looks clean. [22:25:53] I think a draft is something different [22:27:15] (03PS5) 10Effie Mouzeli: (WIP) mediawiki: remove all hhvm related files and hieradata [puppet] - 10https://gerrit.wikimedia.org/r/551527 (https://phabricator.wikimedia.org/T229792) [22:28:12] rlazarus: you are right. but only slightly [22:28:18] "Work in Progress (WIP) is very similar to Draft status. In fact, both result in a change status of “Draft”. The difference is that WIP can be used to toggle a change in a non-draft state back to “Draft” state. Then the Ready button can be used to change it back to a non-draft state." [22:28:51] "The original “Draft” state is a one-shot change: it must start out as “Draft” and can then be switched to a non-draft status. “WIP” allows toggling back to Draft afterwards." [22:29:36] ah, so in either case draft state is the thing we're talking about [22:29:37] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T230575 (10JHedden) 05Open→03Resolved @Jclark-ctr replaced the drive in slot 4 with a new 1.9TB drive today. I've confirmed that the system and RAID set are both healthy. [22:29:38] fair enough [22:29:58] and jeez that's a lot more tangled than I expected [22:30:17] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/551527 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [22:30:41] git review -D = upload a change as draft (i haven't actually been using it though) [22:30:45] WIP was originally known as "Draft". In gerrit 2.14 the feature was split into Work In Progress and Private changes (we disabled the later one with a config). (Drafts were basically hidden). [22:35:22] (03CR) 10Andrew Bogott: [C: 03+2] Add some standard maintenance access to clouddb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/551652 (https://phabricator.wikimedia.org/T238514) (owner: 10Andrew Bogott) [22:35:35] (03PS4) 10Dzahn: phabricator: enable phd service on phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/549906 (https://phabricator.wikimedia.org/T232883) [22:35:59] (03PS1) 10Andrew Bogott: cloudservices: move from pdns3 to pdns4 [puppet] - 10https://gerrit.wikimedia.org/r/551942 (https://phabricator.wikimedia.org/T210715) [22:41:26] (03PS2) 10Andrew Bogott: cloudservices: move from pdns3 to pdns4 [puppet] - 10https://gerrit.wikimedia.org/r/551942 (https://phabricator.wikimedia.org/T210715) [22:42:47] (03CR) 10Paladox: [C: 03+1] phabricator: enable phd service on phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/549906 (https://phabricator.wikimedia.org/T232883) (owner: 10Dzahn) [22:46:01] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/551794 (https://phabricator.wikimedia.org/T238306) (owner: 10Elukey) [22:51:03] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Unassigned shards in eqiad - https://phabricator.wikimedia.org/T233403 (10EBernhardson) shard allocation looks pretty happy since we pooled the new servers and depooled the old ones. I'd be willing to call this complete for now, re-open if it'... [22:52:31] (03CR) 10Dzahn: [C: 03+1] "https://secure.phabricator.com/book/phabricator/article/cluster_daemons/" [puppet] - 10https://gerrit.wikimedia.org/r/549906 (https://phabricator.wikimedia.org/T232883) (owner: 10Dzahn) [22:52:58] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Andrew) I marked both servers as active. I'm not sure I know where to find the arrays in netbox, can you direct me? [22:53:36] (03CR) 10Dzahn: [C: 03+2] phabricator: enable phd service on phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/549906 (https://phabricator.wikimedia.org/T232883) (owner: 10Dzahn) [22:55:41] RECOVERY - Check systemd state on phab2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:56:13] icinga-wm: not really :p [22:57:21] (03PS1) 10Dzahn: Revert "phabricator: enable phd service on phab2001" [puppet] - 10https://gerrit.wikimedia.org/r/551943 [22:58:53] (03PS2) 10Dzahn: Revert "phabricator: enable phd service on phab2001" [puppet] - 10https://gerrit.wikimedia.org/r/551943 [22:59:19] (03PS3) 10Dzahn: Revert "phabricator: enable phd service on phab2001" [puppet] - 10https://gerrit.wikimedia.org/r/551943 (https://phabricator.wikimedia.org/T232883) [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191119T2300). [23:00:04] RoanKattouw: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:03:26] I'll do my own SWAT [23:04:39] (03CR) 10Dzahn: [C: 03+2] Revert "phabricator: enable phd service on phab2001" [puppet] - 10https://gerrit.wikimedia.org/r/551943 (https://phabricator.wikimedia.org/T232883) (owner: 10Dzahn) [23:25:53] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Papaul) https://netbox.wikimedia.org/dcim/devices/2147/ https://netbox.wikimedia.org/dcim/devices/2148/ [23:30:03] (03CR) 10Dzahn: "this would be after Change-Id: I692c3103ba3f9d9" [puppet] - 10https://gerrit.wikimedia.org/r/551939 (owner: 10Dzahn) [23:31:43] (03PS2) 10Dzahn: ATS/varnish: rename thorium director to analytic-web [puppet] - 10https://gerrit.wikimedia.org/r/551939 [23:31:48] (03PS2) 10Dzahn: create analytics-web.discovery.wmnet, point to thorium [dns] - 10https://gerrit.wikimedia.org/r/551938 [23:33:16] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Amanda Bittaker - https://phabricator.wikimedia.org/T238705 (10Abit) [23:33:53] (03PS3) 10Dzahn: ATS/varnish: rename thorium director to analytics-web [puppet] - 10https://gerrit.wikimedia.org/r/551939 [23:36:02] (03CR) 10Dzahn: [C: 04-1] "parameter 'group' expects a String value, got Undef" [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [23:38:17] (03PS3) 10Dzahn: monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 [23:41:01] (03CR) 10jerkins-bot: [V: 04-1] monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [23:42:49] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=phab2001-vcs.codfw.wmnet [23:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:30] !log catrope@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/WikimediaEvents/: EditAttemptStep: Allow other extensions to trigger oversampling (T238249) (duration: 00m 53s) [23:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:35] T238249: Propagate editing_session_id and oversampling flag from newcomer homepage to EditAttemptStep - https://phabricator.wikimedia.org/T238249 [23:58:03] !log catrope@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/MobileFrontend/: EditAttemptStep: Allow overriding session ID (T238249) (duration: 00m 53s) [23:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log