[00:02:39] RECOVERY - Varnish HTCP daemon on cp1075 is OK: PROCS OK: 1 process with UID = 115 (vhtcpd), args vhtcpd https://wikitech.wikimedia.org/wiki/Varnish [00:02:44] !log cp1075 - systemctl status vhtcpd [00:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:59] !log cp1075 - systemctl restart vhtcpd [00:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:08] (03CR) 10Dzahn: "could you add a comment why we wanted to do this again :)" [puppet] - 10https://gerrit.wikimedia.org/r/536359 (owner: 10Paladox) [00:19:50] 10Operations, 10serviceops: Update component/php72 to 7.2.22 - https://phabricator.wikimedia.org/T230024 (10Dzahn) @jijiki Yes, i think they should. upstream just says "newer than 7.1" and @Paladox also confirmed they run a newer version for Phab on Miraheze. cc: @20after4 [00:24:15] 10Operations, 10LDAP: Create an LDAP replica in codfw (using LVS) - https://phabricator.wikimedia.org/T227778 (10Dzahn) gerrit has also switched over today [00:24:45] 10Operations, 10LDAP: Migrate web services using LDAP authentication towards the readonly LDAP replicas - https://phabricator.wikimedia.org/T227650 (10Dzahn) Gerrit has been switched to use readonly replicas today. [00:36:38] (03PS3) 10Krinkle: Drop InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538341 (owner: 10Jforrester) [00:38:55] (03CR) 10Krinkle: [C: 03+2] "Result of git-grep before this commit:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538341 (owner: 10Jforrester) [00:39:25] * Krinkle staging on mwdebug1002 [00:40:13] (03Merged) 10jenkins-bot: Drop InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538341 (owner: 10Jforrester) [00:40:28] (03CR) 10jenkins-bot: Drop InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538341 (owner: 10Jforrester) [00:41:09] (03CR) 10Dzahn: [C: 03+1] "works (https://puppet-compiler.wmflabs.org/compiler1001/18548/) it also defaults to false but nothing wrong with making it explicit." [puppet] - 10https://gerrit.wikimedia.org/r/538837 (https://phabricator.wikimedia.org/T177782) (owner: 10Jcrespo) [00:43:25] !log krinkle@deploy1001 Synchronized tests/: 6dca83a9f6c2c (duration: 01m 05s) [00:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:42] !log krinkle@deploy1001 Synchronized docroot/noc/: 6dca83a9f6c2c (duration: 01m 05s) [00:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:33] !log krinkle@deploy1001 Synchronized wmf-config/: 6dca83a9f6c2c (duration: 01m 04s) [00:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:08] (03PS8) 10Krinkle: Move VariantSettings back to InitialiseSettings now that the migration is done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 (owner: 10Jforrester) [00:48:28] (03PS9) 10Krinkle: Move VariantSettings back to InitialiseSettings now that the migration is done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [00:55:43] (03CR) 10Krinkle: "* Checked there are no more references to VariantSettings." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [00:55:48] (03CR) 10Krinkle: [C: 03+2] Move VariantSettings back to InitialiseSettings now that the migration is done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [00:56:41] (03Merged) 10jenkins-bot: Move VariantSettings back to InitialiseSettings now that the migration is done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [00:56:56] (03CR) 10jenkins-bot: Move VariantSettings back to InitialiseSettings now that the migration is done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [00:57:38] * Krinkle staging on mwdebug1002 [01:00:38] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 3373247e123b53 - create new file (duration: 01m 05s) [01:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:47] !log krinkle@deploy1001 Synchronized README: 3373247e123b53 (duration: 01m 04s) [01:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:27] (03PS1) 10Krinkle: noc: Refresh conf symlinks following 3373247e123b538 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539007 (https://phabricator.wikimedia.org/T223602) [01:04:36] (03CR) 10Krinkle: [C: 03+2] noc: Refresh conf symlinks following 3373247e123b538 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539007 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [01:05:33] (03Merged) 10jenkins-bot: noc: Refresh conf symlinks following 3373247e123b538 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539007 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [01:06:46] (03CR) 10jenkins-bot: noc: Refresh conf symlinks following 3373247e123b538 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539007 (https://phabricator.wikimedia.org/T223602) (owner: 10Krinkle) [01:07:41] !log krinkle@deploy1001 Synchronized docroot/noc: 3373247e123b53 and 1efc8bd68107877311a749 (duration: 01m 05s) [01:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:55] !log krinkle@deploy1001 Synchronized tests: 3373247e123b5 (duration: 01m 04s) [01:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:14] !log krinkle@deploy1001 Synchronized src/WmfClusters.php: 3373247e123b (duration: 01m 04s) [01:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:47] PROBLEM - Check the last execution of search-drop-query-clicks on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:14:18] (03CR) 10Krinkle: [C: 03+2] "Sync commands for historical record:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537220 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [01:14:45] !log krinkle@deploy1001 Synchronized wmf-config/: 3373247e12 (duration: 01m 04s) [01:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:47] PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [01:14:59] PROBLEM - puppet last run on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:14:59] PROBLEM - Check whether ferm is active by checking the default input chain on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [01:15:19] PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [01:15:29] PROBLEM - Check size of conntrack table on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [01:15:35] PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [01:15:47] PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:16:01] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [01:16:02] Well, fuck, that was unfortunate. Should have synced CommonSettings.php before the rest of the directory. Looks like rsync ended up deleting VartiantSettings before updating CommonSettings [01:16:07] PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [01:16:25] RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [01:16:37] RECOVERY - Check whether ferm is active by checking the default input chain on stat1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [01:16:39] !log stat1007 - restart nagios-nrpe-server, echo "please don't use all of the RAM on this server" | wall [01:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:57] RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [01:17:07] RECOVERY - Check size of conntrack table on stat1007 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [01:17:11] RECOVERY - configured eth on stat1007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [01:17:23] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:17:39] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [01:17:45] RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [01:20:35] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:21:06] 10Operations, 10Traffic, 10Performance-Team (Radar): Enable mwdebug routes for noc.wikimedia.org - https://phabricator.wikimedia.org/T233768 (10Krinkle) [01:22:15] 10Operations, 10Traffic, 10Performance-Team (Radar): Enable mwdebug routes for noc.wikimedia.org - https://phabricator.wikimedia.org/T233768 (10Krinkle) [01:24:25] RECOVERY - Check the last execution of search-drop-query-clicks on stat1007 is OK: OK: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:52:11] 10Operations, 10Traffic: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10Dzahn) ^ Fixed by @papaul. Confirmed it works now to change password via IPMI from remote. [02:16:56] 10Operations, 10ops-eqiad, 10Traffic: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525 (10Dzahn) [02:23:51] (03PS1) 10Krinkle: Bug: T233458 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539008 (https://phabricator.wikimedia.org/T233458) [02:24:20] (03PS2) 10Krinkle: build: Upgrade from PHPUnit 6 to PHPUnit 8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539008 (https://phabricator.wikimedia.org/T233771) [02:25:59] (03CR) 10Krinkle: [C: 03+2] build: Upgrade from PHPUnit 6 to PHPUnit 8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539008 (https://phabricator.wikimedia.org/T233771) (owner: 10Krinkle) [02:26:49] (03PS3) 10Krinkle: build: Upgrade from PHPUnit 6 to PHPUnit 8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539008 (https://phabricator.wikimedia.org/T233771) [02:26:52] (03CR) 10Krinkle: [C: 03+2] build: Upgrade from PHPUnit 6 to PHPUnit 8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539008 (https://phabricator.wikimedia.org/T233771) (owner: 10Krinkle) [02:27:48] (03Merged) 10jenkins-bot: build: Upgrade from PHPUnit 6 to PHPUnit 8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539008 (https://phabricator.wikimedia.org/T233771) (owner: 10Krinkle) [02:28:03] (03CR) 10jenkins-bot: build: Upgrade from PHPUnit 6 to PHPUnit 8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539008 (https://phabricator.wikimedia.org/T233771) (owner: 10Krinkle) [02:28:52] 10Operations, 10ops-eqiad, 10Traffic: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525 (10Dzahn) @Papaul confirmed this looks like it needs onsite to drain the power. I asked @Vgutierrez about depooling this. Could i up the priority a bit due to the relation to T147074? [02:29:31] 10Operations, 10ops-eqiad, 10Traffic: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525 (10Dzahn) p:05Normal→03High [02:30:14] !log pool wdqs1006 - it has caught up with lag [02:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:58] !log depool wdqs1005 to let it catch up with lag [02:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:23] Asking because of an OTRS ticket: Are there any know issues that may have caused timeouts in Argentina? [02:39:28] (at around 00:00 UTC) [02:50:45] AntiComposite: Hi - Not that I can see for 30min before or after 00:00 UTC [02:51:09] Thanks [02:51:21] (03PS1) 10Krinkle: noc: Move WmfCluster.php to separate sub-dir. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539009 [02:52:01] (03PS2) 10Krinkle: noc: Move WmfCluster.php to separate sub-dir. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539009 [02:56:43] (03PS1) 10Krinkle: noc: Make db.php easier to test with a mock fixture in-repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539010 [02:56:49] (03PS2) 10Krinkle: noc: Make db.php easier to test with a mock fixture in-repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539010 [02:57:32] (03CR) 10Krinkle: [C: 03+2] noc: Move WmfCluster.php to separate sub-dir. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539009 (owner: 10Krinkle) [02:57:34] (03CR) 10Krinkle: [C: 03+2] noc: Make db.php easier to test with a mock fixture in-repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539010 (owner: 10Krinkle) [02:58:14] !log belatedly promoting wmf.24 to group0 refs T220749 [02:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:58:18] T220749: 1.34.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T220749 [02:58:25] (03Merged) 10jenkins-bot: noc: Move WmfCluster.php to separate sub-dir. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539009 (owner: 10Krinkle) [02:58:28] (03Merged) 10jenkins-bot: noc: Make db.php easier to test with a mock fixture in-repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539010 (owner: 10Krinkle) [02:58:41] (03CR) 10jenkins-bot: noc: Move WmfCluster.php to separate sub-dir. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539009 (owner: 10Krinkle) [02:58:45] Krinkle: let me know when you're done deploying? [02:59:35] twentyafterfour: k, few minutes, almost done. [02:59:42] no rush [03:00:43] (03CR) 10jenkins-bot: noc: Make db.php easier to test with a mock fixture in-repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539010 (owner: 10Krinkle) [03:01:22] !log krinkle@deploy1001 Synchronized src/: c7c6c0ee0, 8405bf1c2 (for noc.wm.o) (duration: 01m 09s) [03:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:27] !log krinkle@deploy1001 Synchronized docroot/noc/: c7c6c0ee0, 8405bf1c2 (duration: 01m 05s) [03:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:12] (03PS1) 10Krinkle: Enforce PHPUnit global leak strict and fix StaticSettingsTest leak [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539011 (https://phabricator.wikimedia.org/T233771) [03:06:24] ^ is the last one [03:07:00] (03CR) 10Krinkle: [C: 03+2] Enforce PHPUnit global leak strict and fix StaticSettingsTest leak [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539011 (https://phabricator.wikimedia.org/T233771) (owner: 10Krinkle) [03:07:48] (03Merged) 10jenkins-bot: Enforce PHPUnit global leak strict and fix StaticSettingsTest leak [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539011 (https://phabricator.wikimedia.org/T233771) (owner: 10Krinkle) [03:08:04] (03CR) 10jenkins-bot: Enforce PHPUnit global leak strict and fix StaticSettingsTest leak [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539011 (https://phabricator.wikimedia.org/T233771) (owner: 10Krinkle) [03:08:04] twentyafterfour: all yours [03:23:18] (03Abandoned) 10CRusnov: profile::authdns: Add automation framework [puppet] - 10https://gerrit.wikimedia.org/r/537576 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [03:24:45] (03PS1) 1020after4: group0 wikis to 1.34.0-wmf.24 refs T220749 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539012 [03:24:47] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.34.0-wmf.24 refs T220749 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539012 (owner: 1020after4) [03:25:54] (03Merged) 10jenkins-bot: group0 wikis to 1.34.0-wmf.24 refs T220749 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539012 (owner: 1020after4) [03:27:19] (03CR) 10jenkins-bot: group0 wikis to 1.34.0-wmf.24 refs T220749 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539012 (owner: 1020after4) [03:28:16] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.34.0-wmf.24 refs T220749 [03:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:20] T220749: 1.34.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T220749 [04:02:37] 10Operations, 10Traffic, 10HTTPS, 10Upstream: Enable ESNI support on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10Shizhao) [04:10:54] (03PS1) 10CRusnov: Add script to generate DNS records from Netbox [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/539013 (https://phabricator.wikimedia.org/T233183) [04:43:44] !log Deploy schema change on s3 with replication - T231172 [04:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:43:48] T231172: Alter gbw_reason/gb_reason/gbw_by_text on WMF production - https://phabricator.wikimedia.org/T231172 [05:06:42] !log Run a data check on labsdb1011 - T233766 [05:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:46] T233766: labsdb1011 mariadb crashed - https://phabricator.wikimedia.org/T233766 [05:11:08] (03PS1) 10Marostegui: mariadb: Remove db2035 [puppet] - 10https://gerrit.wikimedia.org/r/539014 (https://phabricator.wikimedia.org/T229784) [05:11:25] (03PS1) 10Marostegui: wmnet: Remove db2035 production entries [dns] - 10https://gerrit.wikimedia.org/r/539015 (https://phabricator.wikimedia.org/T229784) [05:11:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=False) [05:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:53] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2035.codfw.wmnet` - db2035.codfw.wmnet (**PASS**) - Downtimed... [05:15:09] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db2035 [puppet] - 10https://gerrit.wikimedia.org/r/539014 (https://phabricator.wikimedia.org/T229784) (owner: 10Marostegui) [05:16:14] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove db2035 production entries [dns] - 10https://gerrit.wikimedia.org/r/539015 (https://phabricator.wikimedia.org/T229784) (owner: 10Marostegui) [05:16:44] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Marostegui) [05:17:13] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Marostegui) a:05RobH→03Papaul Host ready for @Papaul after running the decommissioning script [05:21:50] (03PS1) 10Marostegui: mariadb: Remove entries for db2037-db2041 [puppet] - 10https://gerrit.wikimedia.org/r/539016 (https://phabricator.wikimedia.org/T224720) [05:22:55] (03PS1) 10Marostegui: wmnet: Remove production entries db2037-db2041 [dns] - 10https://gerrit.wikimedia.org/r/539017 (https://phabricator.wikimedia.org/T224720) [05:23:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=False) [05:23:23] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2037 - https://phabricator.wikimedia.org/T224720 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2037.codfw.wmnet` - db2037.codfw.wmnet (**PASS**) - Downtimed host on Icin... [05:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=False) [05:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:45] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission db2038 - https://phabricator.wikimedia.org/T227565 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2038.codfw.wmnet` - db2038.codfw.wmnet (**PASS**) - Downtimed... [05:23:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=False) [05:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:08] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2040 - https://phabricator.wikimedia.org/T224079 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2040.codfw.wmnet` - db2040.codfw.wmnet (**PASS**) - Downtimed host on Icin... [05:24:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=False) [05:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:32] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2041 - https://phabricator.wikimedia.org/T223950 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2041.codfw.wmnet` - db2041.codfw.wmnet (**PASS**) - Downtimed host on Icin... [05:24:48] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove entries for db2037-db2041 [puppet] - 10https://gerrit.wikimedia.org/r/539016 (https://phabricator.wikimedia.org/T224720) (owner: 10Marostegui) [05:25:56] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production entries db2037-db2041 [dns] - 10https://gerrit.wikimedia.org/r/539017 (https://phabricator.wikimedia.org/T224720) (owner: 10Marostegui) [05:26:47] 10Operations, 10ops-codfw, 10decommission: Decommission db2037 - https://phabricator.wikimedia.org/T224720 (10Marostegui) a:05RobH→03Papaul [05:27:04] 10Operations, 10ops-codfw, 10decommission: Decommission db2037 - https://phabricator.wikimedia.org/T224720 (10Marostegui) Host ready for @Papaul to decommission [05:27:31] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2038 - https://phabricator.wikimedia.org/T227565 (10Marostegui) a:05RobH→03Papaul [05:27:35] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2038 - https://phabricator.wikimedia.org/T227565 (10Marostegui) Host ready for @Papaul to decommission [05:27:57] 10Operations, 10ops-codfw, 10decommission: Decommission db2040 - https://phabricator.wikimedia.org/T224079 (10Marostegui) a:05RobH→03Papaul [05:28:05] 10Operations, 10ops-codfw, 10decommission: Decommission db2040 - https://phabricator.wikimedia.org/T224079 (10Marostegui) Host ready for @Papaul to decommission [05:28:32] 10Operations, 10ops-codfw, 10decommission: Decommission db2041 - https://phabricator.wikimedia.org/T223950 (10Marostegui) a:05RobH→03Papaul [05:28:38] 10Operations, 10ops-codfw, 10decommission: Decommission db2041 - https://phabricator.wikimedia.org/T223950 (10Marostegui) Host ready for @Papaul to decommission [05:33:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=False) [05:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1072.eqiad.wmnet` - db1072.eqiad.wmnet (**PASS**) - Downtimed host on Ic... [05:34:44] (03PS1) 10Marostegui: mariadb: Remove db1072 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/539018 (https://phabricator.wikimedia.org/T228956) [05:35:15] (03PS1) 10Marostegui: wmnet: Remove production entries for db1072 [dns] - 10https://gerrit.wikimedia.org/r/539019 (https://phabricator.wikimedia.org/T228956) [05:35:34] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db1072 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/539018 (https://phabricator.wikimedia.org/T228956) (owner: 10Marostegui) [05:36:14] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production entries for db1072 [dns] - 10https://gerrit.wikimedia.org/r/539019 (https://phabricator.wikimedia.org/T228956) (owner: 10Marostegui) [05:37:22] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 (10Marostegui) a:05RobH→03None [05:37:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission db1072.eqiad.wmnet - https://phabricator.wikimedia.org/T228956 (10Marostegui) Host ready for onsite steps [05:56:13] (03PS1) 10Elukey: reportupdater::job: use python3 in the systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/539021 (https://phabricator.wikimedia.org/T204736) [06:01:51] (03CR) 10Elukey: [C: 03+2] reportupdater::job: use python3 in the systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/539021 (https://phabricator.wikimedia.org/T204736) (owner: 10Elukey) [06:10:38] (03PS2) 10Elukey: Remove Python 2 packages from Analytics Client nodes [puppet] - 10https://gerrit.wikimedia.org/r/538750 (https://phabricator.wikimedia.org/T204734) [06:11:55] (03Abandoned) 10Elukey: statistics::gpu: add missing group [puppet] - 10https://gerrit.wikimedia.org/r/526656 (owner: 10Jbond) [06:15:14] (03PS1) 10Elukey: role::statistics::explorer::gpu: allow analytics users to log in [puppet] - 10https://gerrit.wikimedia.org/r/539022 (https://phabricator.wikimedia.org/T148843) [06:20:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2085:3311 T233625', diff saved to https://phabricator.wikimedia.org/P9171 and previous config saved to /var/cache/conftool/dbconfig/20190925-062036-marostegui.json [06:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:41] T233625: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 [06:21:09] !log Deploy schema change on db2085:3311 T233625 [06:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:39] (03PS2) 10Giuseppe Lavagetto: lvs: do not check hhvm/php7 at the same time anymore. [puppet] - 10https://gerrit.wikimedia.org/r/538864 (https://phabricator.wikimedia.org/T219127) [06:28:14] (03CR) 10Giuseppe Lavagetto: [C: 03+2] lvs: do not check hhvm/php7 at the same time anymore. [puppet] - 10https://gerrit.wikimedia.org/r/538864 (https://phabricator.wikimedia.org/T219127) (owner: 10Giuseppe Lavagetto) [06:29:01] !log @ helmfile [EQIAD] Ran 'sync' command on namespace 'restrouter' for release 'production' . [06:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:17] !log @ helmfile [CODFW] Ran 'sync' command on namespace 'restrouter' for release 'codfw' . [06:29:19] (03PS3) 10Jcrespo: mariadb: make core_test hosts not page on replication/process issues [puppet] - 10https://gerrit.wikimedia.org/r/538837 (https://phabricator.wikimedia.org/T177782) [06:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:33] (03PS1) 10Alexandros Kosiaris: Remove old deprecated helper scripts [deployment-charts] - 10https://gerrit.wikimedia.org/r/539023 [06:31:34] (03CR) 10Jcrespo: [C: 03+2] mariadb: make core_test hosts not page on replication/process issues [puppet] - 10https://gerrit.wikimedia.org/r/538837 (https://phabricator.wikimedia.org/T177782) (owner: 10Jcrespo) [06:33:27] <_joe_> !log restarting pybal on all low-traffic lbs [06:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:54] (03CR) 10Jcrespo: "At some point in the future, these should be merged into a single core profile, then parametrized based on the role only." [puppet] - 10https://gerrit.wikimedia.org/r/538837 (https://phabricator.wikimedia.org/T177782) (owner: 10Jcrespo) [06:34:18] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [06:35:50] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [06:37:32] (03CR) 10Alexandros Kosiaris: [C: 04-1] hiera: update ores to pass statsd through statsd_exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538976 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [06:39:34] (03PS1) 10Elukey: base: remove md5 from gen_fingerprints' output [puppet] - 10https://gerrit.wikimedia.org/r/539025 [06:41:46] (03PS2) 10Elukey: base: remove md5 from gen_fingerprints' output [puppet] - 10https://gerrit.wikimedia.org/r/539025 [06:42:06] (03PS1) 10Giuseppe Lavagetto: conftool-data: temporarily comment out mw1298 [puppet] - 10https://gerrit.wikimedia.org/r/539027 [06:46:50] (03PS2) 10Elukey: role::statistics::explorer::gpu: allow analytics users to log in [puppet] - 10https://gerrit.wikimedia.org/r/539022 (https://phabricator.wikimedia.org/T148843) [06:53:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] conftool-data: temporarily comment out mw1298 [puppet] - 10https://gerrit.wikimedia.org/r/539027 (owner: 10Giuseppe Lavagetto) [06:55:34] (03PS1) 10Alexandros Kosiaris: restrouter: Stop passing the image parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/539028 [06:55:56] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [06:58:53] (03PS10) 10Jcrespo: backups: Change file owner of bacula storage&director config [puppet] - 10https://gerrit.wikimedia.org/r/538239 (https://phabricator.wikimedia.org/T229209) [06:59:19] (03CR) 10Muehlenhoff: role::statistics::explorer::gpu: allow analytics users to log in (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539022 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [07:02:46] (03CR) 10Elukey: role::statistics::explorer::gpu: allow analytics users to log in (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539022 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [07:09:01] (03CR) 10Muehlenhoff: [C: 03+1] role::statistics::explorer::gpu: allow analytics users to log in (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539022 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [07:10:44] (03CR) 10Elukey: [C: 03+2] role::statistics::explorer::gpu: allow analytics users to log in [puppet] - 10https://gerrit.wikimedia.org/r/539022 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [07:10:51] (03PS3) 10Elukey: role::statistics::explorer::gpu: allow analytics users to log in [puppet] - 10https://gerrit.wikimedia.org/r/539022 (https://phabricator.wikimedia.org/T148843) [07:12:45] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Joe) 05Open→03Resolved [07:12:48] 10Operations, 10serviceops: SRE FY19-20 Q1 goal: complete the transition to PHP7 - https://phabricator.wikimedia.org/T219127 (10Joe) [07:12:58] 10Operations, 10CPT Initiatives (PHP7 (TEC4)), 10HHVM, 10MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Joe) [07:16:27] (03PS1) 10Urbanecm: Fully close bgwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539029 (https://phabricator.wikimedia.org/T233322) [07:17:37] !log allow analytics users to log in into stat1005 [07:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:42] !log pool wdqs1005 to allow depooling wdqs1004 to handle lag issues [07:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:46] after a year we made it :) [07:21:33] 10Operations, 10netbox: Netbox: tracking of hardware errors / grouping servers in order/batches - https://phabricator.wikimedia.org/T233774 (10MoritzMuehlenhoff) [07:22:30] jouncebot: now [07:22:30] No deployments scheduled for the next 3 hour(s) and 37 minute(s) [07:22:32] jouncebot: next [07:22:32] In 3 hour(s) and 37 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190925T1100) [07:22:53] (03CR) 10Urbanecm: [C: 03+2] Revert "Add localized Wikipedia wordmark for szlwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538930 (https://phabricator.wikimedia.org/T233104) (owner: 10Jdlrobson) [07:23:06] (03CR) 10jerkins-bot: [V: 04-1] Revert "Add localized Wikipedia wordmark for szlwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538930 (https://phabricator.wikimedia.org/T233104) (owner: 10Jdlrobson) [07:24:16] (03PS5) 10Urbanecm: Revert "Add localized Wikipedia wordmark for szlwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538930 (https://phabricator.wikimedia.org/T233104) (owner: 10Jdlrobson) [07:24:59] (03CR) 10Urbanecm: [C: 03+2] Revert "Add localized Wikipedia wordmark for szlwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538930 (https://phabricator.wikimedia.org/T233104) (owner: 10Jdlrobson) [07:26:31] (03Abandoned) 10Urbanecm: Typo: Add a slash to szlwiki for wgMinervaCustomLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538911 (https://phabricator.wikimedia.org/T233104) (owner: 10Urbanecm) [07:26:33] (03CR) 10jenkins-bot: Revert "Add localized Wikipedia wordmark for szlwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538930 (https://phabricator.wikimedia.org/T233104) (owner: 10Jdlrobson) [07:27:20] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: c761ec1: Revert "Add localized Wikipedia wordmark for szlwiki" (T233104) (duration: 01m 16s) [07:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:23] T233104: Add localized Wikipedia wordmark to the Silesian (szl) mobile frontend - https://phabricator.wikimedia.org/T233104 [07:28:40] !log urbanecm@deploy1001 Synchronized static/images/mobile/copyright/: c761ec1: Revert "Add localized Wikipedia wordmark for szlwiki" (T233104) (duration: 01m 04s) [07:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:29] (03PS1) 10Urbanecm: Add wgMinervaCustomLogos for szlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539052 (https://phabricator.wikimedia.org/T233104) [07:38:03] !log installing emacs updates for buster (from SUA update, extended ELPA repository key) [07:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:12] jouncebot: now [07:38:12] No deployments scheduled for the next 3 hour(s) and 21 minute(s) [07:38:29] PROBLEM - mediawiki-installation DSH group on mw1298 is CRITICAL: Host mw1298 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:42:02] (03PS1) 10DCausse: Revert "[cirrus] temp disable sanity check" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539053 [07:42:04] I need to deploy this change ^, please let me know if you have objections [07:43:31] 10Operations, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Elukey: mcrouter does not remove a memcached shard from consistent hashing when timeouts happen - https://phabricator.wikimedia.org/T208934 (10elukey) [07:44:32] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline. Also please add a related bug/task" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538931 (owner: 10Herron) [07:46:07] (03CR) 10Filippo Giunchedi: Set up scap target for deploying the phatality plugin into kibana (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538858 (https://phabricator.wikimedia.org/T230752) (owner: 1020after4) [07:46:32] twentyafterfour: ^ in case you are here now [07:46:33] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1006.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [07:46:56] godog on it [07:47:25] (03CR) 10DCausse: [C: 03+2] Revert "[cirrus] temp disable sanity check" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539053 (owner: 10DCausse) [07:47:55] (03PS4) 10Muehlenhoff: openldap_corp: Hierarize existing setup to allow adding a second server pair [puppet] - 10https://gerrit.wikimedia.org/r/538588 [07:48:09] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [07:48:14] (03PS3) 1020after4: Set up scap target for deploying the phatality plugin into kibana [puppet] - 10https://gerrit.wikimedia.org/r/538858 (https://phabricator.wikimedia.org/T230752) [07:48:16] (03Merged) 10jenkins-bot: Revert "[cirrus] temp disable sanity check" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539053 (owner: 10DCausse) [07:48:31] (03CR) 1020after4: Set up scap target for deploying the phatality plugin into kibana (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538858 (https://phabricator.wikimedia.org/T230752) (owner: 1020after4) [07:49:09] (03CR) 10jenkins-bot: Revert "[cirrus] temp disable sanity check" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539053 (owner: 10DCausse) [07:49:43] twentyafterfour: thanks! running pcc now, lgtm though. I think the target groups for scap are still missing ? [07:50:19] I mean the puppet/hiera config that will generate the dsh group list from conftool data [07:51:00] hmmm [07:51:20] !log dcausse@deploy1001 Synchronized wmf-config/CirrusSearch-production.php: T233584 revert: [cirrus] temp disable sanity check (duration: 01m 05s) [07:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:24] T233584: Re-adjust cirrusSearchLinksUpdate vs cirrusSearchLinksUpdatePrioritized concurrency - https://phabricator.wikimedia.org/T233584 [07:52:20] twentyafterfour: can be in another patch too, I'll merge for now [07:52:34] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1001/18553/" [puppet] - 10https://gerrit.wikimedia.org/r/538858 (https://phabricator.wikimedia.org/T230752) (owner: 1020after4) [07:52:58] (03PS5) 10Muehlenhoff: openldap_corp: Hierarize existing setup to allow adding a second server pair [puppet] - 10https://gerrit.wikimedia.org/r/538588 [07:54:36] godog: yeah I have no idea about the dsh target list or how to do that. I included a blank target list in the deployment repo [07:56:21] (03CR) 10Muehlenhoff: [C: 03+2] openldap_corp: Hierarize existing setup to allow adding a second server pair [puppet] - 10https://gerrit.wikimedia.org/r/538588 (owner: 10Muehlenhoff) [07:57:05] twentyafterfour: ok I'll give it a try [07:58:36] (03PS1) 10Filippo Giunchedi: hieradata: add phatality dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/539054 (https://phabricator.wikimedia.org/T230752) [07:58:48] twentyafterfour: sth like the above, then changing scap.cfg to point dsh_targets to it [07:59:01] ok I'll change the scap.cfg [07:59:34] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Sorry, I don't understand the logic." [software/service-checker] - 10https://gerrit.wikimedia.org/r/538711 (owner: 10Cwhite) [07:59:51] (03Abandoned) 10Vgutierrez: Release 8.0.5-1wm9 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/538857 (https://phabricator.wikimedia.org/T233667) (owner: 10Vgutierrez) [08:00:32] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add phatality dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/539054 (https://phabricator.wikimedia.org/T230752) (owner: 10Filippo Giunchedi) [08:00:40] (03PS2) 10Filippo Giunchedi: hieradata: add phatality dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/539054 (https://phabricator.wikimedia.org/T230752) [08:01:05] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Specifically, unless you have to specify a different metrics collector for different endpoints, it makes sense for this to be a class vari" [software/service-checker] - 10https://gerrit.wikimedia.org/r/538711 (owner: 10Cwhite) [08:03:51] twentyafterfour: ok once done please pull the repo on deploy1001 so puppet can finish [08:04:03] or rather, scap deploy --init [08:04:19] ok on it [08:12:59] (03PS1) 10Vgutierrez: ATS: Use the main NIC instead of the loopback interface to reach varnish [puppet] - 10https://gerrit.wikimedia.org/r/539056 (https://phabricator.wikimedia.org/T233667) [08:15:20] of course I copied the wrong example for scap::dsh::groups, fixing )o) [08:15:22] godog: weird, I set dsh_targets to phatality but scap deploy --init isn't working [08:15:25] (03PS1) 10Filippo Giunchedi: hieradata: fix scap::dsh::groups config for logstash/phatality [puppet] - 10https://gerrit.wikimedia.org/r/539057 [08:15:27] oh [08:15:30] twentyafterfour: yeah fixed by that ^ [08:16:49] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: fix scap::dsh::groups config for logstash/phatality [puppet] - 10https://gerrit.wikimedia.org/r/539057 (owner: 10Filippo Giunchedi) [08:16:58] (03PS2) 10Filippo Giunchedi: hieradata: fix scap::dsh::groups config for logstash/phatality [puppet] - 10https://gerrit.wikimedia.org/r/539057 [08:17:16] (03CR) 10Vgutierrez: "PCC seems to be happy: https://puppet-compiler.wmflabs.org/compiler1001/18554/" [puppet] - 10https://gerrit.wikimedia.org/r/539056 (https://phabricator.wikimedia.org/T233667) (owner: 10Vgutierrez) [08:21:03] godog: ok I think it might be ready to deploy :) [08:21:24] indeed scap deploy --init worked [08:21:59] twentyafterfour: ack! testing puppet on logstash hosts [08:22:32] (03PS4) 10Elukey: analytics::refinery::job::druid_load: Add sanitization for netflow [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [08:23:01] (03CR) 10Elukey: "Marcel: I added country_ip_src,country_ip_dst to the list of fields, let me know if it is ok!" [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [08:23:08] (03CR) 10Vgutierrez: [C: 03+2] ATS: Use the main NIC instead of the loopback interface to reach varnish [puppet] - 10https://gerrit.wikimedia.org/r/539056 (https://phabricator.wikimedia.org/T233667) (owner: 10Vgutierrez) [08:23:16] (03PS2) 10Vgutierrez: ATS: Use the main NIC instead of the loopback interface to reach varnish [puppet] - 10https://gerrit.wikimedia.org/r/539056 (https://phabricator.wikimedia.org/T233667) [08:24:58] twentyafterfour: ok scap is failing to kibana-plugin install with permission denied, indeed because I think the command needs to be ran as 'kibana' [08:26:35] which is actually even better for the sudo rules [08:26:51] ok [08:27:14] I'll change the puppet part, could you take care of the scap/checks.yaml part ? [08:27:30] yep [08:27:36] so the full command will be [08:27:39] sudo -u kibana /usr/share/kibana/bin/kibana-plugin install /srv/deployment/releng/phatality/deploy/phatality-5.6.14.zip [08:28:44] that's correct yeah [08:28:57] done [08:29:02] (03PS4) 10Abijeet Patro: Fix incorrect channel name for TranslationNotifications extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537628 (https://phabricator.wikimedia.org/T144780) [08:32:51] (03PS1) 10Filippo Giunchedi: kibana: switch to user kibana for phatality plugin-install [puppet] - 10https://gerrit.wikimedia.org/r/539059 [08:32:57] (03CR) 10Nikerabbit: [C: 03+1] Fix incorrect channel name for TranslationNotifications extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537628 (https://phabricator.wikimedia.org/T144780) (owner: 10Abijeet Patro) [08:32:59] (03PS3) 10Muehlenhoff: Set IDP access strategy for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/538907 [08:33:11] (03CR) 10Filippo Giunchedi: [C: 03+2] kibana: switch to user kibana for phatality plugin-install [puppet] - 10https://gerrit.wikimedia.org/r/539059 (owner: 10Filippo Giunchedi) [08:33:22] (03PS1) 10Elukey: profile::prometheus::alerts: add monitor for netflow realtime druid data [puppet] - 10https://gerrit.wikimedia.org/r/539060 (https://phabricator.wikimedia.org/T229682) [08:34:01] (03CR) 10Muehlenhoff: [C: 03+2] Set IDP access strategy for debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/538907 (owner: 10Muehlenhoff) [08:34:28] (03PS2) 10Filippo Giunchedi: kibana: switch to user kibana for phatality plugin-install [puppet] - 10https://gerrit.wikimedia.org/r/539059 [08:34:30] (03PS2) 10Elukey: profile::prometheus::alerts: add monitor for netflow realtime druid data [puppet] - 10https://gerrit.wikimedia.org/r/539060 (https://phabricator.wikimedia.org/T229682) [08:34:34] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] kibana: switch to user kibana for phatality plugin-install [puppet] - 10https://gerrit.wikimedia.org/r/539059 (owner: 10Filippo Giunchedi) [08:35:04] (03PS1) 10Giuseppe Lavagetto: Revert "cache_text: Vary for PHP7" [puppet] - 10https://gerrit.wikimedia.org/r/539062 [08:35:34] (03CR) 10jerkins-bot: [V: 04-1] Revert "cache_text: Vary for PHP7" [puppet] - 10https://gerrit.wikimedia.org/r/539062 (owner: 10Giuseppe Lavagetto) [08:36:52] twentyafterfour: ok should work now, please try again [08:36:56] (03CR) 10Elukey: [C: 03+2] profile::prometheus::alerts: add monitor for netflow realtime druid data [puppet] - 10https://gerrit.wikimedia.org/r/539060 (https://phabricator.wikimedia.org/T229682) (owner: 10Elukey) [08:37:03] (03PS3) 10Elukey: profile::prometheus::alerts: add monitor for netflow realtime druid data [puppet] - 10https://gerrit.wikimedia.org/r/539060 (https://phabricator.wikimedia.org/T229682) [08:37:05] (03CR) 10Elukey: [V: 03+2 C: 03+2] profile::prometheus::alerts: add monitor for netflow realtime druid data [puppet] - 10https://gerrit.wikimedia.org/r/539060 (https://phabricator.wikimedia.org/T229682) (owner: 10Elukey) [08:39:17] (03PS5) 10Mathew.onipe: query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) [08:39:19] (03PS7) 10Mathew.onipe: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [08:39:21] (03PS3) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) [08:39:24] !log twentyafterfour@deploy1001 Started deploy [releng/phatality@8f05ba9]: (no justification provided) [08:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:54] godog: error on scap deploy [08:39:55] (03PS1) 10Giuseppe Lavagetto: Revert "ATS: Vary-slotting for PHP7" [puppet] - 10https://gerrit.wikimedia.org/r/539063 [08:39:55] sudo: no tty present and no askpass program specified [08:40:36] ah yeah my bad, forcing puppet run [08:40:40] (03CR) 10jerkins-bot: [V: 04-1] query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [08:40:42] (03CR) 10jerkins-bot: [V: 04-1] Revert "ATS: Vary-slotting for PHP7" [puppet] - 10https://gerrit.wikimedia.org/r/539063 (owner: 10Giuseppe Lavagetto) [08:41:04] (03PS11) 10Jcrespo: backups: Change file owner of bacula storage&director config [puppet] - 10https://gerrit.wikimedia.org/r/538239 (https://phabricator.wikimedia.org/T229209) [08:41:05] in the meantime I updated the README.md with deployment instructions [08:41:06] (03PS1) 10Jcrespo: mariadb: Disable paging for mariadb disk space on core test hosts [puppet] - 10https://gerrit.wikimedia.org/r/539064 (https://phabricator.wikimedia.org/T177782) [08:41:31] (03PS1) 10Arturo Borrero Gonzalez: openstack: drop jessie code [puppet] - 10https://gerrit.wikimedia.org/r/539065 (https://phabricator.wikimedia.org/T212302) [08:43:01] twentyafterfour: should be good, please try again [08:44:03] !log repooling cp4027 - T233667 [08:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:06] T233667: varnish-fe is handling X-Forwarded-For differently when ats is in front of it - https://phabricator.wikimedia.org/T233667 [08:44:09] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/18555/" [puppet] - 10https://gerrit.wikimedia.org/r/539065 (https://phabricator.wikimedia.org/T212302) (owner: 10Arturo Borrero Gonzalez) [08:44:21] (03CR) 10Jcrespo: "I am going to deploy this one left: https://puppet-compiler.wmflabs.org/compiler1001/18556/db1114.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/539064 (https://phabricator.wikimedia.org/T177782) (owner: 10Jcrespo) [08:44:36] (03PS2) 10Jcrespo: mariadb: Disable paging for mariadb disk space on core test hosts [puppet] - 10https://gerrit.wikimedia.org/r/539064 (https://phabricator.wikimedia.org/T177782) [08:44:51] 10Operations, 10Anti-Harassment, 10CheckUser, 10MediaWiki-User-management, 10Traffic: Users editing from 127.0.0.1 (due to experimenting with ATS terminating TLS) - https://phabricator.wikimedia.org/T233657 (10Vgutierrez) [08:44:54] 10Operations, 10Traffic, 10Patch-For-Review: varnish-fe is handling X-Forwarded-For differently when ats is in front of it - https://phabricator.wikimedia.org/T233667 (10Vgutierrez) 05Open→03Resolved [08:44:57] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [08:45:37] (03CR) 10Jcrespo: [C: 03+2] mariadb: Disable paging for mariadb disk space on core test hosts [puppet] - 10https://gerrit.wikimedia.org/r/539064 (https://phabricator.wikimedia.org/T177782) (owner: 10Jcrespo) [08:48:50] !log twentyafterfour@deploy1001 Finished deploy [releng/phatality@8f05ba9]: (no justification provided) (duration: 09m 26s) [08:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:52] !log twentyafterfour@deploy1001 Started deploy [releng/phatality@8f05ba9]: (no justification provided) [08:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:57] !log twentyafterfour@deploy1001 Finished deploy [releng/phatality@8f05ba9]: (no justification provided) (duration: 00m 05s) [08:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:07] godog: seems to have worked [08:49:36] but I don't see the plugin live [08:49:48] I suppose kibana might have to be manually restarted? [08:49:56] possibly, I'm trying that [08:50:54] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:50:55] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:50] (03Abandoned) 10Jayprakash12345: Enable $wgAllowCopyUploads for pawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/495446 (https://phabricator.wikimedia.org/T217486) (owner: 10Jayprakash12345) [08:52:58] !log roll-restart kibana [08:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:00] (03CR) 10Alexandros Kosiaris: [C: 03+2] restrouter: Stop passing the image parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/539028 (owner: 10Alexandros Kosiaris) [08:55:21] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove old deprecated helper scripts [deployment-charts] - 10https://gerrit.wikimedia.org/r/539023 (owner: 10Alexandros Kosiaris) [08:55:22] (03Merged) 10jenkins-bot: Remove old deprecated helper scripts [deployment-charts] - 10https://gerrit.wikimedia.org/r/539023 (owner: 10Alexandros Kosiaris) [08:55:22] (03Merged) 10jenkins-bot: restrouter: Stop passing the image parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/539028 (owner: 10Alexandros Kosiaris) [08:55:40] twentyafterfour: completed [09:01:18] !log @ helmfile [EQIAD] Ran 'sync' command on namespace 'restrouter' for release 'production' . [09:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:21] !log @ helmfile [EQIAD] Ran 'sync' command on namespace 'restrouter' for release 'production' . [09:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:33] !log @ helmfile [EQIAD] Ran 'sync' command on namespace 'restrouter' for release 'production' . [09:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:58] !log @ helmfile [EQIAD] Ran 'sync' command on namespace 'restrouter' for release 'production' . [09:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:28] !log @ helmfile [CODFW] Ran 'sync' command on namespace 'restrouter' for release 'codfw' . [09:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:34] (03PS2) 10Giuseppe Lavagetto: Revert "cache_text: Vary for PHP7" [puppet] - 10https://gerrit.wikimedia.org/r/539062 [09:07:36] (03PS2) 10Giuseppe Lavagetto: Revert "ATS: Vary-slotting for PHP7" [puppet] - 10https://gerrit.wikimedia.org/r/539063 [09:10:10] (03CR) 10jerkins-bot: [V: 04-1] Revert "ATS: Vary-slotting for PHP7" [puppet] - 10https://gerrit.wikimedia.org/r/539063 (owner: 10Giuseppe Lavagetto) [09:12:05] (03CR) 10Vgutierrez: [C: 03+1] Revert "cache_text: Vary for PHP7" [puppet] - 10https://gerrit.wikimedia.org/r/539062 (owner: 10Giuseppe Lavagetto) [09:12:48] (03CR) 10Jbond: Fix maintain_dbusers class lookup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/538979 (owner: 10Alex Monk) [09:14:27] hmm, still no phatality, I wonder what went wrong [09:15:13] !log twentyafterfour@deploy1001 Started deploy [releng/phatality@8f05ba9]: (no justification provided) [09:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:35] (03PS1) 10Elukey: profile::zookeeper::server: use openjkd-8 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/539069 (https://phabricator.wikimedia.org/T217057) [09:16:04] ah ha! n [09:16:07] !log twentyafterfour@deploy1001 Finished deploy [releng/phatality@8f05ba9]: (no justification provided) (duration: 00m 54s) [09:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:24] hmm Check 'install_zip' exceeded 30.0s timeout [09:17:00] twentyafterfour: mhhh indeed [09:17:13] (03PS3) 10Giuseppe Lavagetto: Revert "ATS: Vary-slotting for PHP7" [puppet] - 10https://gerrit.wikimedia.org/r/539063 [09:17:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/539065 (https://phabricator.wikimedia.org/T212302) (owner: 10Arturo Borrero Gonzalez) [09:18:24] !log twentyafterfour@deploy1001 Started deploy [releng/phatality@8f05ba9]: (no justification provided) [09:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:40] godog: when I ran this on beta the install command took a while so I'm increasing the timeout [09:20:43] hmm nope, 120s timeout still failed [09:20:48] !log twentyafterfour@deploy1001 Finished deploy [releng/phatality@8f05ba9]: (no justification provided) (duration: 02m 24s) [09:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:17] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/18557/" [puppet] - 10https://gerrit.wikimedia.org/r/539069 (https://phabricator.wikimedia.org/T217057) (owner: 10Elukey) [09:21:28] I don't see anything useful in the logs [09:24:48] (03CR) 10Jbond: "I would also be curious of to see a filing report to make sure nothing else is going wrong" [puppet] - 10https://gerrit.wikimedia.org/r/538979 (owner: 10Alex Monk) [09:24:51] godog: do you see the plugin in /usr/share/kibana/plugins/ ? [09:25:08] twentyafterfour@deployment-logstash2:~$ ls /usr/share/kibana/plugins/ [09:25:10] phatality [09:26:34] 10Operations, 10Traffic: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [09:26:53] ohhh I know what's wrong - kibana wants file:// urls not just file paths [09:27:04] (03CR) 10Vgutierrez: [C: 03+1] Revert "ATS: Vary-slotting for PHP7" [puppet] - 10https://gerrit.wikimedia.org/r/539063 (owner: 10Giuseppe Lavagetto) [09:27:24] !log twentyafterfour@deploy1001 Started deploy [releng/phatality@8f05ba9]: (no justification provided) [09:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:37] (03CR) 10Volans: "Much better, thanks! Some comment/reply inline." (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 (https://phabricator.wikimedia.org/T233053) (owner: 10Ayounsi) [09:27:56] godog: can we adjust the sudoers rule to allow file:/// in the kibana-plugin install arg list? [09:29:02] (03CR) 10Jbond: [C: 03+1] base: remove md5 from gen_fingerprints' output [puppet] - 10https://gerrit.wikimedia.org/r/539025 (owner: 10Elukey) [09:29:35] (03PS1) 1020after4: Allow file:// in the kibana-plugin install command sudo rule [puppet] - 10https://gerrit.wikimedia.org/r/539070 [09:29:54] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/539070/ [09:30:13] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp5002 [puppet] - 10https://gerrit.wikimedia.org/r/539071 (https://phabricator.wikimedia.org/T231433) [09:30:17] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 [puppet] - 10https://gerrit.wikimedia.org/r/539072 (https://phabricator.wikimedia.org/T231433) [09:31:49] (03PS2) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp5002 [puppet] - 10https://gerrit.wikimedia.org/r/539072 (https://phabricator.wikimedia.org/T231433) [09:32:59] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 2:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/537732 (https://phabricator.wikimedia.org/T227290) (owner: 10Bstorm) [09:33:01] (03Abandoned) 10Jbond: ipmi: relax password minimum length [software/spicerack] - 10https://gerrit.wikimedia.org/r/536616 (https://phabricator.wikimedia.org/T147074) (owner: 10Jbond) [09:33:03] twentyafterfour: in a meeting, sec [09:33:12] godog: no rush thanks! [09:35:02] (03PS2) 10Mforns: Rsync analytics mediawiki history dumps to dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/538312 (https://phabricator.wikimedia.org/T208612) [09:35:49] (03CR) 10Mforns: Rsync analytics mediawiki history dumps to dumps.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538312 (https://phabricator.wikimedia.org/T208612) (owner: 10Mforns) [09:37:08] (03PS1) 10Alexandros Kosiaris: Bump number of replicas for restrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/539075 [09:42:06] (03PS5) 10Mforns: analytics::refinery::job::druid_load: Add sanitization for netflow [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) [09:43:07] (03CR) 10Mforns: analytics::refinery::job::druid_load: Add sanitization for netflow (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [09:44:42] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1298 is CRITICAL: Host mw1298 is not in mediawiki-installation dsh group Giuseppe Lavagetto needs to be reinstalled. https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:47:52] (03CR) 10Alexandros Kosiaris: [C: 03+2] Bump number of replicas for restrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/539075 (owner: 10Alexandros Kosiaris) [09:48:11] (03Merged) 10jenkins-bot: Bump number of replicas for restrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/539075 (owner: 10Alexandros Kosiaris) [09:50:30] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'restrouter' for release 'codfw' . [09:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:03] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'restrouter' for release 'production' . [09:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:10] (03CR) 10Elukey: analytics::refinery::job::druid_load: Add sanitization for netflow (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [09:55:55] (03PS5) 10Jbond: ipmi: use run instead of checkouput [software/spicerack] - 10https://gerrit.wikimedia.org/r/537468 [09:57:34] (03CR) 10Jbond: ipmi: use run instead of checkouput (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/537468 (owner: 10Jbond) [09:58:44] (03PS2) 10Alexandros Kosiaris: Assign restrouter LVS IPs [dns] - 10https://gerrit.wikimedia.org/r/526448 (https://phabricator.wikimedia.org/T223953) [09:58:46] (03PS2) 10Alexandros Kosiaris: Activate restrouter discovery records [dns] - 10https://gerrit.wikimedia.org/r/526449 (https://phabricator.wikimedia.org/T223953) [09:59:12] (03CR) 10jerkins-bot: [V: 04-1] Assign restrouter LVS IPs [dns] - 10https://gerrit.wikimedia.org/r/526448 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [10:00:09] (03PS3) 10Alexandros Kosiaris: Assign restrouter LVS IPs [dns] - 10https://gerrit.wikimedia.org/r/526448 (https://phabricator.wikimedia.org/T223953) [10:00:11] (03PS3) 10Alexandros Kosiaris: Activate restrouter discovery records [dns] - 10https://gerrit.wikimedia.org/r/526449 (https://phabricator.wikimedia.org/T223953) [10:01:21] (03CR) 10jerkins-bot: [V: 04-1] ipmi: use run instead of checkouput [software/spicerack] - 10https://gerrit.wikimedia.org/r/537468 (owner: 10Jbond) [10:03:36] (03CR) 10Alexandros Kosiaris: [C: 04-1] LVS for RESTRouter. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/521584 (https://phabricator.wikimedia.org/T223953) (owner: 10Ppchelko) [10:07:20] jbond42: yay, the issue is fixed :) now sphinx is failing because unable to retrieve an external file, but that's unrelated [10:07:29] (03CR) 10Mforns: [C: 03+1] "LGTM!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [10:08:50] 10Operations, 10LDAP-Access-Requests: Turnilo access for Jerrie Kumalah and Erin Yener (fundraising analysts) - https://phabricator.wikimedia.org/T233780 (10MarcoAurelio) [10:09:10] 10Operations, 10Wikidata, 10Wikidata-Termbox, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10akosiaris) 05Open→03Resolved a:03akosiaris The service has for long been deployed and even has nice dashboards in grafana, resolving. [10:09:25] volans: yes i noticed that localy as well only started in tha last hour so it may pass [10:09:46] yeah seems an issue they're having [10:09:56] hopefully will get resolved soon, or we can merge also without it [10:10:54] (03CR) 10Elukey: "do you want me to merge? Ready to go?" [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [10:11:35] (03CR) 10Filippo Giunchedi: [C: 03+2] Allow file:// in the kibana-plugin install command sudo rule [puppet] - 10https://gerrit.wikimedia.org/r/539070 (owner: 1020after4) [10:13:10] 10Operations, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10Marostegui) BBU has arrived to the DC, I am trying to coordinate with @Cmjohnson and @Jclark-ctr to see if we can replace this asap. [10:13:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] Assign restrouter LVS IPs [dns] - 10https://gerrit.wikimedia.org/r/526448 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [10:13:19] !log twentyafterfour@deploy1001 Finished deploy [releng/phatality@8f05ba9]: (no justification provided) (duration: 45m 54s) [10:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:51] (03PS2) 10Alexandros Kosiaris: rsyslog: Support adding metadata to input, default to off [puppet] - 10https://gerrit.wikimedia.org/r/538626 (https://phabricator.wikimedia.org/T207200) [10:13:56] (03PS2) 10Alexandros Kosiaris: rsyslog: populate kubernetes configuration [puppet] - 10https://gerrit.wikimedia.org/r/538627 (https://phabricator.wikimedia.org/T207200) [10:14:40] (03CR) 10Mforns: [C: 04-1] "@elukey I think we should wait!" [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [10:15:01] twentyafterfour: lol @ 45m [10:15:08] twentyafterfour: sudo is fixed now btw [10:16:20] twentyafterfour: I have to go shortly, you should be set tho [10:20:28] (03CR) 10Elukey: "Nono makes sense, I asked since to know if you were waiting for an ops to merge or for something else. :)" [puppet] - 10https://gerrit.wikimedia.org/r/535924 (https://phabricator.wikimedia.org/T229674) (owner: 10Mforns) [10:21:20] !log twentyafterfour@deploy1001 Started deploy [releng/phatality@8f05ba9]: (no justification provided) [10:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:49] godog: hmm still not working but thank you for your help [10:21:59] it's saying: sudo: no tty present and no askpass program specified [10:22:03] !log twentyafterfour@deploy1001 deploy aborted: (no justification provided) (duration: 00m 42s) [10:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:50] (03PS2) 10Kosta Harlan: GrowthExperiments: Enable WelcomeSurvey for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537801 (https://phabricator.wikimedia.org/T233063) [10:24:04] (03CR) 10jerkins-bot: [V: 04-1] GrowthExperiments: Enable WelcomeSurvey for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537801 (https://phabricator.wikimedia.org/T233063) (owner: 10Kosta Harlan) [10:24:47] (03PS3) 10Kosta Harlan: GrowthExperiments: Enable WelcomeSurvey for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537801 (https://phabricator.wikimedia.org/T233063) [10:25:27] (03PS3) 10Giuseppe Lavagetto: Revert "cache_text: Vary for PHP7" [puppet] - 10https://gerrit.wikimedia.org/r/539062 [10:25:30] !log twentyafterfour@deploy1001 Started deploy [releng/phatality@8f05ba9]: (no justification provided) [10:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:42] !log twentyafterfour@deploy1001 Finished deploy [releng/phatality@8f05ba9]: (no justification provided) (duration: 00m 12s) [10:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:59] * twentyafterfour shrugs [10:26:18] !log switch cp5002 from nginx to ats-tls - T231433 [10:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:21] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [10:26:33] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "cache_text: Vary for PHP7" [puppet] - 10https://gerrit.wikimedia.org/r/539062 (owner: 10Giuseppe Lavagetto) [10:26:50] !log twentyafterfour@deploy1001 Started deploy [releng/phatality@8f05ba9]: (no justification provided) [10:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:57] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp5002 [puppet] - 10https://gerrit.wikimedia.org/r/539071 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [10:27:06] (03PS2) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp5002 [puppet] - 10https://gerrit.wikimedia.org/r/539071 (https://phabricator.wikimedia.org/T231433) [10:27:07] !log twentyafterfour@deploy1001 Finished deploy [releng/phatality@8f05ba9]: (no justification provided) (duration: 00m 16s) [10:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:00] (03PS3) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp5002 [puppet] - 10https://gerrit.wikimedia.org/r/539071 (https://phabricator.wikimedia.org/T231433) [10:28:03] sigh... [10:28:59] (03PS1) 10Elukey: profile::analytics::refinery::job::data_purge: remove absent timer [puppet] - 10https://gerrit.wikimedia.org/r/539084 [10:29:23] (03PS2) 10Elukey: profile::analytics::refinery::job::data_purge: remove absent timer [puppet] - 10https://gerrit.wikimedia.org/r/539084 [10:31:09] (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/539084 (owner: 10Elukey) [10:31:38] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::data_purge: remove absent timer [puppet] - 10https://gerrit.wikimedia.org/r/539084 (owner: 10Elukey) [10:33:01] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp5002 [puppet] - 10https://gerrit.wikimedia.org/r/539072 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [10:33:10] (03PS3) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp5002 [puppet] - 10https://gerrit.wikimedia.org/r/539072 (https://phabricator.wikimedia.org/T231433) [10:34:55] PROBLEM - HTTPS Unified ECDSA on cp5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [10:35:03] ^^ expected [10:35:03] PROBLEM - HTTPS Unified RSA on cp5002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [10:38:07] RECOVERY - HTTPS Unified ECDSA on cp5002 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345544 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 57 days) https://wikitech.wikimedia.org/wiki/HTTPS [10:38:15] RECOVERY - HTTPS Unified RSA on cp5002 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345537 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 57 days) https://wikitech.wikimedia.org/wiki/HTTPS [10:45:18] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [10:46:26] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:46:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:50] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp4022 [puppet] - 10https://gerrit.wikimedia.org/r/539085 (https://phabricator.wikimedia.org/T231433) [10:47:52] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp4022 [puppet] - 10https://gerrit.wikimedia.org/r/539086 (https://phabricator.wikimedia.org/T231433) [10:49:18] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp4022 [puppet] - 10https://gerrit.wikimedia.org/r/539085 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [10:52:52] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to 443 on cp4022 [puppet] - 10https://gerrit.wikimedia.org/r/539086 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [10:53:21] (03PS1) 10Arturo Borrero Gonzalez: toolforge: update nginx-ingress configuration [puppet] - 10https://gerrit.wikimedia.org/r/539087 (https://phabricator.wikimedia.org/T228500) [10:53:55] (03PS1) 1020after4: Phatality: Escape the colon in the sudoers rule [puppet] - 10https://gerrit.wikimedia.org/r/539088 [10:54:17] PROBLEM - HTTPS Unified RSA on cp4022 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [10:54:33] (03CR) 10jerkins-bot: [V: 04-1] Phatality: Escape the colon in the sudoers rule [puppet] - 10https://gerrit.wikimedia.org/r/539088 (owner: 1020after4) [10:54:38] ^^ expected [10:55:39] RECOVERY - HTTPS Unified RSA on cp4022 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345579 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 57 days) https://wikitech.wikimedia.org/wiki/HTTPS [10:55:46] <_joe_> vgutierrez: we know by now :) [10:55:52] _joe_: <3 [11:00:04] Amir1, Lucas_WMDE, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190925T1100). [11:00:04] Ammarpad: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:29] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1005.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [11:00:49] I can SWAT today! [11:01:01] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1005.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:01:50] (03PS2) 10Urbanecm: Fully close bgwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539029 (https://phabricator.wikimedia.org/T233322) [11:01:50] vgutierrez: ^ [11:02:00] (03CR) 10Urbanecm: [C: 03+2] Fully close bgwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539029 (https://phabricator.wikimedia.org/T233322) (owner: 10Urbanecm) [11:02:15] (03PS4) 10Giuseppe Lavagetto: Revert "ATS: Vary-slotting for PHP7" [puppet] - 10https://gerrit.wikimedia.org/r/539063 [11:02:57] (03PS2) 10Arturo Borrero Gonzalez: toolforge: update nginx-ingress configuration [puppet] - 10https://gerrit.wikimedia.org/r/539087 (https://phabricator.wikimedia.org/T228500) [11:03:01] (03Merged) 10jenkins-bot: Fully close bgwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539029 (https://phabricator.wikimedia.org/T233322) (owner: 10Urbanecm) [11:03:03] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1005.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:03:05] <_joe_> onimisionipe: why has that to do with vgutierrez? [11:03:12] <_joe_> it's wdqs having issues [11:03:14] PROBLEM - LVS HTTP IPv4 #page on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:03:20] oh not again [11:03:23] crap [11:03:41] (03CR) 10jenkins-bot: Fully close bgwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539029 (https://phabricator.wikimedia.org/T233322) (owner: 10Urbanecm) [11:03:54] <_joe_> is anyone acting on that cluster? [11:04:05] yes [11:04:33] <_joe_> I don't see anything in SAL, so please expand :) [11:04:42] * jbond42 here if needed [11:05:02] so I've been depooling/repooling recently to solve lag issues [11:05:06] but this is new [11:05:14] <_joe_> jbond42: can you help onimisionipe finding out the problem? I have to go to lunch [11:05:19] I recently pooled wdqs1005 and depooled 1004 [11:05:36] RECOVERY - LVS HTTP IPv4 #page on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.015 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:05:38] (03PS4) 10Urbanecm: GrowthExperiments: Enable WelcomeSurvey for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537801 (https://phabricator.wikimedia.org/T233063) (owner: 10Kosta Harlan) [11:05:55] <_joe_> ook, please stop and try to assess what happened [11:05:58] _joe_: sure, onimisionipe what specific issue syou seeing? [11:06:41] so wdqs1005 is hanging on the app side with curl -H 'Host: localhost' http://wdqs1005.eqiad.wmnet/readiness-probe [11:06:48] from lvs1016 [11:07:14] and that's why lvs complains about it [11:07:19] (03CR) 10Volans: [C: 03+1] "LGTM, hard to test in isolation." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/534507 (owner: 10Ayounsi) [11:07:52] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 127485c: Fully close bgwikinews (T233322) (duration: 01m 06s) [11:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:56] T233322: Close and Delete/Redirect Bulgarian Wikinews - https://phabricator.wikimedia.org/T233322 [11:07:58] ok let me take a quick look [11:08:06] * volans around too [11:09:28] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [11:10:28] there's a lot of malformed query from. This might be related. Its causing blazegraph to respond slowly [11:10:50] *from user-agent: Toolforge - legacy code [11:11:54] impact seems to hitting wdqs1005 the most and could be related to the lvs stuff [11:11:57] #yes i see lots of ' o.w.q.r.b.t.ThrottlingFilter - A request is being throttled' [11:12:23] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1005.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [11:12:38] I'm depooling 1005 [11:12:54] that's going to move the issue to another host... [11:12:57] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp3035 [puppet] - 10https://gerrit.wikimedia.org/r/539090 (https://phabricator.wikimedia.org/T231433) [11:12:59] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp3035 [puppet] - 10https://gerrit.wikimedia.org/r/539091 (https://phabricator.wikimedia.org/T231433) [11:13:22] true [11:13:40] !log switch cp3035 from nginx to ats-tls - T231433 [11:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:43] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [11:13:50] there seem to be a lot of GC logs [11:13:54] I will repool wdqs1004 instead [11:14:01] might reduce the impact [11:14:21] it was depooled before to catch up on lag and it seems to have caught up [11:14:25] just as a temp solution [11:15:38] !log EU SWAT done [11:15:38] !log repooled wdqs1004 to reduce load on the wdqs public cluster [11:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:03] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp3035 [puppet] - 10https://gerrit.wikimedia.org/r/539090 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [11:16:35] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:17:09] jbond42, onimisionipe from https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=wdqs1005&var-datasource=eqiad%20prometheus%2Fops&var-cluster=wdqs it looks like the load decreased at 11:00 UTC [11:17:38] but I'm assuming that's because pybal depooled the server [11:17:53] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:18:04] (03CR) 10Gergő Tisza: [C: 03+1] GrowthExperiments: Enable WelcomeSurvey for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537801 (https://phabricator.wikimedia.org/T233063) (owner: 10Kosta Harlan) [11:18:10] i still see [11:18:11] lvs1015 ~ % curl -H 'Host: localhost' http://wdqs1005.eqiad.wmnet/readiness-probe [11:16:52] [11:18:14] Service load too high, please come back later% [11:18:55] right [11:18:59] same from lvs1016 [11:19:05] this problem is not really new. Someone has been pushing the cluster to its limits and our request filters might not be handling it well [11:19:17] onimisionipe: someone from our toolforge? [11:19:23] yes [11:19:30] maybe it's worth the effort talking to them :) [11:20:36] might also be heavy queries related [11:20:38] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1005.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:20:44] I pasted that :) [11:20:49] not an alert [11:21:35] vgutierrez: Ok [11:22:19] PROBLEM - WDQS HTTP Port on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [11:22:32] (03PS2) 1020after4: Phatality: Escape the colon in the sudoers rule [puppet] - 10https://gerrit.wikimedia.org/r/539088 [11:22:38] not again [11:22:38] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to 443 on cp3035 [puppet] - 10https://gerrit.wikimedia.org/r/539091 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [11:23:33] onimisionipe: i see this error with what seems to be the same query a lot in the nginx error logs https://phabricator.wikimedia.org/P9175 [11:25:57] PROBLEM - HTTPS Unified ECDSA on cp3035 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [11:26:01] PROBLEM - HTTPS Unified RSA on cp3035 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [11:26:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "My very basic and incipient test suite works with this change:" [puppet] - 10https://gerrit.wikimedia.org/r/539087 (https://phabricator.wikimedia.org/T228500) (owner: 10Arturo Borrero Gonzalez) [11:27:04] (03CR) 10Volans: "There is a rebase conflict, I cannot run the compiler. Some comment inline in the meanwhile. When you get a chance rebase it locally resol" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/534538 (owner: 10Ayounsi) [11:27:14] (03PS2) 10Alexandros Kosiaris: LVS for RESTRouter. [puppet] - 10https://gerrit.wikimedia.org/r/521584 (https://phabricator.wikimedia.org/T223953) (owner: 10Ppchelko) [11:27:35] RECOVERY - HTTPS Unified ECDSA on cp3035 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345550 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 57 days) https://wikitech.wikimedia.org/wiki/HTTPS [11:27:39] RECOVERY - HTTPS Unified RSA on cp3035 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345546 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 57 days) https://wikitech.wikimedia.org/wiki/HTTPS [11:28:43] RECOVERY - WDQS HTTP Port on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [11:28:45] onimisionipe: did you just restart  wdqs-blazegraph [11:29:01] jbond42: I reloaded blazegraph on 1005. I'm sure issue will persist [11:29:11] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:29:33] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:29:35] !log restarted wdqs-blazegraph on wdqs1005 [11:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:18] is it normal for java GC to occure every couple of seconds? [11:33:04] definitely not normal [11:34:33] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [11:34:58] PROBLEM - SSH mw1290.mgmt on mw1290.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:37:00] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp2005 [puppet] - 10https://gerrit.wikimedia.org/r/539092 (https://phabricator.wikimedia.org/T231433) [11:37:02] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp2005 [puppet] - 10https://gerrit.wikimedia.org/r/539093 (https://phabricator.wikimedia.org/T231433) [11:37:34] !log switch cp2005 from nginx to ats-tls - T231433 [11:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:38] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [11:38:34] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp2005 [puppet] - 10https://gerrit.wikimedia.org/r/539092 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [11:41:28] (03PS1) 10Mforns: analytics::search::jobs.pp: Move last deletion timers to drop-older-than [puppet] - 10https://gerrit.wikimedia.org/r/539094 (https://phabricator.wikimedia.org/T204735) [11:41:45] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to 443 on cp2005 [puppet] - 10https://gerrit.wikimedia.org/r/539093 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [11:42:20] (03CR) 10Effie Mouzeli: [C: 03+1] Remove tmpreaper from mediawiki servers [puppet] - 10https://gerrit.wikimedia.org/r/538884 (https://phabricator.wikimedia.org/T151304) (owner: 10Muehlenhoff) [11:44:08] PROBLEM - HTTPS Unified RSA on cp2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [11:44:12] PROBLEM - HTTPS Unified ECDSA on cp2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [11:45:28] RECOVERY - HTTPS Unified RSA on cp2005 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345531 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 57 days) https://wikitech.wikimedia.org/wiki/HTTPS [11:45:32] RECOVERY - HTTPS Unified ECDSA on cp2005 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345528 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 57 days) https://wikitech.wikimedia.org/wiki/HTTPS [11:47:19] (03PS3) 10Alexandros Kosiaris: LVS for RESTRouter. [puppet] - 10https://gerrit.wikimedia.org/r/521584 (https://phabricator.wikimedia.org/T223953) (owner: 10Ppchelko) [11:47:33] (03CR) 10Mforns: analytics::search::jobs.pp: Move last deletion timers to drop-older-than (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/539094 (https://phabricator.wikimedia.org/T204735) (owner: 10Mforns) [11:48:01] (03CR) 10Alexandros Kosiaris: [C: 03+2] LVS for RESTRouter. [puppet] - 10https://gerrit.wikimedia.org/r/521584 (https://phabricator.wikimedia.org/T223953) (owner: 10Ppchelko) [11:54:15] (03PS3) 10Arturo Borrero Gonzalez: toolforge: update nginx-ingress configuration [puppet] - 10https://gerrit.wikimedia.org/r/539087 (https://phabricator.wikimedia.org/T228500) [11:54:18] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [11:54:45] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 28154 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops [11:55:24] * onimisionipe is looking [11:56:47] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp1078 [puppet] - 10https://gerrit.wikimedia.org/r/539097 (https://phabricator.wikimedia.org/T231433) [11:56:49] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp1078 [puppet] - 10https://gerrit.wikimedia.org/r/539098 (https://phabricator.wikimedia.org/T231433) [11:57:01] PROBLEM - PyBal connections to etcd on lvs2003 is CRITICAL: CRITICAL: 44 connections established with conf2001.codfw.wmnet:2379 (min=45) https://wikitech.wikimedia.org/wiki/PyBal [11:57:08] that's me ^ [11:57:16] !log switch cp1078 from nginx to ats-tls - T231433 [11:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:19] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [11:57:22] akosiaris: ack :) [11:58:11] PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.48:7231]) https://wikitech.wikimedia.org/wiki/PyBal [11:58:16] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp1078 [puppet] - 10https://gerrit.wikimedia.org/r/539097 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [11:59:34] shards are relocating [11:59:48] (03PS1) 10Vgutierrez: Revert "hiera: Move nginx from port 443 to 4443 on cp1078" [puppet] - 10https://gerrit.wikimedia.org/r/539099 [12:00:47] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.48:7231]) https://wikitech.wikimedia.org/wiki/PyBal [12:00:48] (03CR) 10Vgutierrez: [C: 03+2] Revert "hiera: Move nginx from port 443 to 4443 on cp1078" [puppet] - 10https://gerrit.wikimedia.org/r/539099 (owner: 10Vgutierrez) [12:02:54] (03PS1) 10Alexandros Kosiaris: restrouter: Fix typo with DC names [dns] - 10https://gerrit.wikimedia.org/r/539101 [12:03:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] restrouter: Fix typo with DC names [dns] - 10https://gerrit.wikimedia.org/r/539101 (owner: 10Alexandros Kosiaris) [12:05:34] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: name=kubernetes1001.* [12:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:54] !log depool kubernetes1001 and disable puppet on it for rsyslog mmkubernetes testing [12:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:05] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp1078 [puppet] - 10https://gerrit.wikimedia.org/r/539102 (https://phabricator.wikimedia.org/T231433) [12:07:13] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp1078 [puppet] - 10https://gerrit.wikimedia.org/r/539102 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [12:07:21] (03PS2) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp1078 [puppet] - 10https://gerrit.wikimedia.org/r/539102 (https://phabricator.wikimedia.org/T231433) [12:07:37] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:08:41] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:08:55] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:08:55] uh... [12:08:57] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:08:59] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:09:31] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:09:32] <_joe_> we just had a huge spike on the api cluster [12:09:37] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:09:44] _joe_: L7 spike? [12:09:51] aka valid requests [12:09:52] or just traffic? [12:10:03] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:10:05] PROBLEM - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [12:10:18] basically all the eqiad varnish bes maxed out on connections to the api cluster [12:10:23] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:10:27] ouch [12:10:29] we had something similar, but slightly less in magnitude, happen yesterday [12:10:29] <_joe_> yeah the api was lagging [12:10:30] L7 then [12:10:42] <_joe_> https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200&from=now-30m&to=now [12:10:53] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:11:01] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:11:23] <_joe_> it's gone now but we should dig into it [12:11:29] _joe_: do you think expensive requests, or a lagging backend? [12:11:43] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:11:45] <_joe_> cdanis: I would not make a bet at this moment [12:12:05] CPU usage on api_appserver cluster dropped during that interval [12:12:20] <_joe_> yeah that seems to point to either a deadlock or a backend lagging [12:12:45] RECOVERY - PyBal connections to etcd on lvs2003 is OK: OK: 45 connections established with conf2001.codfw.wmnet:2379 (min=45) https://wikitech.wikimedia.org/wiki/PyBal [12:13:09] PROBLEM - HTTPS Unified ECDSA on cp1078 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [12:13:18] (03PS2) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 443 on cp1078 [puppet] - 10https://gerrit.wikimedia.org/r/539098 (https://phabricator.wikimedia.org/T231433) [12:13:25] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@241b284]: Performance tweaks: domUtil + addSectionEditButtons (T229286) [12:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:29] T229286: "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 [12:13:45] PROBLEM - HTTPS Unified RSA on cp1078 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [12:13:51] RECOVERY - Disk space on elastic1025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops [12:14:05] PROBLEM - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [12:14:05] RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:14:25] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to 443 on cp1078 [puppet] - 10https://gerrit.wikimedia.org/r/539098 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [12:14:56] (03PS1) 10Jbond: kibana::phatality: fix sudo line [puppet] - 10https://gerrit.wikimedia.org/r/539104 (https://phabricator.wikimedia.org/T230752) [12:15:30] 10Operations, 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10Phamhi) a:03Phamhi [12:15:36] <_joe_> cdanis: let's move to #sre ? [12:16:00] (03CR) 10jerkins-bot: [V: 04-1] kibana::phatality: fix sudo line [puppet] - 10https://gerrit.wikimedia.org/r/539104 (https://phabricator.wikimedia.org/T230752) (owner: 10Jbond) [12:16:55] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:18:01] RECOVERY - HTTPS Unified RSA on cp1078 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345529 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 57 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:18:39] RECOVERY - HTTPS Unified ECDSA on cp1078 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345490 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 57 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:18:43] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@241b284]: Performance tweaks: domUtil + addSectionEditButtons (T229286) (duration: 05m 17s) [12:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:46] T229286: "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 [12:19:20] (03PS2) 10Jbond: kibana::phatality: fix sudo line [puppet] - 10https://gerrit.wikimedia.org/r/539104 (https://phabricator.wikimedia.org/T230752) [12:19:59] PROBLEM - Ensure traffic_manager binds on 8443 and responds to HTTP requests on cp1078 is CRITICAL: connect to address 10.64.0.133 and port 8443: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:20:05] PROBLEM - ats-tls HTTPS en.wikipedia.org RSA on cp1078 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [12:20:34] icinga being slow on getting the new checks [12:21:03] (03CR) 10Jbond: [C: 03+2] kibana::phatality: fix sudo line [puppet] - 10https://gerrit.wikimedia.org/r/539104 (https://phabricator.wikimedia.org/T230752) (owner: 10Jbond) [12:21:12] (03PS3) 10Jbond: kibana::phatality: fix sudo line [puppet] - 10https://gerrit.wikimedia.org/r/539104 (https://phabricator.wikimedia.org/T230752) [12:23:32] RECOVERY - ats-tls HTTPS en.wikipedia.org RSA on cp1078 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345197 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 57 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:24:43] (03PS1) 10Alexandros Kosiaris: Rename codfw releases to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/539108 [12:24:45] (03PS1) 10Alexandros Kosiaris: restrouter: Fix the parsoid port in the configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/539109 (https://phabricator.wikimedia.org/T223953) [12:25:14] 10Operations, 10Traffic, 10Patch-For-Review: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [12:27:06] (03CR) 10Alexandros Kosiaris: [C: 03+2] Rename codfw releases to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/539108 (owner: 10Alexandros Kosiaris) [12:27:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] restrouter: Fix the parsoid port in the configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/539109 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [12:28:23] !log @ helmfile [CODFW] Ran 'sync' command on namespace 'restrouter' for release 'production' . [12:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:01] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'restrouter' for release 'production' . [12:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:22] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - restrouter_7231: Servers kubernetes2001.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:31:36] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:32:04] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'restrouter' for release 'staging' . [12:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:42] !log depool wdqs1005 to allow it catch up on lag [12:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:10] RECOVERY - SSH mw1290.mgmt on mw1290.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:36:36] jouncebot, NotASpy|away: [12:36:38] oops, sorry [12:36:40] jouncebot: now [12:36:40] No deployments scheduled for the next 3 hour(s) and 23 minute(s) [12:37:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1075 for BBU replacement T233534', diff saved to https://phabricator.wikimedia.org/P9176 and previous config saved to /var/cache/conftool/dbconfig/20190925-123736-marostegui.json [12:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:41] T233534: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 [12:37:51] !log Stop MySQL on db1075 for BBU replacement T233534 [12:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:56] (03PS1) 10Marostegui: Revert "wiki replicas: depool lasbdb1011 just in case of issues" [puppet] - 10https://gerrit.wikimedia.org/r/539112 [12:40:29] (03PS1) 10Marostegui: db1075: Change binlog format to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/539113 (https://phabricator.wikimedia.org/T233569) [12:41:11] (03CR) 10Marostegui: [C: 03+2] db1075: Change binlog format to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/539113 (https://phabricator.wikimedia.org/T233569) (owner: 10Marostegui) [12:41:25] (03PS1) 10Alexandros Kosiaris: calico: Add port 8000 (parsoid) to restrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/539115 (https://phabricator.wikimedia.org/T223953) [12:41:33] !log Shutdown db1075 for onsite maintenance T233534 [12:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] calico: Add port 8000 (parsoid) to restrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/539115 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [12:42:53] (03PS2) 10Marostegui: Revert "wiki replicas: depool lasbdb1011 just in case of issues" [puppet] - 10https://gerrit.wikimedia.org/r/539112 [12:43:54] (03CR) 10Marostegui: [C: 03+2] Revert "wiki replicas: depool lasbdb1011 just in case of issues" [puppet] - 10https://gerrit.wikimedia.org/r/539112 (owner: 10Marostegui) [12:44:34] !log Repool labsdb1011 T233766 [12:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:37] T233766: labsdb1011 mariadb crashed - https://phabricator.wikimedia.org/T233766 [12:44:37] !log akosiaris@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [12:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:01] !log akosiaris@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [12:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:37] !log akosiaris@ helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [12:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:55] !log akosiaris@ helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [12:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:24] !log akosiaris@ helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [12:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:42] RECOVERY - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:47:48] RECOVERY - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:47:54] !log akosiaris@ helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [12:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:12] 10Operations, 10LDAP-Access-Requests: NDA Request from WMDE employee Verena - https://phabricator.wikimedia.org/T233807 (10Verena) [12:48:48] (03PS2) 10Elukey: profile::zookeeper::server: use openjkd-8 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/539069 (https://phabricator.wikimedia.org/T217057) [12:50:10] (03CR) 10Elukey: [C: 03+2] profile::zookeeper::server: use openjkd-8 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/539069 (https://phabricator.wikimedia.org/T217057) (owner: 10Elukey) [12:50:51] (03PS4) 10Alexandros Kosiaris: Activate restrouter discovery records [dns] - 10https://gerrit.wikimedia.org/r/526449 (https://phabricator.wikimedia.org/T223953) [12:50:55] (03CR) 10Alexandros Kosiaris: [C: 03+2] Activate restrouter discovery records [dns] - 10https://gerrit.wikimedia.org/r/526449 (https://phabricator.wikimedia.org/T223953) (owner: 10Alexandros Kosiaris) [12:51:42] !log marostegui@cumin1001 dbctl commit (dc=all): ' Depool for schema change on the logging table: db2088:3312 db2084:3315 db2087:3316 db2086:3317 T233625', diff saved to https://phabricator.wikimedia.org/P9177 and previous config saved to /var/cache/conftool/dbconfig/20190925-125140-marostegui.json [12:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:45] T233625: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 [12:53:46] (03PS1) 10DCausse: [cirrus] Disable instant indexing on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539117 [12:55:12] PROBLEM - Host db1075.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:56:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2085:3311 T233625', diff saved to https://phabricator.wikimedia.org/P9178 and previous config saved to /var/cache/conftool/dbconfig/20190925-125601-marostegui.json [12:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:22] db1075 mgmt down is expected [12:57:10] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/538979 (owner: 10Alex Monk) [13:00:54] RECOVERY - Host db1075.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [13:01:58] 10Operations, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10Marostegui) I can see the battery now after @Jclark-ctr has installed the new one: ` Battery/Capacitor Count: 1 Battery/Capacitor Status: OK ` [13:05:52] RECOVERY - HP RAID on db1075 is OK: OK: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [13:05:58] 10Operations, 10DBA: Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230783 (10Jclark-ctr) [13:06:01] 10Operations, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10Jclark-ctr) 05Open→03Resolved a:05Cmjohnson→03Jclark-ctr replaced battery. resolving ticket [13:06:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1075 after replacing its BBU', diff saved to https://phabricator.wikimedia.org/P9179 and previous config saved to /var/cache/conftool/dbconfig/20190925-130613-marostegui.json [13:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:32] (03PS1) 10Elukey: Revert "profile::zookeeper::server: use openjkd-8 on Buster" [puppet] - 10https://gerrit.wikimedia.org/r/539119 [13:11:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight for db1075 after BBU replacement', diff saved to https://phabricator.wikimedia.org/P9180 and previous config saved to /var/cache/conftool/dbconfig/20190925-131149-marostegui.json [13:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:59] (03CR) 10Elukey: [C: 03+2] Revert "profile::zookeeper::server: use openjkd-8 on Buster" [puppet] - 10https://gerrit.wikimedia.org/r/539119 (owner: 10Elukey) [13:21:12] (03PS1) 10Elukey: Move the Hadoop test cluster to the Analytics Zookeeper cluster [puppet] - 10https://gerrit.wikimedia.org/r/539120 (https://phabricator.wikimedia.org/T217057) [13:21:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight for db1075 after BBU replacement', diff saved to https://phabricator.wikimedia.org/P9181 and previous config saved to /var/cache/conftool/dbconfig/20190925-132147-marostegui.json [13:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:31] (03CR) 10Filippo Giunchedi: "Superseded by I4717874405" [puppet] - 10https://gerrit.wikimedia.org/r/539088 (owner: 1020after4) [13:25:22] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/539104 (https://phabricator.wikimedia.org/T230752) (owner: 10Jbond) [13:25:49] (03CR) 10Filippo Giunchedi: "Also I'd have expected puppet validate_cmd to fail here :|" [puppet] - 10https://gerrit.wikimedia.org/r/539104 (https://phabricator.wikimedia.org/T230752) (owner: 10Jbond) [13:25:54] (03CR) 10Elukey: [C: 03+2] Move the Hadoop test cluster to the Analytics Zookeeper cluster [puppet] - 10https://gerrit.wikimedia.org/r/539120 (https://phabricator.wikimedia.org/T217057) (owner: 10Elukey) [13:27:11] (03CR) 10Jbond: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/539104 (https://phabricator.wikimedia.org/T230752) (owner: 10Jbond) [13:31:31] !log installing remaining expat security updates [13:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight for db1075 after BBU replacement', diff saved to https://phabricator.wikimedia.org/P9182 and previous config saved to /var/cache/conftool/dbconfig/20190925-133146-marostegui.json [13:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:32] (03PS1) 10Gilles: Remove origin trials config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539121 (https://phabricator.wikimedia.org/T230817) [13:33:34] (03CR) 10Filippo Giunchedi: initial commit (032 comments) [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376 (owner: 10Cwhite) [13:37:29] (03PS5) 10Giuseppe Lavagetto: Revert "ATS: Vary-slotting for PHP7" [puppet] - 10https://gerrit.wikimedia.org/r/539063 [13:37:43] (03CR) 10Gilles: [C: 03+2] Remove origin trials config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539121 (https://phabricator.wikimedia.org/T230817) (owner: 10Gilles) [13:38:22] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:49] (03Merged) 10jenkins-bot: Remove origin trials config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539121 (https://phabricator.wikimedia.org/T230817) (owner: 10Gilles) [13:39:10] (03CR) 10jenkins-bot: Remove origin trials config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539121 (https://phabricator.wikimedia.org/T230817) (owner: 10Gilles) [13:41:00] !log gilles@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T230817 Remove origin trials config (duration: 01m 05s) [13:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:03] T230817: Clean up origin trial code - https://phabricator.wikimedia.org/T230817 [13:41:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "ATS: Vary-slotting for PHP7" [puppet] - 10https://gerrit.wikimedia.org/r/539063 (owner: 10Giuseppe Lavagetto) [13:42:09] 10Operations, 10Release Pipeline, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10akosiaris) 05Open→03Resolved a:03akosiaris restrouter is up and running, LVS is setu... [13:42:13] 10Operations, 10Release Pipeline, 10serviceops, 10Goal, 10Release-Engineering-Team (Pipeline): Self-service Deployment Pipeline - https://phabricator.wikimedia.org/T228676 (10akosiaris) [13:42:20] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Core Platform Team Legacy (Watching / External), and 3 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10akosiaris) [13:42:47] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) - https://phabricator.wikimedia.org/T227538 (10Marostegui) [13:44:30] (03PS1) 10Elukey: role::analytics_cluster::zookeeper: enable prometheus metrics by default [puppet] - 10https://gerrit.wikimedia.org/r/539122 (https://phabricator.wikimedia.org/T217057) [13:45:49] <_joe_> !log restarting trafficserver on cp1075 to pick up the change [13:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:00] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18561/" [puppet] - 10https://gerrit.wikimedia.org/r/539122 (https://phabricator.wikimedia.org/T217057) (owner: 10Elukey) [13:47:06] (03PS2) 10Elukey: role::analytics_cluster::zookeeper: enable prometheus metrics by default [puppet] - 10https://gerrit.wikimedia.org/r/539122 (https://phabricator.wikimedia.org/T217057) [13:47:33] (03PS3) 10Elukey: profile::zookeeper::server: enable prometheus metrics by default [puppet] - 10https://gerrit.wikimedia.org/r/539122 (https://phabricator.wikimedia.org/T217057) [13:47:50] (03CR) 10Elukey: [C: 03+2] profile::zookeeper::server: enable prometheus metrics by default [puppet] - 10https://gerrit.wikimedia.org/r/539122 (https://phabricator.wikimedia.org/T217057) (owner: 10Elukey) [13:47:56] PROBLEM - Check systemd state on debmonitor1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:20] 10Operations, 10ops-eqiad, 10DC-Ops: b8-eqiad pdu refresh (Thursday 10/31 @11am UTC) - https://phabricator.wikimedia.org/T227543 (10Marostegui) [13:49:10] (03CR) 10Anomie: Add Draft and Draft_talk aliases for wikis that define draft namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510780 (https://phabricator.wikimedia.org/T223472) (owner: 10Petar.petkovic) [13:49:50] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Gilles) [13:49:53] 10Operations, 10ops-eqiad, 10DC-Ops: b8-eqiad pdu refresh (Thursday 10/31 @11am UTC) - https://phabricator.wikimedia.org/T227543 (10Marostegui) [13:50:32] 10Operations, 10ops-eqiad, 10DC-Ops: b8-eqiad pdu refresh (Thursday 10/31 @11am UTC) - https://phabricator.wikimedia.org/T227543 (10Marostegui) [13:51:14] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission [13:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:16] !log filippo@cumin1001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) [13:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:32] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission [13:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:25] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=False) [13:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:29] 10Operations, 10ops-codfw, 10decommission, 10media-storage, 10User-fgiunchedi: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by filippo@cumin1001 for hosts: `ms-be[2013-2015].codfw.wmnet` - ms-be2013.codfw.wmnet (**PASS**... [13:53:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1075 after BBU replacement', diff saved to https://phabricator.wikimedia.org/P9183 and previous config saved to /var/cache/conftool/dbconfig/20190925-135317-marostegui.json [13:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:56] 10Operations, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10Marostegui) db1075 is now fully pooled back. Thanks John! [13:54:11] (03PS6) 10Mathew.onipe: query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) [13:54:13] (03PS8) 10Mathew.onipe: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [13:54:15] (03PS4) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) [13:54:40] 10Operations, 10ops-codfw, 10decommission, 10media-storage, 10User-fgiunchedi: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10fgiunchedi) [13:55:17] !log rolling restart of apache on webperf* to pick up Expat security update [13:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:26] (03CR) 10jerkins-bot: [V: 04-1] query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [14:00:36] !log Rolling restart thumbor for expat updat [14:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:09] 10Operations, 10ops-codfw, 10media-storage: rack/setup/install ms-be205[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T233638 (10Papaul) [14:02:40] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10jrobell) Hey all, I spoke to Erin and Jerrie about this and there seems to be some confusion around which groups are... [14:02:42] !log Deploy schema change on db2086:3318 [14:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:56] (03PS1) 10Filippo Giunchedi: Decom ms-be201[345] [puppet] - 10https://gerrit.wikimedia.org/r/539126 (https://phabricator.wikimedia.org/T221068) [14:08:24] (03CR) 10jerkins-bot: [V: 04-1] Decom ms-be201[345] [puppet] - 10https://gerrit.wikimedia.org/r/539126 (https://phabricator.wikimedia.org/T221068) (owner: 10Filippo Giunchedi) [14:09:16] (03PS2) 10Filippo Giunchedi: Decom ms-be201[345] [puppet] - 10https://gerrit.wikimedia.org/r/539126 (https://phabricator.wikimedia.org/T221068) [14:09:18] (03CR) 10Muehlenhoff: [C: 03+1] Decom ms-be201[345] [puppet] - 10https://gerrit.wikimedia.org/r/539126 (https://phabricator.wikimedia.org/T221068) (owner: 10Filippo Giunchedi) [14:09:55] (03CR) 10jerkins-bot: [V: 04-1] Decom ms-be201[345] [puppet] - 10https://gerrit.wikimedia.org/r/539126 (https://phabricator.wikimedia.org/T221068) (owner: 10Filippo Giunchedi) [14:10:52] I'm not understanding gerrit, the patch is updated [14:10:54] This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset. [14:11:09] or rather jenkins/zuul [14:11:24] ah there we go [14:11:32] (03CR) 10Filippo Giunchedi: [C: 03+2] Decom ms-be201[345] [puppet] - 10https://gerrit.wikimedia.org/r/539126 (https://phabricator.wikimedia.org/T221068) (owner: 10Filippo Giunchedi) [14:11:57] (03PS1) 10Elukey: role::prometheus::analytics: add Analytics Zookeeper cluster's metrics [puppet] - 10https://gerrit.wikimedia.org/r/539129 (https://phabricator.wikimedia.org/T217057) [14:12:26] (03PS2) 10Elukey: role::prometheus::analytics: add Analytics Zookeeper cluster's metrics [puppet] - 10https://gerrit.wikimedia.org/r/539129 (https://phabricator.wikimedia.org/T217057) [14:14:10] !log restarting apache on various services to pick up Expat security update (releases, netmon, miscweb, graphite, planet,puppetboard) [14:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:22] (03PS1) 10Filippo Giunchedi: Decom ms-be201[345] [dns] - 10https://gerrit.wikimedia.org/r/539130 (https://phabricator.wikimedia.org/T221068) [14:15:59] (03CR) 10Filippo Giunchedi: [C: 03+2] Decom ms-be201[345] [dns] - 10https://gerrit.wikimedia.org/r/539130 (https://phabricator.wikimedia.org/T221068) (owner: 10Filippo Giunchedi) [14:16:55] (03CR) 10Elukey: [C: 03+2] role::prometheus::analytics: add Analytics Zookeeper cluster's metrics [puppet] - 10https://gerrit.wikimedia.org/r/539129 (https://phabricator.wikimedia.org/T217057) (owner: 10Elukey) [14:18:19] 10Operations, 10ops-codfw, 10decommission, 10media-storage, and 2 others: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10fgiunchedi) [14:19:07] 10Operations, 10ops-codfw, 10decommission, 10media-storage, and 2 others: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10fgiunchedi) a:05RobH→03Papaul This is ready for you to take over @Papaul, thanks! [14:22:26] (03PS1) 10Filippo Giunchedi: hieradata: remove decom legacy ms-be partitions exception [puppet] - 10https://gerrit.wikimedia.org/r/539131 [14:24:48] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: remove decom legacy ms-be partitions exception [puppet] - 10https://gerrit.wikimedia.org/r/539131 (owner: 10Filippo Giunchedi) [14:24:56] (03PS2) 10Filippo Giunchedi: hieradata: remove decom legacy ms-be partitions exception [puppet] - 10https://gerrit.wikimedia.org/r/539131 [14:25:45] 10Operations, 10ops-codfw, 10media-storage: rack/setup/install ms-be205[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T233638 (10Papaul) [14:25:56] (03CR) 10Petar.petkovic: Add Draft and Draft_talk aliases for wikis that define draft namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510780 (https://phabricator.wikimedia.org/T223472) (owner: 10Petar.petkovic) [14:29:11] !log restarting apache on grafana1001 to pick up Expat security update [14:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:47] (03PS9) 10Mathew.onipe: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [14:33:49] (03PS5) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) [14:34:58] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission [14:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:14] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=True) [14:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:18] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Unable to power on ms-be1027 - https://phabricator.wikimedia.org/T233289 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by filippo@cumin1001 for hosts: `ms-be1027.eqiad.wmnet` - ms-be1027.eqiad.wmnet (**FAIL**) - Host steps raised exception:... [14:35:48] (03CR) 10Mathew.onipe: query_service: prepare query_service for reusbility (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [14:36:26] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Unable to power on ms-be1027 - https://phabricator.wikimedia.org/T233289 (10fgiunchedi) Indeed the decom script failed on this host that's powered down already, the full trace is ` root@cumin1001:~# cookbook sre.hosts.decommission -t T233289 ms-be1027.eqiad.wmne... [14:38:03] !log restarting apache on analytics-tool/an-tool to pick up Expat security update [14:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:29] 10Operations, 10DC-Ops, 10SRE-tools: Host decommission improvements - https://phabricator.wikimedia.org/T231066 (10fgiunchedi) I tested the cookbook on ms-be1027 in T233289, the host is powered down and not coming back (faulty hw) and the cookbook stopped when trying to get to the host, whereas IMHO it shoul... [14:40:10] (03PS1) 10Petar.petkovic: Fix Draft namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539134 (https://phabricator.wikimedia.org/T233770) [14:44:37] (03CR) 10Anomie: [C: 03+1] "Other than one nitpick over a trailing comma, this looks good to me." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539134 (https://phabricator.wikimedia.org/T233770) (owner: 10Petar.petkovic) [14:44:48] (03PS3) 10Alexandros Kosiaris: rsyslog: Support adding metadata to input, default to off [puppet] - 10https://gerrit.wikimedia.org/r/538626 (https://phabricator.wikimedia.org/T207200) [14:44:50] (03PS3) 10Alexandros Kosiaris: rsyslog: populate kubernetes configuration [puppet] - 10https://gerrit.wikimedia.org/r/538627 (https://phabricator.wikimedia.org/T207200) [14:45:45] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: NDA Request from WMDE employee Verena - https://phabricator.wikimedia.org/T233807 (10Nuria) [14:45:58] (03PS2) 10Petar.petkovic: Fix Draft namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539134 (https://phabricator.wikimedia.org/T233770) [14:46:39] (03CR) 10Petar.petkovic: Fix Draft namespace aliases (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539134 (https://phabricator.wikimedia.org/T233770) (owner: 10Petar.petkovic) [14:46:55] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1004.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:48:31] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:48:59] (03CR) 10Anomie: [C: 03+2] Fix Draft namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539134 (https://phabricator.wikimedia.org/T233770) (owner: 10Petar.petkovic) [14:49:20] (03PS1) 10Volans: sre.hosts.decommission: fix typo in method name [cookbooks] - 10https://gerrit.wikimedia.org/r/539135 (https://phabricator.wikimedia.org/T231066) [14:49:42] 10Operations, 10DC-Ops, 10SRE-tools, 10Patch-For-Review: Host decommission improvements - https://phabricator.wikimedia.org/T231066 (10Volans) >>! In T231066#5522792, @fgiunchedi wrote: > I tested the cookbook on ms-be1027 in T233289, the host is powered down and not coming back (faulty hw) and the cookboo... [14:51:30] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/539135 (https://phabricator.wikimedia.org/T231066) (owner: 10Volans) [14:51:34] (03Merged) 10jenkins-bot: Fix Draft namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539134 (https://phabricator.wikimedia.org/T233770) (owner: 10Petar.petkovic) [14:52:11] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: fix typo in method name [cookbooks] - 10https://gerrit.wikimedia.org/r/539135 (https://phabricator.wikimedia.org/T231066) (owner: 10Volans) [14:52:33] (03CR) 10jenkins-bot: Fix Draft namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539134 (https://phabricator.wikimedia.org/T233770) (owner: 10Petar.petkovic) [14:52:54] !log pool wdqs1005 - lag issues have minimized. [14:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:32] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Fix Draft namespace aliases (T233770) (duration: 01m 04s) [14:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:35] T233770: "ApiQuerySiteinfo.php: PHP Notice: Undefined offset: 118" on ko.wikisource.org - https://phabricator.wikimedia.org/T233770 [14:54:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/539135 (https://phabricator.wikimedia.org/T231066) (owner: 10Volans) [14:54:28] (03CR) 10Jdlrobson: "Can you rebase this? I believe VariantSettings recently replaced Initialisesettings. Will take closer look in a couple of hours. Thanks fo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539052 (https://phabricator.wikimedia.org/T233104) (owner: 10Urbanecm) [14:55:12] !log ladsgroup@deploy1001 Synchronized php-1.34.0-wmf.24/extensions/Wikibase/view/lib/resources.php: Revert "Merge valueview modules": T233800 (duration: 01m 04s) [14:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:16] T233800: Adding or editing monolingual text fails on beta and test Wikidata - https://phabricator.wikimedia.org/T233800 [14:55:31] (03Merged) 10jenkins-bot: sre.hosts.decommission: fix typo in method name [cookbooks] - 10https://gerrit.wikimedia.org/r/539135 (https://phabricator.wikimedia.org/T231066) (owner: 10Volans) [14:56:12] godog: I've merged and deployed on cumin1001 the fix, could you retry the cookbook on the failed host? It should warn that some steps did fail and the cookbook will be considered failed but it should indeed perform all the rest of the steps. [14:56:17] thanks for reporting it! [14:57:11] volans: that was quick, thanks! yeah I'll try again now [14:57:14] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission [14:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:27] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=True) [14:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:31] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Unable to power on ms-be1027 - https://phabricator.wikimedia.org/T233289 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by filippo@cumin1001 for hosts: `ms-be1027.eqiad.wmnet` - ms-be1027.eqiad.wmnet (**FAIL**) - Downtimed host on Icinga -... [14:57:33] \o/ worked great [14:57:33] 10Operations, 10ops-eqiad: Move YHSM from auth1001 to auth1002 - https://phabricator.wikimedia.org/T233821 (10MoritzMuehlenhoff) [14:58:44] :) [15:02:42] (03PS1) 10Filippo Giunchedi: Decom ms-be1027 [puppet] - 10https://gerrit.wikimedia.org/r/539136 (https://phabricator.wikimedia.org/T233289) [15:03:48] (03CR) 10Filippo Giunchedi: [C: 03+1] Decom ms-be1027 [puppet] - 10https://gerrit.wikimedia.org/r/539136 (https://phabricator.wikimedia.org/T233289) (owner: 10Filippo Giunchedi) [15:07:39] !log imported jenkins 2.176.4 for jessie/stretch T233214 [15:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:41] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2046.codfw.wmnet - https://phabricator.wikimedia.org/T231767 (10Papaul) [15:12:50] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2046.codfw.wmnet - https://phabricator.wikimedia.org/T231767 (10Papaul) ` papaul@asw-c-codfw# run show interfaces ge-6/0/15 descriptions Interface Admin Link Description ge-6/0/15 down down DISABLED [15:14:57] (03CR) 10Ayounsi: "> Patch Set 7:" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 (https://phabricator.wikimedia.org/T233053) (owner: 10Ayounsi) [15:16:46] (03CR) 10Ayounsi: "mccabe: MC0001 / run is too complex (12) is back..." [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 (https://phabricator.wikimedia.org/T233053) (owner: 10Ayounsi) [15:16:59] (03PS8) 10Ayounsi: Add cookbook to update Sentry PDUs passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 (https://phabricator.wikimedia.org/T233053) [15:17:09] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Papaul) ` papaul@asw-b-codfw# show | compare [edit interfaces interface-range vlan-private1-b-codfw] - member ge-1/0/4; [edit interfaces interface-range disabled] member ge-3/0/2... [15:17:48] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Papaul) [15:18:25] (03PS1) 10Elukey: profile::analytics::cluster::packages::statistics: add python3-mock [puppet] - 10https://gerrit.wikimedia.org/r/539138 [15:18:44] (03CR) 10jerkins-bot: [V: 04-1] Add cookbook to update Sentry PDUs passwords [cookbooks] - 10https://gerrit.wikimedia.org/r/537486 (https://phabricator.wikimedia.org/T233053) (owner: 10Ayounsi) [15:21:31] (03CR) 10Elukey: [C: 03+2] profile::analytics::cluster::packages::statistics: add python3-mock [puppet] - 10https://gerrit.wikimedia.org/r/539138 (owner: 10Elukey) [15:24:08] 10Operations, 10ops-codfw, 10decommission: Decommission db2036 - https://phabricator.wikimedia.org/T223885 (10Papaul) ` papaul@asw-c-codfw# show | compare [edit interfaces interface-range vlan-private1-c-codfw] - member ge-6/0/3; [edit interfaces interface-range disabled] member ge-6/0/15 { ... } +... [15:24:51] 10Operations, 10ops-codfw, 10decommission: Decommission db2036 - https://phabricator.wikimedia.org/T223885 (10Papaul) [15:24:55] (03CR) 10Ayounsi: [C: 03+2] Homer deploy repo init [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/534507 (owner: 10Ayounsi) [15:25:32] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Homer deploy repo init [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/534507 (owner: 10Ayounsi) [15:27:53] 10Operations, 10ops-codfw, 10decommission: Decommission db2037 - https://phabricator.wikimedia.org/T224720 (10Papaul) ` papaul@asw-c-codfw# show | compare [edit interfaces interface-range vlan-private1-c-codfw] - member ge-6/0/4; [edit interfaces interface-range disabled] member ge-6/0/3 { ... } +... [15:28:22] 10Operations, 10ops-eqiad, 10DBA: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10Cmjohnson) we're on the schedule to get the board swapped for 9/26 [15:28:41] 10Operations, 10ops-codfw, 10decommission: Decommission db2037 - https://phabricator.wikimedia.org/T224720 (10Papaul) [15:30:24] 10Operations, 10ops-eqiad, 10DBA: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10Marostegui) Cool, I will have the host down for you tomorrow. Thanks for the heads up [15:32:06] Urbanecm: if you have time, I'm planning to do this today T230359 [15:32:07] T230359: Create N'Ko Wikipedia - https://phabricator.wikimedia.org/T230359 [15:32:27] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2038 - https://phabricator.wikimedia.org/T227565 (10Papaul) ` papaul@asw-c-codfw# show | compare [edit interfaces interface-range vlan-private1-c-codfw] - member ge-6/0/5; [edit interfaces interface-range disabled] member ge-6/0/4... [15:32:43] but Amir1, I thought creating new wikis was broken! tell me more about this fascinating new development [15:32:50] (: [15:32:55] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2038 - https://phabricator.wikimedia.org/T227565 (10Papaul) [15:35:11] (03CR) 10Gehel: [C: 04-1] query_service: prepare query_service for reusbility (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [15:35:12] Lucas_WMDE: :D but a superhero reviewed the fixing patch [15:36:19] the real superhero is the one who wrote the patch though ^^ [15:36:54] 10Operations, 10ops-codfw, 10decommission: Decommission db2040 - https://phabricator.wikimedia.org/T224079 (10Papaul) ` papaul@asw-a-codfw# show | compare [edit interfaces interface-range vlan-private1-a-codfw] - member ge-3/0/27; [edit interfaces interface-range disabled] member ge-5/0/32 { ... } +... [15:37:18] 10Operations, 10ops-codfw, 10decommission: Decommission db2040 - https://phabricator.wikimedia.org/T224079 (10Papaul) [15:38:15] !log installing php5 security updates [15:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:47] (03CR) 10Cwhite: "> Patch Set 3:" [software/service-checker] - 10https://gerrit.wikimedia.org/r/538711 (owner: 10Cwhite) [15:39:03] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: NDA Request from WMDE employee Verena - https://phabricator.wikimedia.org/T233807 (10herron) Looping in @RStallman-legalteam [15:40:11] (03CR) 10Cwhite: [C: 03+2] profile: add mmutf8fix to kafka output actions [puppet] - 10https://gerrit.wikimedia.org/r/538642 (https://phabricator.wikimedia.org/T233662) (owner: 10Cwhite) [15:40:18] (03PS3) 10Cwhite: profile: add mmutf8fix to kafka output actions [puppet] - 10https://gerrit.wikimedia.org/r/538642 (https://phabricator.wikimedia.org/T233662) [15:40:45] 10Operations, 10Traffic, 10Performance-Team (Radar): Enable mwdebug routes for noc.wikimedia.org - https://phabricator.wikimedia.org/T233768 (10herron) p:05Triage→03Normal [15:41:07] 10Operations, 10Wikimedia-Mailing-lists: Create wikimedia-bd-regional mailing list - https://phabricator.wikimedia.org/T233742 (10herron) p:05Triage→03Normal [15:42:58] 10Operations, 10LDAP-Access-Requests: Turnilo access for Jerrie Kumalah and Erin Yener (fundraising analysts) - https://phabricator.wikimedia.org/T233780 (10herron) p:05Triage→03Normal [15:43:11] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: NDA Request from WMDE employee Verena - https://phabricator.wikimedia.org/T233807 (10herron) p:05Triage→03Normal [15:43:35] (03PS1) 10Muehlenhoff: Switch auth1002/auth2001 to role::test [puppet] - 10https://gerrit.wikimedia.org/r/539145 [15:43:37] 10Operations, 10netbox: Netbox: tracking of hardware errors / grouping servers in order/batches - https://phabricator.wikimedia.org/T233774 (10herron) p:05Triage→03Normal [15:43:57] 10Operations, 10ops-eqiad: Move YHSM from auth1001 to auth1002 - https://phabricator.wikimedia.org/T233821 (10herron) p:05Triage→03Normal [15:44:11] 10Operations, 10ops-codfw, 10decommission: Decommission db2041 - https://phabricator.wikimedia.org/T223950 (10Papaul) ` papaul@asw-c-codfw# show |compare [edit interfaces interface-range vlan-private1-c-codfw] - member ge-6/0/8; [edit interfaces interface-range disabled] member ge-6/0/5 { ... } +... [15:44:31] 10Operations, 10serviceops: Make the parsoid cluster to support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10herron) p:05Triage→03Normal [15:44:36] 10Operations, 10ops-codfw, 10decommission: Decommission db2041 - https://phabricator.wikimedia.org/T223950 (10Papaul) [15:47:04] 10Operations, 10Analytics, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10herron) p:05Triage→03Normal [15:48:03] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2043.codfw.wmnet - https://phabricator.wikimedia.org/T230311 (10Papaul) ` papaul@asw-c-codfw# show | compare [edit interfaces interface-range vlan-private1-c-codfw] - member ge-6/0/12; [edit interfaces interface-range disable... [15:48:22] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2043.codfw.wmnet - https://phabricator.wikimedia.org/T230311 (10Papaul) [15:48:30] (03CR) 10Cwhite: "> Patch Set 1: Code-Review-1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538976 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [15:49:47] (03PS3) 10KartikMistry: Use ContentTranslationEnableMT to disable MT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538867 (https://phabricator.wikimedia.org/T232986) [15:50:52] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2044.codfw.wmnet - https://phabricator.wikimedia.org/T230761 (10Papaul) ` papaul@asw-c-codfw# show | compare [edit interfaces interface-range vlan-private1-c-codfw] - member ge-6/0/13; [edit interfaces interface-r... [15:51:13] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2044.codfw.wmnet - https://phabricator.wikimedia.org/T230761 (10Papaul) [15:53:57] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2049.codfw.wmnet - https://phabricator.wikimedia.org/T230721 (10Papaul) ` papaul@asw-c-codfw# show | compare [edit interfaces interface-range vlan-private1-c-codfw] - member ge-6/0/18; [edit interfaces interface-range disabled] me... [15:54:21] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2049.codfw.wmnet - https://phabricator.wikimedia.org/T230721 (10Papaul) [15:55:20] (03PS1) 10Muehlenhoff: Enable ldap-corp1001/2001 as additional replicas [puppet] - 10https://gerrit.wikimedia.org/r/539150 [15:56:56] (03PS1) 10Mforns: analytics::refinery::job::data_purge: Add timer to delete old MWH dumps [puppet] - 10https://gerrit.wikimedia.org/r/539151 (https://phabricator.wikimedia.org/T208612) [15:57:14] (03PS5) 10KartikMistry: Fix incorrect channel name for TranslationNotifications extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537628 (https://phabricator.wikimedia.org/T144780) (owner: 10Abijeet Patro) [15:58:12] (03PS2) 10Mforns: analytics::refinery::job::data_purge: Add timer to delete old MWH dumps [puppet] - 10https://gerrit.wikimedia.org/r/539151 (https://phabricator.wikimedia.org/T208612) [15:59:47] (03CR) 10Mforns: "Deletion command tested carefully, and corresponding checksum added." [puppet] - 10https://gerrit.wikimedia.org/r/539151 (https://phabricator.wikimedia.org/T208612) (owner: 10Mforns) [16:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190925T1600). Please do the needful. [16:00:04] kart_ and tgr: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:22] OK! I'm here. [16:00:33] o/ [16:01:10] (03CR) 10KartikMistry: [C: 03+2] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537628 (https://phabricator.wikimedia.org/T144780) (owner: 10Abijeet Patro) [16:02:13] (03Merged) 10jenkins-bot: Fix incorrect channel name for TranslationNotifications extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537628 (https://phabricator.wikimedia.org/T144780) (owner: 10Abijeet Patro) [16:02:29] (03CR) 10jenkins-bot: Fix incorrect channel name for TranslationNotifications extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537628 (https://phabricator.wikimedia.org/T144780) (owner: 10Abijeet Patro) [16:03:49] Deploying.. [16:06:40] !log kartik@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit|537628|Fix incorrect channel name for TranslationNotifications extension (T144780)]] (duration: 01m 06s) [16:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:46] T144780: Translation Notification Bot sending the same message multiple times to every translator - https://phabricator.wikimedia.org/T144780 [16:06:51] tgr: I'm done. Go ahead. [16:06:56] thx [16:09:17] (03PS5) 10Gergő Tisza: GrowthExperiments: Enable WelcomeSurvey for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537801 (https://phabricator.wikimedia.org/T233063) (owner: 10Kosta Harlan) [16:10:35] (03CR) 10Gergő Tisza: [C: 03+2] GrowthExperiments: Enable WelcomeSurvey for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537801 (https://phabricator.wikimedia.org/T233063) (owner: 10Kosta Harlan) [16:11:29] (03Merged) 10jenkins-bot: GrowthExperiments: Enable WelcomeSurvey for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537801 (https://phabricator.wikimedia.org/T233063) (owner: 10Kosta Harlan) [16:11:47] (03CR) 10jenkins-bot: GrowthExperiments: Enable WelcomeSurvey for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/537801 (https://phabricator.wikimedia.org/T233063) (owner: 10Kosta Harlan) [16:15:42] Amir1: Okay, I'll send a config patch soon then :) [16:15:59] (03CR) 10Krinkle: "Might wanna tag with T91474 and T113114." [puppet] - 10https://gerrit.wikimedia.org/r/513266 (owner: 10Giuseppe Lavagetto) [16:16:48] (03CR) 10Dzahn: [C: 03+2] base: remove md5 from gen_fingerprints' output [puppet] - 10https://gerrit.wikimedia.org/r/539025 (owner: 10Elukey) [16:16:58] (03PS3) 10Dzahn: base: remove md5 from gen_fingerprints' output [puppet] - 10https://gerrit.wikimedia.org/r/539025 (owner: 10Elukey) [16:18:56] Urbanecm: nice. thanks. [16:19:12] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:537801|GrowthExperiments: Enable WelcomeSurvey for euwiki (T233063)]] (duration: 01m 04s) [16:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:17] T233063: Deploy welcome survey to Basque Wikipedia - https://phabricator.wikimedia.org/T233063 [16:20:47] (03PS3) 10Elukey: Remove Python 2 packages from Analytics Client nodes [puppet] - 10https://gerrit.wikimedia.org/r/538750 (https://phabricator.wikimedia.org/T204734) [16:21:49] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "thanks! that is harmless to merge because usage is for humans to paste output on wikitech server pages" [puppet] - 10https://gerrit.wikimedia.org/r/539025 (owner: 10Elukey) [16:25:17] (03CR) 10Dzahn: "works as expected on puppetmaster1001" [puppet] - 10https://gerrit.wikimedia.org/r/539025 (owner: 10Elukey) [16:26:07] (03PS4) 10Elukey: Remove Python 2 packages from Analytics Client nodes [puppet] - 10https://gerrit.wikimedia.org/r/538750 (https://phabricator.wikimedia.org/T204734) [16:32:39] 10Operations, 10Traffic, 10Wikidata, 10serviceops, and 3 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Dzahn) 05Stalled→03Declined @JanZerebecki as the original reporter. Please see T99531#5406014 and all the other comments above. This has b... [16:33:59] (03PS3) 10Herron: logstash: throttle duplicate normalized_message with level:ERR* [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739) [16:38:06] (03CR) 10Alexandros Kosiaris: [C: 04-1] "> This should work because implicit hiera lookups take priority over default value assignment" [puppet] - 10https://gerrit.wikimedia.org/r/538976 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [16:38:10] (03CR) 10jerkins-bot: [V: 04-1] logstash: throttle duplicate normalized_message with level:ERR* [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739) (owner: 10Herron) [16:38:48] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 49.39 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:39:36] (03PS1) 10Elukey: prometheus::node_puppet_agent: use Python3 and its deps [puppet] - 10https://gerrit.wikimedia.org/r/539156 [16:42:00] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 83.08 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:44:22] (03CR) 10Filippo Giunchedi: "LGTM, are all of these deps available in jessie as well ?" [puppet] - 10https://gerrit.wikimedia.org/r/539156 (owner: 10Elukey) [16:48:01] (03CR) 10Elukey: "> LGTM, are all of these deps available in jessie as well ?" [puppet] - 10https://gerrit.wikimedia.org/r/539156 (owner: 10Elukey) [16:50:01] (03CR) 10Filippo Giunchedi: [C: 03+1] "Though I'd wait tomorrow for merging" [puppet] - 10https://gerrit.wikimedia.org/r/539156 (owner: 10Elukey) [16:51:09] (03PS3) 10Dzahn: Phatality: Escape the colon in the sudoers rule [puppet] - 10https://gerrit.wikimedia.org/r/539088 (owner: 1020after4) [16:51:51] (03CR) 10Dzahn: [C: 04-1] "duplicate. rebased to nothing because it's already done." [puppet] - 10https://gerrit.wikimedia.org/r/539088 (owner: 1020after4) [16:51:56] (03CR) 10Jdlrobson: [C: 04-1] "Sorry this is taking so long." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539052 (https://phabricator.wikimedia.org/T233104) (owner: 10Urbanecm) [16:55:28] (03CR) 10Masumrezarock100: "The PNG looks a bit blurry if I zoom in. Possibly because of the low resolution." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539052 (https://phabricator.wikimedia.org/T233104) (owner: 10Urbanecm) [16:57:14] (03CR) 10Urbanecm: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539052 (https://phabricator.wikimedia.org/T233104) (owner: 10Urbanecm) [16:57:30] (03PS2) 10Urbanecm: Add wgMinervaCustomLogos for szlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539052 (https://phabricator.wikimedia.org/T233104) [17:00:05] Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Creating N'ko wikipedia deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190925T1700). [17:00:16] o/ [17:00:40] yet another? lol [17:00:51] good luck! [17:01:01] We create five N'ko Wikipedias every day :D [17:01:16] Lucas_WMDE: Thanks. For now, let's backport the thing [17:02:33] Amir1: I'm starting to make the configuration [17:02:52] Urbanecm: cool. I deploy the backports in the mean time [17:03:50] (03CR) 10Jdlrobson: [C: 04-1] "> Even more recently, VariantSettings.php was migrated back to InitialiseSettings.php :)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539052 (https://phabricator.wikimedia.org/T233104) (owner: 10Urbanecm) [17:04:34] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Dzahn) >>! In T215411#5024717, @RobH wrote: > So this has a memory error and is out of warranty. > > This means we should look at decommissioning this host an... [17:05:58] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Dzahn) [17:08:07] thanks Amir1 [17:08:56] !log ladsgroup@deploy1001 Synchronized php-1.34.0-wmf.24/extensions/WikimediaMaintenance/addWiki.php: Redefine RevisionStore service for the wiki being created (T212881) (duration: 01m 04s) [17:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:00] T212881: addWiki.php broken creating ES tables - https://phabricator.wikimedia.org/T212881 [17:11:17] !log ladsgroup@deploy1001 Synchronized php-1.34.0-wmf.23/extensions/WikimediaMaintenance/addWiki.php: Redefine RevisionStore service for the wiki being created (T212881) (duration: 01m 05s) [17:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:06] 10Operations, 10observability, 10serviceops: Some syslog messages - https://phabricator.wikimedia.org/T233828 (10jijiki) [17:12:18] (03CR) 10Masumrezarock100: "> Patch Set 1: Code-Review-1" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539052 (https://phabricator.wikimedia.org/T233104) (owner: 10Urbanecm) [17:12:30] (03PS1) 10Urbanecm: Initial configuration for nqowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539162 (https://phabricator.wikimedia.org/T230359) [17:13:43] Amir1: ^^ waiting on jenkins, hoping it is correct ^^ [17:14:25] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) @Dzahn I do not know yet when this server will be decommissioned, we have quite some work ahead of us before moving thumbor to k8s [17:15:30] Urbanecm: new languages is always more fun, nqo is not in dns [17:15:45] Amir1: good remark [17:15:55] * Urbanecm is going to upload a DNS patch [17:15:59] (03PS1) 10Ladsgroup: Add nqo to langlist [dns] - 10https://gerrit.wikimedia.org/r/539163 (https://phabricator.wikimedia.org/T230543) [17:16:01] Urbanecm: already done [17:16:06] you're fast :) [17:16:17] do we have an op to deploy that around? [17:16:20] (03CR) 10jerkins-bot: [V: 04-1] Add nqo to langlist [dns] - 10https://gerrit.wikimedia.org/r/539163 (https://phabricator.wikimedia.org/T230543) (owner: 10Ladsgroup) [17:16:51] (03PS2) 10Urbanecm: Initial configuration for nqowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539162 (https://phabricator.wikimedia.org/T230359) [17:16:55] (03CR) 10Krinkle: logstash: throttle duplicate normalized_message with level:ERR* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538931 (https://phabricator.wikimedia.org/T233739) (owner: 10Herron) [17:17:28] (03PS2) 10Ladsgroup: Add nqo to langlist [dns] - 10https://gerrit.wikimedia.org/r/539163 (https://phabricator.wikimedia.org/T230543) [17:18:01] bblack: ema_ vgutierrez Can you take a look at https://gerrit.wikimedia.org/r/539163 ? It should be straightforward [17:18:13] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for andrew-wmde - https://phabricator.wikimedia.org/T233202 (10Dzahn) @Andrew-WMDE We will need an SSH key from you and then make the needed change in the operations/puppet repo in modules/admin/data/data.yaml. Could you please make a new k... [17:18:35] (03PS3) 10Urbanecm: Initial configuration for nqowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539162 (https://phabricator.wikimedia.org/T230359) [17:18:48] Amir1: do you know how InterwikiSortOrders.php works? [17:18:57] I've added it where nrm, the following lang code, was [17:19:01] but not sure if that's okay [17:19:08] also already included wikiversions entry [17:19:23] Urbanecm: it doesn't matter anymore, given the compact interwiki list [17:19:33] it has to be there for one or two wikis though [17:19:44] ok [17:19:50] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Dzahn) @jijiki I assumed it is broken anyways. Can it run despite the memory error? [17:20:55] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Dzahn) [17:21:27] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Dzahn) Nevermind then, i declined the decom ticket again. [17:21:33] (03PS3) 10Urbanecm: Add nqo to langlist [dns] - 10https://gerrit.wikimedia.org/r/539163 (https://phabricator.wikimedia.org/T230359) (owner: 10Ladsgroup) [17:25:21] now we need to get this merged before we can move forward [17:25:48] yup [17:25:58] Amir1: Can I mess with mwdebug1002? No problem if not. [17:26:00] i got that. hold on [17:26:27] Krinkle: It's fine a ten or more minutes [17:26:59] right now, we need to get dns thing worked out and I don't need it until after running addWiki.php [17:27:08] Krinkle: Make it twenty [17:27:10] OK [17:27:25] * Krinkle debugging https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/511078/ [17:27:57] (03CR) 10Dzahn: [C: 03+2] "approved by langcom per https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_N'Ko" [dns] - 10https://gerrit.wikimedia.org/r/539163 (https://phabricator.wikimedia.org/T230359) (owner: 10Ladsgroup) [17:29:08] Krinkle: nice [17:29:48] (03PS3) 10Urbanecm: Add wgMinervaCustomLogos for szlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539052 (https://phabricator.wikimedia.org/T233104) [17:29:52] !log DNS - adding nqo (N'Ko) to langlist for new nqo.wikipedia, approved by langcom https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_N'Ko (T230359) [17:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:56] T230359: Create N'Ko Wikipedia - https://phabricator.wikimedia.org/T230359 [17:30:20] Amir1: Urbanecm: nqo.wikipedia.org is an alias for dyna.wikimedia.org. [17:30:26] thanks mutante [17:30:32] should be unblocked [17:30:33] yw [17:30:53] Amir1: the old instructions for the extra commands are not needed anymore nowadays :) [17:31:03] a long running ticket about that has been resolved [17:31:16] (03CR) 10Krinkle: [C: 04-1] "Whatever the case, HHVM is (almost) no longer used. And it seems the PHP7 version has this bug fixed as of d4c5a5d5b4ff7b8b4, it does the " [puppet] - 10https://gerrit.wikimedia.org/r/511078 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup) [17:31:23] so it's also "just merge and authdns-update" for entirely new languages now, which is nice [17:31:57] mutante: niiice, thanks [17:32:31] welcome Amir1, dont forget to tell dba about labs replicas [17:32:34] (03CR) 10Krinkle: [C: 04-1] "Unpicked from Beta Cluster puppetmaster03" [puppet] - 10https://gerrit.wikimedia.org/r/511078 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup) [17:32:36] Amir1: done with mwdebug1002 [17:33:04] It probably takes some time to propagate through cache. nqo.wikipedia.org errors to me right now [17:33:14] Amir1: wfm [17:33:27] mutante: T230543 :D [17:33:28] T230543: Prepare and check storage layer for nqowiki - https://phabricator.wikimedia.org/T230543 [17:33:31] i can take a little bit but not too long [17:33:34] Amir1: ok :) [17:33:37] (03PS2) 10Ayounsi: Deploy homer [puppet] - 10https://gerrit.wikimedia.org/r/534538 (https://phabricator.wikimedia.org/T228388) [17:34:25] (03CR) 10Ayounsi: "> Patch Set 1:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/534538 (https://phabricator.wikimedia.org/T228388) (owner: 10Ayounsi) [17:35:01] I might need to jump back in to flush my dns cache [17:35:49] Amir1: maybe holding shift while clicking reload [17:36:21] no luck yet :( [17:36:55] let me check by phone [17:37:24] (03PS1) 10Jcrespo: Revert "wiki replicas: depool lasbdb1011 just in case of issues" [puppet] - 10https://gerrit.wikimedia.org/r/539165 [17:37:28] (03CR) 10Krinkle: "An older version of this patch is still cherry-picked on Beta Cluster. It should probably be removed there or renewed." [puppet] - 10https://gerrit.wikimedia.org/r/432528 (https://phabricator.wikimedia.org/T182085) (owner: 1020after4) [17:37:33] (03PS2) 10Jcrespo: Revert "wiki replicas: depool lasbdb1011 just in case of issues" [puppet] - 10https://gerrit.wikimedia.org/r/539165 [17:37:34] Amir1: dig A nqo.wikipedia.org @ns0.wikimedia.org [17:37:53] (03Abandoned) 10Jcrespo: Revert "wiki replicas: depool lasbdb1011 just in case of issues" [puppet] - 10https://gerrit.wikimedia.org/r/539165 (owner: 10Jcrespo) [17:38:28] got it on my phone :D [17:38:48] (03CR) 10Krinkle: "Does not appear to be cherry-picked on Beta anymore." [puppet] - 10https://gerrit.wikimedia.org/r/439774 (owner: 10Alex Monk) [17:38:50] let's proceed then [17:39:35] i am predicting escaping issues in mediawiki config or some fun like that later on.. with a language name: N'Ko [17:39:49] (03CR) 10Krinkle: "Does not appear to be cherry-picked on Beta anymore." [puppet] - 10https://gerrit.wikimedia.org/r/462019 (https://phabricator.wikimedia.org/T125976) (owner: 10Thcipriani) [17:40:01] (03CR) 10Ladsgroup: [C: 03+2] Initial configuration for nqowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539162 (https://phabricator.wikimedia.org/T230359) (owner: 10Urbanecm) [17:41:00] (03Merged) 10jenkins-bot: Initial configuration for nqowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539162 (https://phabricator.wikimedia.org/T230359) (owner: 10Urbanecm) [17:41:18] (03CR) 10jenkins-bot: Initial configuration for nqowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539162 (https://phabricator.wikimedia.org/T230359) (owner: 10Urbanecm) [17:41:51] (03CR) 10Krinkle: "This is still cherry-picked on Beta. Please remove it from there or re-open the patchset so that the hashtag reflects what is live." [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T153468) (owner: 10Hashar) [17:42:54] (03CR) 10Krinkle: "This is still cherry-picked on Beta. Please remove it from there or re-open the patchset so that the hashtag reflects what is live." [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [17:44:36] (03Restored) 10Dzahn: prometheus: make ferm DNS record type configurable [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T153468) (owner: 10Hashar) [17:44:56] (03Restored) 10Dzahn: Configure forensic logging of Apache requests; enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [17:45:00] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) We have not noticed anything weird so far, I reckon it should be ok for a little longer [17:45:55] (03CR) 10Volans: "Compiler fails with:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/534538 (https://phabricator.wikimedia.org/T228388) (owner: 10Ayounsi) [17:45:58] 10Operations, 10ops-eqiad, 10DBA: db1074 crashed: Broken BBU - https://phabricator.wikimedia.org/T231638 (10jcrespo) Reminder to move sanitarium (T231638#5453802) back here (or somewhere else on eqiad) before closing this ticket. [17:46:00] (03CR) 10Krinkle: "actually, it is. missed it in the grep" [puppet] - 10https://gerrit.wikimedia.org/r/439774 (owner: 10Alex Monk) [17:46:03] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: (OoW) thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Dzahn) Ok, i misunderstood then. [17:46:05] (03PS4) 10Urbanecm: Add wgMinervaCustomLogos for szlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539052 (https://phabricator.wikimedia.org/T233104) [17:46:51] (03CR) 10Krinkle: "This is still cherry-picked on Beta Cluster (possibly cause for merge conflicts, not sure). Please re-open and tag "beta-cherry-picked" or" [puppet] - 10https://gerrit.wikimedia.org/r/488593 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [17:47:49] !log ladsgroup@deploy1001 Synchronized dblists: (no justification provided) (duration: 01m 04s) [17:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:11] (03CR) 10Krinkle: "This is still cherry-picked on beta Cluster. Might need to be rebased properly so that it no longer shows up. Might be adding to the curre" [puppet] - 10https://gerrit.wikimedia.org/r/515058 (owner: 10Filippo Giunchedi) [17:48:39] (03CR) 10Jdlrobson: [C: 03+1] Add wgMinervaCustomLogos for szlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539052 (https://phabricator.wikimedia.org/T233104) (owner: 10Urbanecm) [17:48:44] (03CR) 10Krinkle: "Still cherry-picked on Beta Cluster, but also merged. needs to be resolved somehow." [puppet] - 10https://gerrit.wikimedia.org/r/538642 (https://phabricator.wikimedia.org/T233662) (owner: 10Cwhite) [17:48:57] (03CR) 10Jdlrobson: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539052 (https://phabricator.wikimedia.org/T233104) (owner: 10Urbanecm) [17:49:27] thanks Jdlrobson [17:49:59] (03CR) 10CRusnov: "THis could probably be removed from the beta server since afaik it is merged in production." [puppet] - 10https://gerrit.wikimedia.org/r/515058 (owner: 10Filippo Giunchedi) [17:50:17] (03PS3) 10Ayounsi: Deploy homer [puppet] - 10https://gerrit.wikimedia.org/r/534538 (https://phabricator.wikimedia.org/T228388) [17:50:36] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:51:01] !log ladsgroup@deploy1001 rebuilt and synchronized wikiversions files: (no justification provided) [17:51:02] (03CR) 10Bstorm: "Added a first diagram to the documentation page. It's frustratingly fuzzy because Wikitech doesn't trust the namespace used by draw.io fo" [puppet] - 10https://gerrit.wikimedia.org/r/537732 (https://phabricator.wikimedia.org/T227290) (owner: 10Bstorm) [17:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:18] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:51:20] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [17:52:12] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:52:25] (03CR) 10Masumrezarock100: [C: 03+1] "A +1 from me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539052 (https://phabricator.wikimedia.org/T233104) (owner: 10Urbanecm) [17:52:54] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:52:58] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [17:53:00] usual scary spike that happens during deployment [17:53:56] (03PS1) 10Ayounsi: Add fake SSH keypair for user homer [labs/private] - 10https://gerrit.wikimedia.org/r/539169 (https://phabricator.wikimedia.org/T228388) [17:54:00] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Create nqowiki T230359 (duration: 01m 04s) [17:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:04] T230359: Create N'Ko Wikipedia - https://phabricator.wikimedia.org/T230359 [17:54:31] (03CR) 10Ladsgroup: Initial configuration for nqowiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539162 (https://phabricator.wikimedia.org/T230359) (owner: 10Urbanecm) [17:55:13] Amir1: well, the wikipedia-english.dblist is automatically computed [17:55:21] `%% wikipedia.dblist - wikipedia-e-acute.dblist - wikipedia-devanagari.dblist - wikipedia-cyrillic.dblist` [17:55:51] interesting [17:56:05] Amir1: could you also review&merge https://gerrit.wikimedia.org/r/#/c/539166/, please? [17:56:10] !log ladsgroup@deploy1001 Synchronized static/images/project-logos/: Create nqowiki T230359 (duration: 01m 05s) [17:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:50] 10Operations, 10ops-codfw, 10decommission, 10media-storage, 10User-fgiunchedi: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10Papaul) ` {master:7}[edit] papaul@asw-a-codfw# show | compare [edit interfaces interface-range disabled] member ge-3/0/27 { ... } + member ge-6/0/15;... [17:57:20] Urbanecm: thanks. Should we deploy it too? [17:57:25] !log ladsgroup@deploy1001 Synchronized langlist: Create nqowiki T230359 (duration: 01m 02s) [17:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:35] 10Operations, 10ops-codfw, 10decommission, 10media-storage, 10User-fgiunchedi: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10Papaul) [17:57:46] I think it can wait for the train [17:58:42] cool [17:58:51] Did we add this to RTL wikis? [17:59:06] Amir1: what is this in this context? [17:59:10] (03PS1) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539170 [17:59:12] (03CR) 10Ladsgroup: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539170 (owner: 10Ladsgroup) [17:59:21] dblists [17:59:36] I don't get your q Amir1 [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190925T1800) [18:00:04] (03CR) 10jerkins-bot: [V: 04-1] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539170 (owner: 10Ladsgroup) [18:00:07] dblists/rtl.dblist [18:00:10] (03CR) 10jerkins-bot: [V: 04-1] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539170 (owner: 10Ladsgroup) [18:00:45] yup [18:00:53] it exploded [18:00:58] let me fix it [18:01:10] !log creating nqowiki is going to take five more minutes [18:01:12] oh, there's rtl.dblist? [18:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:28] that should be noted in https://wikitech.wikimedia.org/wiki/Add_a_wiki... [18:01:46] Don't get me started [18:02:06] now I have undeployed code that can't be merged or deploy [18:02:46] i don't understand what's going on Amir1 [18:03:42] https://integration.wikimedia.org/ci/job/operations-mw-config-php72-composer-test-docker/514/console [18:03:51] This is automatic: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/539170 [18:04:12] (03PS1) 10Ladsgroup: Add nqowiki to rtl wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539172 (https://phabricator.wikimedia.org/T230359) [18:04:47] (03CR) 10Urbanecm: [C: 03+1] Add nqowiki to rtl wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539172 (https://phabricator.wikimedia.org/T230359) (owner: 10Ladsgroup) [18:04:52] (03CR) 10Ladsgroup: [C: 03+2] Add nqowiki to rtl wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539172 (https://phabricator.wikimedia.org/T230359) (owner: 10Ladsgroup) [18:05:51] The good thing is that this automatic thing brought back the deploy repo to its original state, nothing needs to be done [18:05:54] (03Merged) 10jenkins-bot: Add nqowiki to rtl wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539172 (https://phabricator.wikimedia.org/T230359) (owner: 10Ladsgroup) [18:06:10] (By automatic thing I mean "scap update-interwiki-cache") [18:06:22] good [18:06:57] (03Abandoned) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539170 (owner: 10Ladsgroup) [18:07:53] !log ladsgroup@deploy1001 Synchronized dblists/rtl.dblist: Create nqowiki T230359 (duration: 01m 05s) [18:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:56] T230359: Create N'Ko Wikipedia - https://phabricator.wikimedia.org/T230359 [18:08:15] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add fake SSH keypair for user homer [labs/private] - 10https://gerrit.wikimedia.org/r/539169 (https://phabricator.wikimedia.org/T228388) (owner: 10Ayounsi) [18:08:19] (03PS1) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539173 [18:08:21] (03CR) 10Ladsgroup: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539173 (owner: 10Ladsgroup) [18:09:23] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539173 (owner: 10Ladsgroup) [18:10:39] !log ladsgroup@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 39s) [18:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:49] Urbanecm: We need to do https://wikitech.wikimedia.org/wiki/Add_a_wiki#RESTBase and onwards [18:10:56] xyup [18:11:00] !log creating nqowiki is finished now [18:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:09] That can wait [18:11:09] (03CR) 1020after4: "Thank you!!!!" [puppet] - 10https://gerrit.wikimedia.org/r/539104 (https://phabricator.wikimedia.org/T230752) (owner: 10Jbond) [18:11:14] Amir1: I've already uploaded a patch for parsoid fyi [18:11:21] but we are done for now [18:11:22] . [18:11:38] thank you Amir1 ! [18:11:56] (03Abandoned) 1020after4: Phatality: Escape the colon in the sudoers rule [puppet] - 10https://gerrit.wikimedia.org/r/539088 (owner: 1020after4) [18:13:31] !log twentyafterfour@deploy1001 Started deploy [releng/phatality@8f05ba9]: Deploy phatality [18:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:55] 10Operations, 10DBA, 10Data-Services, 10Patch-For-Review: Prepare and check storage layer for nqowiki - https://phabricator.wikimedia.org/T230543 (10Ladsgroup) @Marostegui The wiki is up, please do what needs to be done 🔨 [18:13:55] !log twentyafterfour@deploy1001 Finished deploy [releng/phatality@8f05ba9]: Deploy phatality (duration: 00m 24s) [18:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:02] (03CR) 10jenkins-bot: Add nqowiki to rtl wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539172 (https://phabricator.wikimedia.org/T230359) (owner: 10Ladsgroup) [18:14:06] (03CR) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539173 (owner: 10Ladsgroup) [18:14:20] congrats Amir. addwiki is always a hassle [18:15:30] (03CR) 10Ayounsi: "> Patch Set 2:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/534538 (https://phabricator.wikimedia.org/T228388) (owner: 10Ayounsi) [18:19:06] !log twentyafterfour@deploy1001 Started deploy [releng/phatality@42ba003]: deploy for version 5.6.15 [18:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:56] !log twentyafterfour@deploy1001 Finished deploy [releng/phatality@42ba003]: deploy for version 5.6.15 (duration: 00m 50s) [18:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:04] ugh [18:20:16] Kibana did not load properly. Check the server output for more information. [18:20:25] I need to take break and eat lunch. Ping me if things are exploded with nqowiki [18:20:58] (03PS14) 10Alex Monk: swift: use implicit /dev/swift prefix for swift devices [puppet] - 10https://gerrit.wikimedia.org/r/361648 (https://phabricator.wikimedia.org/T163673) (owner: 10Filippo Giunchedi) [18:21:22] !log twentyafterfour@deploy1001 Started deploy [releng/phatality@42ba003]: trying again [18:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:53] !log twentyafterfour@deploy1001 Finished deploy [releng/phatality@42ba003]: trying again (duration: 03m 31s) [18:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:09] ok I think I broke kibana [18:25:34] (03CR) 10Alex Monk: "Does not appear used anymore, dropped from cherry-picks" [puppet] - 10https://gerrit.wikimedia.org/r/488593 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [18:25:38] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1007 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f150cbe5358: Failed to establish a new connection: [Errno 111] Connection [18:25:38] ://wikitech.wikimedia.org/wiki/Search%23Administration [18:26:00] PROBLEM - Check systemd state on logstash1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:26:48] :( [18:27:08] oh it seems to be coming back on it's own ... hmm [18:27:14] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1007 is OK: OK - elasticsearch status production-logstash-eqiad: number_of_data_nodes: 3, number_of_nodes: 6, task_max_waiting_in_queue_millis: 0, number_of_in_flight_fetch: 0, active_primary_shards: 211, delayed_unassigned_shards: 0, initializing_shards: 0, cluster_name: production-logstash-eqiad, active_shards: 484, relocating_shards: 0, timed_out: False, activ [18:27:14] as_number: 100.0, status: green, unassigned_shards: 0, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:27:36] RECOVERY - Check systemd state on logstash1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:28:12] (03CR) 10Alex Monk: "Replaced with I7bf3d6f77f7495aae7352c2727c13487300cfe33" [puppet] - 10https://gerrit.wikimedia.org/r/497614 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [18:30:52] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10MoritzMuehlenhoff) a:05RobH→03Papaul [18:31:03] (03CR) 10Alex Monk: "Yeah it's duplicated in there, we have netbox_attachments, phabricator_files, then netbox_attachments again. Removing cherry-pick" [puppet] - 10https://gerrit.wikimedia.org/r/515058 (owner: 10Filippo Giunchedi) [18:31:18] (03Abandoned) 10Paladox: Update plugins to 2.15.13 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/507521 (owner: 10Paladox) [18:31:57] Krinkle: I've sorted out merged/abandoned patches on https://gerrit.wikimedia.org/r/#/q/hashtag:%22beta-cherry-picked%22 [18:32:02] (03PS11) 10Paladox: Gerrit: Convert CoC and Privacy links to use the new PolyGerrit extension point [puppet] - 10https://gerrit.wikimedia.org/r/520295 [18:32:10] thanks for making that list reflect reality [18:38:44] Krenair: cool, so the merged ones are no longer cherry-picked? [18:39:01] I know it's icky to have to untag merges ones, but did that for now because some were merged *and* picked. [18:39:07] eh [18:39:12] one of the merged ones had technically been merged [18:39:21] except it then got reverted [18:39:28] (03PS12) 10Paladox: Gerrit: Convert CoC and Privacy links to use the new PolyGerrit extension point [puppet] - 10https://gerrit.wikimedia.org/r/520295 [18:39:29] and so I shifted the cherry-pick status to the revert of the revert [18:39:42] ok :) [18:39:46] fun [18:39:56] another one had somehow been rebased into just duplicating the thing that got merged, I just dropped that one [18:40:34] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10Papaul) ` papaul@asw-a-codfw# show | compare [edit interfaces interface-range vlan-private1-a-codfw] - member ge-5/0/16; [edit interfaces interface-range disabled] membe... [18:41:05] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10Papaul) [18:42:35] (03CR) 10BBlack: [C: 03+1] "LGTM, but I'm not around to manage deployment this week!" [puppet] - 10https://gerrit.wikimedia.org/r/537974 (https://phabricator.wikimedia.org/T232615) (owner: 10Gilles) [18:43:34] (03PS1) 10Paladox: Gerrit: Migrate theme to support Polymer 2 [puppet] - 10https://gerrit.wikimedia.org/r/539180 [18:44:39] (03CR) 10Alex Monk: Fix maintain_dbusers class lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538979 (owner: 10Alex Monk) [18:45:23] (03PS2) 10Paladox: Gerrit: Migrate theme to support Polymer 2 [puppet] - 10https://gerrit.wikimedia.org/r/539180 [18:47:16] (03PS3) 10Paladox: Gerrit: Migrate theme to support Polymer 2 [puppet] - 10https://gerrit.wikimedia.org/r/539180 (https://phabricator.wikimedia.org/T227509) [18:47:18] 10Operations, 10Traffic: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 (10ayounsi) @BBlack @faidon let me know when is a good time to remove that MSS hack on the routers. To be done one router at a time with time in between for the sessions to re-establish. Will also drain NTT/Teli... [18:48:48] 10Operations, 10Icinga, 10observability: scs monitoring missing in Icinga - https://phabricator.wikimedia.org/T233318 (10ayounsi) [18:51:20] (03PS4) 10Paladox: Gerrit: Migrate theme to support Polymer 2 [puppet] - 10https://gerrit.wikimedia.org/r/539180 (https://phabricator.wikimedia.org/T227509) [18:54:10] (03PS1) 10Dzahn: parsoid: introduce parameter to use parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/539181 [18:54:15] (03PS1) 10CRusnov: netbox: Setup automated DNS generation [puppet] - 10https://gerrit.wikimedia.org/r/539182 [18:56:25] (03CR) 10jerkins-bot: [V: 04-1] netbox: Setup automated DNS generation [puppet] - 10https://gerrit.wikimedia.org/r/539182 (owner: 10CRusnov) [18:56:59] (03PS1) 10Isaac Johnson: Enable reader demographic surveys in English, Polish, and Russian. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539183 (https://phabricator.wikimedia.org/T232525) [18:57:48] (03CR) 10jerkins-bot: [V: 04-1] Enable reader demographic surveys in English, Polish, and Russian. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539183 (https://phabricator.wikimedia.org/T232525) (owner: 10Isaac Johnson) [18:59:58] (03PS2) 10CRusnov: netbox: Setup automated DNS generation [puppet] - 10https://gerrit.wikimedia.org/r/539182 [19:00:05] twentyafterfour: That opportune time is upon us again. Time for a MediaWiki train - American version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190925T1900). [19:01:11] 10Operations, 10Puppet: Puppet systemd::mask is an anti pattern that has unwanted side effect - https://phabricator.wikimedia.org/T233839 (10hashar) [19:02:12] (03CR) 10jerkins-bot: [V: 04-1] netbox: Setup automated DNS generation [puppet] - 10https://gerrit.wikimedia.org/r/539182 (owner: 10CRusnov) [19:02:42] 10Operations, 10Puppet, 10Traffic: Puppet systemd::mask is an anti pattern that has unwanted side effect - https://phabricator.wikimedia.org/T233839 (10hashar) Adding #traffic team since `systemd::mask` has been introduced for trafficserver / tlsproxy. [19:04:26] (03PS2) 10Isaac Johnson: Enable reader demographic surveys in English, Polish, and Russian. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539183 (https://phabricator.wikimedia.org/T232525) [19:18:16] It looks like the train is unblocked. [19:22:12] (03PS1) 1020after4: group1 wikis to 1.34.0-wmf.24 refs T220749 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539189 [19:22:16] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.34.0-wmf.24 refs T220749 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539189 (owner: 1020after4) [19:23:13] (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.24 refs T220749 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539189 (owner: 1020after4) [19:23:30] (03CR) 10jenkins-bot: group1 wikis to 1.34.0-wmf.24 refs T220749 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539189 (owner: 1020after4) [19:27:25] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.34.0-wmf.24 refs T220749 [19:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:51] T220749: 1.34.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T220749 [19:28:29] !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.34.0-wmf.24 refs T220749 (duration: 01m 03s) [19:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:00] 10Operations, 10OTRS, 10Wikimedia-Mailing-lists: Convert glam@wikimedia.org OTRS into a Google Group - https://phabricator.wikimedia.org/T233843 (10Astinson) [19:44:43] (03PS1) 1020after4: Allow deployer to run other kibana-plugin commands [puppet] - 10https://gerrit.wikimedia.org/r/539191 [19:45:51] (03PS3) 10Isaac Johnson: Enable reader demographic surveys in English, Polish, and Russian. With proper links now. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539183 (https://phabricator.wikimedia.org/T232525) [19:46:28] (03CR) 1020after4: "One alternative to this would be to have a script on the server which removes the old and installs the new version, then just have a sudo " [puppet] - 10https://gerrit.wikimedia.org/r/539191 (owner: 1020after4) [19:50:17] (03PS1) 10Ayounsi: [WIP] Netbox Juniper installed base report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/539192 [19:50:58] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Netbox Juniper installed base report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/539192 (owner: 10Ayounsi) [19:53:46] 10Operations, 10Analytics, 10Fundraising-Backlog, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10herron) >>! In T233636#5517703, @EYener wrote: > Tools / Data Sources > Turnilo > Superset Afaict LDAP... [19:53:57] 10Operations, 10LDAP-Access-Requests: Turnilo access for Jerrie Kumalah and Erin Yener (fundraising analysts) - https://phabricator.wikimedia.org/T233780 (10herron) [19:54:00] 10Operations, 10Analytics, 10Fundraising-Backlog, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10herron) [19:54:24] (03PS2) 10Jbond: Fix maintain_dbusers class lookup [puppet] - 10https://gerrit.wikimedia.org/r/538979 (owner: 10Alex Monk) [19:57:25] (03CR) 10Jbond: "PCC https://puppet-compiler.wmflabs.org/compiler1001/18570/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538979 (owner: 10Alex Monk) [19:57:56] (03PS2) 10Ayounsi: [WIP] Netbox Juniper installed base report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/539192 [20:00:04] cscott, arlolra, subbu, bearND, halfak, and accraze: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190925T2000). [20:00:40] no parsoid deploy today [20:01:33] (03CR) 10Alex Monk: Fix maintain_dbusers class lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538979 (owner: 10Alex Monk) [20:01:39] (03CR) 10Hashar: "Pff and that fails :-\" [puppet] - 10https://gerrit.wikimedia.org/r/538938 (owner: 10Hashar) [20:05:38] (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/537732 (https://phabricator.wikimedia.org/T227290) (owner: 10Bstorm) [20:07:06] (03CR) 10Bstorm: toolforge-k8s: proposed role for all tools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/537755 (https://phabricator.wikimedia.org/T227290) (owner: 10Bstorm) [20:11:43] (03CR) 10Bstorm: "It may be possible to carefully re-order this for easier reading. I'm also working on a diagram of it." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/537755 (https://phabricator.wikimedia.org/T227290) (owner: 10Bstorm) [20:13:14] (03PS2) 10Dzahn: parsoid: introduce parameter to use parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/539181 [20:17:28] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@dbf4e7e]: Speed up querySelectors in domUtil (T229286) [20:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:35] T229286: "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 [20:20:27] !log Upgrading CI Jenkins [20:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:26] 10Operations, 10Wikimedia-Mailing-lists: Create wikimedia-bd-regional mailing list - https://phabricator.wikimedia.org/T233742 (10herron) 05Open→03Resolved a:03herron Hello, this list has been created as requested, and list details should have been sent by the system to the first list admin listed (nahid... [20:22:40] 10Operations, 10OTRS, 10Office-IT, 10Wikimedia-Mailing-lists: Convert glam@wikimedia.org OTRS into a Google Group - https://phabricator.wikimedia.org/T233843 (10MarcoAurelio) [20:23:00] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@dbf4e7e]: Speed up querySelectors in domUtil (T229286) (duration: 05m 32s) [20:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:04] T229286: "worker died, restarting" mobileapps issue - https://phabricator.wikimedia.org/T229286 [20:28:38] PROBLEM - Memory correctable errors -EDAC- on mw1239 is CRITICAL: 6 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1239&var-datasource=eqiad+prometheus/ops [20:30:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10jijiki) 05Resolved→03Open Host is alerting again, I will take a look tomorrow [20:33:52] 10Operations, 10ops-eqiad: apply hostname labels for krb1001/WMF5173 - https://phabricator.wikimedia.org/T233642 (10wiki_willy) a:03Cmjohnson [21:22:17] (03PS1) 10Ladsgroup: mediawiki: Use mediawiki::errorpage instead of a php7-fatal-error.php.erb [puppet] - 10https://gerrit.wikimedia.org/r/539203 (https://phabricator.wikimedia.org/T113114) [21:29:27] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/538979 (owner: 10Alex Monk) [21:31:06] (03CR) 10Jbond: [C: 03+2] Fix maintain_dbusers class lookup [puppet] - 10https://gerrit.wikimedia.org/r/538979 (owner: 10Alex Monk) [21:31:14] (03PS3) 10Jbond: Fix maintain_dbusers class lookup [puppet] - 10https://gerrit.wikimedia.org/r/538979 (owner: 10Alex Monk) [21:34:28] (03PS3) 10Ayounsi: [WIP] Netbox Juniper installed base report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/539192 [21:36:02] (03CR) 10Jbond: Fix maintain_dbusers class lookup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/538979 (owner: 10Alex Monk) [21:39:10] (03CR) 10Ladsgroup: "> Patch Set 11: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/511078 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup) [21:42:22] (03PS2) 10Ladsgroup: mediawiki: Use mediawiki::errorpage instead of a php7-fatal-error.php.erb [puppet] - 10https://gerrit.wikimedia.org/r/539203 (https://phabricator.wikimedia.org/T113114) [21:44:20] (03PS4) 10Ayounsi: [WIP] Netbox Juniper installed base report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/539192 [21:44:34] (03PS3) 10Ladsgroup: mediawiki: Use mediawiki::errorpage instead of a php7-fatal-error.php.erb [puppet] - 10https://gerrit.wikimedia.org/r/539203 (https://phabricator.wikimedia.org/T113114) [21:46:50] (03CR) 10Ladsgroup: "PCC is weird: https://puppet-compiler.wmflabs.org/compiler1001/18584/mw1234.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/539203 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup) [21:56:48] !log remove GRE MTU hacks on eqsin caches (cp5xxx) - T232602 [21:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:53] T232602: GRE MTU mitigations - Tracking - https://phabricator.wikimedia.org/T232602 [21:57:40] !log remove GRE MTU hacks on esams caches (cp3xxx) - T232602 [21:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:15] !log remove GRE MTU hacks on eqiad caches (cp1xxx) - T232602 [21:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:19] !log remove GRE MTU hacks on archiva1001 gerrit2001 cobalt install1002 - T232602 [21:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:33] (03CR) 10Krinkle: "Hm. the diff is not showing the removed lines. I think a header is missing? e.g. setting up vars and sending http header before DOCTYPE." [puppet] - 10https://gerrit.wikimedia.org/r/539203 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup) [22:11:49] (03PS1) 10Paladox: gerrit: add role on gerrit1001 and remove spare [puppet] - 10https://gerrit.wikimedia.org/r/539204 [22:11:57] (03CR) 10Krinkle: mediawiki: Use mediawiki::errorpage instead of a php7-fatal-error.php.erb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539203 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup) [22:12:18] (03PS2) 10Paladox: gerrit: add role on gerrit1001 and remove spare [puppet] - 10https://gerrit.wikimedia.org/r/539204 [22:12:45] (03PS3) 10Paladox: gerrit: add role on gerrit1001 and remove spare [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) [22:12:50] (03CR) 10Krinkle: mediawiki: Use mediawiki::errorpage instead of a php7-fatal-error.php.erb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539203 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup) [22:13:07] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/539204 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox) [22:14:23] (03CR) 10Krinkle: "I have a gut feeling the HTTP500 bit is redundant. Perhaps try on a debug server to remove it and try the /w/fatal-error.php scenarios?" [puppet] - 10https://gerrit.wikimedia.org/r/539203 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup) [22:17:24] (03PS4) 10Paladox: gerrit: override gerrit::server::slave_hosts under gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/536359 [22:18:11] (03PS5) 10Paladox: gerrit: override gerrit::server::slave_hosts under gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/536359 [22:19:43] (03PS5) 10Ayounsi: [WIP] Netbox Juniper installed base report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/539192 [22:26:48] (03Abandoned) 10Paladox: gerrit: override gerrit::server::slave_hosts under gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/536359 (owner: 10Paladox) [22:30:49] (03Restored) 10Paladox: gerrit: override gerrit::server::slave_hosts under gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/536359 (owner: 10Paladox) [22:31:20] (03PS6) 10Paladox: gerrit: override gerrit::server::slave_hosts under gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/536359 [22:45:25] (03PS6) 10Ayounsi: [WIP] Netbox Juniper installed base report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/539192 [23:00:05] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190925T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:05:46] (03CR) 10Nuria: [C: 03+1] profile::analytics::refinery::job::druid_load: add dims to netflow [puppet] - 10https://gerrit.wikimedia.org/r/538603 (https://phabricator.wikimedia.org/T229682) (owner: 10Elukey) [23:06:48] (03CR) 10Nuria: [C: 03+1] "virtual +2 on my end" [puppet] - 10https://gerrit.wikimedia.org/r/538312 (https://phabricator.wikimedia.org/T208612) (owner: 10Mforns) [23:33:49] (03PS3) 10Dzahn: parsoid: introduce parameter to use parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/539181 [23:40:41] (03PS4) 10Dzahn: parsoid: introduce parameter to use parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/539181 [23:42:45] (03CR) 10jerkins-bot: [V: 04-1] parsoid: introduce parameter to use parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/539181 (owner: 10Dzahn) [23:50:08] (03PS1) 10Paladox: Gerrit: Allow configuring accountPattern and groupBase [puppet] - 10https://gerrit.wikimedia.org/r/539211 [23:53:31] (03PS1) 10Jhedden: openstack: Use WMF style apache logs [puppet] - 10https://gerrit.wikimedia.org/r/539212 (https://phabricator.wikimedia.org/T223907) [23:54:46] (03CR) 10Jhedden: [C: 03+2] openstack: Use WMF style apache logs [puppet] - 10https://gerrit.wikimedia.org/r/539212 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [23:55:03] (03PS2) 10Paladox: Gerrit: Allow configuring accountPattern [puppet] - 10https://gerrit.wikimedia.org/r/539211 [23:55:08] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/539211 (owner: 10Paladox) [23:55:51] (03PS3) 10Paladox: Gerrit: Allow configuring accountPattern [puppet] - 10https://gerrit.wikimedia.org/r/539211 [23:55:57] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/539211 (owner: 10Paladox)