[00:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Evening SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191218T0000). [00:00:04] mooeypoo: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:02:00] o/ here [00:02:06] anyone deploying? [00:02:27] (03PS2) 10Cwhite: scb: add graphoid matching rules and deploy statsd exporter to scb cluster [puppet] - 10https://gerrit.wikimedia.org/r/558732 (https://phabricator.wikimedia.org/T205870) [00:02:29] @Niharika ? <3 [00:02:46] Yep, I'm here. Let's do it. [00:02:50] \o/ [00:03:13] (03PS2) 10Niharika29: Enable $wgAllowRequiringEmailForResets on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558341 (https://phabricator.wikimedia.org/T240736) (owner: 10Samwilson) [00:03:38] (03CR) 10Niharika29: [C: 03+2] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558341 (https://phabricator.wikimedia.org/T240736) (owner: 10Samwilson) [00:03:47] \o [00:04:33] (03Merged) 10jenkins-bot: Enable $wgAllowRequiringEmailForResets on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558341 (https://phabricator.wikimedia.org/T240736) (owner: 10Samwilson) [00:06:01] mooeypoo: The patch is on mwdebug1001. [00:07:15] Niharika: lookin good! [00:07:30] Cool! [00:07:40] looks great! [00:08:32] (03CR) 10Niharika29: [C: 03+2] [cirrus] Disable Glent M0 A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548750 (https://phabricator.wikimedia.org/T237363) (owner: 10DCausse) [00:08:41] (03PS4) 10Niharika29: [cirrus] Disable Glent M0 A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548750 (https://phabricator.wikimedia.org/T237363) (owner: 10DCausse) [00:09:14] !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable on test wikis - T240736 (duration: 01m 02s) [00:09:16] mooeypoo: musikanimal: Deployed. [00:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:22] T240736: PRU: Enable PRU Functionality via UI in Test Wiki - https://phabricator.wikimedia.org/T240736 [00:09:26] Thank you! [00:09:43] (03CR) 10Niharika29: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548750 (https://phabricator.wikimedia.org/T237363) (owner: 10DCausse) [00:10:20] (03PS1) 10Arlolra: Bump Parsoid/PHP cluster memory_limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558737 (https://phabricator.wikimedia.org/T239806) [00:11:18] (03Merged) 10jenkins-bot: [cirrus] Disable Glent M0 A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548750 (https://phabricator.wikimedia.org/T237363) (owner: 10DCausse) [00:11:18] (03CR) 10Krinkle: [C: 04-1] "Move it closer to where wgLocalisationCacheConf['storeClas is normally set. Otherwise, I think it just gets overwritten again?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558239 (https://phabricator.wikimedia.org/T105683) (owner: 10Ladsgroup) [00:12:42] ebernhardson: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/548750/ is on mwdebug1001 [00:12:58] looking [00:13:39] Niharika: looks good [00:13:47] Okay. [00:14:47] (03PS4) 10Niharika29: [cirrus] Enable Glent M0 for dewiki, enwiki and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548751 (https://phabricator.wikimedia.org/T237365) (owner: 10DCausse) [00:15:28] !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disable Glent M0 A/B test - T237363 (duration: 01m 02s) [00:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:34] T237363: Undeploy Glent M0 A/B test - https://phabricator.wikimedia.org/T237363 [00:15:42] (03CR) 10Niharika29: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548751 (https://phabricator.wikimedia.org/T237365) (owner: 10DCausse) [00:16:38] (03Merged) 10jenkins-bot: [cirrus] Enable Glent M0 for dewiki, enwiki and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/548751 (https://phabricator.wikimedia.org/T237365) (owner: 10DCausse) [00:18:00] ebernhardson: Your second patch is on mwdebug1001 too. [00:20:52] Niharika: looks good as well [00:22:35] !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable Glent M0 for dewiki, enwiki and frwiki - T237365 (duration: 01m 02s) [00:22:39] ebernhardson: Both deployed. And that concludes the swat. [00:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:41] T237365: Enable Glent M0 on de, en and fr wikipedias - https://phabricator.wikimedia.org/T237365 [00:56:37] (03PS2) 10Ladsgroup: Add a bit for forcing LC caching backend in cli mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558239 (https://phabricator.wikimedia.org/T105683) [00:57:57] (03CR) 10Ladsgroup: "> Patch Set 1: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558239 (https://phabricator.wikimedia.org/T105683) (owner: 10Ladsgroup) [01:00:04] (03CR) 10Krinkle: [C: 03+1] Add a bit for forcing LC caching backend in cli mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558239 (https://phabricator.wikimedia.org/T105683) (owner: 10Ladsgroup) [01:14:13] (03PS1) 10Krinkle: varnish: Remove duplicate 'Content-Type: text/html' statement [puppet] - 10https://gerrit.wikimedia.org/r/558752 [01:44:03] (03PS1) 10Krinkle: CommonSettings.php: Remove CLI 'display_errors=stderr' setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558758 [01:50:35] (03PS2) 10Krinkle: CommonSettings.php: Remove CLI 'display_errors=stderr' setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558758 [01:57:58] (03PS1) 10Krinkle: CommonSettings.php: Remove very old 'error_append_string' INI override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558761 [01:58:47] (03CR) 10Krinkle: "Without patch:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558761 (owner: 10Krinkle) [01:59:48] (03CR) 10Krinkle: "mwdeploy@mwdebug1001:/srv/mediawiki/w$ cat krinkle.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558761 (owner: 10Krinkle) [02:04:06] (03PS1) 10Krinkle: Follows-up 164a3ac1f099 which removed IEUrlExtension from MediaWiki and has been deployed to all wikis since. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558763 (https://phabricator.wikimedia.org/T232563) [02:04:32] (03PS2) 10Krinkle: CommonSettings.php: Remove 'SERVER_SOFTWARE' override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558763 (https://phabricator.wikimedia.org/T232563) [02:14:56] (03CR) 10VolkerE: [C: 03+1] CommonSettings.php: Remove 'SERVER_SOFTWARE' override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558763 (https://phabricator.wikimedia.org/T232563) (owner: 10Krinkle) [02:26:00] (03PS1) 10Krinkle: CommonSettings.php: Move core DB/SQL-related config closer together [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558768 [02:26:02] (03PS1) 10Krinkle: CommonSettings.php: Remove the disabled "temporary" code for T232613 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558769 (https://phabricator.wikimedia.org/T232613) [02:38:24] (03CR) 10Subramanya Sastry: [C: 03+1] "Joe: bump to 1G ok with you?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558737 (https://phabricator.wikimedia.org/T239806) (owner: 10Arlolra) [02:39:09] (03CR) 10Subramanya Sastry: [C: 03+1] Bump Parsoid/PHP cluster memory_limit (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558737 (https://phabricator.wikimedia.org/T239806) (owner: 10Arlolra) [02:41:11] (03PS1) 10Krinkle: etcd: Set globals explicitly in CommonSettings instead of etcd.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558774 [02:41:13] (03PS1) 10Krinkle: etcd: Set $wmfEtcdLastModifiedIndex from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558775 [02:41:15] (03PS1) 10Krinkle: etcd: Add $etcdHost parameter to wmfSetupEtcd() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558776 [02:41:17] (03PS1) 10Krinkle: etcd: Set wmfSetupEtcd($etcdHost) from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558777 [02:42:44] (03CR) 10jerkins-bot: [V: 04-1] etcd: Add $etcdHost parameter to wmfSetupEtcd() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558776 (owner: 10Krinkle) [02:42:57] (03CR) 10jerkins-bot: [V: 04-1] etcd: Set wmfSetupEtcd($etcdHost) from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558777 (owner: 10Krinkle) [02:43:05] (03PS1) 10KartikMistry: Update cxserver to 2019-12-11-144337-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/558778 (https://phabricator.wikimedia.org/T233405) [02:55:01] (03PS2) 10C. Scott Ananian: Bump Parsoid/PHP cluster memory_limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558737 (https://phabricator.wikimedia.org/T239806) (owner: 10Arlolra) [02:55:30] (03CR) 10C. Scott Ananian: [C: 03+1] Bump Parsoid/PHP cluster memory_limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558737 (https://phabricator.wikimedia.org/T239806) (owner: 10Arlolra) [02:55:47] (03PS3) 10C. Scott Ananian: Bump Parsoid/PHP cluster memory_limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558737 (https://phabricator.wikimedia.org/T239806) (owner: 10Arlolra) [03:28:47] (03PS1) 10CRusnov: Import various tools from netbox-deploy as part of unification [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/558791 [03:34:01] (03CR) 10CRusnov: [C: 03+2] "Self merging because this is a code import." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/558791 (owner: 10CRusnov) [04:54:40] !log add static routes for cloud's 185.15.57.0/29 on cr1/2-codfw - T239347 [04:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:47] T239347: create a 'normal' network for codf1dev neutron w/public IPs - https://phabricator.wikimedia.org/T239347 [04:59:42] !log advertise 185.15.57.0/24 from [co|eq]dfw - T239347 [04:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:48] T239347: create a 'normal' network for codf1dev neutron w/public IPs - https://phabricator.wikimedia.org/T239347 [05:11:21] (03PS1) 10Ayounsi: Start advertising 185.15.57.0/24 from codfw/eqdfw [homer/public] - 10https://gerrit.wikimedia.org/r/558821 (https://phabricator.wikimedia.org/T239347) [05:12:10] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Start advertising 185.15.57.0/24 from codfw/eqdfw [homer/public] - 10https://gerrit.wikimedia.org/r/558821 (https://phabricator.wikimedia.org/T239347) (owner: 10Ayounsi) [05:31:35] !log Deploy schema change on commonswiki.image on s4 primary master (db1138) - T233135 [05:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:41] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [05:38:50] (03PS1) 10Ammarpad: Add new namespace and aliases for zh.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558827 (https://phabricator.wikimedia.org/T241023) [05:40:07] (03PS2) 10Ammarpad: Add new namespace and aliases for zh.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558827 (https://phabricator.wikimedia.org/T241023) [05:41:39] (03PS3) 10Ammarpad: Add new namespace and aliases for zh.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558827 (https://phabricator.wikimedia.org/T241023) [05:43:04] (03PS4) 10Ammarpad: Add new namespace and aliases for zh.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558827 (https://phabricator.wikimedia.org/T241023) [05:46:36] (03PS1) 10Marostegui: db1136: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/558829 [05:48:21] (03CR) 10Marostegui: [C: 03+2] db1136: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/558829 (owner: 10Marostegui) [05:55:20] !log Upgrade db2071 and db2072 [05:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:48] !log Upgrade db2088, db2092 [05:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:35] !log Upgrade db2112 db2116 db2130 [06:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:36] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-m [06:07:08] PROBLEM - High average GET latency for mw requests on appserver in codfw on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [06:12:32] RECOVERY - High average GET latency for mw requests on appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [06:12:48] (03PS1) 10Marostegui: db-eqiad.php: Depool pc1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558836 [06:12:50] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [06:14:55] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool pc1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558836 (owner: 10Marostegui) [06:15:42] (03Merged) 10jenkins-bot: db-eqiad.php: Depool pc1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558836 (owner: 10Marostegui) [06:17:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool pc1007 for upgrade (duration: 01m 11s) [06:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:10] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool pc1007" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558838 [06:21:48] PROBLEM - MariaDB Slave IO: pc1 on pc2010 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@pc1007.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on pc1007.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [06:22:11] ^ expected [06:22:30] PROBLEM - MariaDB Slave IO: pc1 on pc2007 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@pc1007.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on pc1007.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [06:22:33] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool pc1007" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558838 (owner: 10Marostegui) [06:23:27] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool pc1007" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558838 (owner: 10Marostegui) [06:23:36] RECOVERY - MariaDB Slave IO: pc1 on pc2010 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [06:24:16] RECOVERY - MariaDB Slave IO: pc1 on pc2007 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [06:24:40] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool pc1007 after upgrade (duration: 01m 00s) [06:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105:3311, db1105:3312', diff saved to https://phabricator.wikimedia.org/P9922 and previous config saved to /var/cache/conftool/dbconfig/20191218-062759-marostegui.json [06:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:06] !log Upgrade db1105 [06:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1105:3311, db1105:3312', diff saved to https://phabricator.wikimedia.org/P9923 and previous config saved to /var/cache/conftool/dbconfig/20191218-063652-marostegui.json [06:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:57] !log upgrading debdeploy-client to 0.2.0 fleet-wide [06:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:13] !log upgrading debmonitor-client to 0.2.0 fleet-wide [06:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1105:3311, db1105:3312', diff saved to https://phabricator.wikimedia.org/P9924 and previous config saved to /var/cache/conftool/dbconfig/20191218-064510-marostegui.json [06:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:41] !log running replicate-osm on maps1004 after failed osm sync - T239728 [06:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:46] T239728: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 [06:48:03] !log volker-e@deploy1001 Started deploy [design/style-guide@d13b55d]: Deploy design/style-guide: [06:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:10] !log volker-e@deploy1001 Finished deploy [design/style-guide@d13b55d]: Deploy design/style-guide: (duration: 00m 07s) [06:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:32] !log Upgrade db2135 [06:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:50] !log Upgrade db2132, db2133, db2134 [06:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:16] (03PS1) 10Marostegui: filtered_tables.txt: Remove dropped columns [puppet] - 10https://gerrit.wikimedia.org/r/558851 (https://phabricator.wikimedia.org/T233135) [07:08:02] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/558316 (https://phabricator.wikimedia.org/T240732) (owner: 10Jcrespo) [07:14:43] !log andrew@deploy1001 Started deploy [horizon/deploy@f77e91b]: Fix for T240979 [07:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:49] T240979: Unable to create Web Proxy in the "phragile" Cloud VPS project (using Horizon) - https://phabricator.wikimedia.org/T240979 [07:18:07] !log andrew@deploy1001 Finished deploy [horizon/deploy@f77e91b]: Fix for T240979 (duration: 03m 24s) [07:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:44] (03CR) 10WMDE-Fisch: [C: 03+1] Phragile: Added PHP extensions needed by PHP 7 dependencies [puppet] - 10https://gerrit.wikimedia.org/r/558476 (https://phabricator.wikimedia.org/T211228) (owner: 10WMDE-leszek) [07:30:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1016 for upgrade', diff saved to https://phabricator.wikimedia.org/P9925 and previous config saved to /var/cache/conftool/dbconfig/20191218-073002-marostegui.json [07:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:49] ACKNOWLEDGEMENT - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 Ayounsi https://phabricator.wikimedia.org/T240659 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:40:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1016', diff saved to https://phabricator.wikimedia.org/P9926 and previous config saved to /var/cache/conftool/dbconfig/20191218-074032-marostegui.json [07:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1105:3311, db1105:3312', diff saved to https://phabricator.wikimedia.org/P9927 and previous config saved to /var/cache/conftool/dbconfig/20191218-074642-marostegui.json [07:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:15] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [07:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:21] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'citoid' for release 'staging' . [07:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:48] !log run helmfile sync for all staging deployments T239835 [07:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:53] T239835: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster - https://phabricator.wikimedia.org/T239835 [07:53:55] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [07:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:24] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'echostore' for release 'staging' . [07:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:48] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics' for release 'analytics' . [07:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:12] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove old redundant rbac/ dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/558704 (owner: 10Alexandros Kosiaris) [07:55:18] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'logging-external' . [07:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:29] (03Merged) 10jenkins-bot: Remove old redundant rbac/ dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/558704 (owner: 10Alexandros Kosiaris) [07:55:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] RBAC: Add system:nodes group to system:node [deployment-charts] - 10https://gerrit.wikimedia.org/r/558705 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [07:56:08] (03Merged) 10jenkins-bot: RBAC: Add system:nodes group to system:node [deployment-charts] - 10https://gerrit.wikimedia.org/r/558705 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [07:56:24] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-main' for release 'main' . [07:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:53] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [07:56:54] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [07:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:28] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'restrouter' for release 'staging' . [07:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:05] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'sessionstore' for release 'staging' . [07:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1105:3311, db1105:3312', diff saved to https://phabricator.wikimedia.org/P9928 and previous config saved to /var/cache/conftool/dbconfig/20191218-075828-marostegui.json [07:58:31] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'termbox' for release 'test' . [07:58:32] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'termbox' for release 'staging' . [07:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:15] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [07:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1016', diff saved to https://phabricator.wikimedia.org/P9929 and previous config saved to /var/cache/conftool/dbconfig/20191218-075919-marostegui.json [07:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:43] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'zotero' for release 'staging' . [07:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:19] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [08:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:43] !log Upgrade db2109 [08:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:53] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [08:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:40] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [08:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:12] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [08:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:57] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [08:12:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1016', diff saved to https://phabricator.wikimedia.org/P9930 and previous config saved to /var/cache/conftool/dbconfig/20191218-081256-marostegui.json [08:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:04] (03PS1) 10Muehlenhoff: Make the images proxy configurable and add boron [puppet] - 10https://gerrit.wikimedia.org/r/558886 [08:20:05] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [08:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:03] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [08:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:13] (03PS2) 10Jcrespo: admin: Provide access to kzimmerman (kzeta) to production analytics [puppet] - 10https://gerrit.wikimedia.org/r/558316 (https://phabricator.wikimedia.org/T240732) [08:21:40] (03CR) 10Jcrespo: admin: Provide access to kzimmerman (kzeta) to production analytics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/558316 (https://phabricator.wikimedia.org/T240732) (owner: 10Jcrespo) [08:22:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool es1016', diff saved to https://phabricator.wikimedia.org/P9931 and previous config saved to /var/cache/conftool/dbconfig/20191218-082226-marostegui.json [08:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:29] (03CR) 10Muehlenhoff: [C: 03+1] admin: Provide access to kzimmerman (kzeta) to production analytics [puppet] - 10https://gerrit.wikimedia.org/r/558316 (https://phabricator.wikimedia.org/T240732) (owner: 10Jcrespo) [08:24:31] (03PS2) 10Giuseppe Lavagetto: wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 [08:26:46] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 (owner: 10Giuseppe Lavagetto) [08:29:18] (03PS3) 10Giuseppe Lavagetto: wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 [08:31:32] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 (owner: 10Giuseppe Lavagetto) [08:44:26] (03CR) 10Gehel: [C: 03+2] elasticsearch: decommission elastic[1018-1031] [dns] - 10https://gerrit.wikimedia.org/r/558525 (https://phabricator.wikimedia.org/T239821) (owner: 10Gehel) [08:44:33] (03PS3) 10Gehel: elasticsearch: decommission elastic[1018-1031] [dns] - 10https://gerrit.wikimedia.org/r/558525 (https://phabricator.wikimedia.org/T239821) [08:46:22] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM from a quick look, CC'ing Alex as he's involved with Graphoid IIRC" [puppet] - 10https://gerrit.wikimedia.org/r/558732 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [08:49:24] (03PS2) 10Muehlenhoff: Make the images proxy configurable and add boron [puppet] - 10https://gerrit.wikimedia.org/r/558886 [08:51:40] (03PS1) 10Vgutierrez: ATS: Disable debug mode in cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/558970 (https://phabricator.wikimedia.org/T238494) [08:53:55] (03CR) 10Ema: [C: 03+1] ATS: Disable debug mode in cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/558970 (https://phabricator.wikimedia.org/T238494) (owner: 10Vgutierrez) [08:54:05] (03CR) 10Vgutierrez: [C: 03+2] ATS: Disable debug mode in cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/558970 (https://phabricator.wikimedia.org/T238494) (owner: 10Vgutierrez) [08:58:29] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [08:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:13] !log restarting ats-be on cp3050 [08:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:34] (03PS5) 10Alexandros Kosiaris: Switch codfw calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558472 (https://phabricator.wikimedia.org/T239835) [09:01:34] (03PS5) 10Alexandros Kosiaris: Switch eqiad calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558473 (https://phabricator.wikimedia.org/T239835) [09:01:36] (03PS1) 10Alexandros Kosiaris: rbac: Add default metadata to system:node RBAC [deployment-charts] - 10https://gerrit.wikimedia.org/r/558972 (https://phabricator.wikimedia.org/T239835) [09:03:14] (03CR) 10Alexandros Kosiaris: [C: 03+1] "graphoid is without a maintainer and scheduled to be undeployed by mid of next quarter. Feel free to proceed with this, but don't spend to" [puppet] - 10https://gerrit.wikimedia.org/r/558732 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [09:04:49] (03CR) 10Alexandros Kosiaris: [C: 03+1] Make the images proxy configurable and add boron [puppet] - 10https://gerrit.wikimedia.org/r/558886 (owner: 10Muehlenhoff) [09:05:47] (03CR) 10Alexandros Kosiaris: [C: 03+2] rbac: Add default metadata to system:node RBAC [deployment-charts] - 10https://gerrit.wikimedia.org/r/558972 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [09:05:56] (03Merged) 10jenkins-bot: rbac: Add default metadata to system:node RBAC [deployment-charts] - 10https://gerrit.wikimedia.org/r/558972 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [09:10:22] (03PS6) 10Alexandros Kosiaris: Switch codfw calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558472 (https://phabricator.wikimedia.org/T239835) [09:10:24] (03PS6) 10Alexandros Kosiaris: Switch eqiad calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558473 (https://phabricator.wikimedia.org/T239835) [09:10:26] (03PS1) 10Alexandros Kosiaris: admin: don't rely on coredns for kube-system tiller [deployment-charts] - 10https://gerrit.wikimedia.org/r/558973 (https://phabricator.wikimedia.org/T239835) [09:14:30] (03PS2) 10Alexandros Kosiaris: admin: don't rely on coredns for kube-system tiller [deployment-charts] - 10https://gerrit.wikimedia.org/r/558973 (https://phabricator.wikimedia.org/T239835) [09:14:32] (03PS7) 10Alexandros Kosiaris: Switch codfw calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558472 (https://phabricator.wikimedia.org/T239835) [09:14:34] (03PS7) 10Alexandros Kosiaris: Switch eqiad calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558473 (https://phabricator.wikimedia.org/T239835) [09:17:38] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor pedantic nitpick, otherwise LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/558220 (https://phabricator.wikimedia.org/T232135) (owner: 10BryanDavis) [09:18:36] !log repool cp3050 after ats-be restart [09:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:10] (03PS1) 10Volans: CLI: fix typo in help message. [software/debmonitor] - 10https://gerrit.wikimedia.org/r/558979 (https://phabricator.wikimedia.org/T237978) [09:20:31] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: don't rely on coredns for kube-system tiller [deployment-charts] - 10https://gerrit.wikimedia.org/r/558973 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [09:20:44] (03Merged) 10jenkins-bot: admin: don't rely on coredns for kube-system tiller [deployment-charts] - 10https://gerrit.wikimedia.org/r/558973 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [09:22:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1011 for upgrade', diff saved to https://phabricator.wikimedia.org/P9933 and previous config saved to /var/cache/conftool/dbconfig/20191218-092228-marostegui.json [09:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] Be more verbose about backup1001, backup2001 [dns] - 10https://gerrit.wikimedia.org/r/547537 (owner: 10Alexandros Kosiaris) [09:23:57] (03PS2) 10Alexandros Kosiaris: Be more verbose about backup1001, backup2001 [dns] - 10https://gerrit.wikimedia.org/r/547537 [09:24:12] (03Abandoned) 10Alexandros Kosiaris: Be more verbose about backup1001, backup2001 [dns] - 10https://gerrit.wikimedia.org/r/547537 (owner: 10Alexandros Kosiaris) [09:24:52] !log execute 'megacli -LDSetProp WT -LAll -aAll' on analytics1057 - T239045 [09:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:57] T239045: analytics1057's BBU is faulty - https://phabricator.wikimedia.org/T239045 [09:26:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1011', diff saved to https://phabricator.wikimedia.org/P9934 and previous config saved to /var/cache/conftool/dbconfig/20191218-092625-marostegui.json [09:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:55] (03PS3) 10Muehlenhoff: Make the images proxy configurable and add boron [puppet] - 10https://gerrit.wikimedia.org/r/558886 [09:28:03] (03PS1) 10Elukey: Fix /mnt/hdfs check to use Kerberos on the Hadoop test coordinator [puppet] - 10https://gerrit.wikimedia.org/r/558982 [09:29:23] (03CR) 10Elukey: [C: 03+2] Fix /mnt/hdfs check to use Kerberos on the Hadoop test coordinator [puppet] - 10https://gerrit.wikimedia.org/r/558982 (owner: 10Elukey) [09:30:54] (03PS1) 10Ema: ATS: increase keep_alive_no_activity_timeout_out on ats-be [puppet] - 10https://gerrit.wikimedia.org/r/558984 (https://phabricator.wikimedia.org/T238494) [09:30:56] (03PS1) 10Ema: mediawiki::webserver: increase TLS termination keepalive_timeout [puppet] - 10https://gerrit.wikimedia.org/r/558985 (https://phabricator.wikimedia.org/T238494) [09:31:55] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/20048/" [puppet] - 10https://gerrit.wikimedia.org/r/558886 (owner: 10Muehlenhoff) [09:33:40] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/558886 (owner: 10Muehlenhoff) [09:34:32] (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler1001/20049/mw1266.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/558985 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [09:37:12] (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler1003/20050/cp3050.esams.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/558984 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [09:37:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1011', diff saved to https://phabricator.wikimedia.org/P9935 and previous config saved to /var/cache/conftool/dbconfig/20191218-093720-marostegui.json [09:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:08] (03CR) 10Vgutierrez: "IMHO the timeout value should be aligned with the rest of the timeouts in the stack (see https://wikitech.wikimedia.org/wiki/HTTP_timeouts" [puppet] - 10https://gerrit.wikimedia.org/r/558985 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [09:45:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1011', diff saved to https://phabricator.wikimedia.org/P9936 and previous config saved to /var/cache/conftool/dbconfig/20191218-094540-marostegui.json [09:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:32] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/558979 (https://phabricator.wikimedia.org/T237978) (owner: 10Volans) [09:53:57] (03CR) 10Volans: [C: 03+2] CLI: fix typo in help message. [software/debmonitor] - 10https://gerrit.wikimedia.org/r/558979 (https://phabricator.wikimedia.org/T237978) (owner: 10Volans) [09:55:35] (03PS4) 10Muehlenhoff: Make the images proxy configurable and add boron [puppet] - 10https://gerrit.wikimedia.org/r/558886 [09:56:34] (03Merged) 10jenkins-bot: CLI: fix typo in help message. [software/debmonitor] - 10https://gerrit.wikimedia.org/r/558979 (https://phabricator.wikimedia.org/T237978) (owner: 10Volans) [09:57:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool es1011', diff saved to https://phabricator.wikimedia.org/P9937 and previous config saved to /var/cache/conftool/dbconfig/20191218-095710-marostegui.json [09:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:07] !log populate new calico stores for codfw T239835 [10:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:12] T239835: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster - https://phabricator.wikimedia.org/T239835 [10:03:27] (03CR) 10DCausse: "async import is currently running to catchup updates on wdqs1010 (https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1" [puppet] - 10https://gerrit.wikimedia.org/r/558526 (https://phabricator.wikimedia.org/T238045) (owner: 10DCausse) [10:04:35] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [10:04:37] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:50] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [10:04:53] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:05] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [10:05:05] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [10:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:10] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [10:05:11] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:16] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [10:05:18] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:31] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime [10:06:31] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [10:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:42] (03CR) 10DCausse: [cirrus] add elastic mapping for ores drafttopics (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558577 (https://phabricator.wikimedia.org/T240550) (owner: 10DCausse) [10:09:35] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.10/extensions/GrowthExperiments/includes: T240444 Make PageViewInfo a soft dependency (duration: 01m 04s) [10:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:40] T240444: GrowthExperiments homepage requires PageViewInfo even it is declared as soft dependency - https://phabricator.wikimedia.org/T240444 [10:10:44] !log Upgrade db2083 [10:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:16] (03CR) 10Muehlenhoff: [C: 03+2] Make the images proxy configurable and add boron [puppet] - 10https://gerrit.wikimedia.org/r/558886 (owner: 10Muehlenhoff) [10:12:35] heads up, kubernetes codfw cluster reinit beginning in a few, I 've already downtime stuff and I am currently depooling all services, but I 've may have forgotten something [10:12:49] (03PS5) 10Alexandros Kosiaris: k8s: Migrate codfw to the new etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/558354 (https://phabricator.wikimedia.org/T239835) [10:12:50] (03PS5) 10Alexandros Kosiaris: k8s: Migrate eqiad to the new etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/558355 (https://phabricator.wikimedia.org/T239835) [10:12:52] (03PS1) 10Alexandros Kosiaris: cache::text: Depool k8s services [puppet] - 10https://gerrit.wikimedia.org/r/559002 (https://phabricator.wikimedia.org/T239835) [10:12:57] Good luck. [10:16:02] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=(eventgate.*|mathoid|citoid|restrouter|sessionstore|echostore|zotero|termbox|wikifeeds|cxserver|blubberoid) [10:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:15] !log depooling eventgate.*|mathoid|citoid|restrouter|sessionstore|echostore|zotero|termbox|wikifeeds|cxserver|blubberoid) from codfw kubernetes [10:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, thanks!" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/558726 (https://phabricator.wikimedia.org/T241008) (owner: 10BryanDavis) [10:28:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Also, worth double checking that the ingress admission controller will allow this new setting in the ingress objects:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/558726 (https://phabricator.wikimedia.org/T241008) (owner: 10BryanDavis) [10:31:46] (03CR) 10Alexandros Kosiaris: "@ottomata, a change in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/master/helmfile.d/services/eqiad/even" [puppet] - 10https://gerrit.wikimedia.org/r/558117 (owner: 10Ottomata) [10:31:51] (03CR) 10Alexandros Kosiaris: [C: 03+1] Change eventgate-logging-external TLS port to 4392 [puppet] - 10https://gerrit.wikimedia.org/r/558117 (owner: 10Ottomata) [10:34:35] 10Operations, 10netops, 10cloud-services-team (Kanban): Return traffic to eqiad WMCS triggering FNM - https://phabricator.wikimedia.org/T240789 (10akosiaris) I guess parsing past alerts can help answer this question? If the number of exceptions that need to be defined is small enough it might not be worth it... [10:35:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] Switch codfw calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558472 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [10:35:53] (03Merged) 10jenkins-bot: Switch codfw calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558472 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [10:36:12] (03PS3) 10Alexandros Kosiaris: mathoid: Remove mwapi_req/restbase_req [deployment-charts] - 10https://gerrit.wikimedia.org/r/488800 [10:37:10] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Multiple services have now been moved over, we can now merge this. It should be a noop and will make it to the next mathoid chart version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/488800 (owner: 10Alexandros Kosiaris) [10:37:28] (03Merged) 10jenkins-bot: mathoid: Remove mwapi_req/restbase_req [deployment-charts] - 10https://gerrit.wikimedia.org/r/488800 (owner: 10Alexandros Kosiaris) [10:41:54] 10Operations, 10netops, 10cloud-services-team (Kanban): Return traffic to eqiad WMCS triggering FNM - https://phabricator.wikimedia.org/T240789 (10ayounsi) Good point! Looking at past alerts only that one was a false positive. 2) would allow us to have stricter thresholds, but I agree that it's outside the... [10:47:26] (03PS3) 10Andrew Bogott: cloud base images: enable passwordless login on serial0 [puppet] - 10https://gerrit.wikimedia.org/r/558296 (https://phabricator.wikimedia.org/T240660) [10:47:48] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.11/extensions/VisualEditor/includes/ApiVisualEditor.php: T240961: Fix unchecked array access in ApiVisualEditor (duration: 01m 02s) [10:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:54] T240961: VisualEditor throwing "PHP Notice: Undefined index: etag" on officewiki as of wmf.11 - https://phabricator.wikimedia.org/T240961 [10:47:58] (03PS4) 10Andrew Bogott: cloud base images: enable passwordless login on serial0 [puppet] - 10https://gerrit.wikimedia.org/r/558296 (https://phabricator.wikimedia.org/T240660) [10:48:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: Update toolviews.py nginx log parser [puppet] - 10https://gerrit.wikimedia.org/r/558676 (https://phabricator.wikimedia.org/T238641) (owner: 10BryanDavis) [10:48:59] (03CR) 10Andrew Bogott: [C: 03+2] cloud base images: enable passwordless login on serial0 [puppet] - 10https://gerrit.wikimedia.org/r/558296 (https://phabricator.wikimedia.org/T240660) (owner: 10Andrew Bogott) [10:49:22] (03CR) 10Arturo Borrero Gonzalez: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/558676 (https://phabricator.wikimedia.org/T238641) (owner: 10BryanDavis) [10:51:31] (03PS4) 10Giuseppe Lavagetto: wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 [10:51:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Please collect +1 from Hieu before merging." [puppet] - 10https://gerrit.wikimedia.org/r/558597 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [10:53:25] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 (owner: 10Giuseppe Lavagetto) [10:56:22] (03PS1) 10Andrew Bogott: bootstrap-vz buster: rename puppet-overrides.conf to match the stretch filename [puppet] - 10https://gerrit.wikimedia.org/r/559014 (https://phabricator.wikimedia.org/T240660) [10:57:16] (03CR) 10Andrew Bogott: [C: 03+2] bootstrap-vz buster: rename puppet-overrides.conf to match the stretch filename [puppet] - 10https://gerrit.wikimedia.org/r/559014 (https://phabricator.wikimedia.org/T240660) (owner: 10Andrew Bogott) [10:59:57] 10Operations, 10ops-eqiad, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10jijiki) @Jclark-ctr if you feel this will work better, we are happy. Either way, this racking is still better than the original one (30 servers in D 5). Thank you! [11:00:21] jouncebot: next [11:00:21] In 0 hour(s) and 59 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191218T1200) [11:01:04] CFisch_WMDE: are you running the upcoming SWAT? [11:04:39] effie: If I'm alone with my patch I might take the opportunity to practice my deployment skills, so yes. [11:04:56] (I've got backup in the office to accompany me on that) [11:09:18] I want to to swap the scap proxies in eqiad and codfw [11:09:33] so ping me when you are about to run scap [11:09:45] 10Operations, 10User-jbond: Collects metrics for CAS - https://phabricator.wikimedia.org/T233934 (10jbond) 05Open→03Resolved a:03jbond This is completed https://grafana-next.wikimedia.org/d/spring_boot_21/spring-boot-statistics [11:09:47] 10Operations, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10jbond) [11:09:52] I can delay merging my patch if needed anyway [11:13:50] (03PS5) 10Giuseppe Lavagetto: wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 [11:14:55] !log installing spamassassin security updates on mendelevium/OTRS [11:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:06] !log Upgrade db2081, db2082 [11:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:54] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [11:23:06] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-m [11:24:54] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [11:25:55] (03PS1) 10Effie Mouzeli: scap:dsh.yaml switch scap proxies so to reimage them [puppet] - 10https://gerrit.wikimedia.org/r/559021 (https://phabricator.wikimedia.org/T239054) [11:26:05] !log installing spamassassin security updates on fermium/lists [11:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:28] (03PS6) 10Giuseppe Lavagetto: wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 [11:27:18] !log installing apache update on basion servers [11:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, and all subs in the same rack." [puppet] - 10https://gerrit.wikimedia.org/r/559021 (https://phabricator.wikimedia.org/T239054) (owner: 10Effie Mouzeli) [11:28:23] !log installing ruby2.3 security updates [11:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:38] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 (owner: 10Giuseppe Lavagetto) [11:31:26] 10Operations: Add a second CPU to debmonitor hosts - https://phabricator.wikimedia.org/T241046 (10MoritzMuehlenhoff) [11:32:50] (03PS7) 10Giuseppe Lavagetto: wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 [11:34:25] 10Operations, 10vm-requests: Add a second CPU to debmonitor hosts - https://phabricator.wikimedia.org/T241046 (10MoritzMuehlenhoff) [11:34:39] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 (owner: 10Giuseppe Lavagetto) [11:51:26] (03PS8) 10Giuseppe Lavagetto: wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 [11:53:20] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Introduce a more usable data structure to describe services. [puppet] - 10https://gerrit.wikimedia.org/r/558620 (owner: 10Giuseppe Lavagetto) [11:54:39] (03PS7) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [dns] - 10https://gerrit.wikimedia.org/r/555570 (https://phabricator.wikimedia.org/T224585) [11:56:03] (03CR) 10Phamhi: [C: 03+1] Switch cloudmetrics to the new unified partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/558597 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191218T1200). [12:00:05] CFisch_WMDE: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:36] I actually have a patch to deploy, will add it :D [12:00:51] CFisch_WMDE: let me know when you're around [12:00:57] CFisch_WMDE and I were going to deploy some Popups backports, Amir1 feel free to go first! [12:01:20] +1 [12:01:34] cooool [12:04:54] (03PS4) 10Muehlenhoff: Switch cloudmetrics to the new unified partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/558597 (https://phabricator.wikimedia.org/T156955) [12:07:53] (03CR) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/555570 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [12:07:55] (03CR) 10Muehlenhoff: [C: 03+2] Switch cloudmetrics to the new unified partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/558597 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [12:10:12] Amir1: wondering if you are deploying? [12:10:27] awight: I'm waiting for you :D [12:10:35] sorry I didn't say it explicitly [12:10:48] Amir1: go first, if you don't mind? [12:11:01] sure sure [12:11:01] Ours might be slower than usual... [12:11:04] ty! [12:11:07] oh okay [12:11:27] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558239 (https://phabricator.wikimedia.org/T105683) (owner: 10Ladsgroup) [12:12:49] (03Merged) 10jenkins-bot: Add a bit for forcing LC caching backend in cli mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558239 (https://phabricator.wikimedia.org/T105683) (owner: 10Ladsgroup) [12:16:50] effie: ^^^ FYI. [12:17:15] thanks james! [12:17:26] I will merge mypatch later, no need to delay swat [12:19:59] effie: OK. Is it OK for me to do the train in 40 minutes' time, too? [12:21:08] !log ladsgroup@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:558239|Add a bit for forcing LC caching backend in cli mode (T105683)]] (duration: 01m 03s) [12:21:12] (03PS1) 10Effie Mouzeli: common::mcrouter.yaml switch mcrouter proxies so to reimage them [puppet] - 10https://gerrit.wikimedia.org/r/559033 (https://phabricator.wikimedia.org/T239054) [12:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:14] T105683: Add Scap support for static-array format of LCStore - https://phabricator.wikimedia.org/T105683 [12:22:02] awight: I'm done [12:22:37] James_F: yeah, I will wait for a +1 anyway [12:22:48] Kk. [12:22:54] thank you! [12:23:00] Amir1: ack [12:30:41] (03PS6) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) [12:32:46] effie: I'm about to scap now. [12:32:54] scap it [12:35:19] (03CR) 10Alexandros Kosiaris: [C: 03+2] cache::text: Depool k8s services [puppet] - 10https://gerrit.wikimedia.org/r/559002 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [12:35:33] !log wmde-fisch@deploy1001 Synchronized php-1.35.0-wmf.10/extensions/Popups: SWAT: [[gerrit:559010|Fix initial preferences for newly created user accounts (T240947)]] (duration: 01m 03s) [12:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:40] T240947: ReferencePreviews accidentally enabled for new users even if in Beta - https://phabricator.wikimedia.org/T240947 [12:35:44] And there will be another one... [12:35:51] (03PS7) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) [12:43:18] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [12:43:37] effie: About to scap again, still good to go? ;-) [12:43:44] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: update routing_source_ip [puppet] - 10https://gerrit.wikimedia.org/r/559036 (https://phabricator.wikimedia.org/T239347) [12:43:48] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikifeeds_8889: Servers kubernetes2001.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: echostore_8082: Servers kubernetes2002.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: eventgate-logging-external_43192: Servers kubernetes2002.codfw.wmnet, kub [12:43:48] .wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: eventgate-analytics_31192: Servers kubernetes2002.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: blubberoid_8748: Servers kubernetes2006.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: blubberoid-https_4666: Servers kubernetes2002.codfw.wmnet, kubernetes2006.codfw.wmnet, k [12:43:48] fw.wmnet are marked down but pooled: restrouter_7231: Servers kubernetes2006.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: eventgate-ma https://wikitech.wikimedia.org/wiki/PyBal [12:43:58] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([kubernetes2004.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2001.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [12:44:06] akosiaris: ^^ Is that your acks expiring? [12:44:30] nope, I forgot about scheduling downtime for these as well [12:44:39] they are valid, but not worrisome [12:44:42] Ah. :-) [12:44:46] * akosiaris scheduling downtime for them as well [12:44:46] Yeah, codfw. [12:45:10] (03PS8) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) [12:45:10] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikifeeds_8889: Servers kubernetes2002.codfw.wmnet, kubernetes2001.codfw.wmnet, kubernetes2006.codfw.wmnet are marked down but pooled: echostore_8082: Servers kubernetes2004.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: eventgate-logging-external_43192: Servers kubernetes2001.codfw.wmnet, kub [12:45:11] .wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: eventgate-analytics_31192: Servers kubernetes2002.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled: blubberoid_8748: Servers kubernetes2004.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet are marked down but pooled: blubberoid-https_4666: Servers kubernetes2004.codfw.wmnet, kubernetes2006.codfw.wmnet, k [12:45:11] fw.wmnet are marked down but pooled: restrouter_7231: Servers kubernetes2004.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled: eventgate-ma https://wikitech.wikimedia.org/wiki/PyBal [12:45:12] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:45:21] restbase however... it was not expected [12:45:26] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:45:40] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:45:44] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:45:58] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:46:00] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:46:00] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:46:24] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:46:26] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:46:28] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:46:28] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:46:28] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:46:52] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:01] ouch [12:47:11] yeah me looking [12:47:25] effie: I'll just scap now [12:47:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: codfw1dev: update routing_source_ip [puppet] - 10https://gerrit.wikimedia.org/r/559036 (https://phabricator.wikimedia.org/T239347) (owner: 10Arturo Borrero Gonzalez) [12:48:50] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.179e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [12:48:52] !log wmde-fisch@deploy1001 Synchronized php-1.35.0-wmf.11/extensions/Popups: SWAT: [[gerrit:559010|Fix initial preferences for newly created user accounts (T240947)]] (duration: 01m 02s) [12:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:58] T240947: ReferencePreviews accidentally enabled for new users even if in Beta - https://phabricator.wikimedia.org/T240947 [12:49:16] (03PS9) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) [12:49:18] OK I'm done effie [12:49:41] this seems to be just wikifeeds, which should be depooled however [12:50:44] (03PS6) 10Effie Mouzeli: systemd: fixes in coredump class [puppet] - 10https://gerrit.wikimedia.org/r/545558 (https://phabricator.wikimedia.org/T236253) [12:50:54] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=(wikifeeds) [12:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:05] dammit, eqiad was depooled? sigh [12:51:28] PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([kubernetes2004.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2001.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [12:51:54] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:52:09] !log pool wikifeeds eqiad. For some reason it was depooled [12:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:38] !log disable puppet fleet wide to restart apache on puppetmasters [12:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:08] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:53:25] (03PS10) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) [12:53:34] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:53:36] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:53:36] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:53:46] CFisch_WMDE: thank you! [12:53:52] jouncebot: nex [12:53:54] jouncebot: next [12:53:54] In 0 hour(s) and 6 minute(s): Mediawiki train - European Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191218T1300) [12:54:01] cool [12:54:20] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:54:22] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:54:41] OK, train departing to group1 in five minutes. Last call for any blockers. [12:56:10] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:56:10] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:56:10] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:56:10] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:56:10] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:56:11] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:56:22] !log enable puppet fleet wide [12:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:29] (03CR) 10Phamhi: "> Patch Set 5:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [12:56:48] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:58:36] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:59:30] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s: Migrate codfw to the new etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/558354 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [13:00:04] James_F and longma: Your horoscope predicts another unfortunate Mediawiki train - European Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191218T1300). [13:00:30] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: introduce FQDN for routing_source_ip in codfw1dev [dns] - 10https://gerrit.wikimedia.org/r/559040 (https://phabricator.wikimedia.org/T239347) [13:00:53] 10Operations, 10Data-Services, 10Discovery-Search, 10Wikidata, and 2 others: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10Gehel) So it looks like for the foreseeable future, using external dumps mirror will still be the way to go to retrieve full dumps intern... [13:01:10] (03PS1) 10Jforrester: group1 wikis to 1.35.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559041 [13:01:12] (03CR) 10Jforrester: [C: 03+2] group1 wikis to 1.35.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559041 (owner: 10Jforrester) [13:02:10] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559041 (owner: 10Jforrester) [13:02:49] !log installing ruby2.5 security updates [13:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: introduce FQDN for routing_source_ip in codfw1dev [dns] - 10https://gerrit.wikimedia.org/r/559040 (https://phabricator.wikimedia.org/T239347) (owner: 10Arturo Borrero Gonzalez) [13:03:16] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10jbond) [13:03:23] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10jbond) [13:03:34] !log jforrester@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.11 [13:03:35] Burst of DB errors, as normal. [13:03:37] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) This information has been documented in https://wikitech.wikimedia.org/wiki/User:Jbond/Encryption [13:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:46] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10jbond) 05Open→03Resolved [13:03:48] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10jbond) [13:04:26] Hmm, almost all on Commons… [13:04:36] !log jforrester@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.11 (duration: 01m 01s) [13:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:42] 10Operations, 10Puppet, 10User-jbond: Clean up SSL configueration - https://phabricator.wikimedia.org/T240941 (10jbond) [13:05:16] Big burst of "No working replica DB server" errors. [13:05:28] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:06:48] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [13:06:52] could be the train? [13:07:07] I will be back in a bit [13:07:14] Yeah, error rate is not falling back. [13:07:25] Looks like Wikibase on Commons is unhappy. [13:09:29] (03PS1) 10Jforrester: train: Rolling Commons back to 1.35.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559043 (https://phabricator.wikimedia.org/T233859) [13:09:31] (03CR) 10Jforrester: [C: 03+2] train: Rolling Commons back to 1.35.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559043 (https://phabricator.wikimedia.org/T233859) (owner: 10Jforrester) [13:10:29] (03Merged) 10jenkins-bot: train: Rolling Commons back to 1.35.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559043 (https://phabricator.wikimedia.org/T233859) (owner: 10Jforrester) [13:11:24] PROBLEM - Prometheus k8s cache not updating on prometheus2003 is CRITICAL: instance=127.0.0.1:9906 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2003&var-datasource=codfw+prometheus/ops [13:11:33] !log jforrester@deploy1001 rebuilt and synchronized wikiversions files: train: Rolling Commons back to 1.35.0-wmf.10 T233859 [13:11:36] 10Operations, 10Traffic, 10User-jbond: Setup a new PKI software as an alternative to the puppet CA for managing services certificates - https://phabricator.wikimedia.org/T194031 (10jbond) a:05Volans→03jbond [13:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:39] T233859: 1.35.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T233859 [13:11:44] PROBLEM - Prometheus k8s cache not updating on prometheus2004 is CRITICAL: instance=127.0.0.1:9906 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2004&var-datasource=codfw+prometheus/ops [13:11:51] heh, interesting, I did not expect a prometheus alarm... although it makes sense [13:11:54] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad average message produce rate in last 30m on icinga1001 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [13:12:38] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:12:48] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad average message consume rate in last 30m on icinga1001 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [13:13:14] and it's gone now [13:13:55] ACKNOWLEDGEMENT - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack andrew bogott I updated base images -- this will recover shortly. https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:14:26] jynus: Yeah, train rolled back for Commons. Filing task now. [13:14:26] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:14:26] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:16:15] 10Operations, 10Data-Services, 10Discovery-Search, 10Wikidata, and 2 others: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10ArielGlenn) How fast a download do folks want? Can we schedule rsyncs for the specifiic use cases with a higher bandwidth cap? [13:16:39] James_F: looking at queries I've seen a query, unrelated to this, to file as task on, too [13:20:20] 10Operations, 10netops, 10cloud-services-team (Kanban): Return traffic to eqiad WMCS triggering FNM - https://phabricator.wikimedia.org/T240789 (10CDanis) +1 to doing #1 and revisiting if it becomes a problem again. [13:20:40] jynus: Different issue or same issue but different query triggering it? [13:24:04] James_F: I thought it was connected, but it starts a long time ago and continues now, so I learned that unrelated [13:24:07] James_F: is the rollback complete or still in progress? [13:24:16] I am filing it [13:24:23] I'm seeing weird behavior on wikitech, maybe related? [13:25:14] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:25:27] * andrewbogott backs away [13:25:36] back [13:25:42] PROBLEM - Check systemd state on labweb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:44] andrewbogott: Rolled back but only on Commons. Might be. [13:26:02] the consistent thing I see is crashes in search [13:26:13] https://wikitech.wikimedia.org/w/index.php?search=virsh+console&title=Special%3ASearch&go=Go&wprov=acrw1_-1 [13:26:14] Yeah, looking. [13:26:17] thanks [13:26:40] Searching looks OK to me, though? [13:27:02] does that page I linked above load for you? [13:27:14] For me I get "(Cannot access the database: Cannot access the database: Unknown error (10.64.32.72))" [13:27:23] that's from running a search for 'virsh console' [13:27:28] andrewbogott: Yes, instantly. Hence my confusion. [13:27:30] RECOVERY - Check systemd state on labweb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:27:33] hm [13:27:37] well, I'm in asia…? [13:27:44] when you search again it breaks for me [13:27:45] But there are lots of errors, you're right. [13:27:51] * andrewbogott tries again with a different browser [13:27:57] Oh, huh, on refresh it broke. [13:28:03] I'll rollback Wikitech too. [13:28:46] search worked in a second browser but then I reloaded in the original browser and am missing static content [13:28:52] so it's all over the place :( [13:28:58] Yeah. [13:28:58] https://phabricator.wikimedia.org/T241058 [13:29:07] Did something exciting change in core's handing of DB connections? [13:29:23] Yes. [13:29:25] (03PS1) 10Jforrester: train: Rolling Wikitech back to 1.35.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559052 (https://phabricator.wikimedia.org/T233859) [13:29:27] (03CR) 10Jforrester: [C: 03+2] train: Rolling Wikitech back to 1.35.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559052 (https://phabricator.wikimedia.org/T233859) (owner: 10Jforrester) [13:29:30] I did some dbctl changes the other day. [13:29:56] But that would be expected to blow up at the time, not as part of the train only. [13:30:01] True. [13:30:25] (03Merged) 10jenkins-bot: train: Rolling Wikitech back to 1.35.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559052 (https://phabricator.wikimedia.org/T233859) (owner: 10Jforrester) [13:30:45] (03PS1) 10Ema: ATS: disable compress plugin on ats-be [puppet] - 10https://gerrit.wikimedia.org/r/559053 (https://phabricator.wikimedia.org/T238494) [13:30:46] !log jforrester@deploy1001 rebuilt and synchronized wikiversions files: train: Rolling Wikitech back to 1.35.0-wmf.10 T233859 [13:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:52] T233859: 1.35.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T233859 [13:31:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] network: data: cleanup unused WMCS ranges [puppet] - 10https://gerrit.wikimedia.org/r/556994 (https://phabricator.wikimedia.org/T240670) (owner: 10Arturo Borrero Gonzalez) [13:32:33] No changes this week to includes/db or anything else obvious from a quick glance. [13:32:56] (03PS2) 10Arturo Borrero Gonzalez: networks: cleanup unused WMCS ranges [dns] - 10https://gerrit.wikimedia.org/r/556995 (https://phabricator.wikimedia.org/T240670) [13:34:19] andrewbogott: Hmm, I'm also seeing those errors on wmf.10 on Wikitech now I've rolled it back… [13:35:37] Filed as T241059 [13:35:38] T241059: Cannot access the database: Unknown error (10.64.32.72) - https://phabricator.wikimedia.org/T241059 [13:36:09] (03PS2) 10Ema: ATS: disable compress plugin on ats-be [puppet] - 10https://gerrit.wikimedia.org/r/559053 (https://phabricator.wikimedia.org/T238494) [13:36:28] James_F: I couldn't tell you when I last did a search on wikitech but probably almost every day… I doubt it's been broken for a week [13:37:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] networks: cleanup unused WMCS ranges [dns] - 10https://gerrit.wikimedia.org/r/556995 (https://phabricator.wikimedia.org/T240670) (owner: 10Arturo Borrero Gonzalez) [13:37:04] PROBLEM - Check systemd state on labweb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:37:28] (03CR) 10Vgutierrez: [C: 03+1] ATS: disable compress plugin on ats-be [puppet] - 10https://gerrit.wikimedia.org/r/559053 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [13:38:27] oh, now wikitech is down for me entirely. [13:38:34] !log installing dbus security updates for stretch [13:38:40] Maybe this is all jynus's ddos and nothing to do with release version [13:38:50] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS: cleanup network allocations - https://phabricator.wikimedia.org/T240670 (10aborrero) Patches are merged. I can do the Netbox cleanup (delete all those objects) by myself if you confirm @ayounsi [13:38:51] Maybe. [13:39:17] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1133&var-port=9104 [13:40:06] PROBLEM - Check systemd state on labweb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:16] I can roll back the train everywhere, but theoretically wikitech should be pretty isolated from the rest of the train. [13:40:24] I think it's an actual problem with db1133? [13:40:38] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [13:41:02] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb-ssl_7443: Servers labweb1002.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:41:04] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [13:41:06] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [13:41:14] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb-ssl_7443: Servers labweb1002.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:41:30] Is 10.64.32.72 db1133? [13:41:34] Yes. [13:41:38] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [13:42:02] thats the labswiki master [13:42:14] Right. [13:42:44] At least one of the haproxy alerts there, dbproxy1017, is also re: db1133 [13:43:06] Same for dbproxy1021. [13:43:12] Also has db1133 as its backing server. [13:43:23] nova uses the same db cluster (M5) and seems to be working OK [13:45:18] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [13:45:24] OK, ignoring Wikitech, everything else seems OK. [13:45:38] What do we want to do about Wikitech? [13:46:01] the only thing I can see in tendril about db1133 is many aborted connections, but that does not help much [13:46:12] jynus: are you looking at the wikitech / db1133 issue? [13:46:15] has anyone restarted php there? [13:46:22] if not, we can start from there [13:46:25] I can do that now [13:46:41] ok [13:46:42] but why would the dbproxy complain then [13:46:46] effie: does 'restarting php' == 'service apache2 reload'? [13:47:08] andrewbogott: depool ; systemctl restart php7.2-fpm; pool [13:47:14] ok, doing [13:47:24] although depooling and pooling does not matter much [13:47:42] done on both wikitech servers [13:48:00] !log depool ; systemctl restart php7.2-fpm; pool on labweb1001 and labweb1002 [13:48:10] I grabbed a 'show full processlist' and stats output from db1133, although I don't know what to make of it myself: https://phabricator.wikimedia.org/P9939 [13:48:19] there are a fair number of long-running queries though [13:49:26] that looks pretty normal to me but I'll restart some openstack things and see if that clears up the number of connections [13:50:02] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:50:14] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:50:24] (03PS11) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) [13:50:26] ok wikitech is back [13:50:37] at least from my end [13:50:47] Confirmed. [13:50:50] for me too [13:50:52] search works [13:50:56] RECOVERY - Check systemd state on labweb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:56] anyone know what changed? [13:51:04] RECOVERY - Prometheus k8s cache not updating on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2003&var-datasource=codfw+prometheus/ops [13:51:24] RECOVERY - Prometheus k8s cache not updating on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2004&var-datasource=codfw+prometheus/ops [13:51:30] RECOVERY - Check systemd state on labweb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:35] the monitoring graphs for db1133 still look wonky to me, but what do I know https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1133&var-port=9104 [13:52:12] andrewbogott: Well, we deployed the train. [13:52:18] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:52:35] But it stayed when we rolled back, so unless it was ultra-long-running queries that didn't get killed, somehow… [13:52:49] James_F: yeah, I was wondering what fixed it [13:53:00] The restart? [13:53:35] Presumably the sawtooth load on db1133 is unusual? [13:53:42] (Prior to everything blowing up.) [13:54:05] Something cron-based every 10 minutes? [13:54:30] it's weirder than just sawtooth -- if you hover over the graphs, you'll see that during a bunch of that interval, there _aren't_ data points -- as in, stats couldn't be scraped for some reason [13:54:34] (03PS12) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) [13:54:45] cdanis: Oh, so it's locking up every 10? [13:55:09] no, just from ~13:30 onwards [13:56:22] Oh, I see what you mean. During the downtime there were no data points. [13:57:33] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [13:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:03] anyway, it looks like there were also a bunch of the "Cannot access the database: Unknown error" for other wikis in wmf.11? [13:58:11] (03PS1) 10IAmNetx: Add ng.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/559062 (https://phabricator.wikimedia.org/T240771) [13:58:11] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'citoid' for release 'production' . [13:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:31] (03CR) 10jerkins-bot: [V: 04-1] Add ng.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/559062 (https://phabricator.wikimedia.org/T240771) (owner: 10IAmNetx) [13:58:50] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'cxserver' for release 'production' . [13:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:24] cdanis: do you have a logstash url in hand ? [13:59:33] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'echostore' for release 'production' . [13:59:35] https://logstash.wikimedia.org/goto/7ff0eceadac6b8f5aaeecafdc3fb0fe2 [13:59:36] (03PS2) 10IAmNetx: Add ng.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/559062 (https://phabricator.wikimedia.org/T240771) [13:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:42] mostly commonswiki [13:59:51] cdanis: There's always a small spate of those during the train rollout (sadly). [14:00:12] cdanis: Yeah, I rolled back on Commons. See T241057. Possibly related, but was isolated to there at the time. [14:00:12] T241057: Cannot access the database: No working replica DB server: Unknown error (10.64.32.113) - https://phabricator.wikimedia.org/T241057 [14:00:15] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-analytics' for release 'analytics' . [14:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:18] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'logging-external' . [14:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:54] https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php72-noselenium-docker/2083/console [14:02:01] Gerrit going down? [14:02:17] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'main' . [14:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:34] PROBLEM - DPKG on analytics1055 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:02:53] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [14:02:55] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mathoid' for release 'production' . [14:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:09] Amir1: I don't think so. gerrit works fine with me [14:03:12] for* [14:03:22] Amir1: No, but occasionally it'll go away for CI and you'll need to re-run. [14:03:22] some rsync failed over there [14:03:48] what is that rsync doing there? does it have to do something with gerrit? [14:03:53] do I want to know? [14:04:02] James_F: I'm going to deploy the fix right now, Can you try again after it? Please keep me posted [14:04:08] gerrit's up [14:04:16] several jobs failed [14:04:17] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'restrouter' for release 'production' . [14:04:17] akosiaris: I can describe in detail, but not when there are production issues. [14:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:23] https://integration.wikimedia.org/ci/job/wikibase-repo-docker/9684/console [14:04:34] James_F: ok, sorry for the interrupt [14:04:38] akosiaris: :-) [14:05:02] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'sessionstore' for release 'production' . [14:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:08] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:05:25] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'termbox' for release 'production' . [14:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:01] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [14:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:31] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'zotero' for release 'production' . [14:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:44] RECOVERY - PyBal IPVS diff check on lvs2003 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:06:53] Amir1: How are you planning to test the patch? TestCommons? [14:07:25] James_F: it's not possible, this only happens where there are in different hosts [14:07:31] wikidata and commons [14:08:24] nope, I was able to reproduce it [14:10:09] https://test-commons.wikimedia.org/wiki/File:4050443322980010273.png [14:10:22] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:10:56] Amir1: TestCommons and TestWikidata are on different hosts, right? [14:11:08] they should be both on s3? [14:11:18] TestCommons is on s4 for this reason. [14:11:23] testwikidawiki is on s3 [14:11:27] nice idea [14:11:30] :-) [14:11:32] * James_F bows. [14:11:44] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:11:51] (More generally, it's there to be as prod-like as possible.) [14:17:18] Can I force merge it? 😈 [14:17:33] Amir1: No. [14:17:40] Amir1: There's no great rush. [14:17:54] okay :D [14:22:12] 10Operations, 10SRE-tools: Extend debmonitor with image tracking support - https://phabricator.wikimedia.org/T237978 (10MoritzMuehlenhoff) 05Open→03Resolved This is rolled out to production, all further refinements can happen via followup tasks/commits. [14:22:16] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1001/20056/ makes sense. I'll fix the CI errors and the patch can be considered good to go as " [puppet] - 10https://gerrit.wikimedia.org/r/558620 (owner: 10Giuseppe Lavagetto) [14:23:26] 10Operations: rack/setup/install auth1002 - https://phabricator.wikimedia.org/T196698 (10MoritzMuehlenhoff) 05Open→03Resolved This is installed (as a test system for now), closing the task. [14:24:38] (03PS1) 10IAmNetx: Add ng.wikimedia.org as chapter site [puppet] - 10https://gerrit.wikimedia.org/r/559073 (https://phabricator.wikimedia.org/T240771) [14:25:42] 10Operations, 10Traffic: /sec-warning page: please add a helpful XML comment explaining why it's being delivered. - https://phabricator.wikimedia.org/T240794 (10DavidBrooks) Suggest: near the top of the /sec-warning file, add an HTML comment: