[00:03:25] (03PS2) 10CRusnov: tox: Support DNS_INCLUDE_DIR and generated DNS [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) [00:04:29] (03CR) 10CRusnov: "Latest patch as discussed on IRC, does a checkout of the generated DNS repository if no checkout is set in the environment variable." [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) (owner: 10CRusnov) [00:11:49] (03CR) 10Jforrester: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574405 (https://phabricator.wikimedia.org/T245983) (owner: 10Brian Wolff) [00:14:57] (03PS2) 10Jforrester: Make Beta labs CSP settings be same as prod but with beta urls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574405 (https://phabricator.wikimedia.org/T245983) (owner: 10Brian Wolff) [00:16:47] There's a SWAT window now, I'm taking it. [00:17:17] (03CR) 10Jforrester: [C: 03+2] Make Beta labs CSP settings be same as prod but with beta urls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574405 (https://phabricator.wikimedia.org/T245983) (owner: 10Brian Wolff) [00:18:48] (03Merged) 10jenkins-bot: Make Beta labs CSP settings be same as prod but with beta urls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574405 (https://phabricator.wikimedia.org/T245983) (owner: 10Brian Wolff) [00:21:54] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T245983 Set wmgApprovedContentSecurityPolicyDomains (duration: 00m 57s) [00:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:02] T245983: Make beta cluster CSP look identical to prod, except that it uses beta urls - https://phabricator.wikimedia.org/T245983 [00:22:41] 10Operations, 10Traffic, 10netops: Reporting en.wikipedia is down - https://phabricator.wikimedia.org/T246040 (10Reedy) 05Open→03Invalid [00:23:18] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T245983 Read wmgApprovedContentSecurityPolicyDomains for CSP (duration: 00m 56s) [00:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:39] (03PS4) 10Jforrester: The preprocessorClass property in $wgParserConf doesn't do anything any more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567155 (https://phabricator.wikimedia.org/T204945) (owner: 10C. Scott Ananian) [00:23:43] (03CR) 10Jforrester: [C: 03+2] The preprocessorClass property in $wgParserConf doesn't do anything any more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567155 (https://phabricator.wikimedia.org/T204945) (owner: 10C. Scott Ananian) [00:25:17] (03PS3) 10Jforrester: The $wgMaxGeneratedPPNodeCount configuration variable no longer has any effect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567157 (https://phabricator.wikimedia.org/T204945) (owner: 10C. Scott Ananian) [00:25:17] (03Merged) 10jenkins-bot: The preprocessorClass property in $wgParserConf doesn't do anything any more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567155 (https://phabricator.wikimedia.org/T204945) (owner: 10C. Scott Ananian) [00:25:19] (03CR) 10Jforrester: [C: 03+2] The $wgMaxGeneratedPPNodeCount configuration variable no longer has any effect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567157 (https://phabricator.wikimedia.org/T204945) (owner: 10C. Scott Ananian) [00:26:05] (03Merged) 10jenkins-bot: The $wgMaxGeneratedPPNodeCount configuration variable no longer has any effect [mediawiki-config] - 10https://gerrit.wikimedia.org/r/567157 (https://phabricator.wikimedia.org/T204945) (owner: 10C. Scott Ananian) [00:27:47] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Stop setting wgMaxGeneratedPPNodeCount or wgParserConf::preprocessorClass, never read (duration: 00m 56s) [00:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:45] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi https://phabricator.wikimedia.org/T246009 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:29:45] ACKNOWLEDGEMENT - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP Ayounsi https://phabricator.wikimedia.org/T246009 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:29:45] ACKNOWLEDGEMENT - OSPF status on mr1-esams is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP Ayounsi https://phabricator.wikimedia.org/T246009 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:32:20] (03PS4) 10Holger Knust: changeprop: New helmfiles for deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) [00:33:00] (03PS2) 10Jforrester: Delete fixcopyrightwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552549 (https://phabricator.wikimedia.org/T238803) [00:35:31] (03CR) 10Jforrester: [C: 03+2] Delete fixcopyrightwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552549 (https://phabricator.wikimedia.org/T238803) (owner: 10Jforrester) [00:36:02] (03CR) 10Holger Knust: "Fixed the Kafka server names and removed the rogue/outdated comments" (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [00:36:06] 10Operations, 10Cleanup, 10Traffic, 10fixcopyright.wikimedia.org, and 4 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Jdforrester-WMF) [00:36:07] (03Merged) 10jenkins-bot: Delete fixcopyrightwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552549 (https://phabricator.wikimedia.org/T238803) (owner: 10Jforrester) [00:39:11] !log jforrester@deploy1001 Scap failed!: Call to mwscript eval.php stderr: not empty [00:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:07] Fun times. [00:41:23] (03PS1) 10Jforrester: CS: Stop trying to read wmgUseSkinPerPage or wmgUseEUCopyrightCampaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574614 [00:41:23] (03CR) 10Jforrester: [C: 03+2] CS: Stop trying to read wmgUseSkinPerPage or wmgUseEUCopyrightCampaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574614 (owner: 10Jforrester) [00:42:20] (03Merged) 10jenkins-bot: CS: Stop trying to read wmgUseSkinPerPage or wmgUseEUCopyrightCampaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574614 (owner: 10Jforrester) [00:43:25] !log jforrester@deploy1001 Synchronized dblists/all.dblist: T238803: Remove fixcopyrightwiki from all.dblist (duration: 00m 56s) [00:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:31] T238803: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 [00:44:59] !log jforrester@deploy1001 rebuilt and synchronized wikiversions files: T238803: Remove fixcopyrightwiki from wikiversions [00:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:13] !log jforrester@deploy1001 Synchronized dblists/: T238803: Remove fixcopyrightwiki from dblists in general (duration: 00m 58s) [00:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:53] !log jforrester@deploy1001 Synchronized static/images/project-logos/: T238803: Remove fixcopyrightwiki project logos (duration: 00m 56s) [00:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:50] !log Confirmed not SUL entries for fixcopyrightwiki as expected T238803 [00:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:55] T238803: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 [00:48:59] 10Operations, 10SRE-Access-Requests: Give access to Anti Harassment Tools team to production deployment - https://phabricator.wikimedia.org/T246053 (10Mooeypoo) [00:50:08] 10Operations, 10SRE-Access-Requests: Give access to Anti Harassment Tools team to production deployment - https://phabricator.wikimedia.org/T246053 (10aezell) I approve these credentials for these staff. [00:51:01] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Stop trying to read wmgUseSkinPerPage or wmgUseEUCopyrightCampaign (duration: 00m 55s) [00:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:53] !log Ran `DELETE FROM globalimagelinks WHERE gil_wiki='fixcopyrightwiki';` - one row removed T238803 [00:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:10] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T238803: Remove all IS config related to the fixcopyrightwiki wiki (duration: 00m 55s) [00:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:56] 10Operations, 10Cleanup, 10Traffic, 10fixcopyright.wikimedia.org, and 4 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Jdforrester-WMF) [00:55:25] (03PS2) 10Jforrester: Drop ability to load SkinPerPage, EUCopyrightCampaign, and EUCopyrightCampaignSkin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552550 (https://phabricator.wikimedia.org/T238803) [00:55:27] (03PS1) 10Jforrester: Drop i18n load for SkinPerPage, EUCopyrightCampaign, and EUCopyrightCampaignSkin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574616 (https://phabricator.wikimedia.org/T238803) [00:56:43] (03CR) 10Jforrester: [C: 03+2] Drop ability to load SkinPerPage, EUCopyrightCampaign, and EUCopyrightCampaignSkin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552550 (https://phabricator.wikimedia.org/T238803) (owner: 10Jforrester) [00:57:40] (03Merged) 10jenkins-bot: Drop ability to load SkinPerPage, EUCopyrightCampaign, and EUCopyrightCampaignSkin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552550 (https://phabricator.wikimedia.org/T238803) (owner: 10Jforrester) [00:59:12] (03CR) 10Jforrester: [C: 03+2] Drop i18n load for SkinPerPage, EUCopyrightCampaign, and EUCopyrightCampaignSkin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574616 (https://phabricator.wikimedia.org/T238803) (owner: 10Jforrester) [00:59:48] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T238803: Drop ability to load SkinPerPage, EUCopyrightCampaign, and EUCopyrightCampaignSkin (duration: 00m 56s) [00:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:55] T238803: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 [01:00:08] (03Merged) 10jenkins-bot: Drop i18n load for SkinPerPage, EUCopyrightCampaign, and EUCopyrightCampaignSkin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574616 (https://phabricator.wikimedia.org/T238803) (owner: 10Jforrester) [01:05:17] 10Operations, 10Cleanup, 10Traffic, 10fixcopyright.wikimedia.org, and 4 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Jdforrester-WMF) [01:06:16] (03CR) 10Ppchelko: "Looks better. Inlined one ask for help from @Alex" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [01:06:32] 10Operations, 10Cleanup, 10Traffic, 10fixcopyright.wikimedia.org, and 4 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Jdforrester-WMF) [01:07:40] 10Operations, 10Cleanup, 10Traffic, 10fixcopyright.wikimedia.org, and 4 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Jdforrester-WMF) [01:09:08] 10Operations, 10Cleanup, 10Traffic, 10fixcopyright.wikimedia.org, and 4 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Jdforrester-WMF) [01:11:04] (03PS1) 10Jforrester: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574618 [01:11:23] (03CR) 10Jforrester: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574618 (owner: 10Jforrester) [01:12:15] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574618 (owner: 10Jforrester) [01:12:37] 10Operations, 10Cleanup, 10Release-Engineering-Team-TODO, 10Traffic, and 3 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10CCicalese_WMF) Thank you for all of your work on this, @Jdforrester-WMF! [01:12:42] !log jforrester@deploy1001 Synchronized wmf-config/interwiki.php: T238803: Update interwiki cache (duration: 00m 56s) [01:12:51] 10Operations, 10Cleanup, 10Release-Engineering-Team-TODO, 10Traffic, and 3 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Jdforrester-WMF) a:05Jdforrester-WMF→03None This is now done as much as we can at RelEng's side. Assigning back over to CPT for the task tr... [01:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:53] T238803: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 [01:13:46] (03PS4) 10Jforrester: Enable password-reset-update on Wikivoyages and Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573788 (https://phabricator.wikimedia.org/T245792) (owner: 10Samwilson) [01:15:41] 10Operations, 10Cleanup, 10Release-Engineering-Team-TODO, 10Traffic, and 3 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10CCicalese_WMF) a:03CCicalese_WMF [02:12:59] (03CR) 10Gergő Tisza: [C: 03+1] NewcomerTasks: Enable guidance on betalabs wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574427 (https://phabricator.wikimedia.org/T245525) (owner: 10Kosta Harlan) [02:13:22] (03CR) 10Gergő Tisza: [C: 03+1] NewcomerTasks: Disable guidance on all wikis except testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574428 (https://phabricator.wikimedia.org/T245525) (owner: 10Kosta Harlan) [02:25:49] (03PS1) 10Gergő Tisza: Add ORES topics related config for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574633 (https://phabricator.wikimedia.org/T243359) [02:27:59] (03PS1) 10Gergő Tisza: Enable articletopic: search keyword in CirrusSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574634 (https://phabricator.wikimedia.org/T240559) [02:28:51] (03CR) 10Gergő Tisza: "Blocked on the ES index update. Not sure if it's hard- or soft-blocked though." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574634 (https://phabricator.wikimedia.org/T240559) (owner: 10Gergő Tisza) [02:43:23] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1277.65 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:51:05] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [02:51:09] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [02:53:41] 10Operations: Install private instance of gnomon for greater SRE team - https://phabricator.wikimedia.org/T246062 (10CDanis) [03:29:30] 10Operations: Special:WantedTemplates cronjob not running on enwiki - https://phabricator.wikimedia.org/T246063 (10Bawolff) [04:01:49] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [04:01:55] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [04:02:47] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:22:35] PROBLEM - MariaDB Slave Lag: s3 on db1140 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1247.91 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [05:54:58] (03PS1) 10Marostegui: es102[0-5]: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/574641 (https://phabricator.wikimedia.org/T243052) [05:56:31] (03CR) 10Marostegui: [C: 03+2] es102[0-5]: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/574641 (https://phabricator.wikimedia.org/T243052) (owner: 10Marostegui) [06:02:05] !log Move labsdb1010 under db2094:3318 - T232446 [06:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:13] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [06:03:25] RECOVERY - MariaDB Slave Lag: s3 on db1140 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [06:57:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1107 to analyze recentchanges table - T242702', diff saved to https://phabricator.wikimedia.org/P10508 and previous config saved to /var/cache/conftool/dbconfig/20200225-065741-marostegui.json [06:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:49] T242702: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 [07:28:39] (03CR) 10DCausse: [C: 03+1] "I'd say "soft-blocked", the ES term query used by the keyword will silently ignore that the field is not yet declared in the mapping." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574634 (https://phabricator.wikimedia.org/T240559) (owner: 10Gergő Tisza) [07:48:19] 10Operations, 10Traffic: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [07:53:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1107 for 10.4 testing in main API and special groups - T242702', diff saved to https://phabricator.wikimedia.org/P10510 and previous config saved to /var/cache/conftool/dbconfig/20200225-075304-marostegui.json [07:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:11] T242702: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 [07:57:21] (03PS1) 10Vgutierrez: install_server: Reimage lvs2010 as buster [puppet] - 10https://gerrit.wikimedia.org/r/574659 (https://phabricator.wikimedia.org/T245984) [07:58:22] (03PS1) 10Muehlenhoff: Remove access for flemmerich [puppet] - 10https://gerrit.wikimedia.org/r/574660 [07:59:20] (03CR) 10Vgutierrez: [C: 03+2] install_server: Reimage lvs2010 as buster [puppet] - 10https://gerrit.wikimedia.org/r/574659 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [08:00:12] (03PS1) 10Filippo Giunchedi: WIP: hwraid-1dev recipe [puppet] - 10https://gerrit.wikimedia.org/r/574661 [08:00:14] (03PS1) 10Filippo Giunchedi: install_server: move druid to standard partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/574662 (https://phabricator.wikimedia.org/T156955) [08:00:51] (03CR) 10jerkins-bot: [V: 04-1] Remove access for flemmerich [puppet] - 10https://gerrit.wikimedia.org/r/574660 (owner: 10Muehlenhoff) [08:02:55] 10Operations: Deprecate msdos partition scheme in favor of GPT - https://phabricator.wikimedia.org/T239321 (10fgiunchedi) 05Open→03Declined Parent task will be taking care of moving to GPT everywhere [08:02:59] 10Operations, 10Patch-For-Review, 10User-fgiunchedi: Standardizing our partman recipes - https://phabricator.wikimedia.org/T156955 (10fgiunchedi) [08:03:14] (03PS2) 10Muehlenhoff: Remove access for flemmerich [puppet] - 10https://gerrit.wikimedia.org/r/574660 [08:03:56] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs2010.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [08:05:20] (03CR) 10Muehlenhoff: "You can also remove druid-4ssd-raid10.cfg along" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574662 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [08:06:38] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for flemmerich [puppet] - 10https://gerrit.wikimedia.org/r/574660 (owner: 10Muehlenhoff) [08:06:49] (03CR) 10Nikerabbit: [C: 03+1] "Test plan:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416973 (owner: 10KartikMistry) [08:10:58] (03PS2) 10Filippo Giunchedi: install_server: move druid to standard partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/574662 (https://phabricator.wikimedia.org/T156955) [08:11:48] (03PS3) 10Filippo Giunchedi: install_server: move druid to standard partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/574662 (https://phabricator.wikimedia.org/T156955) [08:12:08] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574662 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [08:16:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/574662 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [08:22:31] (03PS1) 10Elukey: Add spark LD_LIBRARY_PATH hints to Yarn NM in Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/574666 (https://phabricator.wikimedia.org/T244499) [08:22:55] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [08:22:55] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [08:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:47] (03CR) 10Elukey: [C: 03+2] Add spark LD_LIBRARY_PATH hints to Yarn NM in Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/574666 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [08:26:38] 10Operations, 10Traffic: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs2010.codfw.wmnet'] ` Of which those **FAILED**: ` ['lvs2010.codfw.wmnet'] ` [08:37:50] (03CR) 10Muehlenhoff: "Ah, ok. That's custom by us and not in reprepro, so all fine." [puppet] - 10https://gerrit.wikimedia.org/r/574584 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [08:39:48] (03PS1) 10Muehlenhoff: Add tschumann to absent_ldap table [puppet] - 10https://gerrit.wikimedia.org/r/574694 [08:47:27] (03CR) 10Muehlenhoff: [C: 03+2] Add tschumann to absent_ldap table [puppet] - 10https://gerrit.wikimedia.org/r/574694 (owner: 10Muehlenhoff) [08:49:23] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Enable es4 as new external store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574696 [08:49:45] (03CR) 10Marostegui: [C: 04-2] "Needs discussion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574696 (owner: 10Marostegui) [08:51:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/574597 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [08:51:43] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Enable es4 as new external store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574696 (https://phabricator.wikimedia.org/T246072) [08:52:28] (03CR) 10Marostegui: [C: 04-2] "Please see https://phabricator.wikimedia.org/T246072 for context" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574696 (https://phabricator.wikimedia.org/T246072) (owner: 10Marostegui) [08:57:06] in abundance of caution I'll be disabling puppet across the A:swift-be cluster to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/574426 cc godog [08:58:01] volans: +1 thank you for taking care of that [09:00:13] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: Add Fsalutari to nda LDAP group - https://phabricator.wikimedia.org/T245997 (10MoritzMuehlenhoff) >>! In T245997#5912965, @Ottomata wrote: > @Muehlenhoff just double checking: Fsalutari has an NDA, can I just add to `nda` LDAP group? Y... [09:03:21] (03CR) 10Volans: [C: 03+2] swift: optimize ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/574426 (owner: 10Volans) [09:06:20] godog: anything special to check on swift logs to make sure everything is ok? I've applied in on ms-be2056 [09:07:21] s/in on/it on/ [09:09:44] !log addshore@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=wikidatawiki --batch-size=50 --sleep=1 --file=10to20holes-24feb1345 [09:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:30] !log addshore@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=wikidatawiki --batch-size=50 --sleep=1 --file=10to20holes-24feb1345 # T219123 [09:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:36] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [09:14:53] (03PS4) 10Muehlenhoff: Enable CAS authentication for tendril/dbmonitor [puppet] - 10https://gerrit.wikimedia.org/r/574404 [09:17:25] (03PS1) 10Elukey: Add specific settings for libcrypto in Hadoop Test [puppet] - 10https://gerrit.wikimedia.org/r/574702 (https://phabricator.wikimedia.org/T244499) [09:17:29] volans: you'll see errors for sure if swift can't talk to other hosts, also if the ulogd drop logs are silent we're likely ok [09:17:41] relocating, bbiab [09:17:47] so far all looks good [09:17:57] eggcelent! [09:18:18] the drop logs are in /var/log/ulog/syslogemu.log ? [09:22:48] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574702 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [09:23:40] (03PS2) 10Elukey: Add specific settings for libcrypto in Hadoop Test [puppet] - 10https://gerrit.wikimedia.org/r/574702 (https://phabricator.wikimedia.org/T244499) [09:23:55] in syslog, prefixed with [fw-in-drop] [09:24:05] thanks moritzm [09:24:41] last one 08:11:05 UTC, before the merge [09:25:22] (03CR) 10Elukey: Add specific settings for libcrypto in Hadoop Test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574702 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [09:25:29] sounds good, for internal hosts, dropped packages are rare anyway, there's the odd PXE packet broadcases (which gets filtered out by default) [09:28:39] (03CR) 10Muehlenhoff: [C: 03+2] Enable CAS authentication for tendril/dbmonitor [puppet] - 10https://gerrit.wikimedia.org/r/574404 (owner: 10Muehlenhoff) [09:29:18] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/21029/" [puppet] - 10https://gerrit.wikimedia.org/r/574702 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [09:30:04] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:30:06] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:46] !log re-enabling puppet on A:swift-be-codfw [09:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:41] (03CR) 10Nikerabbit: [C: 03+1] Enable CX out of beta in eu, sw and ta WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574469 (https://phabricator.wikimedia.org/T245446) (owner: 10KartikMistry) [09:32:53] (03CR) 10Elukey: [C: 03+1] install_server: move druid to standard partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/574662 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [09:41:42] moritzm: are you taking care of dbmonitor2001? [09:42:43] yeah, patch incoming [09:42:45] (03PS1) 10Muehlenhoff: Fix Hiera variables names in CAS tendril template [puppet] - 10https://gerrit.wikimedia.org/r/574704 [09:42:52] moritzm: ok, no rush [09:43:03] just in case it wasn't noticed :-D [09:43:09] It has 0 priority [09:43:45] thanks :-) I'm using dbmonitor2001 to smoketest the puppet changes before enabling puppet on 1001 [09:44:18] the web service doesn't work there anyway [09:44:37] as in, the application returns a blank page at the moment [09:45:24] (before deployment for a few month already) [09:45:38] yeah, but it was at least useful to spot the apache syntax error e.g. [09:45:43] indeed [09:45:53] that is the thing that I was asking :-D [09:46:16] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/21030/" [puppet] - 10https://gerrit.wikimedia.org/r/574704 (owner: 10Muehlenhoff) [09:46:18] (03CR) 10Muehlenhoff: [C: 03+2] Fix Hiera variables names in CAS tendril template [puppet] - 10https://gerrit.wikimedia.org/r/574704 (owner: 10Muehlenhoff) [09:47:02] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: move druid to standard partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/574662 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [09:47:14] (03PS4) 10Filippo Giunchedi: install_server: move druid to standard partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/574662 (https://phabricator.wikimedia.org/T156955) [09:52:00] (03PS1) 10Muehlenhoff: Fix syntax for LimitExcept block [puppet] - 10https://gerrit.wikimedia.org/r/574706 [09:54:56] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/21031/" [puppet] - 10https://gerrit.wikimedia.org/r/574706 (owner: 10Muehlenhoff) [09:54:59] (03CR) 10Muehlenhoff: [C: 03+2] Fix syntax for LimitExcept block [puppet] - 10https://gerrit.wikimedia.org/r/574706 (owner: 10Muehlenhoff) [09:56:11] 10Operations, 10Performance Issue: Investigate CAS performance - https://phabricator.wikimedia.org/T246010 (10Aklapper) @jbond: Assuming this task is about #Operations, hence adding that project tag. [09:57:17] 10Operations, 10Performance Issue: Investigate CAS performance - https://phabricator.wikimedia.org/T246010 (10jbond) p:05Triage→03Medium [10:07:50] (03PS1) 10Jbond: idp: allow people.wikimedia.org to authenticate against apereo_cas [puppet] - 10https://gerrit.wikimedia.org/r/574709 [10:10:30] (03PS15) 10Effie Mouzeli: mediawiki: send apache logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/571239 (https://phabricator.wikimedia.org/T244472) [10:11:59] !log re-enabling puppet on A:swift-be-eqiad [10:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:30] (03CR) 10Elukey: mcrouter: add gutter pool servers in configuration (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [10:40:52] (03CR) 10Alexandros Kosiaris: [C: 04-1] "A number of comments inline." (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [10:47:02] (03PS1) 10Alexandros Kosiaris: Showcase redis pass population for changeprop [labs/private] - 10https://gerrit.wikimedia.org/r/574713 (https://phabricator.wikimedia.org/T213193) [10:47:59] (03CR) 10Alexandros Kosiaris: [C: 04-1] "The labs/private change I was talking about is at https://gerrit.wikimedia.org/r/#/c/labs/private/+/574713" [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [10:54:24] (03PS17) 10Effie Mouzeli: mcrouter: add gutter pool servers in configuration [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) [10:56:11] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Fix parent permissions; inherit from `operations/software` [software/nss-dnsdc] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/574147 (owner: 10MarcoAurelio) [10:56:17] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Thanks!" [software/nss-dnsdc] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/574147 (owner: 10MarcoAurelio) [10:57:45] (03CR) 10Effie Mouzeli: mcrouter: add gutter pool servers in configuration (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [10:58:24] (03PS3) 10Effie Mouzeli: hieradata: fix new lines in monitoring.yaml [puppet] - 10https://gerrit.wikimedia.org/r/570062 [11:04:59] (03CR) 10KartikMistry: [C: 03+1] cxserver: Remove logstash logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/573240 (https://phabricator.wikimedia.org/T219921) (owner: 10Alexandros Kosiaris) [11:05:41] 10Operations, 10Release-Engineering-Team, 10serviceops: mcrouter proxies and scap proxies - https://phabricator.wikimedia.org/T245841 (10jijiki) If there are no objections, I would like to proceed with this [11:23:31] (03PS1) 10Vgutierrez: lvs: Rename ifaces for lvs2010 [puppet] - 10https://gerrit.wikimedia.org/r/574714 (https://phabricator.wikimedia.org/T245984) [11:32:20] (03CR) 10Vgutierrez: [C: 03+2] lvs: Rename ifaces for lvs2010 [puppet] - 10https://gerrit.wikimedia.org/r/574714 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [11:36:51] (03PS1) 10Vgutierrez: lvs: Set txqlen for the proper ifaces on lvs2010 [puppet] - 10https://gerrit.wikimedia.org/r/574715 (https://phabricator.wikimedia.org/T245984) [11:39:51] (03CR) 10Vgutierrez: [C: 03+2] lvs: Set txqlen for the proper ifaces on lvs2010 [puppet] - 10https://gerrit.wikimedia.org/r/574715 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [11:44:04] (03PS1) 10Holger Knust: changeprop: Change names of redis keys in config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/574716 [11:46:25] (03CR) 10Kosta Harlan: [C: 03+1] Add ORES topics related config for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574633 (https://phabricator.wikimedia.org/T243359) (owner: 10Gergő Tisza) [11:53:09] (03PS1) 10Vgutierrez: pybal: Allow overriding BGP med [puppet] - 10https://gerrit.wikimedia.org/r/574718 (https://phabricator.wikimedia.org/T245984) [11:54:38] (03PS1) 10Hnowlan: Add changeprop namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 [11:55:03] (03PS1) 10Urbanecm: New throttle rule for arwiki WikiGap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574720 (https://phabricator.wikimedia.org/T246092) [11:55:18] (03PS2) 10Hnowlan: Admin: Add changeprop namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 (https://phabricator.wikimedia.org/T213193) [11:55:49] (03PS1) 10Elukey: Move all Report Updater Jobs to an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) [12:00:01] (03CR) 10Vgutierrez: "pcc is happy and shows a NOOP: https://puppet-compiler.wmflabs.org/compiler1001/21033/" [puppet] - 10https://gerrit.wikimedia.org/r/574718 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200225T1200). [12:00:04] kostajh: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:19] \o [12:00:27] hi kostajh ! [12:00:30] o/ [12:00:39] I can SWAT today! [12:00:40] hi Urbanecm :) [12:01:02] I have something to backport but I will go last [12:01:04] (03PS2) 10Urbanecm: NewcomerTasks: Enable guidance on betalabs wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574427 (https://phabricator.wikimedia.org/T245525) (owner: 10Kosta Harlan) [12:01:09] ack, I'll ping you once done [12:01:11] (03PS2) 10Vgutierrez: pybal: Allow overriding BGP med [puppet] - 10https://gerrit.wikimedia.org/r/574718 (https://phabricator.wikimedia.org/T245984) [12:01:17] (03CR) 10Urbanecm: [C: 03+2] "noop, beta-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574427 (https://phabricator.wikimedia.org/T245525) (owner: 10Kosta Harlan) [12:01:27] jbond42: ^^ may I get a review for that one? [12:01:40] 10Operations, 10Wikimedia-Incident: Investigate whether we can automatically share incident status docs with WMDE - https://phabricator.wikimedia.org/T244395 (10Ladsgroup) >>! In T244395#5914004, @RLazarus wrote: > "On Monday" turned out to be two weeks later -- sorry about that. Conclusions from today's SRE m... [12:02:18] (03Merged) 10jenkins-bot: NewcomerTasks: Enable guidance on betalabs wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574427 (https://phabricator.wikimedia.org/T245525) (owner: 10Kosta Harlan) [12:02:38] kostajh: ^ should land on beta soon [12:02:43] Urbanecm: cool [12:02:52] (03PS2) 10Urbanecm: NewcomerTasks: Disable guidance on all wikis except testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574428 (https://phabricator.wikimedia.org/T245525) (owner: 10Kosta Harlan) [12:02:55] it's a no-op; no code is using it yet [12:02:59] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574428 (https://phabricator.wikimedia.org/T245525) (owner: 10Kosta Harlan) [12:03:31] kostajh: all of them? [12:03:38] Urbanecm: yes [12:03:45] thanks, I'll skip mwdebug then [12:03:53] Urbanecm: well [12:03:56] one sec [12:03:59] yes? [12:04:03] (03Merged) 10jenkins-bot: NewcomerTasks: Disable guidance on all wikis except testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574428 (https://phabricator.wikimedia.org/T245525) (owner: 10Kosta Harlan) [12:04:10] Urbanecm: let me double-check https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/574633 via mwdebug [12:04:15] the other two are no-ops [12:04:44] kostajh: okay [12:04:47] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574633 (https://phabricator.wikimedia.org/T243359) (owner: 10Gergő Tisza) [12:04:52] (03PS2) 10Urbanecm: Add ORES topics related config for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574633 (https://phabricator.wikimedia.org/T243359) (owner: 10Gergő Tisza) [12:05:00] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574633 (https://phabricator.wikimedia.org/T243359) (owner: 10Gergő Tisza) [12:06:04] btw, this isn't really swat, but when you guys are done, I was wondering if you could look something up for me on mwmaint1002? (I don't have access anymore) [12:06:06] (03Merged) 10jenkins-bot: Add ORES topics related config for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574633 (https://phabricator.wikimedia.org/T243359) (owner: 10Gergő Tisza) [12:06:26] bawolff: sure :-) [12:06:27] It would be really helpful to debug something to know if /var/log/mediawiki/updateSpecialPages/cron-updatequerypages-wantedtemplates-s1-WantedTemplates.log on mwmaint1002 had any errors in it [12:06:38] This is for T246063 [12:06:39] T246063: Special:WantedTemplates cronjob not running on enwiki - https://phabricator.wikimedia.org/T246063 [12:07:16] kostajh: both https://gerrit.wikimedia.org/r/c/574428/ and https://gerrit.wikimedia.org/r/c/574633/ are at mwdebug1002 [12:07:35] Urbanecm: thanks, looking [12:08:10] bawolff: no such file [12:08:17] Urbanecm: yep, all good [12:08:23] kostajh: syncing [12:08:24] (03CR) 10Jbond: "lgtm just some nits" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/574718 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [12:09:01] (03PS5) 10Holger Knust: changeprop: New helmfiles for deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) [12:09:14] vgutierrez: done, if you get a second could i get a review from you on https://gerrit.wikimedia.org/r/c/operations/puppet/+/574009 & 574010 :) [12:09:55] damn, it was a trap! [12:10:00] ;P sure [12:10:06] bawolff: /var/log/mediawiki/updateSpecialPages/updatequerypages-enwiki-only-WantedTemplates.log exists but is very short (“Fatal error: no version entry for `#`”) [12:10:08] bawolff: did you mean /var/log/mediawiki/updateSpecialPages/s1@11-WantedPages.log? [12:10:35] same for the file Urbanecm mentioned [12:10:40] :) [12:10:42] Ah, maybe i misinterpreted what $name was in https://github.com/wikimedia/puppet/blob/b347052863d4d2e87b37d6c2d9f44f833cfd9dc2/modules/mediawiki/manifests/maintenance/updatequerypages/enwiki/cronjob.pp#L42 [12:10:45] thanks [12:10:50] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: cdde3a2: db90d22 (T245525, T243359) (duration: 00m 58s) [12:10:57] kostajh: here you are! [12:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:01] T245525: Newcomer tasks: add feature flag for guidance - https://phabricator.wikimedia.org/T245525 [12:11:04] T243359: Define configuration for ORES articletopic search - https://phabricator.wikimedia.org/T243359 [12:11:07] Urbanecm: thanks! [12:11:11] happy to help [12:11:24] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574720 (https://phabricator.wikimedia.org/T246092) (owner: 10Urbanecm) [12:12:38] bawolff: nearly all the files in /var/log/mediawiki/updateSpecialPages/* are identical (same sha256sum), with that “no version entry” error [12:12:42] (03Merged) 10jenkins-bot: New throttle rule for arwiki WikiGap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574720 (https://phabricator.wikimedia.org/T246092) (owner: 10Urbanecm) [12:13:06] only exception is s8@18-MostLinked.log, which says something about it being killed instead [12:13:19] Hmm, i guess there is both an s1 job and enwiki specific jobs [12:13:24] So i think both those are relavent. [12:13:33] Thanks, that gives me something to look at :) [12:14:15] ok :) [12:14:27] 10Operations: Special:WantedTemplates cronjob not running on enwiki - https://phabricator.wikimedia.org/T246063 (10Bawolff) ` 04:10 < Lucas_WMDE> bawolff: /var/log/mediawiki/updateSpecialPages/updatequerypages-enwiki-only-WantedTemplates.log exists but is very short (“Fatal error: no version... [12:14:29] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: SWAT: 1f58d9a: New throttle rule for arwiki WikiGap (T246092) (duration: 00m 56s) [12:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:38] T246092: Temporary lift IP cap for WikiGap edit-a-thon at Khawarizmi College in 5 March 2020 - https://phabricator.wikimedia.org/T246092 [12:14:44] Amir1: I'm done [12:15:12] Thanks. My patch is not merged yet [12:16:11] 10Operations: Special:WantedTemplates cronjob not running on enwiki - https://phabricator.wikimedia.org/T246063 (10Lucas_Werkmeister_WMDE) I can also show you the whole file, it doesn’t look sensitive: ` ------------------------------------- # ------------------------------------- no version entry for `#`. Fata... [12:17:42] (03PS1) 10Urbanecm: make mwscriptwikiset respect comments set in dblists [puppet] - 10https://gerrit.wikimedia.org/r/574726 [12:17:57] bawolff: ^ I think this fixes the issue you're debugging [12:18:31] (03PS2) 10Urbanecm: make mwscriptwikiset respect comments set in dblists [puppet] - 10https://gerrit.wikimedia.org/r/574726 (https://phabricator.wikimedia.org/T246063) [12:18:31] Urbanecm: awesome! [12:19:10] 10Operations, 10Patch-For-Review: Special:WantedTemplates cronjob not running on enwiki - https://phabricator.wikimedia.org/T246063 (10Urbanecm) The issue is mwscriptwikiset being broken, because it doesn't respect comments recently introduced in dblists. My patch should fix that. [12:19:20] (03CR) 10Holger Knust: "Added private folders for each cluster, added the statsd image tag to the values files, and created 574713 to keep the config.yaml changes" (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [12:19:28] (03CR) 10Brian Wolff: [C: 03+1] "This looks like it would solve the issue" [puppet] - 10https://gerrit.wikimedia.org/r/574726 (https://phabricator.wikimedia.org/T246063) (owner: 10Urbanecm) [12:20:55] (03PS3) 10Vgutierrez: pybal: Allow overriding BGP med [puppet] - 10https://gerrit.wikimedia.org/r/574718 (https://phabricator.wikimedia.org/T245984) [12:23:00] 10Operations, 10Patch-For-Review: Special:WantedTemplates cronjob not running on enwiki - https://phabricator.wikimedia.org/T246063 (10Bawolff) Timestamps of when this start failing match up with the patch that introduced comments on nov 26 fa9812e31415e, so I think this is indeed the cause. [12:24:07] (03CR) 10Lucas Werkmeister (WMDE): make mwscriptwikiset respect comments set in dblists (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574726 (https://phabricator.wikimedia.org/T246063) (owner: 10Urbanecm) [12:24:30] (03CR) 10Urbanecm: make mwscriptwikiset respect comments set in dblists (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574726 (https://phabricator.wikimedia.org/T246063) (owner: 10Urbanecm) [12:25:10] (03PS3) 10Urbanecm: make mwscriptwikiset respect comments set in dblists [puppet] - 10https://gerrit.wikimedia.org/r/574726 [12:25:56] bawolff: if you want, I can run the script manually, so we don't need to wait for the next month [12:26:32] Sure. I think that would keep the enwiki folks happy [12:26:54] Although it did take them 3 months to even notice, so its probably not that critical [12:26:55] okay, I'll do that :-) [12:28:51] (03PS4) 10Urbanecm: make mwscriptwikiset respect comments set in dblists [puppet] - 10https://gerrit.wikimedia.org/r/574726 (https://phabricator.wikimedia.org/T246063) [12:30:07] (03CR) 10Vgutierrez: [C: 04-1] "looks good, please fix the cert path for the acme_chief case" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574010 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [12:32:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic on db1107 for 10.4 on special groups 10 -> 50 - T242702', diff saved to https://phabricator.wikimedia.org/P10511 and previous config saved to /var/cache/conftool/dbconfig/20200225-123222-marostegui.json [12:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:30] T242702: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 [12:33:04] !log Run mwscript updateSpecialPages.php --wiki=enwiki --override --only=Wantedtemplates, cron didn't do that for several months (T246063) [12:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:10] T246063: Special:WantedTemplates cronjob not running on enwiki - https://phabricator.wikimedia.org/T246063 [12:33:57] (03CR) 10Vgutierrez: [C: 03+2] "pcc is still happy: https://puppet-compiler.wmflabs.org/compiler1002/21034/" [puppet] - 10https://gerrit.wikimedia.org/r/574718 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [12:36:37] 10Operations: Have monitoring of updatequerypages cronjobs - https://phabricator.wikimedia.org/T246097 (10Bawolff) [12:37:07] Well I filed a bug for monitoring because this is like the fourth time its broke and nobody noticed for several months [12:38:09] +1 [12:38:11] I don't think I can make it to this SWAT, will do it later [12:40:33] (03PS3) 10Hnowlan: Admin: Add changeprop namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 (https://phabricator.wikimedia.org/T213193) [12:41:58] (03PS1) 10Vgutierrez: lvs: Set lvs2010 as a secondary LVS [puppet] - 10https://gerrit.wikimedia.org/r/574737 (https://phabricator.wikimedia.org/T245984) [12:42:59] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): (Need by: 2020-03-02) rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10Cmjohnson) [12:44:09] !log dns1002 - downtimed, disabled puppet, and depool (stop BGP adverts) for hardware work - T241770 [12:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:20] 10Operations, 10DBA, 10OTRS, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10Marostegui) [12:45:51] 10Operations, 10DBA, 10OTRS, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10Marostegui) p:05Triage→03Medium [12:46:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1019 for on-site maintenance - T243963', diff saved to https://phabricator.wikimedia.org/P10512 and previous config saved to /var/cache/conftool/dbconfig/20200225-124650-marostegui.json [12:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:58] T243963: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 [12:49:01] (03PS1) 10Jbond: microsite::people: add cas configueration [puppet] - 10https://gerrit.wikimedia.org/r/574738 [12:49:04] !log dns1002 - shutdown for hardware work after confirming drain of live requests - T241770 [12:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:44] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:49:46] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:50:07] this I guess is dns1002 :) [12:51:47] !log Stop mysql on es1019 - T243963 [12:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:20] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1, feel free to +2 and merge. Let me know if you don't have rights and we 'll add them." [deployment-charts] - 10https://gerrit.wikimedia.org/r/574716 (owner: 10Holger Knust) [12:52:36] elukey: yes, it is [12:52:42] will ack [12:54:00] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdnsrec site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:54:02] ACKNOWLEDGEMENT - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 Brandon Black T241770 - dns1002 hardware work causes loss of BGP sessions to local routers https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:54:02] ACKNOWLEDGEMENT - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 Brandon Black T241770 - dns1002 hardware work causes loss of BGP sessions to local routers https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:57:04] PROBLEM - Host dns1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:57:19] heh that too, obviously [12:57:29] (03PS2) 10Jbond: microsite::people: add cas configueration [puppet] - 10https://gerrit.wikimedia.org/r/574738 [13:00:17] !log Run mwscript updateSpecialPages.php --wiki=enwiki --override --only=Uncategorizedcategories, cron didn't do that for several months (T246063) [13:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:23] (03PS3) 10Jbond: microsite::people: add cas configueration [puppet] - 10https://gerrit.wikimedia.org/r/574738 [13:00:24] T246063: Special:WantedTemplates cronjob not running on enwiki - https://phabricator.wikimedia.org/T246063 [13:01:27] (03CR) 10Holger Knust: [C: 03+1] "Merging based on review" [deployment-charts] - 10https://gerrit.wikimedia.org/r/574716 (owner: 10Holger Knust) [13:02:52] (03PS4) 10Hnowlan: Admin: Add changeprop namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 (https://phabricator.wikimedia.org/T213193) [13:03:16] RECOVERY - Host dns1002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [13:03:39] (03CR) 10Holger Knust: "+1 and +1 does not equal +2, so I don't think I have the rights" [deployment-charts] - 10https://gerrit.wikimedia.org/r/574716 (owner: 10Holger Knust) [13:03:44] 10Operations, 10Patch-For-Review: Special:WantedTemplates cronjob not running on all wikis - https://phabricator.wikimedia.org/T246063 (10Urbanecm) [13:05:24] (03PS4) 10Jbond: microsite::people: add cas configueration [puppet] - 10https://gerrit.wikimedia.org/r/574738 [13:07:30] (03PS5) 10Jbond: microsite::people: add cas configueration [puppet] - 10https://gerrit.wikimedia.org/r/574738 [13:10:23] (03PS1) 10Cmjohnson: updating mac for dns1002 to match new nic card [puppet] - 10https://gerrit.wikimedia.org/r/574740 (https://phabricator.wikimedia.org/T241770) [13:11:15] (03CR) 10Jbond: "PCC: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/21039" [puppet] - 10https://gerrit.wikimedia.org/r/574738 (owner: 10Jbond) [13:13:37] (03CR) 10Cmjohnson: [C: 03+2] updating mac for dns1002 to match new nic card [puppet] - 10https://gerrit.wikimedia.org/r/574740 (https://phabricator.wikimedia.org/T241770) (owner: 10Cmjohnson) [13:20:33] (03PS1) 10Filippo Giunchedi: smokeping: temp cr2-esams disable [puppet] - 10https://gerrit.wikimedia.org/r/574741 (https://phabricator.wikimedia.org/T246009) [13:21:20] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:22:20] (03CR) 10Ayounsi: [C: 03+2] smokeping: temp cr2-esams disable [puppet] - 10https://gerrit.wikimedia.org/r/574741 (https://phabricator.wikimedia.org/T246009) (owner: 10Filippo Giunchedi) [13:22:28] (03CR) 10Jcrespo: "+1, let's create the revert immediately so we remember to deploy it as soon as hw is backup up." [puppet] - 10https://gerrit.wikimedia.org/r/574741 (https://phabricator.wikimedia.org/T246009) (owner: 10Filippo Giunchedi) [13:22:34] (03CR) 10Jcrespo: [C: 03+1] smokeping: temp cr2-esams disable [puppet] - 10https://gerrit.wikimedia.org/r/574741 (https://phabricator.wikimedia.org/T246009) (owner: 10Filippo Giunchedi) [13:23:36] chaomodus: that's a timeout on the google api call, we should probably add some retry logic there ^^^ [13:27:22] jynus: good idea re: revert, will do [13:27:57] also a reminder on the relevant ticket [13:28:02] !log mwscript updateSpecialPages.php --wiki=enwiki --override --only=Mostcategories [13:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:39] (03PS5) 10Jbond: profile::tlsproxy::envoy: add support for acme certs [puppet] - 10https://gerrit.wikimedia.org/r/574010 (https://phabricator.wikimedia.org/T240941) [13:28:45] (03PS1) 10Filippo Giunchedi: Revert "smokeping: temp cr2-esams disable" [puppet] - 10https://gerrit.wikimedia.org/r/574742 (https://phabricator.wikimedia.org/T246009) [13:28:58] (03PS1) 10Lucas Werkmeister (WMDE): Reinstate wgULSLanguageDetection setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574743 (https://phabricator.wikimedia.org/T246071) [13:29:15] (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574010 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [13:30:09] (03CR) 10Vgutierrez: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1001/21040/" [puppet] - 10https://gerrit.wikimedia.org/r/574737 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [13:35:42] (03CR) 10Volans: [C: 04-1] "One minor detail to fix, see inline" (035 comments) [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) (owner: 10CRusnov) [13:37:23] (03PS1) 10Cmjohnson: Adding production dns for cloudvirt-wdqs100[1-3] [dns] - 10https://gerrit.wikimedia.org/r/574745 (https://phabricator.wikimedia.org/T235685) [13:37:32] (03CR) 10Vgutierrez: [C: 03+2] lvs: Set lvs2010 as a secondary LVS [puppet] - 10https://gerrit.wikimedia.org/r/574737 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [13:39:18] (03CR) 10Cmjohnson: [C: 03+2] Adding production dns for cloudvirt-wdqs100[1-3] [dns] - 10https://gerrit.wikimedia.org/r/574745 (https://phabricator.wikimedia.org/T235685) (owner: 10Cmjohnson) [13:39:57] jouncebot: next [13:39:57] In 3 hour(s) and 20 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200225T1700) [13:41:23] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: write kafka inputs to es by default [puppet] - 10https://gerrit.wikimedia.org/r/574467 (https://phabricator.wikimedia.org/T227080) (owner: 10Filippo Giunchedi) [13:41:50] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs2010.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [13:42:05] !log roll-restart logstash in eqiad/codfw - T227080 [13:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:12] T227080: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 [13:43:29] (03Abandoned) 10Vgutierrez: lvs: Replace lvs2006 with lvs2010 [puppet] - 10https://gerrit.wikimedia.org/r/473734 (https://phabricator.wikimedia.org/T209337) (owner: 10Vgutierrez) [13:44:38] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:03] (03CR) 10Volans: [C: 04-1] "Some comment inline, I don't think it works fine as is." (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/568683 (owner: 10CRusnov) [13:58:35] (03CR) 10Nikerabbit: [C: 03+1] "It shouldn't have been removed indeed!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574743 (https://phabricator.wikimedia.org/T246071) (owner: 10Lucas Werkmeister (WMDE)) [13:58:56] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:00:51] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [14:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:44] (03PS1) 10Muehlenhoff: Enable CASValidateSAML for tendril [puppet] - 10https://gerrit.wikimedia.org/r/574747 [14:03:10] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:15] lovely [14:03:28] (03PS1) 10Aklapper: phabricator weekly changes email: List Herald rules by inactive users [puppet] - 10https://gerrit.wikimedia.org/r/574751 [14:04:01] (03PS2) 10Aklapper: phabricator weekly changes email: List Herald rules by inactive users [puppet] - 10https://gerrit.wikimedia.org/r/574751 (https://phabricator.wikimedia.org/T246105) [14:06:12] (03CR) 10Lucas Werkmeister (WMDE): "I suppose this could be moved into CommonSettings.php if we don’t need it to vary by wiki, but I’ll leave that for later." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574743 (https://phabricator.wikimedia.org/T246071) (owner: 10Lucas Werkmeister (WMDE)) [14:06:45] jouncebot: now [14:06:45] No deployments scheduled for the next 2 hour(s) and 53 minute(s) [14:07:18] (03CR) 10Aklapper: "Probably also needs some MariaDB permissions fiddling to allow accessing the "phabricator_herald" DB. I have no clue how to do that. :-/" [puppet] - 10https://gerrit.wikimedia.org/r/574751 (https://phabricator.wikimedia.org/T246105) (owner: 10Aklapper) [14:07:50] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:08:24] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/574743 is a high-impact fix, any objections to me deploying it now? [14:08:58] ping esp. vgutierrez since it looks like you’re also doing things at the moment [14:09:32] I'm not messing with anything in production right now :) [14:09:38] ok thanks :) [14:09:55] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs2010.codfw.wmnet'] ` and were **ALL** successful. [14:10:13] Lucas_WMDE: we know what caused it? [14:11:12] effie: the patch to remove fixcopyrightwiki completely removed a setting which had only a default and a fixcopyrightwiki override [14:11:19] under the assumption that the default matched the code default [14:11:21] but it doesn’t [14:11:30] so now it falls back to the code default instead of the old IS.php default [14:11:34] (hope that makes sense) [14:11:47] hehe, if we have a fix, it is all good [14:12:14] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "let’s deploy this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574743 (https://phabricator.wikimedia.org/T246071) (owner: 10Lucas Werkmeister (WMDE)) [14:13:24] (03Merged) 10jenkins-bot: Reinstate wgULSLanguageDetection setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574743 (https://phabricator.wikimedia.org/T246071) (owner: 10Lucas Werkmeister (WMDE)) [14:13:26] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [14:13:57] change is on mwdebug1001. testing [14:14:13] yup, interface language is back to some chinese at https://zh-yue.wikipedia.org/wiki/%E9%A0%AD%E7%89%88 in private window [14:14:47] !log add bgp session to 10.192.49.7 (lvs2010) on cr1/cr2-codfw [14:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:57] syncing [14:15:52] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:574743|Reinstate wgULSLanguageDetection setting (T246071)]] (duration: 01m 03s) [14:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:58] T246071: Interface language using Accept-Language header value instead of $wgLanguageCode - https://phabricator.wikimedia.org/T246071 [14:16:00] !log dns1002 - start reimage - T241770 [14:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:22] (03CR) 10Elukey: [C: 03+1] Enable CASValidateSAML for tendril [puppet] - 10https://gerrit.wikimedia.org/r/574747 (owner: 10Muehlenhoff) [14:17:50] (03PS1) 10Vgutierrez: lvs: Set BGP peers for lvs2010 [puppet] - 10https://gerrit.wikimedia.org/r/574753 (https://phabricator.wikimedia.org/T196560) [14:20:17] !log update puppet compiler facts [14:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:04] (03PS1) 10BBlack: update dns1002 macaddr [puppet] - 10https://gerrit.wikimedia.org/r/574754 (https://phabricator.wikimedia.org/T241770) [14:23:31] (03CR) 10BBlack: [V: 03+2 C: 03+2] update dns1002 macaddr [puppet] - 10https://gerrit.wikimedia.org/r/574754 (https://phabricator.wikimedia.org/T241770) (owner: 10BBlack) [14:25:41] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: Add Fsalutari to nda LDAP group - https://phabricator.wikimedia.org/T245997 (10Ottomata) @Fsalutari try now. @Muehlenhoff this user already has a shell entry in data.yaml...is that what you mean? [14:26:42] (03CR) 10Vgutierrez: [C: 03+2] "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1003/21042/" [puppet] - 10https://gerrit.wikimedia.org/r/574753 (https://phabricator.wikimedia.org/T196560) (owner: 10Vgutierrez) [14:28:10] (03CR) 10Volans: [C: 04-1] "@godog, would appreciate your input on the exposed metrics and cardinality" (038 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/574600 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [14:28:19] PROBLEM - Host 2620:0:861:4:208:80:155:108 is DOWN: CRITICAL - Destination Unreachable (2620:0:861:4:208:80:155:108) [14:28:27] (03CR) 10Alexandros Kosiaris: [C: 03+1] "> +1 and +1 does not equal +2, so I don't think I have the rights" [deployment-charts] - 10https://gerrit.wikimedia.org/r/574716 (owner: 10Holger Knust) [14:28:48] bblack: is that you? ^^ [14:29:21] (03CR) 10Holger Knust: [C: 03+2] changeprop: Change names of redis keys in config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/574716 (owner: 10Holger Knust) [14:29:42] (03Merged) 10jenkins-bot: changeprop: Change names of redis keys in config.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/574716 (owner: 10Holger Knust) [14:30:02] !log restart pybal with BGP enabled on lvs2010 - T245984 T196560 [14:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:10] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [14:30:10] T196560: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 [14:30:34] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventgate-analytics-external: Add k8s token [puppet] - 10https://gerrit.wikimedia.org/r/573602 (https://phabricator.wikimedia.org/T233629) (owner: 10Alexandros Kosiaris) [14:30:39] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:31:47] PROBLEM - Recursive DNS on 208.80.155.108 is CRITICAL: Return code of 255 is out of bounds https://wikitech.wikimedia.org/wiki/DNS [14:33:38] okay it's working in mwdebug1002 without any issues, preparing to deploy [14:33:54] ok [14:34:19] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventgate-analytics-external: Add the namespace and calico rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/573624 (https://phabricator.wikimedia.org/T233629) (owner: 10Alexandros Kosiaris) [14:34:23] (03PS2) 10Alexandros Kosiaris: eventgate-analytics-external: Add the namespace and calico rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/573624 (https://phabricator.wikimedia.org/T233629) [14:34:43] PROBLEM - Host 2620:0:861:4:208:80:155:108 is DOWN: CRITICAL - Destination Unreachable (2620:0:861:4:208:80:155:108) [14:34:50] going live [14:34:56] addshore: marostegui jynus [14:35:36] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.20/extensions/Wikibase/lib: [[gerrit:574746|wbterms: only select entity terms that are requested (T246005)]] (duration: 01m 02s) [14:35:41] Amir1: ping only me for today [14:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:43] T246005: Wikibase client description API module results in 15k selected rows with new term storage - https://phabricator.wikimedia.org/T246005 [14:35:45] I am around [14:35:51] Thanks [14:35:53] sure [14:37:19] 10Operations, 10netops: PyBal BGP group prefix-limit 50 teardown - https://phabricator.wikimedia.org/T246110 (10ayounsi) p:05Triage→03High [14:37:38] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Cmjohnson) I attempted to update the idrac f/w for es1019 but the update failed several times for not being able to verify package signature. The update was downloaded directly from dell's portal... [14:37:39] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [14:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:20] number of get requests is going down, I don't know if it's related or a bot just gave up [14:38:24] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200&fullscreen&panelId=26 [14:38:29] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:39:05] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:22] bblack: we have a situation here https://phabricator.wikimedia.org/T246071 [14:39:26] addshore: Look at this beauty: https://logstash.wikimedia.org/goto/039ea1e327733188066c7b80deb53918 [14:39:42] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:55] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:01] (03PS1) 10Filippo Giunchedi: netbox: log to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/574760 (https://phabricator.wikimedia.org/T245511) [14:40:16] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:52] PROBLEM - Host 2620:0:861:4:b226:28ff:fed9:e070 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:06] Amir1: GET requests probably connected to https://phabricator.wikimedia.org/T246071 [14:43:34] (03CR) 10Filippo Giunchedi: [C: 03+2] netbox: log to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/574760 (https://phabricator.wikimedia.org/T245511) (owner: 10Filippo Giunchedi) [14:43:36] PROBLEM - Host 2620:0:861:4:b226:28ff:fed9:e070 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:41] okay [14:43:54] IPv6 IP ? [14:43:57] no DNS ? [14:44:24] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:45:10] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:45:13] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 6 others: Public EventGate instance and endpoint for analytics event intake: eventgate-analytics-external - https://phabricator.wikimedia.org/T233629 (10akosiaris) @Ottomata, tokens created and being propagaged across the cluster... [14:46:07] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 6 others: Public EventGate instance and endpoint for analytics event intake: eventgate-analytics-external - https://phabricator.wikimedia.org/T233629 (10Ottomata) Thank you! [14:46:19] akosiaris: our infra never cease to surprise me [14:46:31] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: Add Fsalutari to nda LDAP group - https://phabricator.wikimedia.org/T245997 (10Fsalutari) I can now log in! Thanks! [14:46:50] !log roll restart netbox uwsgi - T245511 [14:46:51] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 6 others: Public EventGate instance and endpoint for analytics event intake: eventgate-analytics-external - https://phabricator.wikimedia.org/T233629 (10Ottomata) [14:46:53] ah, new rec DNS check [14:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:56] T245511: Move netbox uwsgi logs to logging pipeline - https://phabricator.wikimedia.org/T245511 [14:46:59] (03PS1) 10Cmjohnson: Adding cloudvirt-wdqs servers to dhcpd file and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/574761 (https://phabricator.wikimedia.org/T235685) [14:47:02] jynus: I'm planning to move ahead with read new on the term store but before hand https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s8&var-role=All&from=now-1h&to=now [14:47:25] the number of rows read hasn't changed but MySQL traffic went sky dive [14:47:30] and open connections [14:48:20] Amir1: just fyi, it seems like your deploy caused the previous deploy for not splitting varnish on accept-language to take affect [14:48:36] bawolff: oh yeah the IS.php cache thingy [14:48:51] hmmm it switched back to the correct IP now, funny [14:48:51] (03PS2) 10Cmjohnson: Adding cloudvirt-wdqs servers to dhcpd file and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/574761 (https://phabricator.wikimedia.org/T235685) [14:48:52] Lucas_WMDE: do you know about it? [14:49:07] T236104 [14:49:08] T236104: Cache of wmf-config/InitialiseSettings often 1 step behind - https://phabricator.wikimedia.org/T236104 [14:49:08] Amir1: See backscroll in #wikimedia-releng [14:49:10] I wonder ... [14:49:18] Amir1: ugh [14:49:22] that task description sounds horrendous [14:49:24] * Lucas_WMDE reads [14:49:46] what in the name of [14:50:02] That's why I sync IS.php always twice [14:50:16] ugh [14:50:25] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10fgiunchedi) [14:50:27] That's totally not going to cause downtime some day :S [14:50:31] (03PS1) 10Jbond: tendril: disable cas and enable ldap [puppet] - 10https://gerrit.wikimedia.org/r/574762 [14:50:34] I’ll… try to remember that [14:50:35] (03CR) 10Cmjohnson: [C: 03+2] Adding cloudvirt-wdqs servers to dhcpd file and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/574761 (https://phabricator.wikimedia.org/T235685) (owner: 10Cmjohnson) [14:50:36] thank you [14:50:37] (03PS2) 10Elukey: Move all Report Updater Jobs to an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) [14:51:31] should I sync IS.php again just in case? [14:51:36] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need by: 2020-03-02) rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10Cmjohnson) [14:51:47] (03PS6) 10Holger Knust: changeprop: New helmfiles for deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) [14:52:05] Lucas_WMDE: no, I'm deploying something with IS.php [14:52:10] I do it [14:52:11] ok [14:52:14] thanks [14:53:03] any idea if this affects other parts of wmf-config/ as well? [14:53:10] I’ll add a bit to wikitech [14:53:26] latency went down to normal now [14:53:37] (03CR) 10Jbond: [C: 03+2] tendril: disable cas and enable ldap [puppet] - 10https://gerrit.wikimedia.org/r/574762 (owner: 10Jbond) [14:55:02] sent to ops as a remidner [14:55:46] (03PS1) 10Ladsgroup: Revert "Revert "Increase the reads for term store for clients for up to Q256K"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574764 [14:56:31] Lucas_WMDE: I never encountered it with other parts [14:56:56] added at https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#operations/mediawiki-config_2 [14:57:51] (03PS2) 10Ladsgroup: Revert "Revert "Increase the reads for term store for clients for up to Q256K"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574764 [14:57:58] (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert "Increase the reads for term store for clients for up to Q256K"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574764 (owner: 10Ladsgroup) [14:58:05] where are we on the accept-language thing? [14:58:25] (03CR) 10Hnowlan: [C: 03+1] changeprop: New helmfiles for deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [14:58:42] bblack: I think it's resolved. Lucas_WMDE ? [14:58:43] PROBLEM - Check systemd state on dbmonitor2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:47] 10Operations, 10Beta-Cluster-Infrastructure, 10observability, 10serviceops: Stream a subset of mediawiki apache logs to logstash - https://phabricator.wikimedia.org/T244472 (10jijiki) After a lot of fiddling with @herron, we are finally at this https://phabricator.wikimedia.org/P10513 ! The resource field... [14:59:05] (03Merged) 10jenkins-bot: Revert "Revert "Increase the reads for term store for clients for up to Q256K"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574764 (owner: 10Ladsgroup) [14:59:07] (03PS3) 10Elukey: Move all Report Updater Jobs to an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) [15:01:04] (03PS1) 10Filippo Giunchedi: debmonitor: log to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/574765 (https://phabricator.wikimedia.org/T245512) [15:01:06] (03PS1) 10Filippo Giunchedi: puppetboard: log to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/574766 (https://phabricator.wikimedia.org/T245512) [15:02:14] Amir1, bblack: I think so as well, I’m working on the incident documentation at the moment [15:02:37] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q256K (T219123)]] (duration: 00m 56s) [15:02:37] (03PS1) 10KartikMistry: Update cxserver to 2020-02-24-110149-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/574768 (https://phabricator.wikimedia.org/T227183) [15:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:43] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [15:03:32] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/21044/" [puppet] - 10https://gerrit.wikimedia.org/r/574766 (https://phabricator.wikimedia.org/T245512) (owner: 10Filippo Giunchedi) [15:03:48] (03CR) 10Filippo Giunchedi: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/574765" [puppet] - 10https://gerrit.wikimedia.org/r/574765 (https://phabricator.wikimedia.org/T245512) (owner: 10Filippo Giunchedi) [15:04:33] (03PS4) 10Elukey: Move all Report Updater Jobs to an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) [15:04:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] changeprop: New helmfiles for deployment (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [15:06:03] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q256K (T219123)]], take II (duration: 00m 55s) [15:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:10] (03PS16) 10Effie Mouzeli: mediawiki: send apache logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/571239 (https://phabricator.wikimedia.org/T244472) [15:06:25] RECOVERY - Recursive DNS on 208.80.155.108 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [15:08:07] (03PS1) 10Ladsgroup: Increase the reads for term store for clients for up to Q512K [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574769 (https://phabricator.wikimedia.org/T219123) [15:10:53] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Marostegui) Thanks Chris for tackling this. Let's not spend more time on this host, it has a big history of failing idrac :( So let's just make sure it is available and if it fails in a few months... [15:11:20] jynus: all dbs look healthy, I'm moving forward now [15:11:50] (03PS2) 10Ladsgroup: Increase the reads for term store for clients for up to Q512K [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574769 (https://phabricator.wikimedia.org/T219123) [15:12:01] (03CR) 10Ladsgroup: [C: 03+2] Increase the reads for term store for clients for up to Q512K [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574769 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [15:12:49] +1 [15:13:00] :)) [15:13:00] (03Merged) 10jenkins-bot: Increase the reads for term store for clients for up to Q512K [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574769 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [15:13:14] * addshore is watching but in a meeting [15:13:40] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10jcrespo) Will close this, then, once the host is fully back into production. [15:15:08] (03PS1) 10Vgutierrez: lvs: Increase BGP MED on lvs2010 to 102 [puppet] - 10https://gerrit.wikimedia.org/r/574771 (https://phabricator.wikimedia.org/T196560) [15:15:10] (03PS1) 10Vgutierrez: lvs: Set up lvs2009 as a low-traffic LVS [puppet] - 10https://gerrit.wikimedia.org/r/574772 (https://phabricator.wikimedia.org/T245984) [15:15:23] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q512K (T219123)]] (duration: 00m 56s) [15:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:30] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [15:16:33] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q512K (T219123)]], take II (duration: 00m 55s) [15:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:38] the cache hit rate for memcached is 99% :)) we should have it APCu too [15:16:48] (03CR) 10Vgutierrez: [C: 03+2] lvs: Increase BGP MED on lvs2010 to 102 [puppet] - 10https://gerrit.wikimedia.org/r/574771 (https://phabricator.wikimedia.org/T196560) (owner: 10Vgutierrez) [15:17:53] (03PS4) 10Andrew Bogott: nova.conf: replace rabbit_hosts with transport_url [puppet] - 10https://gerrit.wikimedia.org/r/574593 [15:20:31] Moving to 1M [15:20:38] (03PS17) 10Effie Mouzeli: mediawiki: send apache logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/571239 (https://phabricator.wikimedia.org/T244472) [15:21:06] (03CR) 10Alexandros Kosiaris: [C: 03+1] changeprop: New helmfiles for deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [15:22:13] (03PS1) 10Ladsgroup: Increase the reads for term store for clients for up to Q1Mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574773 (https://phabricator.wikimedia.org/T219123) [15:22:26] Amir1: maybe leave it 5-10 mins and see if anyhting surfaces? [15:22:37] (03PS5) 10Andrew Bogott: nova.conf: replace rabbit_hosts with transport_url [puppet] - 10https://gerrit.wikimedia.org/r/574593 [15:22:43] sure, I go get a coffee in the mean time [15:23:49] (03PS1) 10Andrew Bogott: nova.conf: remove obsolete dhcp_domain setting [puppet] - 10https://gerrit.wikimedia.org/r/574774 [15:24:04] (03CR) 10Andrew Bogott: [C: 03+2] nova.conf: replace rabbit_hosts with transport_url [puppet] - 10https://gerrit.wikimedia.org/r/574593 (owner: 10Andrew Bogott) [15:25:23] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/574766 (https://phabricator.wikimedia.org/T245512) (owner: 10Filippo Giunchedi) [15:25:54] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/574765 (https://phabricator.wikimedia.org/T245512) (owner: 10Filippo Giunchedi) [15:26:15] Amir1: for when you get back, seems good to continue, i just had my eye on 1 spiek [15:26:17] spiek [15:26:24] spike (dam new keyboard) [15:26:46] https://usercontent.irccloud-cdn.com/file/fknOsXI4/image.png [15:27:14] (03PS1) 10Muehlenhoff: Enable SAML 1.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574776 [15:28:33] (03CR) 10Andrew Bogott: [C: 03+2] nova.conf: remove obsolete dhcp_domain setting [puppet] - 10https://gerrit.wikimedia.org/r/574774 (owner: 10Andrew Bogott) [15:28:47] (03PS2) 10Vgutierrez: lvs: Set up lvs2009 as a low-traffic LVS [puppet] - 10https://gerrit.wikimedia.org/r/574772 (https://phabricator.wikimedia.org/T245984) [15:30:48] addshore: back now, I continue, I don't see the spike having an affect on the database reads so far [15:30:56] (03CR) 10Jbond: [C: 03+1] Enable SAML 1.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574776 (owner: 10Muehlenhoff) [15:30:56] indeed [15:30:57] :) [15:31:44] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1001/21047/" [puppet] - 10https://gerrit.wikimedia.org/r/574772 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [15:32:16] (03CR) 10Ladsgroup: [C: 03+2] Increase the reads for term store for clients for up to Q1Mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574773 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [15:32:42] addshore: what about the holes on Q10M-Q20M? [15:33:07] *checks* [15:33:21] (03Merged) 10jenkins-bot: Increase the reads for term store for clients for up to Q1Mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574773 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [15:33:29] My first pass over 10-20 mill is done, i was gonna generate a list 1 more time and re run it once more [15:33:31] {{doing now}} [15:33:53] (03CR) 10Vgutierrez: [C: 03+2] lvs: Set up lvs2009 as a low-traffic LVS [puppet] - 10https://gerrit.wikimedia.org/r/574772 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [15:34:51] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q1Mio (T219123)]] (duration: 00m 56s) [15:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:58] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [15:36:06] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q1Mio (T219123)]], take II (duration: 00m 55s) [15:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:22] (03CR) 10Vgutierrez: [C: 04-2] "superseded by https://gerrit.wikimedia.org/r/c/operations/puppet/+/574772" [puppet] - 10https://gerrit.wikimedia.org/r/570289 (owner: 10Muehlenhoff) [15:37:19] addshore: finally it's just migration and flipping the switch [15:37:21] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` lvs2009.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/20... [15:37:24] :D [15:37:25] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Enable SAML 1.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574776 (owner: 10Muehlenhoff) [15:38:09] Amir1: rows read going upo a bit [15:38:21] that's expected [15:38:44] processlist and such are still down: [15:38:45] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1104&var-port=9104 [15:40:30] there was a huge drop of semaphores at 14:40 [15:40:39] (03PS1) 10Andrew Bogott: neutron.conf: update rabbitmq settings [puppet] - 10https://gerrit.wikimedia.org/r/574778 [15:40:41] (03Abandoned) 10Muehlenhoff: Add lvs2009 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/570289 (owner: 10Muehlenhoff) [15:40:55] unless that is due to less queries run, that means less locking [15:41:22] i believe that was the deployment to fix the issue we identified yesterday (lots of rows selected when not eneded) [15:41:32] 14:35 ladsgroup@deploy1001: Synchronized php-1.35.0-wmf.20/extensions/Wikibase/lib: wbterms: only select entity terms that are requested (T246005) (duration: 01m 02s) [15:41:33] T246005: Wikibase client description API module results in 15k selected rows with new term storage - https://phabricator.wikimedia.org/T246005 [15:41:43] buffer pool hits reduced a bit, but nothing worrying and expected if read patterns changed [15:42:10] less latency spikes overal [15:42:54] it matches with our deployments [15:43:02] something weird happened, however at 13:37 [15:43:40] maybe it got depooled for some seconds due to lag or something [15:43:56] That's before our deployments [15:44:05] not sure about 13:37, one some server in particular? [15:44:12] (just commenting everything I am seeing) [15:44:26] ack :) [15:45:23] does this deployments affect client wikis too, or only s8? [15:45:47] jynus: so it affects client wikis but they read from s8 [15:45:52] all client wikis, but only their connections to s8. it could affect pager render times etc, but only connection to s8 [15:46:01] yup [15:46:03] (03PS1) 10Elukey: role::analytics_cluster::launcher: add kerberos settings for hive [puppet] - 10https://gerrit.wikimedia.org/r/574780 (https://phabricator.wikimedia.org/T243934) [15:46:04] *page [15:46:21] I see some decrease in s1 connections overally at time for deployment [15:46:24] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::launcher: add kerberos settings for hive [puppet] - 10https://gerrit.wikimedia.org/r/574780 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [15:46:50] see: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s1&var-role=All&from=1582624005267&to=1582645605267 [15:47:01] ^that's enwiki [15:47:04] jynus: the problem is that my deployment made a UBN issue to take affect as well [15:47:19] the revert of the header thingy [15:47:21] ok [15:48:30] but it also can be because the whole transaction gets wrapped up faster [15:48:46] so something needing both s1 and s8 would finish it faster [15:49:05] that's an interesting achievement [15:50:04] so the bug was introduced during SWAT, if the number of connections went up during the time and recovered, it means it's the header bug but if it was higher before that and now improved, it's because of our work [15:51:26] I'm going to 2Mio now but then I stop for a bit as I have a quick meeting [15:51:29] Amir1: that sounds like you have an explanation for the rise in avg response time here? https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1582617600000&to=1582646400000&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&fullscreen&panelId=9 [15:51:32] ack :) [15:51:42] because that has been mystifying us over at the accept-language issue [15:52:25] yeah, I don't know about the accept language header problem much but if it went up and then the revert brought it back, that would explain it, right? [15:53:24] well in that case it should’ve gone up since ca. midnight UTC [15:53:32] and not only during EU SWAT [15:53:58] I might be wrong but I think it was deployed during EU SWAT [15:54:21] (03PS1) 10Ladsgroup: Increase the reads for term store for clients for up to Q2Mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574781 (https://phabricator.wikimedia.org/T219123) [15:54:22] no, we have bug reports for that from 8AM UTC [15:54:26] https://phabricator.wikimedia.org/T246071 [15:54:39] unless you’re talking about something else? [15:54:40] okay then, I don't know [15:54:46] ok [15:54:53] (03CR) 10Ladsgroup: [C: 03+2] Increase the reads for term store for clients for up to Q2Mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574781 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [15:55:07] is that the scap sync cache issue? [15:55:27] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/574738 (owner: 10Jbond) [15:55:29] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): (Need by: 2020-03-02) rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10Cmjohnson) @andrewbogott I may have chosen the wrong partman recipe, all 3 have started installing but failed. Please check a... [15:55:31] ie the fix was in swat? and amirs sync actually synced it everywhere? Or am i missreading the whole situation ? [15:55:44] the fix for accept-language was out of SWAT [15:55:50] amirs sync was only 20mins later [15:55:59] see also https://wikitech.wikimedia.org/wiki/Incident_documentation/20200225-mediawiki_interface_language [15:56:11] (03PS2) 10Andrew Bogott: neutron.conf: update rabbitmq settings [puppet] - 10https://gerrit.wikimedia.org/r/574778 [15:56:23] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [15:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:42] (03Merged) 10jenkins-bot: Increase the reads for term store for clients for up to Q2Mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574781 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [15:57:04] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10Cmjohnson) 05Open→03Resolved The reseat was completed but idrac f/w updated failed. resolving the task and will do flea power drains if or when idrac freezes again. [15:58:01] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q2Mio (T219123)]] (duration: 00m 56s) [15:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:07] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [15:58:18] (03CR) 10Andrew Bogott: [C: 03+2] neutron.conf: update rabbitmq settings [puppet] - 10https://gerrit.wikimedia.org/r/574778 (owner: 10Andrew Bogott) [15:58:24] (03CR) 10Jbond: [C: 03+2] idp: allow people.wikimedia.org to authenticate against apereo_cas [puppet] - 10https://gerrit.wikimedia.org/r/574709 (owner: 10Jbond) [15:58:41] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:47] (03CR) 10Ppchelko: Admin: Add changeprop namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [15:59:04] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q2Mio (T219123)]], take II (duration: 00m 55s) [15:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:09] (03PS18) 10Effie Mouzeli: mediawiki: stream apache logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/571239 (https://phabricator.wikimedia.org/T244472) [16:00:12] (03CR) 10Jbond: [C: 03+2] microsite::people: add cas configueration [puppet] - 10https://gerrit.wikimedia.org/r/574738 (owner: 10Jbond) [16:00:25] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: Add Fsalutari to nda LDAP group - https://phabricator.wikimedia.org/T245997 (10MoritzMuehlenhoff) If the user already has shell access, no additional entry is needed. [16:00:27] (03CR) 10Hnowlan: Admin: Add changeprop namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [16:01:27] (03CR) 10Ppchelko: [C: 03+1] Admin: Add changeprop namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/574719 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [16:01:55] Amir1: spike in proc list with that last deploy? [16:02:15] and open connections for a bit too [16:02:15] !log jynus@cumin1001 dbctl commit (dc=all): 'repool es1019 with low load after maintenance T243963', diff saved to https://phabricator.wikimedia.org/P10516 and previous config saved to /var/cache/conftool/dbconfig/20200225-160215-jynus.json [16:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:22] T243963: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 [16:02:32] Amir1: and "worst response times" [16:03:15] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): (Need by: 2020-03-02) rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10MoritzMuehlenhoff) Are these following the same setup as the main production wdqs servers? These are using "partman/standard.... [16:04:05] addshore: meeting atm [16:04:16] ack [16:04:29] open connections went back down, worst response time still elevated currently [16:04:54] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs2009.codfw.wmnet'] ` and were **ALL** successful. [16:05:48] 10Operations, 10ops-eqiad, 10Traffic: cp1088 - https://phabricator.wikimedia.org/T245645 (10Cmjohnson) 05Open→03Resolved @RobH cp1088 does not have a front LCD display but the front LED is working. For good measure, I did a racreset and cleared the syslog [16:06:12] jynus: just checking in, everything looking okay to you (except that elevated worst response time) ? [16:06:26] that went away, didn't it? [16:06:29] (03PS1) 10Jbond: people: update proxy uri [puppet] - 10https://gerrit.wikimedia.org/r/574782 [16:06:52] (03CR) 10Jbond: [V: 03+2 C: 03+2] people: update proxy uri [puppet] - 10https://gerrit.wikimedia.org/r/574782 (owner: 10Jbond) [16:07:02] addshore: yes, look: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&fullscreen&panelId=9&from=1582474016124&to=1582646816124&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200 [16:07:15] 10Operations, 10ops-eqiad, 10Traffic: cp1088 - https://phabricator.wikimedia.org/T245645 (10RobH) Cool, it was an odd error I've never seen before, so if it doesn't happen twice it didn't happen! [16:07:58] (03PS1) 10Vgutierrez: lvs: Enable BGP in lvs2009 [puppet] - 10https://gerrit.wikimedia.org/r/574783 (https://phabricator.wikimedia.org/T245984) [16:08:03] !log add BGP to lvs2009 on cr1/2-codfw [16:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:12] (03CR) 10Muehlenhoff: admin: add CI checks to ensure users and group have the correct gid/uid (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/574412 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [16:09:08] !log update puppet compiler facts [16:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:13] jynus: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1582636546211&to=1582646936619&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s8&var-role=All&fullscreen&panelId=11 is where i was looking [16:10:09] I see, let me double check the individual server latency [16:11:54] not worried at the time, only db1109 seems high [16:12:46] we are back to the ups and downs [16:13:33] ack1 [16:14:20] (03CR) 10Ppchelko: "Looks ok, except the fact we're still not specifying where to connect to Redis. the secrets will get us the redis path, but the redis uri " [deployment-charts] - 10https://gerrit.wikimedia.org/r/574094 (https://phabricator.wikimedia.org/T213193) (owner: 10Holger Knust) [16:15:22] (03CR) 10Filippo Giunchedi: [C: 03+2] debmonitor: log to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/574765 (https://phabricator.wikimedia.org/T245512) (owner: 10Filippo Giunchedi) [16:15:43] (03CR) 10Filippo Giunchedi: [C: 03+2] puppetboard: log to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/574766 (https://phabricator.wikimedia.org/T245512) (owner: 10Filippo Giunchedi) [16:17:41] !log installing pillow security updates [16:17:41] !log restart debmonitor / puppetboard - T245512 [16:17:42] back now, let me check [16:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:51] T245512: Move service::uwsgi logs to logging pipeline - https://phabricator.wikimedia.org/T245512 [16:19:31] addshore: It's already recovering [16:20:10] :) [16:20:20] Amir1: going to try another bump today? [16:20:33] I'll be off soon [16:20:43] yup, my plan is to get it to be the same as repo [16:20:59] and then bump both togther [16:21:02] (03CR) 10Vgutierrez: [C: 03+2] "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1001/21051/" [puppet] - 10https://gerrit.wikimedia.org/r/574783 (https://phabricator.wikimedia.org/T245984) (owner: 10Vgutierrez) [16:21:05] this is exhausting [16:21:10] addshore: Don't worry [16:21:15] (03PS2) 10Vgutierrez: lvs: Enable BGP in lvs2009 [puppet] - 10https://gerrit.wikimedia.org/r/574783 (https://phabricator.wikimedia.org/T245984) [16:22:05] (03PS1) 10Cmjohnson: Adding mgmt dns for mw185-1413 [dns] - 10https://gerrit.wikimedia.org/r/574785 (https://phabricator.wikimedia.org/T241849) [16:22:29] (03CR) 10jerkins-bot: [V: 04-1] Adding mgmt dns for mw185-1413 [dns] - 10https://gerrit.wikimedia.org/r/574785 (https://phabricator.wikimedia.org/T241849) (owner: 10Cmjohnson) [16:22:31] 10Operations, 10ops-eqiad, 10serviceops, 10Patch-For-Review: (Need by: 2020-02-12) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10Cmjohnson) [16:22:46] (03CR) 10CRusnov: "> Patch Set 2: Code-Review-1" (038 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/574600 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [16:24:10] (03CR) 10Alexandros Kosiaris: configmaster: Add DNS Discovery disrepancy check (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/573963 (owner: 10Alexandros Kosiaris) [16:24:43] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): (Need by: 2020-03-02) rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10Andrew) @MoritzMuehlenhoff these will be used as cloudvirts -- they need one small OS volume and one big raid10 volume. I'll... [16:24:56] (03PS6) 10Alexandros Kosiaris: configmaster: Add DNS Discovery discrepancy check [puppet] - 10https://gerrit.wikimedia.org/r/573963 [16:25:06] !log enable BGP in lvs2009 - T196560 T245984 [16:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:14] T245984: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 [16:25:15] T196560: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 [16:27:47] (03CR) 10Jbond: "Thanks updated" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/574412 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [16:28:03] (03PS6) 10Jbond: admin: add CI checks to ensure users and group have the correct gid/uid [puppet] - 10https://gerrit.wikimedia.org/r/574412 (https://phabricator.wikimedia.org/T235162) [16:28:39] (03CR) 10jerkins-bot: [V: 04-1] admin: add CI checks to ensure users and group have the correct gid/uid [puppet] - 10https://gerrit.wikimedia.org/r/574412 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [16:29:55] (03CR) 10Vgutierrez: [C: 04-1] "we should sync this timeout with its homonym in the applayer, increasing it on ats-be is just going to make it worse" [puppet] - 10https://gerrit.wikimedia.org/r/558984 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [16:30:21] (03PS5) 10Jbond: admin: add support for system users and groups [puppet] - 10https://gerrit.wikimedia.org/r/573990 (https://phabricator.wikimedia.org/T235162) [16:30:24] (03PS1) 10Elukey: reportupdate::job: use kerberos when needed [puppet] - 10https://gerrit.wikimedia.org/r/574786 (https://phabricator.wikimedia.org/T243934) [16:30:26] (03CR) 10Herron: [C: 03+1] "I'm not yet familiar with how the CAS integration works behind the scenes, but very much +1 to trying this with Kibana on a separate vhost" [puppet] - 10https://gerrit.wikimedia.org/r/574499 (owner: 10Muehlenhoff) [16:30:35] let's go 2Mio [16:30:38] no 4 [16:30:42] we are already on 2 [16:30:48] (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573990 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [16:31:10] (03PS7) 10Jbond: admin: add CI checks to ensure users and group have the correct gid/uid [puppet] - 10https://gerrit.wikimedia.org/r/574412 (https://phabricator.wikimedia.org/T235162) [16:31:19] (03PS1) 10Jhedden: toolschecker: update k8s node ready check for new cluster [puppet] - 10https://gerrit.wikimedia.org/r/574787 [16:31:31] (03PS4) 10Jbond: admin: add rerprepo system user [puppet] - 10https://gerrit.wikimedia.org/r/573991 [16:31:45] (03PS8) 10Jbond: admin: add CI checks to ensure users and group have the correct gid/uid [puppet] - 10https://gerrit.wikimedia.org/r/574412 (https://phabricator.wikimedia.org/T235162) [16:32:29] (03PS1) 10Ladsgroup: Increase the reads for term store for clients for up to Q4Mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574788 (https://phabricator.wikimedia.org/T219123) [16:32:54] (03CR) 10Ladsgroup: [C: 03+2] Increase the reads for term store for clients for up to Q4Mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574788 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [16:33:15] I have 11 tabs open, grafana, logsatsh, etc. [16:33:52] im still watching too ;) [16:34:02] (03Merged) 10jenkins-bot: Increase the reads for term store for clients for up to Q4Mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574788 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [16:34:34] (03CR) 10jerkins-bot: [V: 04-1] admin: add rerprepo system user [puppet] - 10https://gerrit.wikimedia.org/r/573991 (owner: 10Jbond) [16:34:54] addshore: we don't need annotations, the InnoDB I/O is so perfect https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1092&var-port=9104&fullscreen&panelId=22 [16:34:58] :D [16:35:00] (03CR) 10jerkins-bot: [V: 04-1] toolschecker: update k8s node ready check for new cluster [puppet] - 10https://gerrit.wikimedia.org/r/574787 (owner: 10Jhedden) [16:35:12] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Vgutierrez) @BBlack @Papaul lvs2009 and lvs2010 are now online as secondary load balancers and ready to take over lvs2003 and lvs2006 respectively. This should be enough... [16:35:14] (03PS2) 10Elukey: reportupdate::job: use kerberos when needed [puppet] - 10https://gerrit.wikimedia.org/r/574786 (https://phabricator.wikimedia.org/T243934) [16:35:16] (03CR) 10jerkins-bot: [V: 04-1] admin: add CI checks to ensure users and group have the correct gid/uid [puppet] - 10https://gerrit.wikimedia.org/r/574412 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [16:36:07] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade load balancers to buster - https://phabricator.wikimedia.org/T245984 (10Vgutierrez) [16:36:44] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q4Mio (T219123)]] (duration: 00m 56s) [16:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:51] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [16:37:18] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 6 others: Public EventGate instance and endpoint for analytics event intake: eventgate-analytics-external - https://phabricator.wikimedia.org/T233629 (10Ottomata) [16:37:27] (03PS2) 10Jhedden: toolschecker: update k8s node ready check for new cluster [puppet] - 10https://gerrit.wikimedia.org/r/574787 [16:37:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good! Filippo also intends to use this for swift going forward, BTW" [puppet] - 10https://gerrit.wikimedia.org/r/573990 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [16:38:10] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q4Mio (T219123)]], take II (duration: 00m 56s) [16:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:25] (03PS6) 10Jbond: admin: add support for system users and groups [puppet] - 10https://gerrit.wikimedia.org/r/573990 (https://phabricator.wikimedia.org/T235162) [16:38:58] (03PS5) 10Jbond: admin: add rerprepo system user [puppet] - 10https://gerrit.wikimedia.org/r/573991 [16:39:00] (03PS3) 10Elukey: reportupdate::job: use kerberos when needed [puppet] - 10https://gerrit.wikimedia.org/r/574786 (https://phabricator.wikimedia.org/T243934) [16:39:10] (03PS9) 10Jbond: admin: add CI checks to ensure users and group have the correct gid/uid [puppet] - 10https://gerrit.wikimedia.org/r/574412 (https://phabricator.wikimedia.org/T235162) [16:39:22] (03PS2) 10Phuedx: [prod] [Vector] Set skin version defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572039 (https://phabricator.wikimedia.org/T242381) (owner: 10Niedzielski) [16:40:09] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 6 others: Public EventGate instance and endpoint for analytics event intake: eventgate-analytics-external - https://phabricator.wikimedia.org/T233629 (10Ottomata) [16:40:26] 10Operations, 10netops: PyBal BGP group prefix-limit 50 teardown - https://phabricator.wikimedia.org/T246110 (10Vgutierrez) Option number three looks good, but IMHO I'd decrease the teardown percentage. [16:41:41] (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/574786 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [16:42:12] (03CR) 10Jhedden: [C: 03+2] toolschecker: update k8s node ready check for new cluster [puppet] - 10https://gerrit.wikimedia.org/r/574787 (owner: 10Jhedden) [16:42:32] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/21054/an-launcher1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/574786 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [16:43:25] (03PS3) 10Muehlenhoff: Enable CAS endpoint for Kibana [puppet] - 10https://gerrit.wikimedia.org/r/574499 [16:44:17] jynus: worst response time and open connections sky rocketed [16:44:25] maybe I should stop doubling now [16:44:32] go at a slower pace [16:45:11] (03CR) 10Filippo Giunchedi: "AFAIK this has been tested in beta and works as intended, see inline too" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/571239 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli) [16:45:48] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' . [16:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:54] (03CR) 10Jcrespo: [C: 03+2] WMFReplication: Parallelize slaves() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521232 (owner: 10Jcrespo) [16:49:12] (03PS1) 10Andrew Bogott: nova api-paste.ini.erb: remove some extraneous settings [puppet] - 10https://gerrit.wikimedia.org/r/574792 [16:49:19] (03Abandoned) 10Filippo Giunchedi: service: logging pipeline support for uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/573937 (https://phabricator.wikimedia.org/T245512) (owner: 10Filippo Giunchedi) [16:50:13] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10fgiunchedi) [16:50:35] worst response time has recovered now [16:50:56] yeah, I am checking it with some internal tooling [16:51:02] (03PS1) 10Jforrester: Move wgULSLanguageDetection to the 'must not' section of CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574794 [16:51:21] from now until 10M I go with 2M, 4-> 6 and then 6-> 8 and then 8->10 [16:51:27] that would be enough for today [16:51:44] so remind me, this should be due to memcache warmup on new entities, right? [16:51:58] (03CR) 10CRusnov: [C: 03+1] "Looks good, as discussed." [software/spicerack] - 10https://gerrit.wikimedia.org/r/571780 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [16:52:09] or even db warmup too [16:52:11] nope, the memcached bit doesnt tie into this ramp up at all [16:52:16] probably just db warmup [16:52:18] ok, so only db warmup [16:52:18] I think it's db warmup [16:52:29] yeah, the table being so big [16:52:43] it is a lot of bytes that suddenly are cached that are no longer used [16:52:51] and the new ones are uncached [16:52:52] (03PS1) 10Elukey: role::statistics::private: remove rsync to /mnt/hdfs [puppet] - 10https://gerrit.wikimedia.org/r/574795 (https://phabricator.wikimedia.org/T243934) [16:53:12] jynus: indeed, would you prefer us to go at 1million increments or 2 million? [16:53:16] (03PS1) 10Muehlenhoff: Switch restbase-dev* to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/574796 (https://phabricator.wikimedia.org/T156955) [16:53:20] or, no preference :P [16:53:32] I don't have any specific advice except doing it carefully :-D [16:54:22] (03CR) 10Jhedden: [C: 03+1] nova api-paste.ini.erb: remove some extraneous settings [puppet] - 10https://gerrit.wikimedia.org/r/574792 (owner: 10Andrew Bogott) [16:54:35] its quite nice knowing that there are now million of rows in the wb_terms table not beain read by anything :P [16:54:44] *millions [16:55:42] (03PS1) 10Ladsgroup: Increase the reads for term store for clients for up to Q6Mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574798 (https://phabricator.wikimedia.org/T219123) [16:55:45] :-D [16:55:52] addshore: it feels awesome, one year of work is finally frutifull [16:55:53] little by little [16:56:30] (03CR) 10Ladsgroup: [C: 03+2] Increase the reads for term store for clients for up to Q6Mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574798 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [16:56:56] (03CR) 10Elukey: mcrouter: add gutter pool servers in configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [16:57:07] Amir1: indeed, but remember, we are not there yet :P 6 million out of 80 or whatever [16:57:16] (03CR) 10CRusnov: "In general this seems reasonable, is there a particular specific use case for this configuration in mind or is this future proofing?" [software/spicerack] - 10https://gerrit.wikimedia.org/r/574152 (owner: 10Volans) [16:57:27] (03Merged) 10jenkins-bot: Increase the reads for term store for clients for up to Q6Mio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574798 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [16:58:09] (03CR) 10Elukey: [C: 03+2] role::statistics::private: remove rsync to /mnt/hdfs [puppet] - 10https://gerrit.wikimedia.org/r/574795 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [16:58:53] (03CR) 10Nikerabbit: [C: 03+1] "Not sure whether this should be in the ULS block or here, both could be justified. Warning makes sense." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574794 (owner: 10Jforrester) [16:59:08] (03PS1) 10Ottomata: eventgate-analytics-external - Use proper TLS Kafka port [deployment-charts] - 10https://gerrit.wikimedia.org/r/574800 (https://phabricator.wikimedia.org/T233629) [16:59:35] addshore: I know but most of reads on the lower Qids [16:59:56] RECOVERY - Disk space on notebook1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [17:00:05] godog and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200225T1700). [17:00:05] Urbanecm: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:14] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics-external - Use proper TLS Kafka port [deployment-charts] - 10https://gerrit.wikimedia.org/r/574800 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [17:00:18] * Urbanecm here [17:00:57] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q6Mio (T219123)]] (duration: 00m 56s) [17:01:00] I'm almost done with a deployment but it should not affect puppet stuff, let me know if I should stop [17:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:04] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [17:01:10] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' . [17:01:10] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' . [17:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:38] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' . [17:01:38] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' . [17:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:48] (03CR) 10CRusnov: [C: 03+1] "Looks good!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/571998 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [17:02:00] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Increase the reads for term store for clients for up to Q6Mio (T219123)]], take II (duration: 00m 55s) [17:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:13] (03CR) 10CRusnov: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/574155 (owner: 10Volans) [17:02:58] (03CR) 10CRusnov: [C: 03+1] "continues to look good, good change as per conversation." [cookbooks] - 10https://gerrit.wikimedia.org/r/571999 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [17:04:48] godog: _joe_: is any of you taking the puppet SWAT? 🙂 [17:04:56] <_joe_> Urbanecm: why ^ *# as a regex? [17:05:30] <_joe_> any line either starting with a comment or spaces and a comment? [17:06:20] PROBLEM - Disk space on notebook1004 is CRITICAL: DISK CRITICAL - free space: /srv 4447 MB (3% inode=77%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [17:06:34] _joe_: good note. Should I remove the space? [17:07:04] <_joe_> I was just curious why not just the simpler ^# [17:07:24] (03PS1) 10Muehlenhoff: Switch to /srv/druid [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/574803 [17:07:27] not a blocker here [17:07:30] _joe_: I think I saw the space somewhere, but it might be an oversight [17:07:41] <_joe_> nah it's ok [17:07:41] but it would be nice to reuse the same logic for all of mw in the future [17:07:45] 10Operations, 10Puppet, 10Patch-For-Review, 10User-crusnov, 10User-jbond: Puppet: get row/rack info from Netbox - https://phabricator.wikimedia.org/T229397 (10crusnov) [17:07:49] (03CR) 10Filippo Giunchedi: "> Patch Set 2:" (038 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/574600 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [17:08:08] we have now 2 php implementations, 1 python and one in bash :-D [17:08:18] to deal with dblists [17:08:23] <_joe_> yeah [17:08:24] addshore: https://grafana.wikimedia.org/d/000000548/wikibase-wb_terms?orgId=1&from=1582547775947&to=1582650459815 :D [17:08:33] <_joe_> but this is not introducing a new one [17:08:39] I know I know [17:08:47] and probably will never change again [17:09:02] (03PS5) 10Urbanecm: make mwscriptwikiset respect comments set in dblists [puppet] - 10https://gerrit.wikimedia.org/r/574726 [17:09:03] just it is a bit lol [17:09:11] _joe_: removed the space [17:09:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] make mwscriptwikiset respect comments set in dblists [puppet] - 10https://gerrit.wikimedia.org/r/574726 (owner: 10Urbanecm) [17:09:49] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/574412 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [17:09:51] _joe_: thank you re: puppet swat [17:09:53] <_joe_> jynus: indeed [17:09:59] to be fair, I think original patch was closer to the caninical php implementation [17:10:21] whhicc does trim() as first function on the line [17:10:53] but not worth wasting time on that, just I had to deal with this a few weeks ago [17:11:50] <_joe_> jynus: open a task :) [17:12:02] nope [17:12:04] :-P [17:12:13] <_joe_> Urbanecm: the patch is merged, it will be applied in a few minutes [17:12:26] thank you _joe_ [17:12:44] I will implement a k8s service that does dblist parsing as a rest api, _joe_ [17:12:56] <_joe_> jynus: dblistoid [17:13:02] :D [17:13:04] _joe_: you got it! [17:13:38] <_joe_> for db in $(curl https://dblistoid.discover.wmnet/list/s1); do ... [17:13:44] <_joe_> handy! [17:14:11] after all, I don't think anyone has the js implementation yet [17:14:40] <_joe_> oh that service clearly needs performance. Rust or GTFO [17:14:49] is there a reason why that script doesn’t use the PHP implementation btw? [17:14:52] I was going to say that :-) [17:15:09] as _joe_said, Lucas_WMDE file a task [17:15:33] to get a question answered? [17:15:45] !log restart ats-tls on cp1075 - T244538 [17:15:49] no no, to ask the refactoring :-) [17:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:52] T244538: ats-tls performance issues under production load - https://phabricator.wikimedia.org/T244538 [17:15:59] well maybe there’s a reason not to refactor it :) [17:16:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1008: SMART disk alert - https://phabricator.wikimedia.org/T245815 (10JHedden) [17:16:17] and anyways if I take one more look at that script I’ll start doing all *kinds* of refactorings to it, I should stay away from it [17:16:30] (e. g. killing itself instead of breaking, using more bash syntax instead of sh, etc etc) [17:17:22] (03PS1) 10Hnowlan: hierdata: Add stub values for changeprop [labs/private] - 10https://gerrit.wikimedia.org/r/574806 (https://phabricator.wikimedia.org/T213193) [17:19:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1008: SMART disk alert - https://phabricator.wikimedia.org/T245815 (10RobH) [17:19:24] (03CR) 10Herron: [C: 03+1] "LGTM for trying in prod. Let's disable puppet on logstash::collector and deploy to a single collector as a canary, just in case. Happy t" [puppet] - 10https://gerrit.wikimedia.org/r/571239 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli) [17:19:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1008: SMART disk alert - https://phabricator.wikimedia.org/T245815 (10RobH) Please note that there is a [[ https://phabricator.wikimedia.org/maniphest/task/edit/form/55/ | hardware failure form here ]] that has a full checklist... [17:21:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1008: SMART disk alert - https://phabricator.wikimedia.org/T245815 (10RobH) [17:21:34] !log jynus@cumin1001 dbctl commit (dc=all): 'increase es1019 load to 50% T243963', diff saved to https://phabricator.wikimedia.org/P10519 and previous config saved to /var/cache/conftool/dbconfig/20200225-172133-jynus.json [17:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:41] T243963: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 [17:21:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1008: SMART disk alert - https://phabricator.wikimedia.org/T245815 (10RobH) This system is over 5 years old as of November 2019. Typically when a host is 5+ years old and fails, it is simply decommissioned. Has this host been... [17:22:14] We should rename WMDE termbox to termboxoid I think [17:23:56] 10Operations, 10DC-Ops, 10cloud-services-team (Kanban): labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286 (10Bstorm) This should be watched for during the upgrade process. [17:24:00] 10Operations, 10Patch-For-Review, 10User-Urbanecm: Special:WantedTemplates cronjob not running on all wikis - https://phabricator.wikimedia.org/T246063 (10Urbanecm) 05Open→03Resolved a:03Urbanecm The issue was fixed by changing the mwscriptwikiset script to respect the comments. [17:24:24] 10Operations, 10Traffic, 10Patch-For-Review: ats-tls performance issues under production load - https://phabricator.wikimedia.org/T244538 (10Vgutierrez) it looks like ats-tls performance gets degraded over time, cp1075 was showing similar values to the ones exhibited before disabling DNS on ats-tls, a servic... [17:29:28] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! thank you for working on this" [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/574803 (owner: 10Muehlenhoff) [17:30:04] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, cc'ing Hugh" [puppet] - 10https://gerrit.wikimedia.org/r/574796 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [17:36:11] (03PS5) 10Elukey: Move all Report Updater Jobs to an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) [17:36:40] (03PS1) 10Jbond: templates: update so that CSS and JS files come from CF CDN [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574808 (https://phabricator.wikimedia.org/T246010) [17:36:42] (03PS1) 10Jbond: style: remove branding [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574809 [17:37:18] (03CR) 10Volans: [C: 03+2] ganeti: use canonical cluster names [software/spicerack] - 10https://gerrit.wikimedia.org/r/571780 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [17:37:44] (03CR) 10Volans: [C: 03+2] ganeti: add logging for GntInstance actions [software/spicerack] - 10https://gerrit.wikimedia.org/r/571997 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [17:38:35] (03CR) 10Volans: [C: 03+2] ganeti: add VM creation capability [software/spicerack] - 10https://gerrit.wikimedia.org/r/571998 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [17:39:02] (03CR) 10Volans: "> Patch Set 4:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/574152 (owner: 10Volans) [17:40:42] (03PS1) 10Hnowlan: changeprop: add hierdata k8s entries and LVS entry [puppet] - 10https://gerrit.wikimedia.org/r/574811 (https://phabricator.wikimedia.org/T213193) [17:40:46] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10Papaul) @Vgutierrez thanks for the update I will wait on @BBlack when he is done decommissioning lvs2003 and lvs2006 to proceed [17:40:58] I go eat something, will be back later to go higher, InnoDB/MySQL will have time to recover [17:42:10] (03Merged) 10jenkins-bot: ganeti: use canonical cluster names [software/spicerack] - 10https://gerrit.wikimedia.org/r/571780 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [17:42:48] (03Merged) 10jenkins-bot: ganeti: add logging for GntInstance actions [software/spicerack] - 10https://gerrit.wikimedia.org/r/571997 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [17:44:07] (03Merged) 10jenkins-bot: ganeti: add VM creation capability [software/spicerack] - 10https://gerrit.wikimedia.org/r/571998 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [17:44:32] 10Operations, 10Cloud-VPS, 10observability, 10cloud-services-team (Kanban): Have a paging check for Nova API accessible - https://phabricator.wikimedia.org/T133656 (10Bstorm) 05Open→03Declined Nova fullstack is good enough for the team at this time. [17:46:03] (03CR) 10Ppchelko: "Question: nothing really contacts change-prop via HTTP, except maybe service-checker that does a simple health check. Do we even want to e" [puppet] - 10https://gerrit.wikimedia.org/r/574811 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [17:46:26] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:48:50] (03CR) 10CRusnov: [C: 03+1] "Looks good!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/574152 (owner: 10Volans) [17:48:51] it's recovering, I don't know what's causing it tbh [17:49:00] it might be our term store stuff, might not be [17:49:23] 10Operations, 10Core Platform Team, 10MediaWiki-General, 10serviceops: Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10WDoranWMF) Moving this to feature requests for PMs to review, we'll need to investigate what appropriate limits woul... [17:49:31] (03PS2) 10Jbond: style: remove branding [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574809 (https://phabricator.wikimedia.org/T233939) [17:50:02] (03CR) 10Volans: [C: 03+2] spicerack: add support for HTTP proxy [software/spicerack] - 10https://gerrit.wikimedia.org/r/574152 (owner: 10Volans) [17:50:13] I'll be afk but if things go really bad, revert my mediawiki config patches [17:50:44] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:51:08] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: es1019: reseat IPMI - https://phabricator.wikimedia.org/T243963 (10jcrespo) es1019 is just pending the last config push back to normal traffic weights (and reducing the master's). [17:54:56] (03Merged) 10jenkins-bot: spicerack: add support for HTTP proxy [software/spicerack] - 10https://gerrit.wikimedia.org/r/574152 (owner: 10Volans) [17:55:50] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:56:03] nope, I'm reverting [17:58:36] (03PS1) 10Ladsgroup: Revert reading from the new term store back to Q2M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574812 [17:59:03] (03CR) 10Ladsgroup: [C: 03+2] Revert reading from the new term store back to Q2M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574812 (owner: 10Ladsgroup) [17:59:18] (03PS3) 10Jbond: style: remove branding [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574809 (https://phabricator.wikimedia.org/T233939) [18:00:04] cscott, arlolra, subbu, halfak, and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / Parsoid / Citoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200225T1800). [18:00:06] 10Operations, 10Core Platform Team, 10MediaWiki-General, 10serviceops, 10Wikimedia-Incident: Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10Krinkle) [18:00:09] (03Merged) 10jenkins-bot: Revert reading from the new term store back to Q2M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574812 (owner: 10Ladsgroup) [18:00:16] !log jynus@cumin1001 dbctl commit (dc=all): 'increase s8 special replica weight', diff saved to https://phabricator.wikimedia.org/P10520 and previous config saved to /var/cache/conftool/dbconfig/20200225-180016-jynus.json [18:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:34] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:01:48] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Decrease the reads for term store for clients back to Q2Mio (T219123)]] (duration: 00m 56s) [18:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:54] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [18:02:10] PROBLEM - PHP opcache health on scandium is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:02:22] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:03:02] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574454|Decrease the reads for term store for clients back to Q2Mio (T219123)]], take II (duration: 00m 56s) [18:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:46] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:04:05] If it happens again, then it's not the term store [18:05:28] it is, when it overloads, the mysql query killer kicks in [18:05:51] confirmed on my stats- which ends up being an exception or an error [18:06:45] I go eat, will do the rest (if possible) tomorrow [18:08:28] (03CR) 10Andrew Bogott: [C: 03+2] nova api-paste.ini.erb: remove some extraneous settings [puppet] - 10https://gerrit.wikimedia.org/r/574792 (owner: 10Andrew Bogott) [18:13:33] 10Operations, 10ops-codfw, 10fundraising-tech-ops: codfw:fundraising single-cpu misc servers - https://phabricator.wikimedia.org/T244950 (10Papaul) [18:15:45] (03CR) 10Jdlrobson: "Should we enable this preference in beta cluster as well to aid testing in a shared environment?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572039 (https://phabricator.wikimedia.org/T242381) (owner: 10Niedzielski) [18:23:06] RECOVERY - Check systemd state on dbmonitor2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:23:36] PROBLEM - Disk space on notebook1003 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [18:23:39] (03PS7) 10Jbond: admin: add support for system users and groups [puppet] - 10https://gerrit.wikimedia.org/r/573990 (https://phabricator.wikimedia.org/T235162) [18:24:06] (03PS6) 10Jbond: admin: add rerprepo system user [puppet] - 10https://gerrit.wikimedia.org/r/573991 [18:24:28] (03PS10) 10Jbond: admin: add CI checks to ensure users and group have the correct gid/uid [puppet] - 10https://gerrit.wikimedia.org/r/574412 (https://phabricator.wikimedia.org/T235162) [18:39:01] (03PS6) 10Elukey: Move all Report Updater Jobs to an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) [18:39:06] (03PS1) 10Jhedden: icinga: add contact group to base host prometheus checks [puppet] - 10https://gerrit.wikimedia.org/r/574822 (https://phabricator.wikimedia.org/T246130) [18:46:08] (03CR) 10Jhedden: "PCC results: https://puppet-compiler.wmflabs.org/compiler1001/21058/" [puppet] - 10https://gerrit.wikimedia.org/r/574822 (https://phabricator.wikimedia.org/T246130) (owner: 10Jhedden) [18:54:42] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/21060/" [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [18:58:35] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1008: SMART disk alert - https://phabricator.wikimedia.org/T245815 (10wiki_willy) a:03Jclark-ctr [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200225T1900) [19:04:39] (03PS7) 10Elukey: Move all Report Updater Jobs to an-launcher1001 [puppet] - 10https://gerrit.wikimedia.org/r/574722 (https://phabricator.wikimedia.org/T243934) [19:04:41] (03PS1) 10Elukey: Add an-launcher1001 to the list of statistics servers [puppet] - 10https://gerrit.wikimedia.org/r/574843 (https://phabricator.wikimedia.org/T243934) [19:05:35] (03PS1) 10Andrew Bogott: Openstack control nodes: add a local memcached instance [puppet] - 10https://gerrit.wikimedia.org/r/574845 [19:05:37] (03PS1) 10Andrew Bogott: Keystone: cache with a memached pool running on each controller [puppet] - 10https://gerrit.wikimedia.org/r/574846 [19:05:40] (03PS1) 10Andrew Bogott: nova: use memcache for keystone_authtoken cache [puppet] - 10https://gerrit.wikimedia.org/r/574847 [19:07:22] (03PS6) 10Volans: Add cookbook to control CF BGP advertisements [cookbooks] - 10https://gerrit.wikimedia.org/r/572262 (owner: 10Ayounsi) [19:09:40] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:14:04] (03PS1) 10Ottomata: eventgate-analytics-external - Use api.svc to get stream config [deployment-charts] - 10https://gerrit.wikimedia.org/r/574848 (https://phabricator.wikimedia.org/T233629) [19:14:11] (03PS2) 10Andrew Bogott: Openstack control nodes: add a local memcached instance [puppet] - 10https://gerrit.wikimedia.org/r/574845 [19:14:13] (03PS2) 10Andrew Bogott: Keystone: cache with a memached pool running on each controller [puppet] - 10https://gerrit.wikimedia.org/r/574846 [19:14:15] (03PS2) 10Andrew Bogott: nova: use memcache for keystone_authtoken cache [puppet] - 10https://gerrit.wikimedia.org/r/574847 [19:14:37] (03CR) 10Phuedx: [C: 03+1] [prod] [Vector] Set skin version defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572039 (https://phabricator.wikimedia.org/T242381) (owner: 10Niedzielski) [19:16:19] (03PS2) 10Ottomata: eventgate-analytics-external - Use api.svc to get stream config [deployment-charts] - 10https://gerrit.wikimedia.org/r/574848 (https://phabricator.wikimedia.org/T233629) [19:18:18] (03PS3) 10CRusnov: Add support for getting Device status breakdowns [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/574600 (https://phabricator.wikimedia.org/T243927) [19:18:32] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics-external - Use api.svc to get stream config [deployment-charts] - 10https://gerrit.wikimedia.org/r/574848 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [19:19:05] (03PS3) 10Andrew Bogott: Openstack control nodes: add a local memcached instance [puppet] - 10https://gerrit.wikimedia.org/r/574845 [19:19:07] (03PS3) 10Andrew Bogott: Keystone: cache with a memached pool running on each controller [puppet] - 10https://gerrit.wikimedia.org/r/574846 [19:19:09] (03PS3) 10Andrew Bogott: nova: use memcache for keystone_authtoken cache [puppet] - 10https://gerrit.wikimedia.org/r/574847 [19:20:27] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' . [19:20:27] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' . [19:20:28] (03PS4) 10CRusnov: Add support for getting Device status breakdowns [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/574600 (https://phabricator.wikimedia.org/T243927) [19:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:35] !log 1.35.0-wmf.21 was branched at ed65726f0dcaf2b163ba44426d5e780bc7f8895d for T233869 [19:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:42] T233869: 1.35.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T233869 [19:24:21] (03PS1) 10Ottomata: eventgate-analytics-external - use http://api.svc to get stream config [deployment-charts] - 10https://gerrit.wikimedia.org/r/574850 (https://phabricator.wikimedia.org/T233629) [19:25:07] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics-external - use http://api.svc to get stream config [deployment-charts] - 10https://gerrit.wikimedia.org/r/574850 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [19:26:18] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' . [19:26:18] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' . [19:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:46] (03PS3) 10Niedzielski: [prod] [beta] [Vector] Set skin version defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572039 (https://phabricator.wikimedia.org/T242381) [19:30:36] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' . [19:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:02] !log otto@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [19:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:49] (03CR) 10Niedzielski: "> Should we enable this preference in beta cluster as well to aid testing in a shared environment?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572039 (https://phabricator.wikimedia.org/T242381) (owner: 10Niedzielski) [19:34:00] (03CR) 10Jhedden: nova: use memcache for keystone_authtoken cache (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574847 (owner: 10Andrew Bogott) [19:36:21] (03CR) 10Andrew Bogott: nova: use memcache for keystone_authtoken cache (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/574847 (owner: 10Andrew Bogott) [19:36:45] (03PS3) 10CRusnov: gen-zones.py: Add variable insertion [dns] - 10https://gerrit.wikimedia.org/r/568683 (https://phabricator.wikimedia.org/T243362) [19:37:06] (03CR) 10jerkins-bot: [V: 04-1] gen-zones.py: Add variable insertion [dns] - 10https://gerrit.wikimedia.org/r/568683 (https://phabricator.wikimedia.org/T243362) (owner: 10CRusnov) [19:38:45] (03PS4) 10CRusnov: gen-zones.py: Add variable insertion [dns] - 10https://gerrit.wikimedia.org/r/568683 (https://phabricator.wikimedia.org/T243362) [19:39:32] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' . [19:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:40] (03PS4) 10Andrew Bogott: Keystone: cache with a memached pool running on each controller [puppet] - 10https://gerrit.wikimedia.org/r/574846 [19:40:42] (03PS4) 10Andrew Bogott: nova: use memcache for keystone_authtoken cache [puppet] - 10https://gerrit.wikimedia.org/r/574847 [19:42:35] (03PS1) 10Ottomata: eventgate-analytics-external - remove extra type ? in stream_config_uri [deployment-charts] - 10https://gerrit.wikimedia.org/r/574851 (https://phabricator.wikimedia.org/T233629) [19:42:59] (03CR) 10Jdlrobson: [C: 03+1] [prod] [beta] [Vector] Set skin version defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572039 (https://phabricator.wikimedia.org/T242381) (owner: 10Niedzielski) [19:43:13] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics-external - remove extra type ? in stream_config_uri [deployment-charts] - 10https://gerrit.wikimedia.org/r/574851 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [19:43:40] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/compiler1002/21065/cloudcontrol1003.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/574847 (owner: 10Andrew Bogott) [19:43:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1008: SMART disk alert - https://phabricator.wikimedia.org/T245815 (10Bstorm) This host is on a list for replacement with a thin cloudvirt via {T243471} [19:44:24] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' . [19:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1008: SMART disk alert - https://phabricator.wikimedia.org/T245815 (10Bstorm) It was not proactively replaced previously. Cloudvirts have generally been replaced in refresh. [19:45:45] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' . [19:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1008: SMART disk alert - https://phabricator.wikimedia.org/T245815 (10Bstorm) Actually, strictly speaking, that's only four of the slated refreshes in Q4. This is one of 9 hosts that are refreshing then. [19:47:04] (03PS5) 10Andrew Bogott: Keystone: cache with a memached pool running on each controller [puppet] - 10https://gerrit.wikimedia.org/r/574846 [19:47:06] (03PS5) 10Andrew Bogott: nova: use memcache for keystone_authtoken cache [puppet] - 10https://gerrit.wikimedia.org/r/574847 [19:47:44] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' . [19:47:44] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' . [19:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:55] 10Operations, 10Gerrit: gerrit1002 running out of space - https://phabricator.wikimedia.org/T243808 (10thcipriani) 05Open→03Resolved a:03thcipriani Moved all the lfs files to a symlinked path under new disk on `/srv/lfs` (thanks @Dzahn): ` thcipriani@gerrit1002:~$ df -h Filesystem Size Used Avail... [19:50:58] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10thcipriani) [19:51:15] (03CR) 10Jhedden: [C: 03+1] Keystone: cache with a memached pool running on each controller [puppet] - 10https://gerrit.wikimedia.org/r/574846 (owner: 10Andrew Bogott) [19:51:33] (03PS5) 10CRusnov: gen-zones.py: Add variable insertion [dns] - 10https://gerrit.wikimedia.org/r/568683 (https://phabricator.wikimedia.org/T243362) [19:51:35] (03PS3) 10CRusnov: tox: Support DNS_INCLUDE_DIR and generated DNS [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) [19:51:40] (03CR) 10Jhedden: [C: 03+1] nova: use memcache for keystone_authtoken cache [puppet] - 10https://gerrit.wikimedia.org/r/574847 (owner: 10Andrew Bogott) [19:51:59] (03CR) 10jerkins-bot: [V: 04-1] tox: Support DNS_INCLUDE_DIR and generated DNS [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) (owner: 10CRusnov) [19:52:50] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' . [19:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:11] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' . [19:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:24] (03PS4) 10CRusnov: tox: Support DNS_INCLUDE_DIR and generated DNS [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) [19:54:41] (03CR) 10CRusnov: tox: Support DNS_INCLUDE_DIR and generated DNS (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/569340 (https://phabricator.wikimedia.org/T243362) (owner: 10CRusnov) [19:54:45] 10Operations, 10netops, 10Wikimedia-Incident: Investigate Juniper storm control - https://phabricator.wikimedia.org/T245192 (10Papaul) Next step is to create a single profile named "wmf-mgmt-storm", configure a storm control bandwidth of 15,000Kbps and add all the interfaces except the interface connected t... [19:55:56] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' . [19:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:10] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' . [19:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:16] jynus: for when you're back, it would be great if I can get a sample of queries killed by the query killer [19:59:24] I have a hunch what's causing it [20:00:04] longma and twentyafterfour: That opportune time is upon us again. Time for a Mediawiki train - American Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200225T2000). [20:01:10] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 6 others: Public EventGate instance and endpoint for analytics event intake: eventgate-analytics-external - https://phabricator.wikimedia.org/T233629 (10Ottomata) [20:01:38] !log jhuneidi@deploy1001 Pruned MediaWiki: 1.35.0-wmf.19 (duration: 14m 35s) [20:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:33] (03CR) 10Ayounsi: "Not tested, some comments inline. The name is the larger blocker for me." (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/572262 (owner: 10Ayounsi) [20:03:02] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:03:08] (03PS1) 10Jeena Huneidi: testwikis wikis to 1.35.0-wmf.21 refs T233869 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574853 [20:03:10] (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.35.0-wmf.21 refs T233869 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574853 (owner: 10Jeena Huneidi) [20:04:18] (03Merged) 10jenkins-bot: testwikis wikis to 1.35.0-wmf.21 refs T233869 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574853 (owner: 10Jeena Huneidi) [20:04:25] !log jhuneidi@deploy1001 Started scap: testwikis wikis to 1.35.0-wmf.21 refs T233869 [20:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:32] T233869: 1.35.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T233869 [20:05:55] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn) [20:06:14] (03CR) 10Dzahn: [C: 03+2] apt: remove (duplicate) OCSP stapling config and RSA cert [puppet] - 10https://gerrit.wikimedia.org/r/574597 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [20:06:34] (03CR) 10Effie Mouzeli: mcrouter: add gutter pool servers in configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/569541 (https://phabricator.wikimedia.org/T213089) (owner: 10Effie Mouzeli) [20:07:18] (03PS19) 10Effie Mouzeli: mediawiki: stream apache logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/571239 (https://phabricator.wikimedia.org/T244472) [20:08:49] 10Operations, 10DC-Ops, 10cloud-services-team (Kanban): labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286 (10Bstorm) The upgrade process I speak of is {T224582} [20:12:58] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1008: SMART disk alert - https://phabricator.wikimedia.org/T245815 (10RobH) Ok, summary: * cloudvirt1008 is over 5 years old ** its replacement is on pending task T243471, which was originally slated for Q4 this year, but is b... [20:14:32] RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [20:19:00] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10jijiki) @aaron @Krinkle Please let us know if the configuration and the information on this task are enough to proceed with testing... [20:21:52] (03CR) 10Volans: [C: 03+1] "LGTM, we might find a lot of additional things useful to track, but better to start simple, then it's easy to add on top of it." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/574600 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [20:24:30] !log apt.wikimedia.org (current install* and new apt* roles) - going ECDSA-only and removing RSA certificate from nginx config - to support buster without having to maintain patched nginx for duplicate ssl_stapling_file directive - at the cost of slightly reduced back-compat on the public repo (T224576) [20:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:45] T224576: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 [20:25:29] !log apt.wikimedia.org (current install* and new apt* roles) - going ECDSA-only and removing RSA certificate from nginx config - to support buster without having to maintain patched nginx for duplicate ssl_stapling_file directive - at the cost of slightly reduced back-compat on the public repo (T242602) [20:25:33] !log changing email address for ClioCJS [20:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:35] T242602: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 [20:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1008: SMART disk alert - https://phabricator.wikimedia.org/T245815 (10RobH) [20:28:16] !log reset password for ClioCJS [20:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:06] (03CR) 10Herron: [C: 03+2] mediawiki: stream apache logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/571239 (https://phabricator.wikimedia.org/T244472) (owner: 10Effie Mouzeli) [20:54:35] (03CR) 10Dzahn: [C: 03+2] "looks good. the base class already has this parameter, the class used as well and per compiler output" [puppet] - 10https://gerrit.wikimedia.org/r/574822 (https://phabricator.wikimedia.org/T246130) (owner: 10Jhedden) [20:54:53] Still running the train: sync now generates php l10n files so with the extra work it takes longer than I was expecting [20:55:02] jouncebot: now [20:55:02] For the next 0 hour(s) and 4 minute(s): Mediawiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200225T2000) [20:55:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1008: SMART disk alert - https://phabricator.wikimedia.org/T245815 (10RobH) [20:56:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1008: SMART disk alert - https://phabricator.wikimedia.org/T245815 (10RobH) [21:02:24] longma: you should be fine on time for the next bit -- nothing on the calendar for the next couple of hours [21:02:46] thanks thcipriani [21:03:23] longma: I think it's because we now build l10n cache for cdb and static arrays, [21:03:28] this will be fixed soon [21:03:52] (03PS1) 10Herron: add kibana-next service records [dns] - 10https://gerrit.wikimedia.org/r/574861 [21:04:03] (03PS1) 10Herron: add load balancing for kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/574862 (https://phabricator.wikimedia.org/T234854) [21:05:11] (03CR) 10jerkins-bot: [V: 04-1] add load balancing for kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/574862 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [21:05:28] Amir1: 👍 I'll also make sure to start it earlier next time [21:06:54] Amir1: yep, that's what I'm seeing as well. Extra 1.5G worth of stuff, plus gen time. Mostly the syncing seems to be taking time (which makes sense). [21:08:13] (03PS2) 10Herron: add load balancing for kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/574862 (https://phabricator.wikimedia.org/T234854) [21:08:30] (03CR) 10Andrew Bogott: [C: 03+2] Openstack control nodes: add a local memcached instance [puppet] - 10https://gerrit.wikimedia.org/r/574845 (owner: 10Andrew Bogott) [21:10:36] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Transfer ownership of mediawiki-security mailman list to Security Team - https://phabricator.wikimedia.org/T230951 (10chasemp) 05Stalled→03Resolved `root@fermium:~# rmlist mediawiki-security Not removing archives. Reinvoke with -a to remove them.... [21:11:17] (03CR) 10CRusnov: [C: 03+2] Add support for getting Device status breakdowns [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/574600 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [21:19:47] !log jhuneidi@deploy1001 Finished scap: testwikis wikis to 1.35.0-wmf.21 refs T233869 (duration: 75m 21s) [21:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:54] T233869: 1.35.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T233869 [21:26:54] (03PS1) 10Jeena Huneidi: group0 wikis to 1.35.0-wmf.21 refs T233869 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574866 [21:26:58] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.35.0-wmf.21 refs T233869 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574866 (owner: 10Jeena Huneidi) [21:28:07] (03Merged) 10jenkins-bot: group0 wikis to 1.35.0-wmf.21 refs T233869 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/574866 (owner: 10Jeena Huneidi) [21:29:26] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.21 refs T233869 [21:29:27] ~. [21:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:32] T233869: 1.35.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T233869 [21:29:35] (03PS1) 10SBassett: Deployment group audit [puppet] - 10https://gerrit.wikimedia.org/r/574869 (https://phabricator.wikimedia.org/T237696) [21:31:01] (03PS6) 10Andrew Bogott: Keystone: cache with a memached pool running on each controller [puppet] - 10https://gerrit.wikimedia.org/r/574846 [21:31:03] (03PS6) 10Andrew Bogott: nova: use memcache for keystone_authtoken cache [puppet] - 10https://gerrit.wikimedia.org/r/574847 [21:31:05] (03PS1) 10Andrew Bogott: openstack controller memcached: specify -o slab_reassign [puppet] - 10https://gerrit.wikimedia.org/r/574871 [21:33:24] (03CR) 10Andrew Bogott: [C: 03+2] openstack controller memcached: specify -o slab_reassign [puppet] - 10https://gerrit.wikimedia.org/r/574871 (owner: 10Andrew Bogott) [21:34:59] Okay, I'm declaring the train done for today [21:36:51] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Transfer ownership of mediawiki-security mailman list to Security Team - https://phabricator.wikimedia.org/T230951 (10chasemp) [21:36:53] thanks [21:39:43] 10Operations, 10ops-eqiad, 10serviceops, 10Patch-For-Review: (Need by: 2020-02-28) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10wiki_willy) [21:43:51] 10Operations, 10ops-eqiad, 10DC-Ops: Replace broken BBU on db1084 (HP host) - https://phabricator.wikimedia.org/T245647 (10Jclark-ctr) @Marostegui Received bbu please message me on irc and schedule replacement [21:44:21] 10Operations, 10observability: Have monitoring of updatequerypages cronjobs - https://phabricator.wikimedia.org/T246097 (10Dzahn) [21:45:29] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Dzahn) [21:45:32] 10Operations, 10cloud-services-team: Ferm rules for cloudbackup2001/2001 - https://phabricator.wikimedia.org/T245808 (10Dzahn) [21:49:19] 10Operations, 10ops-eqiad, 10DBA: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10Jclark-ctr) @Marostegui Received replacement bbu. please message me on irc to schedule replacement [21:50:48] 10Operations, 10cloud-services-team (Kanban): Ferm rules for cloudbackup2001/2001 - https://phabricator.wikimedia.org/T245808 (10Bstorm) [21:54:07] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: cache with a memached pool running on each controller [puppet] - 10https://gerrit.wikimedia.org/r/574846 (owner: 10Andrew Bogott) [21:55:16] 10Operations, 10SRE-Access-Requests: Give access to Anti Harassment Tools team to production deployment - https://phabricator.wikimedia.org/T246053 (10Dzahn) Hi all, please see https://wikitech.wikimedia.org/wiki/Production_access#Add_WMF_Staff_to_an_access_group for the next steps. Please read and sign L3... [21:56:12] I plan to do an additional deploy to fix an issue with the special:version page if no one is deploying right now [21:59:46] 10Operations, 10Parsoid-PHP, 10SRE-Access-Requests, 10serviceops: Give all members of the Parsing team production `deployment` access - https://phabricator.wikimedia.org/T245877 (10Dzahn) [21:59:56] 10Operations, 10ops-eqiad, 10DBA: db1095 backup source crashed: broken BBU - https://phabricator.wikimedia.org/T244958 (10Jclark-ctr) Replaced BBU @jcrespo @Marostegui [22:02:25] 10Operations, 10Parsoid-PHP, 10SRE-Access-Requests, 10serviceops: Give all members of the Parsing team production `deployment` access - https://phabricator.wikimedia.org/T245877 (10Dzahn) Hi all, those of you who have not signed it yet, please read and sign L3. @Sbailey Please create a new SSH keypair a... [22:03:30] 10Operations, 10Parsoid-PHP, 10SRE-Access-Requests, 10serviceops: Give all members of the Parsing team production `deployment` access - https://phabricator.wikimedia.org/T245877 (10Dzahn) @ssastry @Jdforrester-WMF Should sbailey be added to the parsoid-admins group as well so that all people who can deploy... [22:04:17] (03CR) 10Andrew Bogott: [C: 03+2] nova: use memcache for keystone_authtoken cache [puppet] - 10https://gerrit.wikimedia.org/r/574847 (owner: 10Andrew Bogott) [22:04:22] 10Operations, 10hardware-requests: Expand Eqiad Ganeti row_A capacity - https://phabricator.wikimedia.org/T242885 (10Jclark-ctr) [22:04:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1013: server down for no reason (power issue?) - https://phabricator.wikimedia.org/T241313 (10Jclark-ctr) [22:13:11] (03PS2) 10Ottomata: eventstreams - bump cpu limit to 2000m for benchmarking [deployment-charts] - 10https://gerrit.wikimedia.org/r/574567 (https://phabricator.wikimedia.org/T238658) [22:14:42] (03CR) 10Ottomata: [C: 03+2] eventstreams - bump cpu limit to 2000m for benchmarking [deployment-charts] - 10https://gerrit.wikimedia.org/r/574567 (https://phabricator.wikimedia.org/T238658) (owner: 10Ottomata) [22:15:32] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'production' . [22:15:32] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [22:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:24] !log scandium restarting php7.2-fpm [22:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:45] !log jhuneidi@deploy1001 Synchronized php-1.35.0-wmf.21/includes/Defines.php: Update MW_VERSION to 1.35.0-wmf.21 (duration: 01m 04s) [22:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1008: SMART disk alert - https://phabricator.wikimedia.org/T245815 (10RobH) Please note a replacement disk is being ordered, and the task description has been updated with next steps once it arrives at eqiad. [22:19:02] RECOVERY - PHP opcache health on scandium is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:22:11] 10Operations, 10ops-codfw: codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) [22:24:44] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [22:33:06] PROBLEM - nova-compute proc minimum on cloudvirt1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:35:16] RECOVERY - nova-compute proc minimum on cloudvirt1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:37:20] (03Abandoned) 10Ladsgroup: Increase the read for clients on the new term store up to Q100K [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572853 (https://phabricator.wikimedia.org/T225057) (owner: 10Ladsgroup) [22:40:39] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: Add Fsalutari to nda LDAP group - https://phabricator.wikimedia.org/T245997 (10Dzahn) 05Open→03Resolved This seems done. Confirmed user is already in data.yaml and he @Fsalutari confirmed he can login. Resolving. [22:41:57] (03PS1) 10Jbond: templates: add initial template file so we have the git history [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574883 [22:41:59] (03PS1) 10Jbond: templates: update so that CSS and JS files come from CF CDN [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574884 (https://phabricator.wikimedia.org/T246010) [22:42:06] (03PS1) 10Bstorm: sonofgridengine: accomodate the new domain name [puppet] - 10https://gerrit.wikimedia.org/r/574885 (https://phabricator.wikimedia.org/T245572) [22:42:44] (03CR) 10BryanDavis: [C: 03+1] "LGTM. The PCC diff against tools-elastic-03 makes sense for the changes in profile::elasticsearch::toolforge." [puppet] - 10https://gerrit.wikimedia.org/r/574527 (https://phabricator.wikimedia.org/T236606) (owner: 10Jhedden) [22:43:39] (03Abandoned) 10Jbond: templates: update so that CSS and JS files come from CF CDN [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574808 (https://phabricator.wikimedia.org/T246010) (owner: 10Jbond) [22:45:23] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: accomodate the new domain name [puppet] - 10https://gerrit.wikimedia.org/r/574885 (https://phabricator.wikimedia.org/T245572) (owner: 10Bstorm) [22:46:10] (03PS2) 10Bstorm: sonofgridengine: accomodate the new domain name [puppet] - 10https://gerrit.wikimedia.org/r/574885 (https://phabricator.wikimedia.org/T245572) [22:49:34] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: accomodate the new domain name [puppet] - 10https://gerrit.wikimedia.org/r/574885 (https://phabricator.wikimedia.org/T245572) (owner: 10Bstorm) [22:54:25] (03PS1) 10Andrew Bogott: nova.conf: use memcached_servers instead of memcache_servers [puppet] - 10https://gerrit.wikimedia.org/r/574886 [22:54:27] (03PS1) 10Andrew Bogott: neutron.conf: remove/update some deprecated settings [puppet] - 10https://gerrit.wikimedia.org/r/574887 [22:55:34] (03CR) 10Andrew Bogott: [C: 03+2] nova.conf: use memcached_servers instead of memcache_servers [puppet] - 10https://gerrit.wikimedia.org/r/574886 (owner: 10Andrew Bogott) [22:56:00] (03PS1) 10Jbond: templates: add initial templates to provide git history [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574888 (https://phabricator.wikimedia.org/T233939) [22:56:02] (03PS1) 10Jbond: style: remove branding [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574889 (https://phabricator.wikimedia.org/T233939) [22:56:18] (03Abandoned) 10Jbond: style: remove branding [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/574809 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [22:56:22] (03PS3) 10Bstorm: sonofgridengine: accomodate the new domain name [puppet] - 10https://gerrit.wikimedia.org/r/574885 (https://phabricator.wikimedia.org/T245572) [22:56:30] (03CR) 10Andrew Bogott: [C: 03+2] neutron.conf: remove/update some deprecated settings [puppet] - 10https://gerrit.wikimedia.org/r/574887 (owner: 10Andrew Bogott) [23:01:31] (03CR) 10Bstorm: [C: 04-1] "-1-ing this until I test it in toolsbeta. There's always a chance that duplicate records will cause problems somewhere." [puppet] - 10https://gerrit.wikimedia.org/r/574885 (https://phabricator.wikimedia.org/T245572) (owner: 10Bstorm) [23:14:18] (03CR) 10Dzahn: [C: 03+2] site: add mw2366-mw2376 with spare role [puppet] - 10https://gerrit.wikimedia.org/r/574124 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [23:14:32] (03PS3) 10Dzahn: site: add mw2366-mw2376 with spare role [puppet] - 10https://gerrit.wikimedia.org/r/574124 (https://phabricator.wikimedia.org/T241852) [23:15:48] PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 119 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [23:17:50] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [23:22:15] (03PS7) 10Volans: Add cookbook to control CF BGP advertisements [cookbooks] - 10https://gerrit.wikimedia.org/r/572262 (owner: 10Ayounsi) [23:22:25] (03CR) 10Volans: "replies inline" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/572262 (owner: 10Ayounsi) [23:26:56] PROBLEM - dhclient process on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [23:27:18] PROBLEM - puppet last run on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [23:27:36] PROBLEM - MD RAID on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [23:27:48] PROBLEM - configured eth on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [23:28:06] PROBLEM - Check size of conntrack table on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [23:28:08] !log adding mw2366 through mw2376 to site [23:28:08] PROBLEM - Check whether ferm is active by checking the default input chain on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:16] PROBLEM - Check systemd state on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:28:28] PROBLEM - DPKG on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [23:28:58] Amir1: ^ your java job used the remaining disk.. i guess ^ [23:29:17] well, or another user [23:30:35] !log notebook1004 - disk full once again (T232068) [23:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:42] T232068: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 [23:32:01] (03CR) 10Ayounsi: [C: 03+1] Add cookbook to control CF BGP advertisements [cookbooks] - 10https://gerrit.wikimedia.org/r/572262 (owner: 10Ayounsi) [23:33:52] (03CR) 10Jbond: Add cookbook to control CF BGP advertisements (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/572262 (owner: 10Ayounsi) [23:35:32] mutante: let me delete those [23:36:06] I have a spark job tho [23:36:28] Amir1: /srv has 5G left, / is full [23:36:43] not sure if you can tell it where to write to [23:38:03] also there is a TON of space in /mnt/nfs/... [23:38:11] mounts labstores [23:38:21] !log pause mediawiki writes to cloudelastic to let old gc on cloudelastic1001-chi recover [23:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:08] RECOVERY - DPKG on notebook1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [23:39:34] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [23:39:44] RECOVERY - dhclient process on notebook1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [23:39:47] Amir1: ^ looks good. thx [23:40:24] RECOVERY - MD RAID on notebook1004 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [23:40:36] RECOVERY - configured eth on notebook1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [23:40:52] RECOVERY - Check size of conntrack table on notebook1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [23:40:54] RECOVERY - Check whether ferm is active by checking the default input chain on notebook1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:41:02] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:51:35] !log cr2-esams> request chassis fpc slot 0 offline - T246009 [23:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:50] (03PS1) 10Dzahn: add codfw appservers in racks A3 and A6 as spares [puppet] - 10https://gerrit.wikimedia.org/r/574895 (https://phabricator.wikimedia.org/T241852) [23:54:34] 10Operations, 10Parsoid-PHP, 10SRE-Access-Requests, 10serviceops: Give all members of the Parsing team production `deployment` access - https://phabricator.wikimedia.org/T245877 (10ssastry) >>! In T245877#5917965, @Dzahn wrote: > > those of you who have not signed it yet, please read and sign L3. Looks l... [23:58:03] 10Operations, 10Parsoid-PHP, 10SRE-Access-Requests, 10serviceops: Give all members of the Parsing team production `deployment` access - https://phabricator.wikimedia.org/T245877 (10Dzahn) @ssastry Yea, though it's not tied to the hostname. It's "the puppet role parsoid::testing installs the admin groups...